Original Reddit post

Been digging into the LingBot-VLA tech report (arXiv:2601.18692) and there’s one finding that I think deserves way more attention than it’s getting: they scaled real-world robot pre-training data from 3,000 hours all the way to 20,000 hours across 9 dual-arm robot configurations, and the downstream success rates just keep climbing with no sign of flattening out. Let me put that in context. We’ve seen scaling laws for LLMs debated endlessly, and there’s growing skepticism that we’re hitting walls in text/code domains. But for physical robot manipulation? This is the first systematic evidence I’ve seen that more real-world data = better real-world performance, and it hasn’t plateaued yet. The implication is kind of wild: whoever can collect and curate the most real robot hours wins. Now here’s where it gets interesting and also where I want to push back on the hype a bit. Their best model (with depth integration) hits an average success rate of 17.30% across 100 tasks on 3 different robot platforms. For comparison: So yes, it clearly outperforms π0.5 and GR00T N1.6 under controlled conditions (same data, same hyperparameters, same hardware per task). But 17.30% success rate in absolute terms tells you that real-world dual-arm manipulation is still brutally hard. The progress score of 35.41% is more encouraging because it means the robot is getting through about a third of each task’s subtasks on average, even when it doesn’t fully succeed. The depth distillation approach is worth noting too. They use learnable queries aligned with depth embeddings from a separate depth model (LingBot-Depth), which lets the VLA implicitly reason about 3D space without needing explicit point clouds at inference time. In simulation with randomized scenes, this bumps success rate from 76.76% (π0.5) to 86.68%. The gap is even more pronounced in randomized environments than clean ones, which suggests depth awareness really matters when things get messy, exactly the conditions real deployment would face. On the engineering side, their training codebase achieves 261 samples/sec/GPU on an 8 GPU setup, which they claim is 1.5x to 2.8x faster than existing VLA codebases like OpenPI, StarVLA, and Dexbotic depending on the backbone VLM. They use FSDP with mixed precision and operator fusion via torch.compile. The scaling efficiency looks nearly linear up to 256 GPUs which is genuinely impressive for this type of workload. They’ve open sourced the full code, base model, and benchmark data (GitHub, HuggingFace, the works). So this isn’t a “trust our numbers” situation, anyone can reproduce. Here’s the tradeoff I keep thinking about though. If the scaling curve truly doesn’t saturate, then the bottleneck for embodied AI isn’t architecture or algorithms, it’s data collection. 20,000 hours of teleoperated dual-arm data is already a massive investment. What happens when you need 100K or 500K hours? The cost and logistics of collecting real robot data at that scale are completely different from scraping the internet for text. Does this mean only well-funded labs with robot fleets will ever compete in this space? Or does open sourcing the model and benchmark (like they’ve done here) create a path where the community can collectively contribute data? I’m also curious whether these scaling properties hold across fundamentally different embodiments. They tested on 3 platforms for evaluation but pre-trained on 9. All dual-arm tabletop setups though. Would the same curve hold if you threw in mobile manipulation, single-arm systems, or legged robots? What’s your read on this? Is “just collect more real data” actually the answer for robotics, or are we going to hit a wall that simulation and synthetic data will need to fill? submitted by /u/FeelingWatercress871

Originally posted by u/FeelingWatercress871 on r/ArtificialInteligence