Pure vision will never be enough because it does not contain information about the physical feedback like pressure and touch, or the strength required to perform a task.
For example, so that you don't crush a human when doing massage (but still need to press hard), or apply the right amount of force (and finesse?) to skin a fish fillet without cutting the skin itself.
Practically in the near term, it's hard to sample from failure examples with videos on Youtube, such as when food spills out of the pot accidentally. Studying simple tasks through the happy path makes it hard to get the robot to figure out how to do something until it succeeds, which can appear even in relatively simple jobs like shuffling garbage.
With that said, I suppose a robot can be made to practice in real life after learning something from vision.
> Pure vision will never be enough because it does not contain information about the physical feedback like pressure and touch, or the strength required to perform a task.
I'm not sure that's necessarily true for a lot of tasks.
A good way to measure this in your head is this:
"If you were given remote control of two robot arms, and just one camera to look through, how many different tasks do you think you could complete successfully?"
When you start thinking about it, you realize there are a lot of things you could do with just the arms and one camera, because you as a human have really good intuition about the world.
It therefore follows that robots should be able to learn with just RGB images too! Counterexamples would be things like grabbing an egg without crushing, perhaps. Though I suspect that could also be done with just vision.
> It therefore follows that robots should be able to learn with just RGB images too!
I don't see how that follows. Humans have trained by experimenting with actually manipulating things, not just by vision. It's not clear at all that someone who had gained intuition about the world exclusively by looking at it would have any success with mechanical arms.
1. First create a model that can evaluate how well a task is going; the YT approach can be used here.
2. Then build a real-world robot, and train it by letting it do tasks, and use the first model to supervise it; here the robot can learn to rely on extra senses such as touch/pressure.
You're agreeing with the parent btw. You've introduced a lot more than just vision. You introduced interventional experimentation. That's a lot more than just observation
By "intervention" I mean interacting with the environment. Purpose a hypothesis, test, modify, test. You can frame RL this way though RL usually generates hypotheses that are far too naïve.
> because you as a human have really good intuition about the world.
This is the line that causes your logic to fail.
You introduced knowledge not obtained through observation. In fact, the knowledge you introduced is the whole chimichanga! It is an easy mistake to make, so don't feel embarrassed.
The claim is that one can learn a world model[0] through vision. The patent countered by saying "vision is not enough." Then you countered by saying "vision is enough if you already have a world model."
[0] I'll be more precise here. You can learn *A* world model, but it isn't the one we really care about and "a world" doesn't require being a self consistent world. We could say the same thing about "a physics", but let's be real, when we say "physics" we know which one is being discussed...
Humans did not accumulate that intuition just using images. In the example you gave, you subconsciously augment the image information with a lifetime of interacting with the world using all the other senses.
Yes, without extra information, manipulating everyday objects is probably as intuitive to robots as manipulating quantum scale molecules is for humans.
I think you vastly underestimate how difficult the task you are proposing would be without depth or pressure indication, even for a super intelligence like humans.
Simple concept, pick up a glass and pour its content into a vertical hole the approximate size of your mouth. Think of all the failure modes that can be triggered in the trivial example you do multiple times a day, to do the same from a single camera feed with no other indicators would take you hours to master and you already are a super intelligent being.
A routine gesture I've done everyday for almost all my life: getting a glass out of the shelves and into my left hand. It seems like a no brainer, I open the cabinet with my left hand, take the glass with my right hand, throw the glass from my right hand to the left hand while closing the cabinet with my shoulder. Put the glass under the faucet with left hand, open the faucet with the right hand.
I have done this 3 seconds gesture, and variations of it, my whole life basically, and never noticed I was throwing the glass from one hand to the other without any visual feedback.
And you're used to the weight of the glass, which you instantly recognize when you pick it up. If it was a different weight than you were expecting, you'd probably slow down and be more deliberate.
If you were to just do the exact same robotic "throw" action with a glass of unexpected weight you'd maybe not throw hard enough and miss, or throw too hard and possibly break it.
The point is how much non-vision sensors vs pure vision, helps humans to be humans. Don't you think this point was proven by LLMs already that generalizability doesn't come from multi-modality but by scaling a single modality itself? And jepa is for sure designed to do a better job at that than an LLM. So no doubt about raw scaling + RL boost will kick-in highly predictable & specific robotic movements.
This is not a proven statement. In fact, it's pretty clear that they don't. They have some generalization but that's not enough for what you're inferring. The best way to show this is to carefully talk to an LLM about anything you have a lot of domain expertise in. Be careful to not give it answers (information leakage can sneak in subtly) and specifically look for those small subtle details (that's why it needs to be a topic you have expertise in). "The smell" will be right but the information won't.
Also, LLMs these days aren't trained on just language
Except this is the absolutely most common thing humans do, and my argument is that that it will spill water all over but rather that it will shatter numerous glasses, knock them over etc all before it has picked up the glass.
The same process will be repeated many times trying to move the glass to its “face” and then when either variable changes, plastic vs glass, size, shape, location and all bets are off purely because there just plainly is the enough information
counterpoint: think about all the tasks you could do with your hands and arms while your eyes are closed. i think its really a lot of stuff considering blind people can do the vast majority of things sighted people can do, and i suspect anything you could do with your eyes closed would be extremely difficult to do with a camera feed as the literal only sensory input
> When you start thinking about it, you realize there are a lot of things you could do with just the arms and one camera, because you as a human have really good intuition about the world.
And where does this intuition come from? It was buily by also feeling other sensations in addition to vision. You learned how gravity pulls things down when you were a kid. How hot/cold feels, how hard/soft feels, how thing smell. Your mental model of the world is substantially informed by non-visual clues.
> It therefore follows that robots should be able to learn with just RGB images too!
That does not follow at all! It's not how you learned either.
Neither have you learned to think by consuming the entirety of all text produced on the internet. LLMs therefore don't think, they are just pretty good at faking the appearance of thinking.
>"If you were given remote control of two robot arms, and just one camera to look through, how many different tasks do you think you could complete successfully?"
There are an infinite number of scenes that can be matched to one 2d picture. And what is a scene really? The last time I checked, RGB was not a good way of input in Computer Vision and rather relied on increasing levels of gradients via CNNs to build a compositional scene. None of that is paticularly translatable to how a LM works with text.
Video games have shown that we can control pretty darn well characters in virtual worlds where we have not experienced their physics. We just look at a 2D monitor and using a joystick/keyboard we manage to figure it out.
a game has very limited physics. like the buttons you press are pre-tuned to perform certain actions and you arent dealing with continuous nearly infinite possibilities with large ranges of motion, pressure, speed etc. like think about how difficult the game QWOP is because you mostly just have visual feedback
I beg to disagree. I got introduced to brand new (to me) physics of flying airplanes by MS flight simulator. None of the rules I knew in real life applied (gravity matters only sometimes, height can be traded for speed etc). Yet learned how to fly.
And when I took real classes in a real Cessna, this experience was transferable (aka the flying model I had in my brain was very similar to the one I experienced with my full body in the cockpit).
Yeah but we already have a conception of what physics should be prior to that that helps us enormously. It's not like game designers are coming up with stuff that intentionally breaks our naïve physics.
I mean they do but we often have generalized (to some degree) world models. So when they do things like change gravity, flip things upside down, or even more egregious changes we can adapt. Because we have contractual counterfactual models. But yeah, they could change things so much that you'd really have to relearn and that could be very very difficult if not impossible (I wonder if anyone has created a playable game with a physics that's impossible for humans to learn, at least without "pen and paper". I think you could do this by putting the game in higher dimensions.)
The tasks you do to recover from the failure is often different from the happy path. For example, the happy path of dumping garbage is carrying a garbage bag to a collection bin. The non-happy path is that the bin is overflowing and you have to put the bag on the ground, or if the bag leaks and you need to move to a new bag, or if the bag breaks entirely and you have to pick up the trash again.
But yeah, I think a better way to put it is that sampling the happy path would indeed make the failure case easier, but sampling just happy paths is far from sufficient from completing even some of the simplest human tasks with failure.
The error margins will be huge, and for small enough force (like the skinning part or handling fine mechanical stuff) there's basically almost zero signal.
> Pure vision will never be enough because it does not contain information
Say it louder for those in the back!
But actually there's more to this that makes the problem even harder! Lack of sensors is just the beginning. There's well known results in physics that:
You cannot create causal models through observation alone.
This is a real pain point for these vision world models and most people I talk to (including a lot at the recent CVPR) just brush this off as "we're just care if it works." Guess what?! Everyone that is pointing this out also cares that it works! We need to stop these thought terminating cliches. We're fucking scientists.
Okay, so why isn't observation enough? It's because you can't differentiate alternative but valid hypotheses. You often have to intervene! We're all familiar with this part. You control variables and modify one or a limited set at a time. Experimental physics is no easy task, even for things that sound rather mundane. This is in fact why children and animals play (okay, I'm conjecturing here).
We need to mention chaos here, because it's the easiest way to understand this. There's many famous problems that fall into this category like the double pendulum, 3 Body Problem, or just fucking gas molecules moving around. Let's take the last one. Suppose you are observing some gas molecules moving inside a box. You measure their positions at t0 and at T. Can you predict their trajectories between those time points? Surprisingly, the answer is no. You can only do this statistically. There's probably paths but not deterministic (this same logic is what leads to multiverse theory btw). But now suppose I was watching the molecules too, but I was continuously recording between t0 and T. Can I predict the trajectories? Well, I don't need to, I just write it down.
Now I hear you, you're saying "Godelski, you observed!" But the problem with these set of problems is that if you don't observe the initial state you can't predict moving forwards and if you don't have very precise observation intervals you are hit with the same problem. I you turn around while I start a double pendulum you can have as much time as you want when you turn back around, you won't be able to model its trajectories.
But it gets worse still. There are confounding variables. There is coupling. Difficult to differentiate hypotheses via causal ordering. And so so much more. If you ever wonder why physicists do so much math it's because doing that is a fuck ton easier than doing the whole set of testing and then reverse engineering the equations from those observations. But in physics we care about counterfactual statements. In F=ma we can propose new masses and new accelerations and rederive the results. That's the what it is all about. Your brain does an amazing job at this too! You need counterfactual modeling to operate in real world environments. You have to be able to ask and answer "what happens if that kid runs into the street?"
I highly suggest people read The Relativity of Wrong [0]. Its a short essay by Isaac Asimov that can serve as a decent intro, though far from complete. I'm suggesting it because I don't want people to confuse "need counterfactual model" with "need the right answer." If you don't get into metaphysics, these results will be baffling.[1] It is also needed to answer any confusion you might have around the aforementioned distinction.
Tldr:
if you could do it from observation alone, physics would have been solved a thousand years ago
There's a lot of complexity and depth that is easy to miss with the excitement, but it still matters.
I'm just touching the surface here too, and we're just talking about mechanics. No quantum needed, just information loss
I didn't understand a single word about this post and what was supposed to be solved and had to stop reading.
Was this actually written by a human being? If so, the author(s) suffer from severe language communication problems. Doesn't seem to be grounded at least with reality and my personal experience with robotics. But here's my real world take:
Robotics is going to be partially solved when ROS/ROS2 becomes effectively exterminated and completely replaced by a sane robotics framework.
I seriously urge the authors to use ROS/ROS2. Show us, implementing your solution with ROS, pushing it to a repository and allow others to verify what you solved, maybe?. Suffer a bit with the framework and then write a real post about real robotics hands-on, and not just wander on fancy uncomprehensible stuff that probably no-one will ever do.
It is readily understandable if you are fluent in the jargon surrounding state of the art LLMs and deep learning. It’s completely inscrutable if you aren’t. The article is also very high level and disconnected from specifics. You can skip to FAIR’s paper and code (linked at the article’s end) for specifics: https://github.com/facebookresearch/vjepa2
If I had to guess, it seems likely that there will be a serious cultural disconnect as 20-something deep learning researchers increasingly move into robotics, not unlike the cultural disconnect that happened in natural language processing in the 2010s and early 20s. Probably lots of interesting developments, and also lots of youngsters excitedly reinventing things that were solved decades ago.
> if you are fluent in the jargon surrounding state of the art LLMs and deep learning
It is definitely not following that jargon. Maybe it follows the tech influencer blog post jargon but I can definitively say it doesn't follow jargon used in research. Which, they are summarizing a research paper. Consequently they misinterpret things and use weird phrases like "actionable physics," which is self referential. "A" physics model is necessarily actionable. It is required to be a counterfactual model. While I can understand the rephrasing to clarify to a more general audience that's a completely different thing than "being fluent in SOTA work." It's literally the opposite...
Also, it definitely doesn't help that they remove all capitalization except in nouns.
I totally agree with you. On the other hand the theory behind it -to combine image recognition to predict the outcome based on specific physical impacts- does sound intriguing and like a somewhat newer idea.
But besides that, you‘re totally right. It’s too „loose“ since to realize that idea the process would have to be way different (and properly explained)
> Doesn't seem to be grounded at least with reality and my personal experience with robotics.
It also doesn't match my personal experience with physics nor ML, and I have degrees in both.
You cannot develop accurate world models through observation alone, full stop.
You cannot verify accurate world models through benchmarks alone, full stop.
These have been pain points in physics for centuries and have been the major pain point even before the quantum revolution. I mean if it were possible, we'd have solved physics long ago. You can find plenty of people going back thousands of years boldly claiming "there is nothing new to be learned in physics," yet it was never true and still isn't true even if we exclude quantum and relativity.
Side note: really the paper is "fine" but I wish we didn't put so much hype in academic writing. Papers should be aimed at other academics and not be advertisements (use the paper to write advertisements like IFLS or Quanta Magazine, but don't degrade the already difficult researcher-to-researcher communication). So I'm saying the experiments are fine and the work represents progress but it is over sold and the conclusions do not necessarily follow
Btw, the paper makes these mistakes too. It makes a very bold assumption that counterfactual models (aka a "world model") are learned. This cannot be demonstrated through benchmarking, it must be proven through interpretability.
Unfortunately, the tail is long and heavy... you don't need black swan events to disrupt these models and boy does this annoying fact make it easy to "hack" these types of models. And frankly, I don't think we want robots operating in the wild (public spaces, as opposed to controlled spaces like a manufacturing floor) if I can make it think an iPhone is an Apple with just a stickynote. Sure, you can solve that precise example but it's not hard to come up with others. It's a cat and mouse game, but remember, Jerry always wins.
It's not a scholarly article but a blog post but you're still right to be frustrated at the very bad writing. I do get the jargon, despite myself, so I can translate: the authors of the blog post claim that machine learning for autonomous robotics is "solved" thanks to an instance of V-JEPA 2 trained on all videos on youtube. It isn't, of course, and the authors themselves point out the severe limitations of the otherwise promising approach (championed by Yan LeCun) when they say, in a notably more subdued manner:
>> the model is basically a diva about camera positioning. move the camera 10 degrees and suddenly it thinks left is right and up is down.
>> in practice, this means you have to manually fiddle with camera positions until you find the sweet spot. very scientific. much engineering.
>> long-horizon drift
>> try to plan more than a few steps ahead and the model starts hallucinating.
That is to say, not quite ready for the real world, V-JEPA 2 is.
But for those who don't get the jargon there's a scholarly article linked at the end of the post that is rather more sober and down-to-earth:
A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.
In other words, some interesting results, some new SOTA, some incremental work. But lots of work for a big team of a couple dozen researchers so there's good stuff in there almost inevitably.
"why didn't we think of this sooner?", asks the article. Not sure who the "we" is supposed to be, but the robotics community has definitely thought of this before. https://robo-affordances.github.io/ from 2023 is one pretty relevant example that comes to mind, but I have recollections of similar ideas going back to at least 2016 or so (many of which are cited in the V-JEPA2 paper). If you think data-driven approaches are a good idea for manipulation, then the idea of trying to use Youtube as a source of data (an extremely popular data source in computer vision for the past decade) isn't exactly a huge leap. Of course, the "how" is the hard part, for all sorts of reasons. And the "how" is what makes this paper (and prior research in the area) interesting.
I definitely saw somebody at Actuate last year talking about supplementing training videos for VLA with Youtube, but I think they actually found that "any" video of the real world helped give a better physics "understanding" to the model.
This writing style is prominent on Twitter and niche Discords. It's funny how much I've come to be able to cut right through it, but if you haven't seen much of it it's really hard to parse. That's by design, too. The vibe of this writing style is to project an air of confidence so strong that the author doesn't care if you get it or not. It's a sort of humblebrag where the writing is supposed to flex the author's understanding of the subject while also not caring if you get it or not.
As others have already covered, there's also some heavy stretching of the truth and rewriting of history going on in this post. That's also common of the extreme bravado in this style of semi-impenetrable writing: The vagueness and ambiguities allow the author to make grandiose claims but then wiggle out of them later if someone is astute enough to catch on.
For example: The blog post is written as “We…” but is the author part of the team? Or is he using “we” meaning society in general?
What's the point in writing something while "not caring" if the reader understands or not? Seems like a false confidence or false bravado to me; it reads like an attempt to project an impression, and not really an attempt to communicate.
Basically: If you understand the topic well, you’re not the target audience.
This is a type of information arbitrage where someone samples something intellectual without fully understanding it, then writes about it for a less technical audience. Their goal is to appear to be the expert on the topic, which translates into clout, social media follows, and eventually they hope job opportunities.
The primary goal of the writing isn’t to get you to understand the topic clearly, because that would diminish the sense that the author is more knowledgeable than you. The goal is to sound guru-like while making the topic feel impenetrably complex for you, while appearing playfully casual for the author.
This style of writing is very effective at convincing people in their impressionable years of a narrative or viewpoint, often one that is hard to defend with more traditional writing styles.
I hope I'm wrong, but this looks like an effort to normalize such writing style. As this happens, intelligent discourse and rhetoric become harder.
It would also be good if the perspective of the article would stay put. This "we" and "they" thing was at best confusing and at worst possibly a way to get more clicks or pretend the author had something to do with the work.
Looks like it was trained on Shaolin Drunken Fist videos. Does it look drunk because of the videos or because there's a discrepancy between videos and it not accounting for gravity and physics in general?
My guess would be lack of actuators. For instance, this robot looks like it has an ankle that can only go up and down, but not roll like a human's. Also, I wonder if there's a center of gravity issue, as it almost always appears to be leaning backwards to even out.
I think it's still pretty impressive in its recoveries, even though there's an unnaturally large number of them necessary. About 8 seconds into the video on the homepage, it almost misses and ends up slipping off the second step. I've eaten shit at missing a couple inch curb, though I don't think "graceful" has ever been used as a descriptor for me. So the fact that it just recovers and keeps going without issue is impressive to me.
> So the fact that it just recovers and keeps going without issue is impressive to me.
I'm pretty sure that's just a matter of reaction speed and it maintaining a constant focus/vigilance on it's movement that you'd usually not reserve outside of some sports and situations pre-identified as deserving the attention due to danger, like concentrating on balance and not getting into a position that overstresses your joints when you know it's icy.
My mom said I was throwing away my life watching YouTube all day and clearly I just haven’t been watching YouTube enough. 1 million YouTube videos here I come!
I do not know and do not care much about robotics per se, but I wish LLM's were better with spatial reasoning. If the new insight helps with that - great!
I dabbled a bit in geolocation with LLM's recently. It is still surprising to me how good they are with finding the general area a picture was taken. Give it a photo of a random street corner on this earth and it is likely will not only tell you the correct city or town but most often even the correct quarter.
On the other hand, if you ask it for a birds eye view of a green, a brown and a white house on the north side of a one-way street (running west to east) east of an intersection running north to south, it may or may not get it right. If you want it to add an arrow going in the direction of the one-way street, it certainly has no clue at all and the result is 50/50.
> the core insight: predict in representation space, not pixels
We've been doing this since 2014? Not only that, others have been doing it at a similar scale. e.g. Nvidia's world foundation models (although those are generative).
> zero-shot generalization (aka the money shot)
This is easily beaten by flow-matching imitation learning models like what Pi has.
> accidentally solved robotics
They're doing 65% success on very simple tasks.
The research is good. This article however misses a lot of other work in the literature. I would recommend you don't read it as an authoritative source.
Yeah, wake me up when they have a robot that can wash, peel, cut fruit and vegetables; unwrap, cut, cook meat; measure salt and spices; whip cream; knead and shape dough; and clean up the resulting mess from all of these. Then they will have "solved" part of robotics.
>> Yeah, wake me up when they have a robot that can wash, peel, cut fruit and vegetables; unwrap, cut, cook meat; measure salt and spices; whip cream; knead and shape dough; and clean up the resulting mess from all of these.
Just the hand, you have 50 things you just can’t do unless you have certain feel. Handling glass? Oh it’s greasy, now your rubber grip is screwed, now go wash it off, and dry it to start again.
Indeed, there is a whole category of Adaptive Compliant Grippers for specific use-cases like handling delicate brittle objects.
Things that randomly change shape or appearance are also very difficult to interact with safely. The force sensing platform from Universal Robots is safer for users, but it has limitations like all platforms. =3
> right now, you need to show the robot pictures of what you want it to do. want it to "clean the kitchen"? better have a photo of a clean kitchen handy.
What about using Flux Kontext (or Controlnets) to turn the messy kitchen into a clean kitchen?
Per HiQ vs. LinkedIn, it doesn't matter what their ToS says if the scraper didn't have to agree to the ToS to scrape the data. YouTube will serve videos to someone who isn't logged in. So if you've never agreed to YouTube's ToS, you can scrape the videos. If YT forced everyone to log in before they could watch a video, then anyone who wants to scrape videos would have had to agree to the ToS at some point.
It does for me, in USA on AT&T Fiber using Safari in private browsing mode. Chrome in incognito as well. And phone on mobile YouTube (though I didn't test with uninstalling/reinstalling to reset IDFA and IDFV, so it's not really a valid test)
My "lawyer" (gpt4o) claims that since YouTube is merely a non-exclusive licensee of the user content upload to their service, even if they have such restrictions in their ToS (they do), they likely would not hold up in court, citing [0]. Something about that non-exclusivity meaning they cannot constrain the copyright further on their own terms. Which I guess makes sense?
And since scraping of publicly available data is not illegal (in the US, according to the aforementioned "lawyer"), it seems like it's okay?
Who cares at this point? No one is stopping ML sets from being primarily pirated. The current power is effectively dismantling copyright for AI related work.
Anyone who has a shred of integrity. I'm not a fan of overreaching copyright laws, but they've been strictly enforced for years now. Decades, even. They've ruined many lives, like how they killed Aaron Swartz.
But now, suddenly, violating copyright is totally okay and carries no consequences whatsoever because the billionaires decided that's how they can get richer now?
If you want to even try to pretend you don't live in a plutocracy and that the rule of law matters at all these developments should concern you.
There's an English phrase "hounded to death", meaning that someone was pursued and hassled until they died. It doesn't specify the cause of death, but I think the assumption would be suicide, since you can't actually die of fatigue.
Many people have dealt with the law, with copyright infringement, even with gross amounts of it, and had the book thrown at them, and survived the experience.
Swartz was ill. It is a tragedy he did not survive the experience, and indeed, trial is very stressful. But he was no more hounded than any defendant who comes under federal scrutiny and has to defend themselves in a court of law via the trial system. Kevin Mitnick spent a year in prison (first incarceration) and survived it. Swartz was offered six months and committed suicide.
I don't know how much we should change of the system to protect the Aaron Swartzs of the world; that's the mother of all Chesterton's Fences.
Many people get (for example) pneumonia and recover. Some people get pneumonia and die. The people who died of pneumonia died because of pneumonia. The fact that other people survived it doesn't mean that they didn't die of it.
Saying that we should not work on cures for pneumonia because it's a Chesterton Fence is obviously, blatantly, illogical. Saying that we should change the system so that government officials working for moneyed interests can't hound someone to death is similarly illogical.
Pneumonia doesn't have any societal benefit. The process by which we decide if the law was broken and punishment necessary has obvious benefit. If you mean we should seek a cure for dangerous suicidal depression, I agree. But you surely are not suggesting that, for example, has Swartz been accused of embezzlement that the state drop out finish the charges purely because he's a suicide risk; how would that be just to the people who were stolen from?
And it's a point of semantics, but no; we generally don't say people who died by suicide died by the things going on in their life when they ended it. Everybody has stressors. The suicidal also have mental illness. Mr. Swartz had self-documented his past suicidal ideation.
Maybe someone should throw you in prison for a year on some BS made-up charges to see how well you survive it. We can use it as a data point for your argument.
This is true. In general, the harm done by crime is directed outwards from the perpetrator, not inwards to the perpetrator. In fact, the behaviors that only cause self-harm that we criminalize are relatively few.
The "Big Beautiful Bill" contains a clause that prohibits state "AI" legislation.
Trump has a "Crypto and AI czar" who is very active in promoting "AI" on his YouTube propaganda outlet. The same czar also promoted, pre-election of course, accelerated peace with Russia and then stopped talking about the subject altogether.
Fun fact: the International Bureau of Weights and Measures in Paris is the owner of a perfect 0 dB noise floor enclosed in a perfect titanium sphere (with some sheep's wool filling to avoid reflections). There is a small door on the side over which microphone capsules can be inserted for calibration.
I genuinely wish there was a cost estimation feature built into them. Doesn't even have to be even remotely close to the true cost if it's anything like the meetings I attend, there will be enough people and it will go on for long enough to make up for it.
I worked as consultant. And started billing at normal hourly rates for meetings. You will be surprised how fast the company desire for my participation in them decreased.
i thought all the cool data driven robotics stuff was like reinforcement learning from sensors that track moving effectors in the real world with online retraining that mimics the sensorimotor experimentation that is observed during the developmental phases of real neurobiological systems?
so you just kinda let it run for a while and it bumps and squirms around until it stands up or whatever.
I don't know. I'm not the expert, but if you've ever tried to a backflip or anything where your toes are above your head, then you'll know that spatial awareness goes well beyond vision. Or if you throw a frisbee for the dog to catch, they don't actually look at it while running; they look, predict position, then move in. Veni, vidi, vici. So any model that "learns physics" just through vision seems flawed from the start. What's your thought there?
This is interesting for generalized problems ("make me a sandwich") but not useful for most real world functions ("perform x within y space at z cost/speed"). I think the number of people on the humanoid bandwagon trying to implement generalized applications is staggering right now. The physics tells you they will never be as fast as purpose-built devices, nor as small, nor as cheap. That's not to say there's zero value there, but really we're - uh - grasping at straws...
For a single example, in any factory watch how humans are added as ad-hoc machines wherever a problem occurs. Machine N outputting faster than machine N+1 can accept? Have a human stack, and destack, the product between them. No matter the size, shape, it within reason the weight of the product. But most importantly: the process can begin within seconds of the problem occurring. No need for a programmer, developer, or maintenance worker to get involved. Just a clear order from the shift manager.
A general purpose robot with physical interfaces similar to a human would be very valuable for such environments. If it had the software to be as easy to instruct as a human.
I wonder if a generalized machine would have an advantage from scale, and then putting all the specialized stuff into software. We have seen this play out before.
Well, there’s a middle ground, kinda. Using more specialized hardware (ex: cobots) but deploy state-of-art Physical AI (ML/Computer Vision) on them. We’re building one such startup at ko-br (https://ko-br.com/) :))
Very good point! This area faces a similar misalignment of goals in that it tries to be a generic fit-all solution that is rampant with today's LLMs.
We made a sandwich but it cost you 10x more than it would a human and slower might slowly become faster and more efficient but by the time you get really good at it, its simply not transferable unless the model is genuinely able to make the leap across into other domains that humans naturally do.
I'm afraid this is where the barrier of general intelligence and human intelligence lies and with enough of these geospatial motor skill database, we might get something that mimics humans very well but still run into problems at the edge, and this last mile problem really is a hinderance to so many domains where we come close but never complete.
I wonder if this will change with some sort of computing changes as well as how we interface with digital systems (without mouse or keyboard), then this might be able to close that 'last mile gap'.
It's an interesting comment, it has the same "compliment the OP, elaborate, raise a further question" format I've seen used by apparently LLM-generated spam accounts on HN. But, the second paragraph is so incoherently structured that I have a hard time thinking an LLM produces it.
>the model is basically a diva about camera positioning. move the camera 10 degrees and suddenly it thinks left is right and up is down.
Reminds that several years ago Tesla had to finally start to explicitly extract 3D model from the net. Similarly i expect here that it would get pipelined - one model extracts/builds 3D, and the other is actually the "robot" working in that 3D. Each one can be alone trained much better and efficiently, with much better transfer and generalization, than the large monolithic model working from the 2D video. In pipeline approach, it is very easy to generate synthetic input 3D data better covering interesting scenario space for the "robot" model.
And, for example, you can't just, without significant training, feed the large monolithic model a lidar point space instead of videos. Whereis in a pipelined approach, you just switch the 3D generating pipeline input model.
Put tiny cams on robot arms and let it control them. They can be flimsy for safety. If it is sure something is happening say nothing, if it is 70-99% sure have it guess what is going on, if <70% have it ask what is going on.
I just wrote a reply to a comment talking about the AI tells this writing has, but it got flagged so my comment disappeared when I hit post. I'll rephrase out of spite:
My first thought upon reading this was that an LLM had been instructed to add a pithy meme joke to each paragraph. They don't make sense in context, and while some terminally online people do speak in memes, those people aren't quoting doge in 2025.
There's also a sense of incoherence in the whole piece. For instance, this section:
"- after: 22 million videos + 1 million images (now we're talking)
they basically hoovered up everything: something-something v2, kinetics, howto100m, and a billion youtube videos"
Was it a billion vids or 22m? It turns out the latter sentence is just rephrasing the list of sources in a cool casual way, and the last one is called YT-Temporal-1B. That's a billion frames of video, not a billion videos.
Also, the author of the blog "Ksagar Atharva" doesn't appear anywhere in the list of authors on the linked FB research paper with Yann LeCun as a co-author. Unless the blog author is using a heavily modified pseudonym.
The research is very real but the blog post appears to be very fake.
>> They don't make sense in context, and while some terminally online people do speak in memes, those people aren't quoting doge in 2025.
Cringely, they are. Nobody who isn't desperate to appear cool would write in that terminally grating register, including when using an LLM to do the writing.
I'm using eigenrobot's (X user) prompt for ChatGPT and the style is very recognizable. Everything lowercase, tone, zoomer abbreviations, esotheric style of jokes.
I don’t know, 400k people are listening to the White House streaming lo-fi hip hop on X right now with cutesy videos of Trump on one side and his executive orders streaming on the other at 4am. I think there’s plenty of people quoting doge in 2025.
If you’re in the US, you likely work with them and they have learned to studiously avoid talking about politics except in vagaries to avoid conflict.
they are referring to doge the dog meme, not the government initiative. The meme is much older and wouldn't be considered "cool" to use by the same people who write in the style of the article. Which indicates it was written by an LLM, because usually only things like ChatGPT throw in such cringe, out of date memes in an otherwise obnoxiously 2025 article
There was a clear attempt at the doge meme format, yes:
> very scientific. much engineering.
Emphasis on attempt because you're supposed to use words with grammatically incorrect modifiers, and the first one doesn't. (Even the second one doesn't seem entirely incorrect to me? I'm not a native speaker though.) "many scientific, so engineering" for example would have worked.
I assume they, or most likely their LLM, tried too hard to follow the most popular sequence (very, much, wow) and failed at it.
> some terminally online people do speak in memes, those people aren't quoting doge in 2025.
You may be surprised to find out how incorrect this.
I can think of two popular conservative sites likely to quote Doge people off hand that do this. I read all news in order not be an insufferable ideologue. So again, off the top of my head, NotTheBee (I think affiliated to BabylonBee (conservative The Onion)) and Twitchy. Among YouTubers, I think Asmond Gold, and I’m sure others like Steven Crowder who himself is in a famous meme.
They are referring to the original doge meme of the dog, not the government initiative today. I guess "quote" isn't really the right word, more like "doing"
Not conservative but I used to love the meme before it was co-opted by musk, so I will occasionally use it as a "haha now you feel OLD" without thinking of its modern connotations.
Also I think it’s somehow important to not let fascism steal our cultural heritage, even if it’s just a meme.
In my country, far righters are displaying the country’s flag everywhere. Now you can’t display a French flag without being thought as a far right person. That’s honestly insufferable.
I know it’s less important with doge but still : before being a crypto it was just a picture of an overly innocent and enthusiastic dog. And even when it became a little crypto, it was totally assumed that it was a meme coin and wasn’t meant for speculation, the idea was that 1DOGE = 1DOGE only and people gifted them to other people who made nice contributions on the internet.
Musk broke all of this when it started to use it to do gigantic pumps and dumps using his own visibility on Twitter.
We don’t have to let fascism steal all the popular symbols / memes, because they will steal them anyway.
Hello there! As a fellow gen-z douchebag, the article looks authentic, albeit a bit slim on Discord screencaps. Will be fun(?) to be proven wrong though.
Pure vision will never be enough because it does not contain information about the physical feedback like pressure and touch, or the strength required to perform a task.
For example, so that you don't crush a human when doing massage (but still need to press hard), or apply the right amount of force (and finesse?) to skin a fish fillet without cutting the skin itself.
Practically in the near term, it's hard to sample from failure examples with videos on Youtube, such as when food spills out of the pot accidentally. Studying simple tasks through the happy path makes it hard to get the robot to figure out how to do something until it succeeds, which can appear even in relatively simple jobs like shuffling garbage.
With that said, I suppose a robot can be made to practice in real life after learning something from vision.
> Pure vision will never be enough because it does not contain information about the physical feedback like pressure and touch, or the strength required to perform a task.
I'm not sure that's necessarily true for a lot of tasks.
A good way to measure this in your head is this:
"If you were given remote control of two robot arms, and just one camera to look through, how many different tasks do you think you could complete successfully?"
When you start thinking about it, you realize there are a lot of things you could do with just the arms and one camera, because you as a human have really good intuition about the world.
It therefore follows that robots should be able to learn with just RGB images too! Counterexamples would be things like grabbing an egg without crushing, perhaps. Though I suspect that could also be done with just vision.
> It therefore follows that robots should be able to learn with just RGB images too!
I don't see how that follows. Humans have trained by experimenting with actually manipulating things, not just by vision. It's not clear at all that someone who had gained intuition about the world exclusively by looking at it would have any success with mechanical arms.
You'd use a two-step approach.
1. First create a model that can evaluate how well a task is going; the YT approach can be used here.
2. Then build a real-world robot, and train it by letting it do tasks, and use the first model to supervise it; here the robot can learn to rely on extra senses such as touch/pressure.
You're agreeing with the parent btw. You've introduced a lot more than just vision. You introduced interventional experimentation. That's a lot more than just observation
What I describe is an unsupervised system.
What you say ("interventional") sounds like it's human-supervised.
But maybe I'm interpreting it in the wrong way, so please correct me if so.
By "intervention" I mean interacting with the environment. Purpose a hypothesis, test, modify, test. You can frame RL this way though RL usually generates hypotheses that are far too naïve.
This looks like a good brief overview (I only skimmed it but wanted to give you more than "lol, google it") http://smithamilli.com/blog/causal-ladder/
You introduced knowledge not obtained through observation. In fact, the knowledge you introduced is the whole chimichanga! It is an easy mistake to make, so don't feel embarrassed.
The claim is that one can learn a world model[0] through vision. The patent countered by saying "vision is not enough." Then you countered by saying "vision is enough if you already have a world model."
[0] I'll be more precise here. You can learn *A* world model, but it isn't the one we really care about and "a world" doesn't require being a self consistent world. We could say the same thing about "a physics", but let's be real, when we say "physics" we know which one is being discussed...
Humans did not accumulate that intuition just using images. In the example you gave, you subconsciously augment the image information with a lifetime of interacting with the world using all the other senses.
Yes, without extra information, manipulating everyday objects is probably as intuitive to robots as manipulating quantum scale molecules is for humans.
I think you vastly underestimate how difficult the task you are proposing would be without depth or pressure indication, even for a super intelligence like humans.
Simple concept, pick up a glass and pour its content into a vertical hole the approximate size of your mouth. Think of all the failure modes that can be triggered in the trivial example you do multiple times a day, to do the same from a single camera feed with no other indicators would take you hours to master and you already are a super intelligent being.
A routine gesture I've done everyday for almost all my life: getting a glass out of the shelves and into my left hand. It seems like a no brainer, I open the cabinet with my left hand, take the glass with my right hand, throw the glass from my right hand to the left hand while closing the cabinet with my shoulder. Put the glass under the faucet with left hand, open the faucet with the right hand.
I have done this 3 seconds gesture, and variations of it, my whole life basically, and never noticed I was throwing the glass from one hand to the other without any visual feedback.
And you're used to the weight of the glass, which you instantly recognize when you pick it up. If it was a different weight than you were expecting, you'd probably slow down and be more deliberate.
If you were to just do the exact same robotic "throw" action with a glass of unexpected weight you'd maybe not throw hard enough and miss, or throw too hard and possibly break it.
The point is how much non-vision sensors vs pure vision, helps humans to be humans. Don't you think this point was proven by LLMs already that generalizability doesn't come from multi-modality but by scaling a single modality itself? And jepa is for sure designed to do a better job at that than an LLM. So no doubt about raw scaling + RL boost will kick-in highly predictable & specific robotic movements.
Also, LLMs these days aren't trained on just language
> generalizability doesn't come from multi-modality but by scaling a single modality itself
Could you expand on what you mean by this?
If I have to pour water into my mouth, you can bet it's going all over my shirt. That's not how we drink.
Except this is the absolutely most common thing humans do, and my argument is that that it will spill water all over but rather that it will shatter numerous glasses, knock them over etc all before it has picked up the glass.
The same process will be repeated many times trying to move the glass to its “face” and then when either variable changes, plastic vs glass, size, shape, location and all bets are off purely because there just plainly is the enough information
counterpoint: think about all the tasks you could do with your hands and arms while your eyes are closed. i think its really a lot of stuff considering blind people can do the vast majority of things sighted people can do, and i suspect anything you could do with your eyes closed would be extremely difficult to do with a camera feed as the literal only sensory input
> When you start thinking about it, you realize there are a lot of things you could do with just the arms and one camera, because you as a human have really good intuition about the world.
And where does this intuition come from? It was buily by also feeling other sensations in addition to vision. You learned how gravity pulls things down when you were a kid. How hot/cold feels, how hard/soft feels, how thing smell. Your mental model of the world is substantially informed by non-visual clues.
> It therefore follows that robots should be able to learn with just RGB images too!
That does not follow at all! It's not how you learned either.
Neither have you learned to think by consuming the entirety of all text produced on the internet. LLMs therefore don't think, they are just pretty good at faking the appearance of thinking.
>"If you were given remote control of two robot arms, and just one camera to look through, how many different tasks do you think you could complete successfully?"
There are an infinite number of scenes that can be matched to one 2d picture. And what is a scene really? The last time I checked, RGB was not a good way of input in Computer Vision and rather relied on increasing levels of gradients via CNNs to build a compositional scene. None of that is paticularly translatable to how a LM works with text.
Humans have innate knowledge that help them interact with the world and can learn from physical interaction for the rest. RGB images aren't enough.
Video games have shown that we can control pretty darn well characters in virtual worlds where we have not experienced their physics. We just look at a 2D monitor and using a joystick/keyboard we manage to figure it out.
a game has very limited physics. like the buttons you press are pre-tuned to perform certain actions and you arent dealing with continuous nearly infinite possibilities with large ranges of motion, pressure, speed etc. like think about how difficult the game QWOP is because you mostly just have visual feedback
I beg to disagree. I got introduced to brand new (to me) physics of flying airplanes by MS flight simulator. None of the rules I knew in real life applied (gravity matters only sometimes, height can be traded for speed etc). Yet learned how to fly.
And when I took real classes in a real Cessna, this experience was transferable (aka the flying model I had in my brain was very similar to the one I experienced with my full body in the cockpit).
Yeah but we already have a conception of what physics should be prior to that that helps us enormously. It's not like game designers are coming up with stuff that intentionally breaks our naïve physics.
I mean they do but we often have generalized (to some degree) world models. So when they do things like change gravity, flip things upside down, or even more egregious changes we can adapt. Because we have contractual counterfactual models. But yeah, they could change things so much that you'd really have to relearn and that could be very very difficult if not impossible (I wonder if anyone has created a playable game with a physics that's impossible for humans to learn, at least without "pen and paper". I think you could do this by putting the game in higher dimensions.)
If the robot already knows "how to" the happy path, the training difficulty falls severely at least if it can continue after a recovery.
The tasks you do to recover from the failure is often different from the happy path. For example, the happy path of dumping garbage is carrying a garbage bag to a collection bin. The non-happy path is that the bin is overflowing and you have to put the bag on the ground, or if the bag leaks and you need to move to a new bag, or if the bag breaks entirely and you have to pick up the trash again.
But yeah, I think a better way to put it is that sampling the happy path would indeed make the failure case easier, but sampling just happy paths is far from sufficient from completing even some of the simplest human tasks with failure.
On humans, you can generally see the force they apply by looking at strain.
The error margins will be huge, and for small enough force (like the skinning part or handling fine mechanical stuff) there's basically almost zero signal.
But actually there's more to this that makes the problem even harder! Lack of sensors is just the beginning. There's well known results in physics that:
This is a real pain point for these vision world models and most people I talk to (including a lot at the recent CVPR) just brush this off as "we're just care if it works." Guess what?! Everyone that is pointing this out also cares that it works! We need to stop these thought terminating cliches. We're fucking scientists.Okay, so why isn't observation enough? It's because you can't differentiate alternative but valid hypotheses. You often have to intervene! We're all familiar with this part. You control variables and modify one or a limited set at a time. Experimental physics is no easy task, even for things that sound rather mundane. This is in fact why children and animals play (okay, I'm conjecturing here).
We need to mention chaos here, because it's the easiest way to understand this. There's many famous problems that fall into this category like the double pendulum, 3 Body Problem, or just fucking gas molecules moving around. Let's take the last one. Suppose you are observing some gas molecules moving inside a box. You measure their positions at t0 and at T. Can you predict their trajectories between those time points? Surprisingly, the answer is no. You can only do this statistically. There's probably paths but not deterministic (this same logic is what leads to multiverse theory btw). But now suppose I was watching the molecules too, but I was continuously recording between t0 and T. Can I predict the trajectories? Well, I don't need to, I just write it down.
Now I hear you, you're saying "Godelski, you observed!" But the problem with these set of problems is that if you don't observe the initial state you can't predict moving forwards and if you don't have very precise observation intervals you are hit with the same problem. I you turn around while I start a double pendulum you can have as much time as you want when you turn back around, you won't be able to model its trajectories.
But it gets worse still. There are confounding variables. There is coupling. Difficult to differentiate hypotheses via causal ordering. And so so much more. If you ever wonder why physicists do so much math it's because doing that is a fuck ton easier than doing the whole set of testing and then reverse engineering the equations from those observations. But in physics we care about counterfactual statements. In F=ma we can propose new masses and new accelerations and rederive the results. That's the what it is all about. Your brain does an amazing job at this too! You need counterfactual modeling to operate in real world environments. You have to be able to ask and answer "what happens if that kid runs into the street?"
I highly suggest people read The Relativity of Wrong [0]. Its a short essay by Isaac Asimov that can serve as a decent intro, though far from complete. I'm suggesting it because I don't want people to confuse "need counterfactual model" with "need the right answer." If you don't get into metaphysics, these results will be baffling.[1] It is also needed to answer any confusion you might have around the aforementioned distinction.
Tldr:
There's a lot of complexity and depth that is easy to miss with the excitement, but it still matters.I'm just touching the surface here too, and we're just talking about mechanics. No quantum needed, just information loss
[0] https://hermiene.net/essays-trans/relativity_of_wrong.html
[1] maybe this is why there are so few physicists working on the world modeling side of ML. At least, using that phrase...
I didn't understand a single word about this post and what was supposed to be solved and had to stop reading.
Was this actually written by a human being? If so, the author(s) suffer from severe language communication problems. Doesn't seem to be grounded at least with reality and my personal experience with robotics. But here's my real world take:
Robotics is going to be partially solved when ROS/ROS2 becomes effectively exterminated and completely replaced by a sane robotics framework.
I seriously urge the authors to use ROS/ROS2. Show us, implementing your solution with ROS, pushing it to a repository and allow others to verify what you solved, maybe?. Suffer a bit with the framework and then write a real post about real robotics hands-on, and not just wander on fancy uncomprehensible stuff that probably no-one will ever do.
Then we can maybe start talking about robotics.
It is readily understandable if you are fluent in the jargon surrounding state of the art LLMs and deep learning. It’s completely inscrutable if you aren’t. The article is also very high level and disconnected from specifics. You can skip to FAIR’s paper and code (linked at the article’s end) for specifics: https://github.com/facebookresearch/vjepa2
If I had to guess, it seems likely that there will be a serious cultural disconnect as 20-something deep learning researchers increasingly move into robotics, not unlike the cultural disconnect that happened in natural language processing in the 2010s and early 20s. Probably lots of interesting developments, and also lots of youngsters excitedly reinventing things that were solved decades ago.
Also, it definitely doesn't help that they remove all capitalization except in nouns.
I totally agree with you. On the other hand the theory behind it -to combine image recognition to predict the outcome based on specific physical impacts- does sound intriguing and like a somewhat newer idea.
But besides that, you‘re totally right. It’s too „loose“ since to realize that idea the process would have to be way different (and properly explained)
You cannot develop accurate world models through observation alone, full stop.
You cannot verify accurate world models through benchmarks alone, full stop.
These have been pain points in physics for centuries and have been the major pain point even before the quantum revolution. I mean if it were possible, we'd have solved physics long ago. You can find plenty of people going back thousands of years boldly claiming "there is nothing new to be learned in physics," yet it was never true and still isn't true even if we exclude quantum and relativity.
Side note: really the paper is "fine" but I wish we didn't put so much hype in academic writing. Papers should be aimed at other academics and not be advertisements (use the paper to write advertisements like IFLS or Quanta Magazine, but don't degrade the already difficult researcher-to-researcher communication). So I'm saying the experiments are fine and the work represents progress but it is over sold and the conclusions do not necessarily follow
Btw, the paper makes these mistakes too. It makes a very bold assumption that counterfactual models (aka a "world model") are learned. This cannot be demonstrated through benchmarking, it must be proven through interpretability.
Unfortunately, the tail is long and heavy... you don't need black swan events to disrupt these models and boy does this annoying fact make it easy to "hack" these types of models. And frankly, I don't think we want robots operating in the wild (public spaces, as opposed to controlled spaces like a manufacturing floor) if I can make it think an iPhone is an Apple with just a stickynote. Sure, you can solve that precise example but it's not hard to come up with others. It's a cat and mouse game, but remember, Jerry always wins.
It's not a scholarly article but a blog post but you're still right to be frustrated at the very bad writing. I do get the jargon, despite myself, so I can translate: the authors of the blog post claim that machine learning for autonomous robotics is "solved" thanks to an instance of V-JEPA 2 trained on all videos on youtube. It isn't, of course, and the authors themselves point out the severe limitations of the otherwise promising approach (championed by Yan LeCun) when they say, in a notably more subdued manner:
>> the model is basically a diva about camera positioning. move the camera 10 degrees and suddenly it thinks left is right and up is down.
>> in practice, this means you have to manually fiddle with camera positions until you find the sweet spot. very scientific. much engineering.
>> long-horizon drift
>> try to plan more than a few steps ahead and the model starts hallucinating.
That is to say, not quite ready for the real world, V-JEPA 2 is.
But for those who don't get the jargon there's a scholarly article linked at the end of the post that is rather more sober and down-to-earth:
A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.
https://arxiv.org/abs/2506.09985
In other words, some interesting results, some new SOTA, some incremental work. But lots of work for a big team of a couple dozen researchers so there's good stuff in there almost inevitably.
"why didn't we think of this sooner?", asks the article. Not sure who the "we" is supposed to be, but the robotics community has definitely thought of this before. https://robo-affordances.github.io/ from 2023 is one pretty relevant example that comes to mind, but I have recollections of similar ideas going back to at least 2016 or so (many of which are cited in the V-JEPA2 paper). If you think data-driven approaches are a good idea for manipulation, then the idea of trying to use Youtube as a source of data (an extremely popular data source in computer vision for the past decade) isn't exactly a huge leap. Of course, the "how" is the hard part, for all sorts of reasons. And the "how" is what makes this paper (and prior research in the area) interesting.
I definitely saw somebody at Actuate last year talking about supplementing training videos for VLA with Youtube, but I think they actually found that "any" video of the real world helped give a better physics "understanding" to the model.
This article contains so many falsehoods and history rewrites that it's pretty painful to read.
This was a bit hard to read. It would be good to have a narrative structure and more clear explanation of concepts.
> This was a bit hard to read.
This writing style is prominent on Twitter and niche Discords. It's funny how much I've come to be able to cut right through it, but if you haven't seen much of it it's really hard to parse. That's by design, too. The vibe of this writing style is to project an air of confidence so strong that the author doesn't care if you get it or not. It's a sort of humblebrag where the writing is supposed to flex the author's understanding of the subject while also not caring if you get it or not.
As others have already covered, there's also some heavy stretching of the truth and rewriting of history going on in this post. That's also common of the extreme bravado in this style of semi-impenetrable writing: The vagueness and ambiguities allow the author to make grandiose claims but then wiggle out of them later if someone is astute enough to catch on.
For example: The blog post is written as “We…” but is the author part of the team? Or is he using “we” meaning society in general?
What's the point in writing something while "not caring" if the reader understands or not? Seems like a false confidence or false bravado to me; it reads like an attempt to project an impression, and not really an attempt to communicate.
Basically: If you understand the topic well, you’re not the target audience.
This is a type of information arbitrage where someone samples something intellectual without fully understanding it, then writes about it for a less technical audience. Their goal is to appear to be the expert on the topic, which translates into clout, social media follows, and eventually they hope job opportunities.
The primary goal of the writing isn’t to get you to understand the topic clearly, because that would diminish the sense that the author is more knowledgeable than you. The goal is to sound guru-like while making the topic feel impenetrably complex for you, while appearing playfully casual for the author.
I guess "bullshitting as a career" isn't going away any time soon.
This style of writing is very effective at convincing people in their impressionable years of a narrative or viewpoint, often one that is hard to defend with more traditional writing styles.
I hope I'm wrong, but this looks like an effort to normalize such writing style. As this happens, intelligent discourse and rhetoric become harder.
Very intentional. Their response would be: “if you need narrative structure and clear explanation of concepts, yngmi”.
And the answer to that would be: WNGTI.
https://www.youtube.com/watch?v=4xmckWVPRaI
Capitalia tantum.
It would also be good if the perspective of the article would stay put. This "we" and "they" thing was at best confusing and at worst possibly a way to get more clicks or pretend the author had something to do with the work.
I was unable to make through the article (now we're talking).
No you didn't, and I don't even need to click on the link to know it
IMO, VideoMimic is a better proof-of-concept
https://www.videomimic.net/
https://www.videomimic.net/page1.html
Looks like it was trained on Shaolin Drunken Fist videos. Does it look drunk because of the videos or because there's a discrepancy between videos and it not accounting for gravity and physics in general?
My guess would be lack of actuators. For instance, this robot looks like it has an ankle that can only go up and down, but not roll like a human's. Also, I wonder if there's a center of gravity issue, as it almost always appears to be leaning backwards to even out.
I think it's still pretty impressive in its recoveries, even though there's an unnaturally large number of them necessary. About 8 seconds into the video on the homepage, it almost misses and ends up slipping off the second step. I've eaten shit at missing a couple inch curb, though I don't think "graceful" has ever been used as a descriptor for me. So the fact that it just recovers and keeps going without issue is impressive to me.
> So the fact that it just recovers and keeps going without issue is impressive to me.
I'm pretty sure that's just a matter of reaction speed and it maintaining a constant focus/vigilance on it's movement that you'd usually not reserve outside of some sports and situations pre-identified as deserving the attention due to danger, like concentrating on balance and not getting into a position that overstresses your joints when you know it's icy.
My mom said I was throwing away my life watching YouTube all day and clearly I just haven’t been watching YouTube enough. 1 million YouTube videos here I come!
I do not know and do not care much about robotics per se, but I wish LLM's were better with spatial reasoning. If the new insight helps with that - great!
I dabbled a bit in geolocation with LLM's recently. It is still surprising to me how good they are with finding the general area a picture was taken. Give it a photo of a random street corner on this earth and it is likely will not only tell you the correct city or town but most often even the correct quarter.
On the other hand, if you ask it for a birds eye view of a green, a brown and a white house on the north side of a one-way street (running west to east) east of an intersection running north to south, it may or may not get it right. If you want it to add an arrow going in the direction of the one-way street, it certainly has no clue at all and the result is 50/50.
Extremely oversold article.
> the core insight: predict in representation space, not pixels
We've been doing this since 2014? Not only that, others have been doing it at a similar scale. e.g. Nvidia's world foundation models (although those are generative).
> zero-shot generalization (aka the money shot)
This is easily beaten by flow-matching imitation learning models like what Pi has.
> accidentally solved robotics
They're doing 65% success on very simple tasks.
The research is good. This article however misses a lot of other work in the literature. I would recommend you don't read it as an authoritative source.
Solved??? Where?
Yeah, wake me up when they have a robot that can wash, peel, cut fruit and vegetables; unwrap, cut, cook meat; measure salt and spices; whip cream; knead and shape dough; and clean up the resulting mess from all of these. Then they will have "solved" part of robotics.
>> Yeah, wake me up when they have a robot that can wash, peel, cut fruit and vegetables; unwrap, cut, cook meat; measure salt and spices; whip cream; knead and shape dough; and clean up the resulting mess from all of these.
Someone's getting peckish :P
Someone watched 'Devs' ?
if you havent - highly recommended.
Not sure why people love this show. Really terrible writing.
Love Alex Garland but the characters ruin the show.
Do you have a link or a less generic search term?
It’s a TV show made by Adam Garland https://m.imdb.com/title/tt8134186/ It’s pretty good sci fi IMHO
[flagged]
Do we have a “let me ChatGPT that for you..” site yet?
Solving robotics is some claim.
Spoiler: not solved
Betteridge not only applies to headlines with questions but it also works quite well with Twitter style headlines.
Indeed, the robotics edge-case problem space complexity balloons far faster than most assume.
Physics informed training is a real methodology (simple introduction to the subject: https://www.youtube.com/@Eigensteve/videos ).
However, the slop article is 80% nonsense. =3
Just the hand, you have 50 things you just can’t do unless you have certain feel. Handling glass? Oh it’s greasy, now your rubber grip is screwed, now go wash it off, and dry it to start again.
Indeed, there is a whole category of Adaptive Compliant Grippers for specific use-cases like handling delicate brittle objects.
Things that randomly change shape or appearance are also very difficult to interact with safely. The force sensing platform from Universal Robots is safer for users, but it has limitations like all platforms. =3
Dr Fei-Fei Li talks about this as the LWM (Large World Model) during this interview: https://youtu.be/fQGu016AlVo and with https://www.worldlabs.ai/
So video gen models basically can be extrapolated to control robotics ? How long until Veo3 robots take over?
> right now, you need to show the robot pictures of what you want it to do. want it to "clean the kitchen"? better have a photo of a clean kitchen handy.
What about using Flux Kontext (or Controlnets) to turn the messy kitchen into a clean kitchen?
Sure thing, let me just put the fridge in the washing machine.
Does YouTube allow massive scraping like this in their ToS?
Per HiQ vs. LinkedIn, it doesn't matter what their ToS says if the scraper didn't have to agree to the ToS to scrape the data. YouTube will serve videos to someone who isn't logged in. So if you've never agreed to YouTube's ToS, you can scrape the videos. If YT forced everyone to log in before they could watch a video, then anyone who wants to scrape videos would have had to agree to the ToS at some point.
It won't serve me videos if I'm not logged in. It tells me to sign in to prove I'm not a bot. How do these people get around this?
It does for me, in USA on AT&T Fiber using Safari in private browsing mode. Chrome in incognito as well. And phone on mobile YouTube (though I didn't test with uninstalling/reinstalling to reset IDFA and IDFV, so it's not really a valid test)
My "lawyer" (gpt4o) claims that since YouTube is merely a non-exclusive licensee of the user content upload to their service, even if they have such restrictions in their ToS (they do), they likely would not hold up in court, citing [0]. Something about that non-exclusivity meaning they cannot constrain the copyright further on their own terms. Which I guess makes sense?
And since scraping of publicly available data is not illegal (in the US, according to the aforementioned "lawyer"), it seems like it's okay?
Not legal advice.
[0] https://www.skadden.com/insights/publications/2024/05/distri...
I don't think they can legally prevent it
They don't and neither do I allow my site - whose content I found on Gemini -scraped
Probably not.
Who cares at this point? No one is stopping ML sets from being primarily pirated. The current power is effectively dismantling copyright for AI related work.
> Who cares at this point
Anyone who has a shred of integrity. I'm not a fan of overreaching copyright laws, but they've been strictly enforced for years now. Decades, even. They've ruined many lives, like how they killed Aaron Swartz.
But now, suddenly, violating copyright is totally okay and carries no consequences whatsoever because the billionaires decided that's how they can get richer now?
If you want to even try to pretend you don't live in a plutocracy and that the rule of law matters at all these developments should concern you.
> … like how they killed Aaron Swartz.
I can’t imagine why you’d let the FBI off the hook
Aaron Swartz died of suicide, not copyright.
His death was a tragedy but it wasn't done to him.
There's an English phrase "hounded to death", meaning that someone was pursued and hassled until they died. It doesn't specify the cause of death, but I think the assumption would be suicide, since you can't actually die of fatigue.
I think that's what was done to Aaron Swartz.
Many people have dealt with the law, with copyright infringement, even with gross amounts of it, and had the book thrown at them, and survived the experience.
Swartz was ill. It is a tragedy he did not survive the experience, and indeed, trial is very stressful. But he was no more hounded than any defendant who comes under federal scrutiny and has to defend themselves in a court of law via the trial system. Kevin Mitnick spent a year in prison (first incarceration) and survived it. Swartz was offered six months and committed suicide.
I don't know how much we should change of the system to protect the Aaron Swartzs of the world; that's the mother of all Chesterton's Fences.
Many people get (for example) pneumonia and recover. Some people get pneumonia and die. The people who died of pneumonia died because of pneumonia. The fact that other people survived it doesn't mean that they didn't die of it.
Saying that we should not work on cures for pneumonia because it's a Chesterton Fence is obviously, blatantly, illogical. Saying that we should change the system so that government officials working for moneyed interests can't hound someone to death is similarly illogical.
Pneumonia doesn't have any societal benefit. The process by which we decide if the law was broken and punishment necessary has obvious benefit. If you mean we should seek a cure for dangerous suicidal depression, I agree. But you surely are not suggesting that, for example, has Swartz been accused of embezzlement that the state drop out finish the charges purely because he's a suicide risk; how would that be just to the people who were stolen from?
And it's a point of semantics, but no; we generally don't say people who died by suicide died by the things going on in their life when they ended it. Everybody has stressors. The suicidal also have mental illness. Mr. Swartz had self-documented his past suicidal ideation.
Maybe someone should throw you in prison for a year on some BS made-up charges to see how well you survive it. We can use it as a data point for your argument.
Crimes generally don't kill the criminal. It's the reaction by authorities that kills (perceived) criminals.
This is true. In general, the harm done by crime is directed outwards from the perpetrator, not inwards to the perpetrator. In fact, the behaviors that only cause self-harm that we criminalize are relatively few.
> If you want to even try to pretend you don't live in a plutocracy and that the rule of law matters at all
Can't even pretend anymore, this season jumped the shark
> The current power is effectively dismantling copyright for AI related work.
Out of the loop apparently, could you elaborate? By "the current power" I take you mean the current US administration?
Trump fired the head of the copyright office:
https://www.heise.de/en/news/After-criticism-of-AI-training-...
The "Big Beautiful Bill" contains a clause that prohibits state "AI" legislation.
Trump has a "Crypto and AI czar" who is very active in promoting "AI" on his YouTube propaganda outlet. The same czar also promoted, pre-election of course, accelerated peace with Russia and then stopped talking about the subject altogether.
Oh wow okay, genuinely missed these. Thanks.
What ToS
https://www.youtube.com/static?template=terms ?
Friendly unit conversion man at your service: 114 years.
How much is that in football fields?
If you accept 30 years as the average lifespan of an nfl stadium, 3.8
Good catch. Approximately 9,192,631 Turkish decibels.
Fun fact: the International Bureau of Weights and Measures in Paris is the owner of a perfect 0 dB noise floor enclosed in a perfect titanium sphere (with some sheep's wool filling to avoid reflections). There is a small door on the side over which microphone capsules can be inserted for calibration.
(/joke)
Too bad the joke doesn't work if you understand decibels.
So a half zoom meeting... or 1/3 Teams one.
I genuinely wish there was a cost estimation feature built into them. Doesn't even have to be even remotely close to the true cost if it's anything like the meetings I attend, there will be enough people and it will go on for long enough to make up for it.
I worked as consultant. And started billing at normal hourly rates for meetings. You will be surprised how fast the company desire for my participation in them decreased.
Why would you do anything but that? You want to just chat with me forever the rate is the rate.
i thought all the cool data driven robotics stuff was like reinforcement learning from sensors that track moving effectors in the real world with online retraining that mimics the sensorimotor experimentation that is observed during the developmental phases of real neurobiological systems?
so you just kinda let it run for a while and it bumps and squirms around until it stands up or whatever.
seems also the future for real ai?
I don't know. I'm not the expert, but if you've ever tried to a backflip or anything where your toes are above your head, then you'll know that spatial awareness goes well beyond vision. Or if you throw a frisbee for the dog to catch, they don't actually look at it while running; they look, predict position, then move in. Veni, vidi, vici. So any model that "learns physics" just through vision seems flawed from the start. What's your thought there?
This is interesting for generalized problems ("make me a sandwich") but not useful for most real world functions ("perform x within y space at z cost/speed"). I think the number of people on the humanoid bandwagon trying to implement generalized applications is staggering right now. The physics tells you they will never be as fast as purpose-built devices, nor as small, nor as cheap. That's not to say there's zero value there, but really we're - uh - grasping at straws...
The value is in the generalisation.
For a single example, in any factory watch how humans are added as ad-hoc machines wherever a problem occurs. Machine N outputting faster than machine N+1 can accept? Have a human stack, and destack, the product between them. No matter the size, shape, it within reason the weight of the product. But most importantly: the process can begin within seconds of the problem occurring. No need for a programmer, developer, or maintenance worker to get involved. Just a clear order from the shift manager.
A general purpose robot with physical interfaces similar to a human would be very valuable for such environments. If it had the software to be as easy to instruct as a human.
I wonder if a generalized machine would have an advantage from scale, and then putting all the specialized stuff into software. We have seen this play out before.
analogy: a CPU is more expensive, more complicated, more energy demanding than custom made circuitry, in most cases.
As the vendor you can sell it with the promise that awesomeness is coming "just around the corner" with the next software update.
You can also seek investment without committing to an actual concrete business model.
Well, there’s a middle ground, kinda. Using more specialized hardware (ex: cobots) but deploy state-of-art Physical AI (ML/Computer Vision) on them. We’re building one such startup at ko-br (https://ko-br.com/) :))
Quite a few startups in your space. Many deployed with customers. Good luck finding a USP!
Very good point! This area faces a similar misalignment of goals in that it tries to be a generic fit-all solution that is rampant with today's LLMs.
We made a sandwich but it cost you 10x more than it would a human and slower might slowly become faster and more efficient but by the time you get really good at it, its simply not transferable unless the model is genuinely able to make the leap across into other domains that humans naturally do.
I'm afraid this is where the barrier of general intelligence and human intelligence lies and with enough of these geospatial motor skill database, we might get something that mimics humans very well but still run into problems at the edge, and this last mile problem really is a hinderance to so many domains where we come close but never complete.
I wonder if this will change with some sort of computing changes as well as how we interface with digital systems (without mouse or keyboard), then this might be able to close that 'last mile gap'.
Note that the username here is a Korean derogatory term for Chinese people.
It's an interesting comment, it has the same "compliment the OP, elaborate, raise a further question" format I've seen used by apparently LLM-generated spam accounts on HN. But, the second paragraph is so incoherently structured that I have a hard time thinking an LLM produces it.
https://news.ycombinator.com/item?id=44073183
I wonder how much language does this model understand. If we pan across text will it fill in sensible next word? How good will it be?
>camera pose sensitivity
>the model is basically a diva about camera positioning. move the camera 10 degrees and suddenly it thinks left is right and up is down.
Reminds that several years ago Tesla had to finally start to explicitly extract 3D model from the net. Similarly i expect here that it would get pipelined - one model extracts/builds 3D, and the other is actually the "robot" working in that 3D. Each one can be alone trained much better and efficiently, with much better transfer and generalization, than the large monolithic model working from the 2D video. In pipeline approach, it is very easy to generate synthetic input 3D data better covering interesting scenario space for the "robot" model.
And, for example, you can't just, without significant training, feed the large monolithic model a lidar point space instead of videos. Whereis in a pipelined approach, you just switch the 3D generating pipeline input model.
Put tiny cams on robot arms and let it control them. They can be flimsy for safety. If it is sure something is happening say nothing, if it is 70-99% sure have it guess what is going on, if <70% have it ask what is going on.
I just wrote a reply to a comment talking about the AI tells this writing has, but it got flagged so my comment disappeared when I hit post. I'll rephrase out of spite:
My first thought upon reading this was that an LLM had been instructed to add a pithy meme joke to each paragraph. They don't make sense in context, and while some terminally online people do speak in memes, those people aren't quoting doge in 2025.
There's also a sense of incoherence in the whole piece. For instance, this section:
"- after: 22 million videos + 1 million images (now we're talking)
they basically hoovered up everything: something-something v2, kinetics, howto100m, and a billion youtube videos"
Was it a billion vids or 22m? It turns out the latter sentence is just rephrasing the list of sources in a cool casual way, and the last one is called YT-Temporal-1B. That's a billion frames of video, not a billion videos.
Also, the author of the blog "Ksagar Atharva" doesn't appear anywhere in the list of authors on the linked FB research paper with Yann LeCun as a co-author. Unless the blog author is using a heavily modified pseudonym.
The research is very real but the blog post appears to be very fake.
It's someone explaining the research as a blog essay right? Which is very commonly done. We=humanity
Exactly. It's very obvious what "we" is referring to here.
Yeah, obviously LLM written. They tried to be unique by removing capitals.
>> They don't make sense in context, and while some terminally online people do speak in memes, those people aren't quoting doge in 2025.
Cringely, they are. Nobody who isn't desperate to appear cool would write in that terminally grating register, including when using an LLM to do the writing.
I'm using eigenrobot's (X user) prompt for ChatGPT and the style is very recognizable. Everything lowercase, tone, zoomer abbreviations, esotheric style of jokes.
yup
I don’t know, 400k people are listening to the White House streaming lo-fi hip hop on X right now with cutesy videos of Trump on one side and his executive orders streaming on the other at 4am. I think there’s plenty of people quoting doge in 2025.
If you’re in the US, you likely work with them and they have learned to studiously avoid talking about politics except in vagaries to avoid conflict.
they are referring to doge the dog meme, not the government initiative. The meme is much older and wouldn't be considered "cool" to use by the same people who write in the style of the article. Which indicates it was written by an LLM, because usually only things like ChatGPT throw in such cringe, out of date memes in an otherwise obnoxiously 2025 article
>those people aren't quoting doge in 2025
Could you explain what this means? Is this article quoting doge?
There was a clear attempt at the doge meme format, yes:
> very scientific. much engineering.
Emphasis on attempt because you're supposed to use words with grammatically incorrect modifiers, and the first one doesn't. (Even the second one doesn't seem entirely incorrect to me? I'm not a native speaker though.) "many scientific, so engineering" for example would have worked.
I assume they, or most likely their LLM, tried too hard to follow the most popular sequence (very, much, wow) and failed at it.
"Much engineering was required" Archaic but still used a bit in articles or to give a certain vibe.
You'd think it would be easy to write "very engineering, much scientific". LLMs work in mysterious ways.
> some terminally online people do speak in memes, those people aren't quoting doge in 2025.
You may be surprised to find out how incorrect this.
I can think of two popular conservative sites likely to quote Doge people off hand that do this. I read all news in order not be an insufferable ideologue. So again, off the top of my head, NotTheBee (I think affiliated to BabylonBee (conservative The Onion)) and Twitchy. Among YouTubers, I think Asmond Gold, and I’m sure others like Steven Crowder who himself is in a famous meme.
That said… yea, you are probably right.
They are referring to the original doge meme of the dog, not the government initiative today. I guess "quote" isn't really the right word, more like "doing"
A reoccurring mistake in this thread. I blame Elon Musk and his boomer humer.
Aren't those sites primarily Russian bots tho?
Isn’t that just a synonym for conservative?
Not conservative but I used to love the meme before it was co-opted by musk, so I will occasionally use it as a "haha now you feel OLD" without thinking of its modern connotations.
Also I think it’s somehow important to not let fascism steal our cultural heritage, even if it’s just a meme.
In my country, far righters are displaying the country’s flag everywhere. Now you can’t display a French flag without being thought as a far right person. That’s honestly insufferable.
I know it’s less important with doge but still : before being a crypto it was just a picture of an overly innocent and enthusiastic dog. And even when it became a little crypto, it was totally assumed that it was a meme coin and wasn’t meant for speculation, the idea was that 1DOGE = 1DOGE only and people gifted them to other people who made nice contributions on the internet.
Musk broke all of this when it started to use it to do gigantic pumps and dumps using his own visibility on Twitter.
We don’t have to let fascism steal all the popular symbols / memes, because they will steal them anyway.
Lets see you try to recover the swastika from fascism ;)
[dead]
[flagged]
[flagged]
I have never seen "ngmi" before, I wonder in which subculture it is common
It's the second most common four-letter acronym in crypto hype threads right after hfsp.
The Urban Dictionary definition is hilarious, opens with "HFSP is an acronym used typically in the crypto community against non-belivers".
Hasn't defined the term yet and I know I'm in for a hell of a ride.
Seen budding lot in Ivy League hacker subculture 15 years ago when I was there
very popular on tech twitter. Right up there with "we're back" and "we're so back"
not sure, by my college friend group uses it occasionally
> gen z douchebag
Hello there! As a fellow gen-z douchebag, the article looks authentic, albeit a bit slim on Discord screencaps. Will be fun(?) to be proven wrong though.