VGGT: Visual Geometry Grounded Transformer
31 comments
·March 25, 2025Workaccount2
bhouston
You are the hero! Thank you! The main post link should be updated to this.
w-m
I read the paper yesterday, would recommend it. Kudos to the authors for getting to these results, and also for presenting them in a polished way. It's nice to follow the arguments about the alternating attention (global across all tokens vs only the tokens per camera), the normalization (normalize the scene scale - done in the data vs DUST3R, which normalizes in the network), and the tokens (image tokens from DINOv2 + camera tokens + additional register tokens, handling the first camera differently as it becomes the frame of reference). The results are amazing, and fine-tuning this model will be fun, e.g. for forward 3DGS reconstruction, looking forward to this.
I'm sure getting to this point was quite difficult, and on the project page you can read how it involved discussions with lots and lots of smart and capable people. But there's no big "aha" moment in the paper, so it feels like another hit for The Bitter Lesson in the end: They used a giant bunch of [data], a year and a half of GPU time to [train] the final model, and created a model with a billion parameters that outperforms all specialized previous models.
Or in the words of the authors, from the paper:
> We also show that it is unnecessary to design a special network for 3D reconstruction. Instead, VGGT is based on a fairly standard large transformer [119], with no particular 3D or other inductive biases (except for alternating between frame-wise and global attention), but trained on a large number of publicly available datasets with 3D annotations.
Fantastic to have this. But it feels.. yes, somewhat bitter.
[The Bitter Lesson]: http://www.incompleteideas.net/IncIdeas/BitterLesson.html (often discussed on HN)
[data]: "Co3Dv2 [88], BlendMVS [146], DL3DV [69], MegaDepth [64], Kubric [41], WildRGB [135], ScanNet [18], HyperSim [89], Mapillary [71], Habitat [107], Replica [104], MVS-Synth [50], PointOdyssey [159], Virtual KITTI [7], Aria Synthetic Environments [82], Aria Digital Twin [82], and a synthetic dataset of artist-created assets similar to Objaverse [20]."
[train]: "The training runs on 64 A100 GPUs over nine days", that would be around $18k on lambda labs in case you're wondering
dleeftink
Doesn't the bitter lesson take the argument a bit too far by opposing search/learn to heuristics? Is the former not dependent on breakthroughs in the latter?
CooCooCaCha
The bitter lesson is the opposite. It argues that hand-crafted heuristics will eventually get beaten by more general learning algorithms that can take advantage of computing power.
porphyra
Indeed, even in "classical chess engines" like Stockfish which previously required handcrafted heuristics at leaf nodes, in recent years the NNUE [1] [2] has greatly outperformed it. Note that this is a completely different approach from the one that AlphaZero takes, and modern Stockfish is significantly stronger than AlphaZero.
[1] https://stockfishchess.org/blog/2020/introducing-nnue-evalua...
dleeftink
> eventually get beaten
Brute forcing is bound to find paths beyond heuristics. What I'm getting at is that the path needs to be established first before it can be beaten. Hence why I'm wondering if one isn't an extension of the other instead of an opposing strategy.
I.e. search and heuristics both have a time and place, not so much a bitter lesson but a common filter for a next iteration to pass through.
sgnelson
I really wish someone would take this and combine it with true photogrammetry to supplement the photogrammetry rather than just try to replace traditional photogrammetry.
This type of thing would be the killer app for phone based 3d scanners. You don't have to have a perfect scan because this will fill in the holes for you.
bhouston
I'm a little suspicious of many of the outdoor examples given though. They are of famous places that are likely in the training set:
- Egyptian pyramids
- Roman Colosseum
These are the most iconic and most photographed things in the world.
That said, there are other examples are there more novel. I am just going to focus on those to judge its quality.
ed
It’s worth trying the demo - I uploaded a low quality video of an indoor space and got decent results
kfarr
Use the hugging face with your own data, it's very good and outputs a glb: https://huggingface.co/spaces/facebook/vggt
davedx
I'd love to hear what the use cases are for this. I was looking at Planet's website yesterday and although the technology is fascinating, I do sometimes struggle to understand what people actually do (commercially or otherwise) with the data? (Genuinely not snark, this stuff's just not my field!)
stevepotter
I'm working on a system that uses affordable hardware (iPhones) to make orthopedic surgery easier and more precise. Among other things, we have to track the position in space of surgical tools like drills. Models like this can play a pivotal role in that.
As someone mentioned, this is great for gaussian splatting, which we also do.
vessenes
This is a super useful utility — until this there was nothing fast and easy that you could dump say 30 quick camera photos of a (room/object/place) into and get out a dense point cloud.
Instead you had to run all these separate pipelines inferring camera location, etc. etc. before you could get any sort of 3D information out of your photos. I’d guess this is going into many many workflows where it will drop in replace a bunch of jury-rigged pipelines.
the8472
The depth maps and point clouds are useful in CGI to turn a 2D image into a 3D environment which can then be incorporated into a raytracing renderer. E.g. CAD-data based foreground object placed in a generated environment.
imbusy111
Architectural visualizations is one. For example, design phase of remodelling your house would be much easier, if you had a 3D reconstruction of the current state already available.
Lerc
Seems like it would provide good data for training control nets for image generation.
This would let you have any of the types if data that this model can output be used as input for controlling image generation.
cluckindan
Collision meshes for Gaussian splats
porphyra
It is cool to see recent research doing this to reconstruct scenes from fewer images, essentially using a transformer to guess what the scene structure is. Previously, you needed a ton of images and had to use COLMAP. All the fancy papers like NERF and Gaussian Splatting used COLMAP in the backend, and while it does a great job in terms of accuracy, it is slow and requires a lot of images with known calibration.
jdthedisciple
Interesting idea, I applaud it.
However I just tried it on Huggingface and the result was ... mediocre at best:
The resulting point cloud missed about half the features from the input image.
porphyra
The final point cloud rendering might be due to a third party renderer rather than VGGT itself.
vessenes
Looking at the output, which is impressive, I want to see this pipeline applied to splats. Dense point clouds lose a bunch of color and directional information needed for high quality splats, but it seems easy to imagine this method would work well for splats. I wonder if the architecture could be fine tuned for this or if you’d need to retrain an entire model.
w-m
Certainly possible, read 4.6. Finetuning for Downstream Tasks in the paper, the first subsection is "Feed-forward Novel View Synthesis". They chose to report their experiments on LVSM, which is not an explicit representation like 3D Gaussian Splatting, but they're citing two feed-forward 3DGS approaches in their state of the art listing.
Should be quite exciting going forward, as fine-tuning might be possible on consumer hardware / single Desktop machines (like it is with LLMs). So I would expect a lot of experiments coming out in this space, soon-ish. If the results hold true, it'll be pretty exciting to drop slow and cumbersome COLMAP processing and scene optimization for a single forward pass that lasts a few seconds.
null
richard___
We need camera poses in dynamic scenes
maelito
Can it be used to build Google Earth like 3D scenes ?
amelius
Please stop using keywords from electrical engineering.
More info and demos:
https://vgg-t.github.io/