VGGT: Visual Geometry Grounded Transformer

44 comments

·March 25, 2025

w-m

I read the paper yesterday, would recommend it. Kudos to the authors for getting to these results, and also for presenting them in a polished way. It's nice to follow the arguments about the alternating attention (global across all tokens vs only the tokens per camera), the normalization (normalize the scene scale - done in the data vs DUST3R, which normalizes in the network), and the tokens (image tokens from DINOv2 + camera tokens + additional register tokens, handling the first camera differently as it becomes the frame of reference). The results are amazing, and fine-tuning this model will be fun, e.g. for forward 3DGS reconstruction, looking forward to this.

I'm sure getting to this point was quite difficult, and on the project page you can read how it involved discussions with lots and lots of smart and capable people. But there's no big "aha" moment in the paper, so it feels like another hit for The Bitter Lesson in the end: They used a giant bunch of [data], a year and a half of GPU time to [train] the final model, and created a model with a billion parameters that outperforms all specialized previous models.

Or in the words of the authors, from the paper:

> We also show that it is unnecessary to design a special network for 3D reconstruction. Instead, VGGT is based on a fairly standard large transformer [119], with no particular 3D or other inductive biases (except for alternating between frame-wise and global attention), but trained on a large number of publicly available datasets with 3D annotations.

Fantastic to have this. But it feels.. yes, somewhat bitter.

[The Bitter Lesson]: http://www.incompleteideas.net/IncIdeas/BitterLesson.html (often discussed on HN)

[data]: "Co3Dv2 [88], BlendMVS [146], DL3DV [69], MegaDepth [64], Kubric [41], WildRGB [135], ScanNet [18], HyperSim [89], Mapillary [71], Habitat [107], Replica [104], MVS-Synth [50], PointOdyssey [159], Virtual KITTI [7], Aria Synthetic Environments [82], Aria Digital Twin [82], and a synthetic dataset of artist-created assets similar to Objaverse [20]."

[train]: "The training runs on 64 A100 GPUs over nine days", that would be around $18k on lambda labs in case you're wondering

kombine

Give it another year and we will have a more specialised architecture tailored to 3D that will reach similar accuracy. VGGT is a ground-breaking research but it is in a way brute-force. There is plenty of work to do to make it more efficient.

dleeftink

Doesn't the bitter lesson take the argument a bit too far by opposing search/learn to heuristics? Is the former not dependent on breakthroughs in the latter?

CooCooCaCha

The bitter lesson is the opposite. It argues that hand-crafted heuristics will eventually get beaten by more general learning algorithms that can take advantage of computing power.

porphyra

Indeed, even in "classical chess engines" like Stockfish which previously required handcrafted heuristics at leaf nodes, in recent years the NNUE [1] [2] has greatly outperformed it. Note that this is a completely different approach from the one that AlphaZero takes, and modern Stockfish is significantly stronger than AlphaZero.

[1] https://stockfishchess.org/blog/2020/introducing-nnue-evalua...

[2] https://www.chessprogramming.org/Stockfish_NNUE

dleeftink

> eventually get beaten

Brute forcing is bound to find paths beyond heuristics. What I'm getting at is that the path needs to be established first before it can be beaten. Hence why I'm wondering if one isn't an extension of the other instead of an opposing strategy.

I.e. search and heuristics both have a time and place, not so much a bitter lesson but a common filter for a next iteration to pass through.

SJC_Hacker

> They used a giant bunch of [data], a year and a half of GPU time to [train] the final model,

>[train]: "The training runs on 64 A100 GPUs over nine days", that would be around $18k on lambda labs in case you're wondering

How is that a "year and half of GPU time". Maybe on some exoplanet ?

dragonwriter

> > [train]: "The training runs on 64 A100 GPUs over nine days",

> How is that a "year and half of GPU time".

64 GPUs × 9 days = 576 GPU-days ≈ 1.577 GPU-years

refulgentis

Doh, that's entirely fair: haven't been in this thread yet, but would echo what I perceive as implicit puzzlement re: this amount of GPU time being described as bitter-lesson-y.

null

[deleted]

Workaccount2

More info and demos:

https://vgg-t.github.io/

bhouston

You are the hero! Thank you! The main post link should be updated to this.

soulofmischief

And then everyone will ask for the source. :)

kavalg

And license: Creative Commons Attribution, Non Commercial

sgnelson

I really wish someone would take this and combine it with true photogrammetry to supplement the photogrammetry rather than just try to replace traditional photogrammetry.

This type of thing would be the killer app for phone based 3d scanners. You don't have to have a perfect scan because this will fill in the holes for you.

davedx

I'd love to hear what the use cases are for this. I was looking at Planet's website yesterday and although the technology is fascinating, I do sometimes struggle to understand what people actually do (commercially or otherwise) with the data? (Genuinely not snark, this stuff's just not my field!)

stevepotter

I'm working on a system that uses affordable hardware (iPhones) to make orthopedic surgery easier and more precise. Among other things, we have to track the position in space of surgical tools like drills. Models like this can play a pivotal role in that.

As someone mentioned, this is great for gaussian splatting, which we also do.

nmfisher

My brother is an orthopaedic surgeon, so I’m curious to know more. Do you have a website?

stevepotter

redefinesurgery.com - I'm steve @, would be happy to talk to your brother

vessenes

This is a super useful utility — until this there was nothing fast and easy that you could dump say 30 quick camera photos of a (room/object/place) into and get out a dense point cloud.

Instead you had to run all these separate pipelines inferring camera location, etc. etc. before you could get any sort of 3D information out of your photos. I’d guess this is going into many many workflows where it will drop in replace a bunch of jury-rigged pipelines.

imbusy111

Architectural visualizations is one. For example, design phase of remodelling your house would be much easier, if you had a 3D reconstruction of the current state already available.

the8472

The depth maps and point clouds are useful in CGI to turn a 2D image into a 3D environment which can then be incorporated into a raytracing renderer. E.g. CAD-data based foreground object placed in a generated environment.

Lerc

Seems like it would provide good data for training control nets for image generation.

This would let you have any of the types if data that this model can output be used as input for controlling image generation.

cluckindan

Collision meshes for Gaussian splats

bhouston

I'm a little suspicious of many of the outdoor examples given though. They are of famous places that are likely in the training set:

- Egyptian pyramids

- Roman Colosseum

These are the most iconic and most photographed things in the world.

That said, there are other examples are there more novel. I am just going to focus on those to judge its quality.

It’s worth trying the demo - I uploaded a low quality video of an indoor space and got decent results

kfarr

Use the hugging face with your own data, it's very good and outputs a glb: https://huggingface.co/spaces/facebook/vggt

porphyra

It is cool to see recent research doing this to reconstruct scenes from fewer images, essentially using a transformer to guess what the scene structure is. Previously, you needed a ton of images and had to use COLMAP. All the fancy papers like NERF and Gaussian Splatting used COLMAP in the backend, and while it does a great job in terms of accuracy, it is slow and requires a lot of images with known calibration.

mk_stjames

I'd really like to see this coupled with some SLAM techniques to essentially allow really accurate, long-range outdoor scene mapping with nothing but a cell phone.

A small panning video of city street, right now, can generate a pretty damn accurate (for some use cases) pointcloud, but the position accuracy falls off as you try to go any large distance away from the start point, dude to the dead-reckoning drift that essentially happens here. But if you could pipe real GPS and synthesized heading (from gyros/accel/megnetometers) from the phone the images were captured on into the transformer with the images, it would instantly and greatly improve the resultant accuracy since it would now have those camera parameters 'ground truth'd'.

I think then this technique could nearly start to rival what you need a $3-10k LIDAR camera to do right now. There are a lot of 'archival' and architecture study fields where absolute precision isn't as important as just getting 'full' scans of an area without missing patches, and speed is a factor. Walking around with a LIDAR camera can really suck compared to just a phone, and this technique would have no problem with multiple people using multiple phones to generate the input.

jdthedisciple

Interesting idea, I applaud it.

However I just tried it on Huggingface and the result was ... mediocre at best:

The resulting point cloud missed about half the features from the input image.

porphyra

The final point cloud rendering might be due to a third party renderer rather than VGGT itself.

vessenes

Looking at the output, which is impressive, I want to see this pipeline applied to splats. Dense point clouds lose a bunch of color and directional information needed for high quality splats, but it seems easy to imagine this method would work well for splats. I wonder if the architecture could be fine tuned for this or if you’d need to retrain an entire model.

w-m

Certainly possible, read 4.6. Finetuning for Downstream Tasks in the paper, the first subsection is "Feed-forward Novel View Synthesis". They chose to report their experiments on LVSM, which is not an explicit representation like 3D Gaussian Splatting, but they're citing two feed-forward 3DGS approaches in their state of the art listing.

Should be quite exciting going forward, as fine-tuning might be possible on consumer hardware / single Desktop machines (like it is with LLMs). So I would expect a lot of experiments coming out in this space, soon-ish. If the results hold true, it'll be pretty exciting to drop slow and cumbersome COLMAP processing and scene optimization for a single forward pass that lasts a few seconds.

vessenes

Totally agreed. An A100 is not expensive these days also, especially for finetuning.

ninetyninenine

I feel agi will be a patchwork of models melded together. Something like this would constitute a single model in the "perception" area.

null

[deleted]

maelito

Can it be used to build Google Earth like 3D scenes ?