Hunyuan3D 2.0 – High-Resolution 3D Assets Generation

151 comments

·January 21, 2025

geuis

Question related to 3D mesh models in general: has any significant work been done on models oriented towards photogrammetry?

Case in point, I have a series of photos (48) that capture a small statue. The photos are high quality, the object was on a rotating platform. Lighting is consistent. The background is solid black.

These normally are ideal variables for photogrammetry but none of the various common applications and websites do a very good job creating a mesh out of it that isn't super low poly and/or full of holes.

I've been casually scanning huggingface for relevant models to try out but haven't really found anything.

troymc

Check out RealityCapture [1]. I think it's what's used to create the Quixel Megascans [2]. (They're both under the Epic corporate umbrella now.)

[1] https://www.capturingreality.com/realitycapture

[2] https://quixel.com/megascans/

Joel_Mckay

COLMAP + CloudCompare with a good CUDA GPU (more VRAM is better) card will give reasonable results for large textured objects like buildings. Glass/Water/Mirror/Gloss will need coated to scan, dry spray on Dr.scholls foot deodorant seems to work fine for our object scans.

There are now more advanced options than Gaussian splatting, and these can achieve normal playback speeds rather than hours of filtering. I'll drop a citation if I recall the recent paper and example code. However, note this style of 3D scene recovery tends to be heavily 3D location dependent.

Best of luck, =3

jocaal

Recently, a lot of development in this area has been in gaussian splatting and from what I have seen, the new methods are super effective.

https://en.wikipedia.org/wiki/Gaussian_splatting

https://www.youtube.com/watch?v=6dPBaV6M9u4

meindnoch

The parent explicitly asked for a mesh.

andybak

You can never be sure what someone's real intent is. They might mean "something meshlike". Personally I usually reply by asking for more info (I always have the XY Problem in my mind) but that is time consuming and some people assume you're being pendantic (I am however correct more often than not - people have posed the wrong question or haven't given critical parts of the context)

jocaal

The second link I posted contains n flow from splats to meshes

geuis

Yeah some very impressive stuff with splats going on. But I haven't seen much about going from splats to high quality 3D meshes. I've tried one or two with pretty poor results.

jiggawatts

There have been a few papers published on the topic, but it's "early days".

Expect a lot of progress over the next couple of years.

Broussebar

For this exact use case I used instant-ngp[0] recently and was really pleased with the results. There's an article[1] explaining how to prepare your data.

[0] https://github.com/NVlabs/instant-ngp

[1] https://github.com/NVlabs/instant-ngp/blob/master/docs/nerf_...

GistNoesis

>full of holes

On the geometry side from the theoretical point of view you can repair meshes, [1], by inferring a signed or unsigned distance field from your existing mesh, then you contour this distance field.

If you like the distance field approach, there are also research work [2], to estimate neural unsigned distance fields directly, (kind of a similar way as Gaussian splats).

[1] https://github.com/nzfeng/signed-heat-3d [it works but it's research code, so buggy, not user friendly, and mostly on toy problems because complexity explode very quickly when using a grid the number of cells grows as a n^3, and then they solve a sparse linear system on top (so total complexity bounded by n^6), but tolerating approximations and writing things properly practical complexity should be on par with methods like finite element method in Computational Fluid Dynamics.

[2] https://virtualhumans.mpi-inf.mpg.de/ndf/

MrSkelter

48 images is an incredibly small number for high quality photogrammetry. 480 wouldn’t be overkill. A couple of hundred would be considered normal.

Elucalidavah

> the object was on a rotating platform

Isn't a static-object-rotating-camera basically a requirement for photogrammetry?

jdietrich

No. For small objects, it is typical to use a turntable to rotate the object; there are a number of commercial and DIY turntables with an automated motion system that can trigger the shutter after a specified degree of rotation.

Mashimo

Why would that make a difference?

addandsubtract

The OC mentioned "static lighting". If they meant static, while the platform was spinning, then the lighting would be inconsistent, because the object would change lighting with each photo. You would have to fix the lighting to the platform to spin with the object, while taking the pictures to get consistent lighting.

SequoiaHope

Photogrammetry generally assumes a fully static scene. If there are static parts of the scene which the camera can see and also rotating parts, the algorithm may struggle to properly match features between images.

null

[deleted]

falloon

Kiri engine is pretty easy to use and just released a good update for their 3DGS pipeline, and they have one of the better 3DGS to mesh options. https://kiri-innovation.github.io/3DGStoMesh2/

MikeTheRocker

Generative AI is going to drive the marginal cost of building 3D interactive content to zero. Unironically this will unlock the metaverse, cringe as that may sound. I'm more bullish than ever on AR/VR.

jsheard

I can only speak for myself, but a Metaverse consisting of infinite procedural slop sounds about as appealing as reading infinite LLM generated books, that is, not at all. "Cost to zero" implies drinking directly from the AI firehose with no human in the loop (those cost money) and entertainment produced in that manner is still dire, even in the relatively mature field of pure text generation.

torginus

I think the biggest issue with stable diffusion based approaches has always been poor compositional ability (putting stuff where you want), and compounding anatomical/spatial errors that gave the images an offputting vibe.

All these problems are trivially solvable (solved) using traditonal 3d meshes and techniques.

grumbel

The issue with composition is only a problem when you rely on a pure text prompt, but has been solved for quite a while by ControlNets or img2img. What was lacking was the integration with existing art tools, but even that is getting solved, e.g. Krita[1] has a pretty nice AI plugin.

3D can be a useful intermediate when editing the 2D image, e.g. Krea has support for that[2]. But I don't think the rest of the traditional 3D pipeline is of much use here, AI image generation already produces images at a quality that traditional rendering just can't keep up with, neither in terms of speed, quality or flexibility.

[1] https://www.youtube.com/watch?v=PPxOE9YH57E

[2] https://www.youtube.com/watch?v=0ER5qfoJXd0

steinhafen

I have tried the model, and I agree with you on the point. A product was uploaded for a test, the output catches the product quite well, but the text on the generated 3D model is unreadable.

MikeTheRocker

IMO current generation models are capable of creating significantly better than "slop" quality content. You need only look at NotebookLM output. As models continue to improve, this will only get better. Look at the rate of improvement of video generation models in the last 12-24 months. It's obvious to me we're rapidly approaching acceptable or even excellent quality on-demand generated content.

jsheard

I feel like you're conflating quality with fidelity. Video generation models have better fidelity than they did a year ago, but they are no closer to producing any kind of compelling content without a human directing them, and the latter is what you would actually need to make the "infinite entertainment machine" happen.

The fidelity of a video generation model is comparable to an LLMs ability to nail spelling and grammar - it's a start, but there's more to being an author than that.

modeless

NotebookLM is still slop. I recommend feeding it your resume and any other online information about you. It's kind of fun to hear the hosts butter you up, but since you know the subject well you will quickly notice that it is not faithful to the source material. It's just plausibly misleading.

deeznuttynutz

This is exactly while I'm building my app now with the expectation that these assets will be exponentially better in the short term.

jdietrich

I can only speak for myself, but a large and growing proportion of the text I read every day is LLM output. If Claude and Deepseek produce slop, then it's a far higher calibre of slop than most human writers could aspire to.

leoedin

What kind of text are you reading? Do you work in LLM development? Or are you just noticing that many news sites are using LLMs more and more?

I've noticed obvious LLM output on low quality news sites, but I don't tend to read them anyway. Maybe all the comments I read are from LLMs and I just don't realise?

echelon

You're too old and jaded [1]. It's for kids inventing infinite worlds to role play and adventure. They're going to have a blast.

[1] Not meant as an insult. Working professionals don't have time for this stuff.

wizzwizz4

Object permanence and a communications channel is enough for this. Give children (who get along with each other) a pile of sticks and leave them alone for half an hour, and there's half a chance their game will ignore the sticks. Most children wouldn't want to have their play mediated by the computer in the way you describe, because the ergonomics are so poor.

Mashimo

> Working professionals don't have time for this stuff.

Why don't working professionals have time for entertainment?

And are working people not always professionals?

null

[deleted]

bufferoverflow

Minecraft is procedurally generated slop, yet it's insanely popular.

chii

Not all procedurally generated things are slop, and not all slop are made via procedural generation.

And popularity has nothing to do with private, subjective quality evaluations of the individual (aka, what someone calls slop might be picasso to another), but with objective, public evaluations of the product via purchases.

littlestymaar

Procgen has nothing to do with AI in terms of slop, for a good reason: procedural generation algorithms are heavily tuned by authors, exactly to avoid the “dull, unoriginal and repetitive” aspect that AI produces.

TeMPOraL

Screw Metaverse. Let's make a VR holodeck.

Star Trek's Holodeck is actually a good case study here (especially with the recent series, Lower Decks, going as far as making two episodes that are interactive movies on a holodeck, going quite deep into how that could work in practice both in terms of producing and experiencing them).

One observation derived here is that infinite procedural content at your fingertip doesn't necessarily kill all meaning, if you bring the meaning with you. The two major use cases[0] for the holodeck are:

- Multiplayer scenarios in which you and your friends enjoy some experience in a program. The meaning is sourced from your friendship and roleplay; the program may be arbitrary output of an RNG in the global sense, but it's the same for you and your friends, so shared experience (and its importance as a social object) in your group is retained.

- Single-player simulations that are highly specific. The meaning here comes from whatever is the reason you're simulating that particular experience, and it's connection to the real world. Like idk., a flight simulator of a random space fighter flying over random world shooting at random shit would quickly get boring, but if I can get the simulator to give me a highly accurate cockpit of an F/A-18 Hornet, flying over real terrain and shooting at realistic enemies in realistic (even if fictional) storyline - now that would be deeply meaningful to me, because 1) F/A-18 Hornet is a real plane that I would otherwise never experience flying, and 2) I have a crush on this particular fighter because F/A-18 Hornet 3.0 is one of the first videogames I ever played in my life as a kid.

Now, to make Metaverse less like bullshit and more like Star Trek, we'd need to make sure the world generation is actually available to the users. No asset stores, no app marketplace bullshit. We live in a multimodal LLM era - we already have all the components to do it like Star Trek did it: "Computer, create a medieval fantasy village, in style of England around year 1400, set next to a forest, with tall mountains visible in the distance", then walk around that world and tweak the defaults from there.

[0] - Ignoring the third use case that's occasionally implied on the show, and that's really obvious given it's the same one the Internet is for - and I'm not talking about cat pictures.

ben_w

> I'm not talking about cat pictures

Caitian pictures, on the other hand…

I think they were more than implying what T'Ana got up to with Shaxs.

deadbabe

I think you’re being short sighted. Imagine feeding in your favorite TV shows to a generative AI and being able to walk around in the world and talk to characters or explore it with other people.

bschwindHN

That's still AI slop, in my opinion.

xgkickt

The trademark/copyright issues of making that both a reality and an income stream are as yet unsolved.

slt2021

do you find it interesting talking to NPCs in games?

Deutschland314

AR/VR doesn't has a 3D model issue.

It has a 'why would I strap on a headset for stuff I can do without'

I will not starting meeting friends just because of the meta verse. I have everything I need already.

And even video calls with Whatsapp is alweird as f.

taejavu

Jeez I'd love to know what Apple's R&D debt on Vision Pro is, based on current sales to date. I really really hope they continue to push for a headset that's within reach of average people but the hole must be so deep at this point I wouldn't be surprised if they cut their losses.

EncomLab

As Carmack pointed out the problem with AR/VR right now - it's not the hardware, it's the software. Until the "visicalc" must have killer app shows up to move the hardware, there is little incentive for general users to make the investment.

PittleyDunkin

> As Carmack pointed out the problem with AR/VR right now - it's not the hardware, it's the software.

The third option is peoples' expectation for AR/VR itself: it could be a highly niche and expensive industry and unlikely to grow to the general population.

null

[deleted]

InDubioProRubio

AR needs a bragging app.. something like the dharma/content you create in virt growing out of your footsteps in real - and why visible on cellphone, feeling more native in with AR-googles

null

[deleted]

PittleyDunkin

Maybe eventually. Based on this quality I don't see this happening any time in the near future.

slt2021

[flagged]

pella

Ouch; License: EUROPEAN UNION, UNITED KINGDOM AND SOUTH KOREA

  TENCENT HUNYUAN 3D 2.0 COMMUNITY LICENSE AGREEMENT
  Tencent Hunyuan 3D 2.0 Release Date: January 21, 2025
  THIS LICENSE AGREEMENT DOES NOT APPLY IN THE EUROPEAN UNION, UNITED KINGDOM AND SOUTH KOREA AND IS EXPRESSLY LIMITED TO THE TERRITORY, AS DEFINED BELOW.

https://github.com/Tencent/Hunyuan3D-2?tab=License-1-ov-file

EMIRELADERO

I assume it's safe to ignore as model weights aren't copyrightable, probably.

slt2021

you dont know what kind of backdoors are hidden in the model weights

LiamPowell

Can you elaborate on how any sort of backdoor could be hidden in the model weights?

It's a technical possibility to hide something in the code, but that would be a bit silly since there's not that much of it here. It's not technically possible to hide a backdoor in a set of numbers that are solely used as the operands to trivial mathematical operations, so I'm very curious about what sort of hidden backdoor you think is here.

dkjaudyeqooe

I'm trying to think of what kind of adversarial 3D model the weights could produce. Perhaps a 3D goatse?

EMIRELADERO

I mean... you can just firewall it?

suraci

Your concern is reasonable.

According to DOD, Tencent - which published this model - is a Chinese military company

https://www.bbc.com/news/articles/c9q78wn9g8zo

gruez

Is this tied to EU regulations around AI models?

null

[deleted]

denkmoon

For the AI un-initiated; is this something you could feasibly run at home? eg on a 4090? (How can I tell how "big" the model is from the github or huggingface page?)

swframe2

I tried using Hunyuan3D-2 on a 4090 GPU. The Windows install encountered build errors, but it worked better on WSL Ubuntu. I first tried it with CUDA 11.3 but got a build error. Switching to CUDA 12.4 worked better. I ran it with their demo image but it reported that the mesh was too big. I removed the mesh size check and it ran fine on the 4090. It is a bit slow on my i9 14k with 128G of memory.

(I previously tried the stability 3d models: https://stability.ai/stable-3d and this seems similar in quality and speed)

denkmoon

Cool, thanks. I'm kinda interested so hearing it at least runs on a 4090 means I might give it a go one weekend.

sorenjan

The hunyuan3d-dit-v2-0 model is 4.93 GB. ComfyUI is on their roadmap, might be best to wait for that, although it doesn't look complicated to use in their example code.

https://huggingface.co/tencent/Hunyuan3D-2/tree/main/hunyuan...

sebzim4500

Interesting. One of the diagrams suggests that the mesh is generated from the marching cubes algorithm but the geometry of the meshes shown above are clearly not generated in this way.

GrantMoyer

To me, the bird mesh actually does look like marching cubes output. Note the abundance of almost square triangle pairs on the front and sides. Also note that marching cubes doesn't nescessarily create stairstep-like artifacts; it can generate a smooth looking mesh given signed distance field input by slightly adjusting the locations of vertices based on the relative magnitude of the field at the surrounding lattice points.

TinkersW

If they are using MC, does that mean they are actually generating SDFs? If so it would be nice if you could output the SDF rather than the triangle mesh.

wumeow

The meshes generated by the huggingface demo definitely look like the product of marching cubes.

godelski

As with any generative model, trust but verify. Try it yourself. Frankly, as a generative researcher myself, there's a lot of reason to not trust what you see in papers and pages.

They link a Huggingface page (great sign!): https://huggingface.co/spaces/tencent/Hunyuan3D-2

I tried to replicate the objects they show on their project page (https://3d-models.hunyuan.tencent.com/). The full prompts exist but are truncated so you can just inspect the element and grab the text.

  Here's what I got
  Leaf
     PNG: https://0x0.st/8HDL.png
     GLB: https://0x0.st/8HD9.glb
  Guitar
     PNG: https://0x0.st/8HDf.png  other view: https://0x0.st/8HDO.png
     GLB: https://0x0.st/8HDV.glb
  Google Translate of Guitar:
     Prompt: A brown guitar is centered against a white background, creating a realistic photography style. This photo captures the culture of the instrument and conveys a tranquil atmosphere.
     PNG: https://0x0.st/8HDt.png   and  https://0x0.st/8HDv.png
     Note: Weird thing on top of guitar. But at least this time the strings aren't fusing into sound hole.

I haven't tested my own prompts or the google translation of the Chinese prompts because I'm getting an over usage error (I'll edit comment if I get them). That said, these look pretty good. The paper and page images definitely look better, but these aren't like Stable Diffusion 1 paper vs Stable Diffusion 1 reality.

But these are long and detailed prompts. Lots of prompt engineering. That should raise some suspicion. Real world has higher variance and let's get an idea how hard it is to use. So let's try some simpler things :)

  Prompt: A guitar
    PNG: https://0x0.st/8HDg.png
    Note: Not bad! Definitely overfit but does that matter here? A bit too thick for a electric guitar but too thin for acoustic.
  Prompt: A Monstera leaf
    PNG: https://0x0.st/8HD6.png  
         https://0x0.st/8HDl.png  
         https://0x0.st/8HDU.png
    Note: A bit wonkier. I picked this because it looked like the leaf in the example but this one is doing some odd things. 
          It's definitely a leaf and monstera like but a bit of a mutant. 
  Prompt: Mario from Super Mario Bros
    PNG: https://0x0.st/8Hkq.png
    Note: Now I'm VERY suspicious....
  Prompt: Luigi from Super Mario Bros
    PNG: https://0x0.st/8Hkc.png
         https://0x0.st/8HkT.png  
         https://0x0.st/8HkA.png
    Note: Highly overfit[0]. This is what I suspected. Luigi isn't just tall Mario. 
          Where is the tie coming from? The suspender buttons are all messed up. 
          Really went uncanny valley here. So this suggests we're really brittle. 
  Prompt: Peach from Super Mario Bros
    PNG: https://0x0.st/8Hku.png  
         https://0x0.st/8HkM.png
    Note: I'm fucking dying over here this is so funny. It's just a peach with a cute face hahahahaha
  Prompt: Toad from Super Mario Bros
    PNG: https://0x0.st/8Hke.png 
         https://0x0.st/8Hk_.png
         https://0x0.st/8HkL.png
    Note: Lord have mercy on this toad, I think it is a mutated Squirtle.

Paper can be found here (the arxiv badge on the page leads to a pdf in the repo, which github is slow to render those): https://arxiv.org/abs/2411.02293

(If you want to share images like I did all I'm doing is `curl -F'file=@foobar.png' https://0x0.st`)

[0] Overfit is a weird thing now. Maybe it doesn't generalize well, but sometimes that's not a problem. I think this is one of the bigger lessons we've learned with recent ML models. My viewpoint is "Sometimes you want a database with a human language interface. Sometimes you want to generalize". So we have to be more context driven here. But certainly there are a lot of things we should be careful about when we're talking about generation. These things are trained on A LOT of data. If you're more "database-like" then certainly there's potential legal ramifications...

Edit: For context, by "look pretty good" I mean in comparison to other works I've seen. I think it is likely a ways from being useful in production. I'm not sure how much human labor would be required to fix the issues.

godelski

Ops ran out of edit time when I was posting my last two

  Prompt: A hawk flying in the sky
    PNG: https://0x0.st/8Hkw.png
         https://0x0.st/8Hkx.png
         https://0x0.st/8Hk3.png
    Note: This looks like it would need more work. I tried a few birds and generic too. They all seem to have similar form. 
  Prompt: A hawk with the head of a dragon flying in the sky and holding a snake
    PNG: https://0x0.st/8HkE.png
         https://0x0.st/8Hk6.png
         https://0x0.st/8HkI.png
         https://0x0.st/8Hkl.png
    Note: This one really isn't great. Just a normal hawk head. Not how a bird holds a snake either...

This last one is really key for judging where the tech is at btw. Most of the generations are assets you could download freely from the internet and you could probably get better ones by some artist on fiver or something. But the last example is more our realistic use case. Something that is relatively reasonable, probably not in the set of easy to download assets, and might be something someone wants. It isn't too crazy of an ask given Chimera and how similar a dragon is to a bird in the first place, this should be on the "easier" end. I'm sure you could prompt engineer your way into it but then we have to have the discussion of what costs more a prompt engineer or an artist? And do you need a prompt engineer who can repair models? Because these look like they need repairs.

This can make it hard to really tell if there's progress or not. It is really easy to make compelling images in a paper and beat benchmarks while not actually creating a something that is __or will become__ a usable product. All the little details matter. Little errors quickly compound... That said, I do much more on generative imagery than generative 3d objects so grain of salt here.

Keep in mind: generative models (of any kind) are incredibly difficult to evaluate. Always keep that in mind. You really only have a good idea after you've generated hundreds or thousands of samples yourself and are able to look at a lot with high scrutiny.

BigJono

Yeah, this is absolutely light years off being useful in production.

People just see fancy demos and start crapping on about the future, but just look at stable diffusion. It's been around for how long, and what serious professional game developers are using it as a core part of their workflow? Maybe some concept artists? But consistent style is such an important thing for any half decent game and these generative tools shit the bed on consistency in a way that's difficult to paper over.

I've spent a lot of time thinking about game design and experimenting with SD/Flux, and the only thing I think I could even get close to production that I couldn't before is maybe an MTG style card game where gameplay is far more important than graphics, and flashy nice looking static artwork is far more important than consistency. That's a fucking small niche, and I don't see a lot of paths to generalisation.

Lanolderen

Stable Diffusion and AI in general seems to be big in marketing at least. A friend decided to abandon engineering and move to marketing and the entire social media part of his job is making a rough post, converting it to corporate marketing language via AI and then generating an eye catching piece of AI art to slap on top.

When video generation gets easy he'll probably move to making short eye catching gifs.

When 3D models and AI in general improve I can imagine him for example generating shitty little games to put in banners. I've been using an adblocker for so long I don't know what exists nowadays but I remember there being banners with "shoot 5 ducks" type games where the last duck kill opens the advertisers website. Sounds feasible for an AI to implement reliably. If you can generate different games like that based on the interests of the person seeing the ad you can probably milk some clicks.

oefrha

> been around for how long, and what serious professional game developers are using it as a core part of their workflow?

Are you in the game industry? If you’re not how would you even know they have not? As someone with some connections in the industry and may soon get more involved personally, I know at least one mobile gaming studio with quite a bit of funding and momentum that has started using a good deal of AI-generated assets that would have been handcrafted in the past.

godelski

Yeah the big problem I have with my field is that there seems to be stronger incentives to be chasing benchmarks and making things look good than there is to actually solve the hard problems. There is a strong preference for "lazy evaluation" which is too dependent on assuming high levels of ethical presentation and due diligence. I find it so problematic because this focus actually makes it hard for people to publish who are tackling these problems. Because it makes the space even noisier (already incredibly noisy by the very nature of the subject) and then it becomes hard to talk about details if they're presumed solved.

I get that we gloss over details, but if there's anywhere you're allowed to be nuanced and be arguing over details should it not be in academia?

(fwiw, I'm also very supportive of having low bars to publication. If it's void of serious error and plagiarism, it is publishable imo. No one can predict what is important or impactful, so we shouldn't even play that game. Trying to decide if it is "novel" or "good enough for <Venue>" is just idiotic and breeds collusion rings and bad actors)

Kelvin506

The first guitar has one of the strings end at the sound hole, and six tuning knobs for five strings.

The second has similar problems: it has tuning knobs with missing winding posts, then five strings becoming four at the bridge. It also has a pickup under the fretboard.

Are these considered good capability examples?

godelski

I take back a fair amount of what I said.

It is pretty good with some easier assets that I suspect there's lots of samples of (and we're comparing to other generative models, not to what humans make. Humans probably still win by a good margin). But when moving out of obvious assets that we could easily find, I'm not seeing good performance at all. Probably a lot can be done with heavy prompt engineering but that just makes things more complicated to evaluate.

keyle

Thanks for this. The results are quite impressive, after trying it myself.

xgkickt

Any user-generated content system suffers from what we call “the penis problem”.

_s_a_m_

Has the word "advanced", gotta be good

nunodonato

imagine something like this but geared towards 3d printing functional objects.

artemonster

[flagged]

otikik

Wow they really need to work on that first splash image[1]. All the assets there look hideous.

[1] https://github.com/Tencent/Hunyuan3D-2/blob/main/assets/imag...

FrozenSynapse

the assets look good as a starting point, they may even be used as background elements