Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models

lukashoel.github.io

220 points by amichail 2 years ago

A different project, perhaps, although with the speed they are popping up it's not easy to keep track, but I was just playing around in a live multiplayer 3D worldspace [1] where a text prompt to instant 360 Skybox is a really cool feature to see working as it forms all around you in realtime (cool on PC amazing in VR). It extends the pipeline of whatever Blockade Labs are using under the hood [2].

[1] https://hyperfy.io/ai-sky

[2] skybox.blockadelabs.com

avaer 2 years ago

Definitely hard to keep up with the tech, even if you're deep in it.
I presented a 3D gameplay hack of this at the recent Blockade meetup: https://youtu.be/TfRJeedTeOs
The metric depth model I used (ZoeDepth) is quite new -- most previous models were inverse relative depth, with poor scaling properties, especially for artistic worlds.
But now there is a much better depth model coming from Intel called Depth Fusion which they are adding to the Blockade API and also open sourcing (!)...
Also worth checking out what's possible with SD ControlNet: https://twitter.com/BlockadeLabs/status/1634578058287132674

wsgeorge 2 years ago

Reminds me of this project submitted yesterday [0]. I'm trying hard to keep up with the pace of projects and papers being announced. This is all very exciting!

[0] https://zero123.cs.columbia.edu/

smaddox 2 years ago

Cool. Stereoscopic diffusion images coming soon.

MayeulC 2 years ago

Pretty cool. Now, I wonder, can't you label a certain region of space with a prompt, and let the diffuser do its job? Maybe with some mathematical function to bend into another area.

The idea would be to roughly place the elements in a 3D scene, and adjust the prompt as the camera is moved around the scene.

Here, it's obvious that the "fireplace" prompt causes the model to place a new fireplace as the previous one comes out of view.

Even if you can't precisely label portions of an image, changing the prompt as the camera moves (or changing the weight coefficients for a prompt describing multiple orientations) would avoid that kind of "unnatural" result.

Regardless, impressive results! I wonder if it would perform better if it was re-trained to output a depth channel as well.

It could be useful for (artistically) filling gaps in photogrammetry projects.

I can't wait for painting or drawing styles to be applied to the output!

gs17 2 years ago

The example trajectories (https://github.com/lukasHoel/text2room/tree/main/model/traje...) seem to have different prompts for different angles, so you can definitely give a vague layout of the room.

totetsu 2 years ago

Time for a 3D run through of some classic text adventure games :D

tmilard 2 years ago

I do believe this kind of quasi-automatic 3D-Realistic-scene-generator is great. But maybe at the end useless.

So why you ask me ? - Now while the speed of 3D generating is wonderfull, and the 3D accuracy is "Ok" (but will progress more in the short future I bet). So great. But... But it lacks a unforgiving flaw : - You can NEVER correct the 3D the 'automat' has guessed for you. I mean it will go wrong in a few parts, and you can NOT do anything with it.Sadly.

I believe that once this kind of softwares really gives the user a magic tric to MANUALLY correct some part of the 3D mesh, you have got a winner and a major selling software.

tmilard 2 years ago

I tried to do one such software myself. Was kind of almost there : https://youtu.be/ufpajCHLWbg
Example result made by tool (load twice because bug) : https://free-visit.net/fr/demo01

antiatheist 2 years ago

It doesn't look like it creates any discreet models from the different parts of the room, it's just a flat mesh.

So it's a lot of work to still to create an object that makes integrateable with anything else.

bilsbie 2 years ago

When is this stuff making it into games! This would be amazing on the quest.

worldsayshi 2 years ago

Shouldn't be too hard to integrate. Just need to load the result into a Unity scene.
If nobody has tried it in a week or two I might give it a go.
- antiatheist 2 years ago
  
  That's not true at all, there's no discreet object recognition.
  You would still need to create maps for colliders, navigation etc. As well as break the flat mesh this provides down into discreet objects if you wanted any more physics integration.
  
  worldsayshi 2 years ago
  
  Sure if you want physics integration it's quite a bigger tasks. Maybe even insurmountable. And it wasn't on my mind at all.
  Unless the resulting mesh is huge it shouldn't be too hard to just build a vr viewer.
- nineteen999 2 years ago
  
  It's cool and all but all your lighting is going to be pre-baked ...
  
  andybak 2 years ago
  
  Which it usually is on Quest titles.

canadiantim 2 years ago

I can't wait to be able to generate 3d housing models from 2d floor plans. That'll probably happen sometime this year, wild how quickly all of this is progressing

jasonjamerson 2 years ago

Brilliant, been waiting for / working toward this. Is there a way to try it out?

pininja 2 years ago

They documented how to get up and running on Github. I see an example in there too. https://github.com/lukasHoel/text2room

fouc 2 years ago

Based on the title, I was randomly hoping that:

A) there was a text specification for a room and all the items in it (in terms of visuals at least).

B) that generating from this would be entirely deterministic