DreamMaker: Making High-quality Text-to-3D Generation with 3D Consistent Regularization

Anonymous Author
Affiliation

DreamMaker generates high-quality and high-resolution 3D models from given text prompts.

a DSLR photo of an exercise bike in a well-lit room
a DSLR photo of an exercise bike in a well-lit room
a zoomed-out DSLR photo of a recliner chair
Michelangelo style statue of dog reading news on a cellphone
a DSLR photo of a car made out of cheese
a DSLR photo of a cat wearing a bee costume
a DSLR photo of a Christmas tree with donuts as decorations
An octopus and a giraffe having cheesecake
a zoomed out DSLR photo of a rabbit cutting grass with a lawnmower
a DSLR photo of a corgi puppy
a zoomed out DSLR photo of a red rotary telephone
a DSLR photo of an ice cream sundae

Interactable Meshes

an asian Santa Claus

a metal bunny sitting on top of a stack of chocolate cookie

an astronaut on a horse

a DSLR photo of a pug made out of metal

a DSLR photo of a tiger dressed as a doctor

a human skull with a vine growing through one of the eye sockets

Beautifully designed hyper-realistic futuristic electric vehicle for elderly people, highest poly count, highest contrast, highest detail, highest quality, UHD

Beautifully designed hyper-realistic psychedelic bee-concept futuristic fighter jet aircraft, highest contrast, highest poly count, highest detail, highest quality, UHD

An octopus and a giraffe having cheesecake


Abstract

2D diffusion-based text-to-3D generation models often encounter the Multi-face Janus problem, which arises due to the absence of 3D consistency in multi-view image generation. To address this challenge, we propose a novel text-to-3D generation framework called DreamMaker. Our approach incorporates 3D geometry consistent regularizations to enhance multi-view consistency and improve the overall quality of 3D generation. The DreamMaker pipeline incorporates a multi-view image generation model and a large text-to-image model, ensuring both 3D consistent and semantically accurate generation. Additionally, we introduce a refinement module that leverages improved 3D scene parameterization and an adaptive camera view sampling strategy to extract high-resolution meshes and textures. Experimental evaluations demonstrate that DreamMaker achieves impressive results. Importantly, it significantly mitigates the occurrence of the Janus problem by approximately 60%. Comparative studies also indicate that DreamMaker outperforms state-of-the-art approaches such as DreamFusion, Magic3D, and SJC, as it garners more user preferences.


Comparison with state-of-the-art models

Although Magic3D and ProlificDreamer are able to generate high-quality objects in some cases, multi-view inconsistency still occurs in many other cases. For example, in Magic3D, wheels of the vehicle in the first case have inconsistent directions and airplane in the second case has multi-heads. Our DreamMaker further introduces 3D consistency to avoid the Janus problem and can still produce high-resolution meshes and high-quality textures. We use the threestudio implementation for all the baselines.

Magic3D-IF-SD                                       ProlificDreamer                                       Ours            

Beautifully designed hyper-realistic futuristic electric vehicle for elderly people highest poly count highest contrast highest detail highest quality UHD

Beautifully designed hyper-realistic psychedelic bee-concept futuristic fighter jet aircraft highest contrast highest poly count highest detail highest quality UHD

an Asian Santa Claus

a metal bunny sitting on top of a stack of chocolate cookie


More ablation studies

we present a series of additional ablation studies that extend beyond the scope of the main manuscript. These investigations delve into various facets of our model's performance and behavior, including: (1) the role of CLIP regularization in texture and shape refinement; (2) the benefits of the warm-up phase in the coarse stage;(3) the capabilities in generating human-centric 3D models; (4)the impact of using different initial images and random seeds for initialization; (5) the analysis of the effects with and with- out Alpha Entropy Regulation, and Smoothness; (6) the effects of utilizing Zero-1-to-3 during the refinement stage.

CLIP Regulation

During the coarse stage of our model, an additional CLIP loss is integrated to refine the generative process, significantly enhancing shapes and textures. The figure below exemplifies this improvement: the surfaces of the miniature schnauzer and piglet display considerably sharper textures subsequent to the application of the CLIP loss. Notably, the facial features of the piglet, such as the eyes and nose, achieve enhanced realism in the final row. This can be attributed to the CLIP regulation's role in optimizing the generated scenes to align closely with the textual prompts via CLIP similarity metrics.

Zero-1-to-3 warmup in the coarse stage

In the beginning of the coarse stage, we implement the Zero-1-to-3 warmup with the guidance of RGB. During this phase, in the absence of DeepFloyd IF, the 3D object is able to achieve a stable, rough initialization, characterized by consistent geometric integrity. This foundational procedure benefits to reinforcing both the texture fidelity and geometric stability in the final 3D model outcomes. Notably, post-warmup, the geometric configurations of objects, such as ice-cream sundaes and saguaro cactus, exhibit markedly improved realism. The implementation of the Zero-1-to-3 warmup lays the groundwork for a robust and realistic geometric framework in the nascent stages of model training.

Human-centric 3D Asset Generation

Our architecture demonstrates proficiency in the creation of human-related assets. By processing prompts that are specifically crafted for human figures, our architecture can generate 3D representations of human faces and bodies with high fidelity and consistency from multiple viewpoints. Remarkably, this is achieved without any predefined knowledge of human anatomical structures. For instance, our model successfully generates a realistic bust of an Asian Santa Claus. Additionally, our generated astronaut, Deadpool and Spiderman display a 3D geometric consistency that adheres closely to human anatomical form.

Random Initialization

We assess the impact of using different seeds and initialization images. Our findings indicate that our method is not notably affected by the choice of random initialization. In contrast, DreamFusion and Magic3D heavily relies on the selection of random seeds in achieving high-quality results and avoiding the multi-face Janus problem.

Smooth Regulation

We conduct an evaluation of the impact of the smoothness loss on our coarse-stage results. Our findings suggest that applying smoothness regulation can enhance the quality of textures. Simultaneously, it has a modest beneficial effect on the quality of the mesh, as visualized via the surface normal maps.

Entropy Loss

We incorporate an additional entropy loss into our model, which aims to address floats and blurs in the background that could enhance the mesh extraction process. We present a comparison of the results obtained when training with and without the implementation of Entropy Loss. The use of entropy loss effectively eliminates the floaters and blurs in the background, leading to clearer and more detailed imagery.

Zero-1-to-3 in the Refinement Stage

We evaluate the effects of utilizing Zero-1-to-3 for surface normal regulation during the refinement stage. Similar to our approach in the coarse stage, we feed the surface normal rendering into Zero-1-to-3, while continuing to use Stable-Diffusion for RGB SDS loss. Utilizing only Stable-Diffusion to compute SDS losses for both surface normal and RGB texture renderings allows our model to generate more detailed mesh surfaces. This outcome is attributable to Zero-1-to-3's limitation in only generating low-resolution images and providing weak regulation at low resolutions, which renders it unsuitable for the refinement stage that necessitates more detailed textures and mesh surfaces.

PBR Material Modeling

In addition to our previous experiments, we also investigate the use of the Physically-Based Rendering (PBR) material modeling method during our refinement stage. This method involves decomposing the texture into three components of the material model: the diffuse term, the roughness and metallic term, and the normal variation term. The PBR material modeling approach can be beneficial for simulation applications.


Mesh exports

We can extract mesh from our fine stage result.


Image to 3D

Generate 3D meshes from the given poster image


More results

a zoomed out DSLR photo of an origami hippo in a river
a DSLR photo of a mug of hot chocolate with whipped cream and marshmallows
a zoomed out DSLR photo of a rabbit cutting grass with a lawnmower
a DSLR photo of a pug made out of metal
a DSLR photo of a red pickup truck driving across a stream
a DSLR photo of a red-eyed tree frog
a DSLR photo of a robot tiger
a DSLR photo of An iridescent steampunk patterned millipede with bison horns
a DSLR photo of a tiger dressed as a doctor
a DSLR photo of a tiger made out of yarn
a zoomed out DSLR photo of an astronaut chopping vegetables in a sunlit kitchen
a zoomed out DSLR photo of an amigurumi motorcycle
a DSLR photo of an overstuffed pastrami sandwich
a DSLR photo of an unstable rock cairn in the middle of a stream
a fox and a hare tangoing together
a human skull with a vine growing through one of the eye sockets
a llama wearing a suit
a piece of Black Forest cake with blueberries
a silver platter piled high with fruits
a sliced loaf of fresh bread
a squirrel wearing a tuxedo and holding a conductor's baton
a zoomed out DSLR photo of a pigeon standing on a manhole cover
a white mug with golden logo on it
a wide angle DSLR photo of a colorful rooster
a zoomed out DSLR photo of a bulldozer made out of toy bricks
a zoomed out DSLR photo of a cake in the shape of a train
a zoomed out DSLR photo of a chimpanzee holding a cup of hot coffee
a zoomed out DSLR photo of a construction excavator
a zoomed out DSLR photo of a corgi wearing a top hat
a zoomed out DSLR photo of a dachsund riding a unicycle
a zoomed out DSLR photo of a hermit crab with a colorful shell
a zoomed out DSLR photo of a human skeleton relaxing in a lounge chair
a zoomed out DSLR photo of a kangaroo sitting on a bench playing the accordion
a zoomed out DSLR photo of a lion's mane jellyfish
a zoomed out DSLR photo of a majestic sailboat
a zoomed out DSLR photo of a model of a house in Tudor style

Template from https://dreamfusion3d.github.io/