Abstract

Text-to-3D generation has made remarkable progress recently, particularly with methods based on Score Distillation Sampling (SDS) that leverages pre-trained 2D diffusion models. While the usage of classifier-free guidance is well acknowledged to be crucial for successful optimization, it is considered an auxiliary trick rather than the most essential component. In this paper, we re-evaluate the role of classifier-free guidance in score distillation and discover a surprising finding: the guidance alone is enough for effective text-to-3D generation tasks. We name this method Classifier Score Distillation (CSD), which can be interpreted as using an implicit classification model for generation. This new perspective reveals new insights for understanding existing techniques. We validate the effectiveness of CSD across a variety of text-to-3D tasks including shape generation, texture synthesis, and shape editing, achieving results superior to those of state-of-the-art methods.

Example generated objects

We present examples of the 3D objects generated by our method here. For more results and comparisons with other methods used during our evaluation, please refer to Comparison on shape generation and Comparison on texture synthesis. We trained our model with a batch size of 1 on a single A800 GPU. We employ the DeepFloyd-IF stage-I to generate a low-resolution NeRF. Subsequently, we use the Stable Diffusion and DMTet representation for high-resolution mesh refinement. Generating each object approximately takes 1 hour.

Comparison on shape generation

We compare our method with other text-to-3D methods. We use fixed order here, but random order for user study.

Dreamfusion

Magic3D

Ours

Comparison on texture synthesis

We compare our method with other methods that support text-guided texture synthesis on 3D meshes. We use fixed order here, but random order for user study.

Geometry

Fantasia3D