🐏🐕🐁🦌 Learning the 3D Fauna of the Web

Zizhang Li1*     Dor Litvak1,2*     Ruining Li3      Yunzhi Zhang1      Tomas Jakab3     Christian Rupprecht3
Shangzhe Wu1†     Andrea Vedaldi3†     Jiajun Wu1†
1Stanford University     2UT Austin     3University of Oxford
(* Equal Contribution, † Equal Advising)
Paper Demo

Our method, 3D-Fauna, learns a pan-category deformable 3D model of more than 100 different animal species using only 2D Internet images as training data, without any prior shape models or keypoint annotations. At test time, the model can turn a single image of an quadruped instance into an articulated, textured 3D mesh in a feed-forward manner, ready for animation and rendering.


Check this Gradio Demo page for single-view animal reconstruction.

Abstract

Learning 3D models of all animals on the Earth requires massively scaling up existing solutions. With this ultimate goal in mind, we develop 3D-Fauna, an approach that learns a pan-category deformable 3D animal model for more than 100 animal species jointly. One crucial bottleneck of modeling animals is the limited availability of training data, which we overcome by simply learning from 2D Internet images. We show that prior category-specific attempts fail to generalize to rare species with limited training images. We address this challenge by introducing the Semantic Bank of Skinned Models (SBSM), which automatically discovers a small set of base animal shapes by combining geometric inductive priors with semantic knowledge implicitly captured by an off-the-shelf self-supervised feature extractor. To train such a model, we also contribute a new large-scale dataset of diverse animal species. At inference time, given a single image of any quadruped animal, our model reconstructs an articulated 3D mesh in a feed-forward fashion within seconds.


Overview

3D-Fauna is trained using only single-view images from the Internet. Given each input image, it first extracts a feature vector using a pre-trained unsupervised image encoder. This is then used to query a learned memory bank to produce a base shape and a DINO feature field in the canonical pose. The model also predicts the albedo, instance-specific deformation, articulated pose and lighting, and is trained via image reconstruction losses on RGB, DINO feature map and mask, as well as a mask discriminator loss, without any prior shape models or keypoint annotations.

Single-Image 3D Reconstruction

Given a single image of any quadruped animal without any category information, the model reconstructs articulated 3D shape and appearance of it, which can be animated and re-rendered from arbitrary viewpoints.



Video Frame Reconstruction

We can also use one model to reconstruct different animals from video frames.


Shape Interpolation

Our trained shape bank allows for interpolation between reconstructioned instances from different input images, which proves our shape space is continuous and smooth.

Try it yourself: Move the slider to interpolate shapes in the interpolated viewpoints (left) and a fixed viewpoint (right).

input
00000_view_appearance
input

Input Image 0

Input Image 1

input
00001_view_appearance
input

Input Image 0

Input Image 1

input
00002_view_appearance
input

Input Image 0

Input Image 1

input
00003_view_appearance
input

Input Image 0

Input Image 1

input
00004_view_appearance
input

Input Image 0

Input Image 1


Base Shape Bank Sampling

We can also directly sample numerous and diverse shapes from the trained shape bank.


BibTeX

@article{li2024learning,
  title     = {Learning the 3D Fauna of the Web},
  author    = {Li, Zizhang and Litvak, Dor and Li, Ruining and Zhang, Yunzhi and Jakab, Tomas and Rupprecht, Christian and Wu, Shangzhe and Vedaldi, Andrea and Wu, Jiajun},
  journal   = {arXiv preprint arXiv:2401.02400},
  year      = {2024}
}

Acknowledgements

We are very grateful to Cristobal Eyzaguirre, Kyle Sargent, Yunhao Ge for insightful discussions, and Chen Geng for proofreading. The work is in part supported by the Stanford Institute for Human-Centered AI (HAI), NSF RI #2211258, ONR MURI N00014-22-1-2740, the Samsung Global Research Outreach (GRO) program, Amazon, Google, and EPSRC VisualAI EP/T028572/1.

Relevant Work

Dove: Learning Deformable 3D Objects by Watching Videos. IJCV 2023.
MagicPony: Learning Articulated 3D Animals in the Wild. CVPR 2023.
Farm3D: Learning Articulated 3D Animals by Distilling 2D Diffusion. 3DV 2024.
Ponymation: Learning 3D Animal Motions from Unlabeled Online Videos. Arxiv 2023.
SAOR: Single-View Articulated Object Reconstruction. Arxiv 2023.
LASSIE: Learning Articulated Shape from Sparse Image Ensemble via 3D Part Discovery. NeurIPS 2022.
Hi-LASSIE: High-Fidelity Articulated Shape and Skeleton Discovery from Sparse Image Ensemble. CVPR 2023.
ARTIC3D: Learning Robust Articulated 3D Shapes from Noisy Web Image Collections. NeurIPS 2023.
BANMo: Building Animatable 3D Neural Models from Many Casual Videos. CVPR 2022.
RAC: Reconstructing Animatable Categories from Videos. CVPR 2023.