- [2024.04]   I am honored to receive the Stanford Graduate Fellowship Award.
- [2024.03]   I will be joining Stanford University as a CS PhD student.
* indicates equal contributions, † indicates equal advising.
|
The Scene Language: Representing Scenes with Programs, Words, and Embeddings
Yunzhi Zhang,
Zizhang Li,
Matt Zhou,
Shangzhe Wu,
Jiajun Wu
arXiv, 2024
Project page /
arXiv /
Code
The Scene Language is a visual scene representation that concisely and precisely describes the structure, semantics, and identity of visual scenes.
It represents a scene with three key components: a program that specifies the hierarchical and relational structure of entities in the scene,
words in natural language that summarize the semantic class of each entity, and embeddings that capture the visual identity of each entity.
|
|
3D Congealing: 3D-Aware Image Alignment in the Wild
Yunzhi Zhang,
Zizhang Li,
Amit Raj,
Andreas Engelhardt,
Yuanzhen Li,
Tingbo Hou,
Jiajun Wu,
Varun Jampani
ECCV, 2024
Project page /
arXiv
3D Congealing aligns semantically similar objects in an unposed 2D image collection to a canonical 3D representation, via fusing prior knowledge from a pre-trained image generative model and semantic information from input images.
|
|
Learning the 3D Fauna of the Web
Zizhang Li*,
Dor Litvak*,
Ruining Li,
Yunzhi Zhang,
Tomas Jakab,
Christian Rupprecht,
Shangzhe Wu†,
Andrea Vedaldi†,
Jiajun Wu†
CVPR, 2024
Project page /
arXiv /
Code /
Video /
Demo
3D-Fauna learns a pan-category deformable 3D model of more than 100 different animal species using only 2D Internet images as training data, without any prior shape models or keypoint annotations. At test time, the model can turn a single image of an quadruped instance into an articulated, textured 3D mesh in a feed-forward manner, ready for animation and rendering.
|
|
ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Real Image
Kyle Sargent,
Zizhang Li,
Tanmay Shah,
Charles Herrmann,
Hong-Xing Yu,
Yunzhi Zhang,
Eric Ryan Chan,
Dmitry Lagun,
Li Fei-Fei,
Deqing Sun,
Jiajun Wu
CVPR, 2024
Project page /
arXiv /
code
We train a 3D-aware diffusion model, ZeroNVS on a mixture of scene data sources that capture object-centric, indoor, and outdoor scenes.
This enables zero-shot SDS distillation of 360-degree NeRF scenes from a single image.
Our model sets a new state-of-the-art result in LPIPS on the DTU dataset in the zero-shot setting.
We also use the MipNeRF-360 dataset as a benchmark for single-image NVS.
|
|
RICO: Regularizing the Unobservable for Indoor Compositional Reconstruction
Zizhang Li,
Xiaoyang Lyu,
Yuanyuan Ding,
Mengmeng Wang,
Yiyi Liao†,
Yong Liu†
ICCV, 2023
arXiv /
code
We investigate the existing problems in SDF-based object compositional reconstruction under the partial observation,
and propose different regularizations following the geometry prior to reach a clean and water-tight disentanglement.
|
|
Learning a Room with the Occ-SDF Hybrid: Signed Distance Function Mingled with Occupancy Aids Scene Representation
Xiaoyang Lyu,
Peng Dai,
Zizhang Li,
Dongyu Yan,
Yi Lin,
Yifan Peng,
Xiaojuan Qi
ICCV, 2023
Project page /
arXiv /
code
We study and analyze several key observations in indoor scene SDF-based volume rendering reconstruction methods. Upon those observations,
we push forward an Occ-SDF hybrid representation for better reconstruction performance.
|
|
A Joint Modeling of Vision-Language-Action for Target-oriented Grasping in Clutter
Kechun Xu,
Shuqi Zhao,
Zhongxiang Zhou,
Zizhang Li,
Huaijin Pi,
Yifeng Zhu,
Yue Wang,
Rong Xiong
ICRA, 2023
arXiv /
code
We propose to jointly model vision, language and action with object-centric representations for the task of
language-conditioned grasping in clutter.
|
|
Failure-aware Policy Learning for Self-assessable Robotics Tasks
Kechun Xu,
Runjian Chen,
Shuqi Zhao,
Zizhang Li,
Hongxiang Yu,
Ci Chen,
Yue Wang,
Rong Xiong
ICRA, 2023
arXiv
We investigate the dependency between the self-assessment results and remaining actions by learning the
failure-aware policy, and propose two policy architectures.
|
|
E-NeRV: Expedite Neural Video Representation with Disentangled Spatial-Temporal Context
Zizhang Li,
Mengmeng Wang,
Huaijin Pi,
Kechun Xu,
Jianbiao Mei,
Yong Liu
ECCV, 2022
arXiv /
code
We investigate the architecture of frame-wise implicit neural video representation and upgrade it by removing a large portion of redundant parameters, and re-design
the network architecture following a spatial-temporal disentanglement motivation.
|
|
Learning Part Segmentation through Unsupervised Domain Adaptation from Synthetic Vehicles
Qing Liu,
Adam Kortylewski,
Zhishuai Zhang,
Zizhang Li,
Mengqi Guo,
Qihao Liu,
Xiaoding Yuan,
Jiteng Mu,
Weichao Qiu,
Alan Yuille
CVPR, 2022,
oral
arXiv /
code
We construct a synthetic multi-part dataset with different categories of objects,
evaluate different part segmentation UDA methods with this benchmark, and also provide an improved baseline.
|
|
MaIL: A Unified Mask-Image-Language Trimodal Network for Referring Image Segmentation
Zizhang Li*,
Mengmeng Wang*,
Jianbiao Mei,
Yong Liu
arxiv, 2021
arXiv
We propose to regard the binary mask as a unique modality and train the tri-modal embedding space
on top of ViLT for referring segmentation task.
|
|
Searching for TrioNet: Combining Convolution with Local and Global Self-Attention
Huaijin Pi,
Huiyu Wang,
Yingwei Li,
Zizhang Li,
Alan Yuille
BMVC, 2021
arXiv /
code
We propose a weight-sharing NAS method to combine convolution, local and global self-attention operators.
|
- Reviewer of 3DV, AAAI, BMVC, CAI, CVPR, ECCV, ICCV, ICLR, ICML, NeurIPS.
|