I got my M.Eng in Control Science and Engineering department from Zhejiang University in 2024, where I was
advised by Prof. Yong Liu in April Lab.
I obtained my B.Eng from the same department with an honor degree at Chu Kochen Honor College in 2021.
The Scene Language is a visual scene representation that concisely and precisely describes the structure, semantics, and identity of visual scenes.
It represents a scene with three key components: a program that specifies the hierarchical and relational structure of entities in the scene,
words in natural language that summarize the semantic class of each entity, and embeddings that capture the visual identity of each entity.
3D Congealing aligns semantically similar objects in an unposed 2D image collection to a canonical 3D representation, via fusing prior knowledge from a pre-trained image generative model and semantic information from input images.
3D-Fauna learns a pan-category deformable 3D model of more than 100 different animal species using only 2D Internet images as training data, without any prior shape models or keypoint annotations. At test time, the model can turn a single image of an quadruped instance into an articulated, textured 3D mesh in a feed-forward manner, ready for animation and rendering.
We train a 3D-aware diffusion model, ZeroNVS on a mixture of scene data sources that capture object-centric, indoor, and outdoor scenes.
This enables zero-shot SDS distillation of 360-degree NeRF scenes from a single image.
Our model sets a new state-of-the-art result in LPIPS on the DTU dataset in the zero-shot setting.
We also use the MipNeRF-360 dataset as a benchmark for single-image NVS.
We investigate the existing problems in SDF-based object compositional reconstruction under the partial observation,
and propose different regularizations following the geometry prior to reach a clean and water-tight disentanglement.
We study and analyze several key observations in indoor scene SDF-based volume rendering reconstruction methods. Upon those observations,
we push forward an Occ-SDF hybrid representation for better reconstruction performance.
We investigate the dependency between the self-assessment results and remaining actions by learning the
failure-aware policy, and propose two policy architectures.
We investigate the architecture of frame-wise implicit neural video representation and upgrade it by removing a large portion of redundant parameters, and re-design
the network architecture following a spatial-temporal disentanglement motivation.
We construct a synthetic multi-part dataset with different categories of objects,
evaluate different part segmentation UDA methods with this benchmark, and also provide an improved baseline.
We transform the non-diffrentiable AP metric to differentiable loss function by utilizing Bezier curve parameterization. We further
use PPO to search the parameters and show improved performance of the PAP loss on various detectors.