Learning Digital Humans from Vision and Language

Yao Feng

Learning Digital Humans from Vision and Language

Files

yao_thesis_final_with_cover.pdf (99.31 MB)

Date

2024-10-10

Authors

Yao Feng

Publisher

ETH Zurich

Abstract

The study of realistic digital humans has gained significant attention within the research communities of computer vision, computer graphics, and machine learning. This growing interest is driven by the importance of understanding human selves and the pivotal role digital humans play in diverse applications, including virtual presence in AR/VR, digital fashion, entertainment, robotics, and healthcare. However, two major challenges hinder the widespread use of digital humans across disciplines: the difficulty in capturing, as current methods rely on complex systems that are time-consuming, labor-intensive, and costly; and the lack of understanding, where even after creating digital humans, gaps in understanding their 3D representations and integrating them with broader world knowledge limit their effective utilization. Overcoming these challenges is crucial to unlocking the full potential of digital humans in interdisciplinary research and practical applications. To address these challenges, this thesis combines insights from computer vision, computer graphics, and machine learning to \textbf{develop scalable methods for capturing and modeling digital humans}. These methods include capturing faces, bodies, hands, hair, and clothing using accessible data such as images, videos, and text descriptions. More importantly, \textbf{we go beyond capturing to shift the research paradigm toward understanding and reasoning} by leveraging large language models (LLMs). For instance, we developed the first foundation model that not only captures 3D human poses from a single image, but also reasons about a person’s potential next actions in 3D by incorporating world knowledge. This thesis unifies scalable capturing and understanding of digital humans, from vision and language data—just as humans do by observing and interpreting the world through visual and linguistic information. Our research begins by developing a framework to capture detailed 3D faces from in-the-wild images. This framework, capable of generating highly realistic and animatable 3D faces from single images, is trained without paired 3D supervision and achieves state-of-the-art accuracy in shape reconstruction. It effectively disentangles identity and expression details, thereby allowing animation of estimated faces with various expressions. Humans, are not just faces, we then develop PIXIE, a method for estimating animatable, whole-body 3D avatars with realistic facial details from a single image. By incorporating an attention mechanism, PIXIE surpasses previous methods in accuracy and enables the creation of expressive, high-quality 3D humans. Expanding beyond human bodies, we proposed SCARF and DELTA, to capture separate body, clothing, face, and hair from monocular videos using a hybrid representation. While clothing and hair are better modeled with implicit representations like neural radiance fields (NeRFs) due to their complex topologies, human bodies are better represented with meshes. SCARF combines the strengths of both by integrating mesh-based bodies with NeRFs for clothing and hair. To enable learning directly from monocular videos, we introduced mesh-integrated volume rendering, which enables optimizing the model directly from 2D image data without requiring 3D supervision. Thanks to the disentangled modeling, the captured avatar's clothing can be transferred to arbitrary body shapes, making it especially valuable for applications such as virtual try-on. Building on SCARF's hybrid representation, we introduced TECA, which uses text-to-image generation models to create realistic and editable 3D avatars. TECA produces more realistic avatars than recent methods while allowing edits due to its compositional design. For instance, users can input descriptions like ``a slim woman with dreadlocks'' to generate a 3D head mesh with texture and a NeRF model for the hair. It also enables transferring NeRF-based hairstyles, scarves, and other accessories between avatars. While these methods make capturing humans more accessible, broader applications require understanding the context of human behavior. Traditional pose estimation methods often isolate subjects by cropping images, which limits their ability to interpret the full scene or reason about actions. To address this, we developed ChatPose, the first model for understanding and reasoning about 3D human poses. ChatPose leverages a multimodal large language model (LLM), finetuned a projection layer to decode embeddings into 3D pose parameters, which are further decoded into 3D body meshes using the SMPL body model. By finetuning on both text-to-3D pose and image-to-3D pose data, ChatPose demonstrates, for the first time, that a LLM can directly reason about 3D human poses. This capability allows ChatPose to describe human behavior, generate 3D poses, and reason about potential next actions in 3D form, combining perception with reasoning. We believe the contributions of this thesis, in scaling up digital human capture and advancing the understanding of humans in 3D, have the potential to shape the future of human-centered research and enable broader applications across diverse fields.