SentiAvatar: Towards Expressive and Interactive Digital Humans

Chuhao Jin^1,2,* Rui Zhang^2,* Qingzhe Gao² Haoyu Shi³ Dayu Wu²
Yichen Jiang² Yihan Wu¹ Ruihua Song^1,†

¹ Gaoling School of Artificial Intelligence, Renmin University of China
² SentiPulse
³ College of Computer Science, Inner Mongolia University
* Equal contribution. Chuhao Jin led this project. † Corresponding author.

arXiv Code Dataset Demo Video

Abstract

We present SentiAvatar, a framework for building expressive interactive 3D digital humans, and use it to create SuSu, a virtual character that speaks, gestures, and emotes in real time. Achieving such a system remains challenging, as it requires jointly addressing three key problems: the lack of large-scale high-quality multimodal data, robust semantic-to-motion mapping, and fine-grained frame-level motion-prosody synchronization. To solve these problems, first, we build SuSuInterActs (21K clips, 37 hours), a dialogue corpus captured via optical motion capture around a single character with synchronized speech, full-body motion, and facial expressions. Second, we pre-train a Motion Foundation Model on 200K+ motion sequences, equipping it with rich action priors that go well beyond the conversation. We then propose an audio-aware plan-then-infill architecture that decouples sentence-level semantic planning from frame-level prosody-driven interpolation, so that generated motions are both semantically appropriate and rhythmically aligned with speech. Experiments show that SentiAvatar achieves state-of-the-art on both SuSuInterActs (R@1 43.64%, nearly 2× the best baseline) and BEATv2 (FGD 4.941, BC 8.078), producing 6 s of output in 0.3 s with unlimited multi-turn streaming.

Dataset

We curate SuSuInterActs, a dialogue corpus captured via optical motion capture around a single character with synchronized speech, full-body motion, and facial expressions. Below are representative examples from our dataset, each with the corresponding action/expression label and dialogue text.
🔊 Hover over a video to play with audio. Only the hovered video will play sound.

动作：摊手表示无奈

不然呢，我又不能出门。

动作：双臂优雅地向两侧展开，身体轻盈地旋转，划出一个漂亮的圆弧。

看好了，这可是限定款的特殊穿法。

动作：双臂缓缓抬起，在身前交错，像在拥抱一个不存在的人。

好啦，我的即兴表演结束了，观众是不是该给点掌声？

动作：身体大幅度地左右倾斜，手臂随着身体摆动。

哎呀，从这边好像也看不到，你挪一下嘛。

动作：双臂从身体两侧缓缓举过头顶，再像流水一样顺着身体滑落下来。

下次走路小心点，我的舞蹈表演可不是免费的。

动作：高抬腿在原地踏步，双臂配合着大幅度摆动，像一个凯旋的士兵。

那当然，它可是负责赶走你所有烦恼的卫兵呢。

Method

Overview of SentiAvatar. (a) Multi-modal inputs are quantized into tokens via encoders. The body pathway employs a hierarchical generation: (b) LLM planner predicts sparse keyframe tokens to capture high-level dialogue content, followed by (c) an audio-aware Infill Transformer for dense, prosody-driven interpolation to ensure fine-grained temporal synchronization.

Qualitative Analysis

Qualitative comparison of generated motions across methods. Each row shows keyframe sequences for a given motion and speech. Texts and arrows of the same color indicate the same time step. The red arrow indicates an incorrect action.

Quantitative Results

Quantitative comparison on SuSuInterActs. Bold: best; ↑/↓: higher/lower is better. ESD in seconds. "†" indicates T2M-GPT variants with token-by-token autoregressive generation.

Method	Condition	R@1 ↑	R@2 ↑	R@3 ↑	FID ↓	ESD ↓	Diversity ↑
Real Motion	—	62.20	73.56	78.70	0.000	0.308	22.61
Audio-only methods
EMAGE	Audio	5.00	9.40	13.32	441.6	0.606	12.92
A2M-GPT^†	Audio	8.72	15.96	20.08	13.66	0.477	22.23
Text-only methods
HunYuan-Motion	Text	5.21	8.59	11.9	352.56	0.708	16.92
T2M-GPT	Text	23.12	30.49	35.43	67.78	0.721	20.65
MoMask	Text	34.55	46.58	54.29	36.25	0.471	22.03
Audio + Text methods
AT2M-GPT^†	Audio, Text	27.52	36.11	41.38	18.491	0.503	22.36
SentiAvatar (Ours)	Audio, Text	43.64	54.94	61.84	8.912	0.456	22.41
Improvement (%)		+26.3	+17.9	+13.9	+34.8	+3.2	+0.2

Conclusion

In this paper, we presented SentiAvatar, a novel framework for building highly expressive, interactive 3D digital humans, demonstrated through our real-time character, SuSu. To address the scarcity of interactive conversational data and the challenges of motion-prosody synchronization, we introduced SuSuInterActs, a 37-hour multimodal dialogue corpus, and proposed a plan-then-infill architecture powered by a pre-trained Motion Foundation Model on over 200K sequences. This design effectively decouples high-level semantic planning from frame-level audio-driven interpolation, ensuring that the generated motions are both contextually accurate and rhythmically aligned with speech. Experiments demonstrate that SentiAvatar achieves state-of-the-art performance on both SuSuInterActs and BEATv2. With highly efficient generation (0.3 s for 6 s of output) and open-sourced resources, our work provides a robust foundation for future research in natural, real-time virtual human interactions.

Citation

If you find this work useful in your research, please cite:

@misc{jin2026sentiavatarexpressiveinteractivedigital, title={SentiAvatar: Towards Expressive and Interactive Digital Humans}, author={Chuhao Jin and Rui Zhang and Qingzhe Gao and Haoyu Shi and Dayu Wu and Yichen Jiang and Yihan Wu and Ruihua Song}, year={2026}, eprint={2604.02908}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2604.02908}, }