SignAvatar: Sign Language 3D Motion Reconstruction and Generation

Abstract

Achieving expressive 3D motion reconstruction and automatic generation for isolated sign words can be challenging, due to the lack of real-world 3D sign-word data, the complex nuances of signing motions, and the cross-modal understanding of sign language semantics. To address these challenges, we introduce SignAvatar, a framework capable of both word-level sign language reconstruction and generation. SignAvatar employs a transformer-based conditional variational autoencoder architecture, effectively establishing relationships across different semantic modalities. Additionally, this approach incorporates a curriculum learning strategy to enhance the model’s robustness and generalization, resulting in more realistic motions. Furthermore, we contribute the ASL3DWord dataset, composed of 3D joint rotation data for the body, hands, and face, for unique sign words. We demonstrate the effectiveness of SignAvatar through extensive experiments, showcasing its superior reconstruction and automatic generation capabilities.

BibTeX

@inproceedings{dong2024signavatar, title = {Signavatar: Sign language 3d motion reconstruction and generation}, author = {Dong, Lu and Chaudhary, Lipisha and Xu, Fei and Wang, Xiao and Lary, Mason and Nwogu, Ifeoma}, booktitle = {2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG)}, year = {2024}, organization = {IEEE} }

SignAvatar: Sign Language 3D Motion Reconstruction and Generations

Abstract

SignAvatar Overview

SignAvatar can accept images as input. Given an image on the left, and using the text-image embedding of CLIP, SignAvatar can recognize the corresponding semantics - "book", and generate the corresponding 3D signing motion. The upper row is the front view and the lower row is the side view.

BibTeX