SignAvatar: Sign Language 3D Motion Reconstruction and Generations

University at Buffalo,SUNY
SignAvatar Teaser

SignAvatar excels at two tasks: reconstructing 3D sign language motions from videos and generating them from semantics (images, text). The top row displays a sign language video for "drink" - note some motion blur here. The middle row shows the 3D avatar reconstruction by SignAvatar, and the bottom row demonstrates its ability to generate a 3D signing avatar from the word "drink".

Abstract

Achieving expressive 3D motion reconstruction and automatic generation for isolated sign words can be challenging, due to the lack of real-world 3D sign-word data, the complex nuances of signing motions, and the cross-modal understanding of sign language semantics. To address these challenges, we introduce SignAvatar, a framework capable of both word-level sign language reconstruction and generation. SignAvatar employs a transformer-based conditional variational autoencoder architecture, effectively establishing relationships across different semantic modalities. Additionally, this approach incorporates a curriculum learning strategy to enhance the model’s robustness and generalization, resulting in more realistic motions. Furthermore, we contribute the ASL3DWord dataset, composed of 3D joint rotation data for the body, hands, and face, for unique sign words. We demonstrate the effectiveness of SignAvatar through extensive experiments, showcasing its superior reconstruction and automatic generation capabilities.

SignAvatar

SignAvatar Overview

SignAvatar

SignAvatar Pipeline The upper row represents the reconstruction process, absorbing knowledge and analyzing their relationships, while the lower row indicates the generation process, which outputs knowledge. Unlike the rigid mapping of sign language production, the input text or images in sign language generation showcase greater semantic flexibility.

SignAvatar

SignAvatar can accept images as input. Given an image on the left, and using the text-image embedding of CLIP, SignAvatar can recognize the corresponding semantics - "book", and generate the corresponding 3D signing motion. The upper row is the front view and the lower row is the side view.

BibTeX

@inproceedings{dong2024signavatar,
  title     = {Signavatar: Sign language 3d motion reconstruction and generation},
  author    = {Dong, Lu and Chaudhary, Lipisha and Xu, Fei and Wang, Xiao and Lary, Mason and Nwogu, Ifeoma},
  booktitle = {2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG)},
  year      = {2024},
  organization = {IEEE}
}
Website template modified from NeRFies.