APSIPA Transactions on Signal and Information Processing > Vol 11 > Issue 1

American Sign Language Fingerspelling Recognition in the Wild with Iterative Language Model Construction

Wuttipong Kumwilaisak, Department of Electronics and Telecommunication Engineering, King Mongkut’s University of Technology Thonburi, Thailand, wuttipong.kum@kmutt.ac.th , Peerawat Pannattee, Department of Electronics and Telecommunication Engineering, King Mongkut’s University of Technology Thonburi, Thailand, Chatchawarn Hansakunbuntheung, Assistive Technology and Medical Devices Research Center, National Science and Technology Development Agency, Thailand, Nattanun Thatphithakkul, Assistive Technology and Medical Devices Research Center, National Science and Technology Development Agency, Thailand
 
Suggested Citation
Wuttipong Kumwilaisak, Peerawat Pannattee, Chatchawarn Hansakunbuntheung and Nattanun Thatphithakkul (2022), "American Sign Language Fingerspelling Recognition in the Wild with Iterative Language Model Construction", APSIPA Transactions on Signal and Information Processing: Vol. 11: No. 1, e22. http://dx.doi.org/10.1561/116.00000003

Publication Date: 28 Jul 2022
© 2022 W. Kumwilaisak, P. Pannattee, C. Hansakunbuntheung and N. Thatphithakkul
 
Subjects
 
Keywords
Fingerspelling recognitionweakly supervised learningiterative trainingdeep learningSiamese network
 

Share

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 102 times

In this article:
Introduction 
Architectural Description 
Automatic Frame Labeling with Weakly Supervised Learning 
Training Deep CNN with Siamese Network-Based Feature Embedding 
Fingerspelling Recognition with Iterative Model Construction 
Experimental Results 
Conclusion 
References 

Abstract

This paper proposes a novel method to improve the accuracy of the American Sign Language fingerspelling recognition. Video sequences from the training set of the “ChicagoFSWild” dataset are first utilized for training a deep neural network of weakly supervised learning to generate frame labels from a sequence label automatically. The network of weakly supervised learning contains the AlexNet and the LSTM. This trained network generates a collection of frame-labeled images from the training video sequences that have Levenshtein distance between the predicted sequence and the sequence label equal to zero. The negative and positive pairs of all fingerspelling gestures are randomly formed from the collected image set. These pairs are adopted to train the Siamese network of the ResNet-50 and the projection function to produce efficient feature representations. The trained Resnet-50 and the projection function are concatenated with the bidirectional LSTM, a fully connected layer, and a softmax layer to form a deep neural network for the American Sign Language fingerspelling recognition. With the training video sequences, video frames corresponding to the video sequences that have Levenshtein distance between the predicted sequence and the sequence label equal to zero are added to the collected image set. The updated collected image set is used to train the Siamese network. The training process, from training the Siamese network to the update of the collected image set, is iterated until the image recognition performance is not further enhanced. The experimental results from the “ChicagoFSWild” dataset show that the proposed method surpasses the existing works in terms of the character error rate.

DOI:10.1561/116.00000003