IJFANS International Journal of Food and Nutritional Sciences

ISSN PRINT 2319 1775 Online 2320-7876

Recognizing sign language from many camera angles using 2D video skeleton data

Main Article Content

Ch.Raghava Prasad
» doi: 10.48047/IJFANS/11/S6/004

Abstract

In this study, we suggest a method for view-oriented feature fusion (VOFF) in multi stream CNNs. Here, we use nine different perspectives to train a CNN model. Each of the nine perspectives can be further broken down into thirds based on the camera's position relative to the action: middle, far left, and far right. Following their training by the dense networks, these three subsets are fused together using a common set of features. Once all 3 softmax layers' scores have been accumulated, a prediction is made. As demonstrated by the results, a strong discriminative view feature vector may be constructed by fusing spatial features across different viewpoints. This fusion technique generates good view feature distribution but fails to distinguish between signs that are visually identical across several views. To address the aforementioned difficulty, researchers have investigated using a contrastive network with triple loss embedding (CNTLE). Within this framework, the viewpoints are coupled as a support set consisting of pro-class and anti-class perspectives. During training, CNN networks are subjected to global cross entropy losses and view-specific triplet losses. The model's shortcomings have been mitigated thanks to the solution, which pairs perceptions of inter-class homogeneous physical appearance with unfavorable evaluations. Because of this, the model was able to provide view invariant features for classification that were satisfactory.

Article Details