Souza, João Marcelo Silva; https://orcid.org/0009-0001-5562-5337; http://lattes.cnpq.br/1431973892971280
Resumo:
In Human-Robot Interaction (HRI), the visual estimation of biosignals over time is essential for extracting human features, interpreting behaviors, and providing various forms of cyber-physical feedback and stimuli. In this context, Facial Expression Recognition (FER) systems have been developed to automate the computational analysis of human behavior, a process which requires meticulous observation and complex and integrated processing of spatiotemporal correlations. However, current FER systems and datasets predominantly explore spatial, static, or instantaneous aspects, which limits investigations of facial muscle deformations and motion over time in real-world situations. To overcome this limitation, this work proposes an alternative approach to the conventional image domain, connecting the visual representation of points of interest to temporal descriptors. To achieve this, the points are tracked over time, normalized spatiotemporally, and converted into metrics that generate motion signatures represented through multivariate time series. This work presents: the proposed methodology, termed Visual-Temporal FER (VT-FER), along with its corresponding framework; 22 standardized face measurements based on the principles of Facial Action Coding System (FACS); the pipeline architecture for computational systems; and a new dataset, the Facial Biosignals Time-Series (FBioT), comprising more than 21,000 seconds of real-world footage collected in uncontrolled environments from public sources. The prototype results validated the temporal hypotheses of the proposed approach, achieving accuracy levels compatible with benchmarks from the scientific community: 94% accuracy in the neural network trained with the Extended Cohn-Kanade (CK+) dataset reference data for emotion detection in controlled environments, and 72% for arousal detection in uncontrolled environments, using the Acted Facial Expressions In The Wild - Valence and Arousal (AFEW-VA) dataset as reference. Additionally, the FBioT dataset enabled the exploration of the methodology’s potential in the development of neural networks, reaching 80% accuracy in the visual-temporal detection of emotions during conversations, and 88% in visual word identification from mouth movement analysis over time.