Vol 2 No 1 (March 2026)
Articles

Evaluating the Impact of Behavioural Features on Hindi Speech Emotion Recognition: A Multimodal Deep Learning Approach

Sujata Kotian
Department of Information Technology, University of Mumbai
Santosh Singh
Department of Information Technology, University of Mumbai

Published 24-03-2026

Keywords

  • Hindi Speech,
  • Emotion Recognition,
  • Behavioral Speech Features,
  • Multimodal Deep Learning,
  • Prosody

Abstract

Context: Speech Emotion Recognition (SER) is an important part of affective computing, but that it cannot work effectively in low resource languages like Hindi. The available SER systems have focused on low-level speech features (acoustic and prosodic) and little has been done to investigate the high-level behavioural speech features (e.g., pauses and rhythm) even though they are significant in human emotional communication.

Objective: This study aimed to explore the hypothesis, whether explicit behavioral speech features can enhance Hindi SER performance, as well as study their joint role in complementing acoustic and prosodic features within a multimodal deep learning system.

Method: A curated Hindi emotional speech corpus of 2,370 utterances of 25 speakers composed of seven emotion classes was studied through a controlled experimental study. The acoustic, prosodic and behavioral features were obtained and represented with a dual-branch multimodal deep learning framework that included CNN/transformer and BiLSTM-attention modules.

Results: The entire multimodal model had an accuracy of 83.9% and a macro-F1 of 0.81, which was significantly higher than the acoustics-only and acoustics-prosodic baselines. The behavioral features provided significant progress to low-arousal emotions, including sadness and neutral, and medium to large effect sizes.

Conclusions: The results show that Hindi SER accuracy and strength is significantly increased by behavioral speech cues. To practitioners, the findings provide justification to apply behavior-aware SER in practice, whereas to researchers, they show the necessity to explicitly model the behavioral characteristics in low-resource and culturally diverse languages.

Downloads

Download data is not yet available.

References

  1. [1]. T. Mittal, U. Bhattacharya, R. Chandra, A. Bera, and D. Manocha, “M3ER: Multiplicative multimodal emotion recognition using facial, textual, and speech cues,” in Proc. AAAI Conf. Artificial Intelligence, vol. 34, no. 2, pp. 1359–1367, Apr. 2020.
  2. [2]. L. Ibn Nasr, A. Masmoudi, and L. H. Belguith, “Emotion recognition from spontaneous Tunisian dialect speech,” ACM Trans. Asian Low-Resour. Lang. Inf. Process., vol. 24, no. 2, pp. 1–16, 2025.
  3. [3]. G. M. Dar and R. Delhibabu, “Speech databases, speech features, and classifiers in speech emotion recognition: A review,” IEEE Access, vol. 12, pp. 151122–151152, 2024.
  4. [4]. T. Mittal, U. Bhattacharya, R. Chandra, A. Bera, and D. Manocha, “M3ER: Multiplicative multimodal emotion recognition using facial, textual, and speech cues,” in Proc. AAAI Conf. Artificial Intelligence, vol. 34, no. 2, pp. 1359–1367, Apr. 2020.
  5. [5]. A. Adolphson and S. Sperber, “On the integrality of factorial ratios and mirror maps,” arXiv preprint arXiv:1802.08348, 2018.
  6. [6]. P. Dhal, U. Datta, M. Woźniak, M. F. Ijaz, and P. K. Singh, “Towards designing a vision transformer-based deep neural network for emotion and gender detection from human speech signals,” in Innovative Applications of Artificial Neural Networks to Data Analytics and Signal Processing. Cham, Switzerland: Springer Nature, pp. 357–393, 2024.
  7. [7]. H. Bhoite, “Real-time multimodal emotion recognition for edge virtual assistants using lightweight transformer models,” Authorea Preprints, 2025.
  8. [8]. S. G. Koolagudi and K. S. Rao, “Emotion recognition from speech: A review,” Int. J. Speech Technol., vol. 15, no. 2, pp. 99–117, 2012.
  9. [9]. B. Schuller, S. Steidl, and A. Batliner, “The Interspeech 2009 emotion challenge,” in Proc. Interspeech, 2009.
  10. [10]. F. van der Velde, “Communication, concepts and grounding,” Neural Networks, vol. 62, pp. 112–117, 2015.
  11. [11]. A. Kumar, S. Kumar, K. Passi, and A. Mahanti, “A hybrid deep BiLSTM-CNN for hate speech detection in multi-social media,” ACM Trans. Asian Low-Resour. Lang. Inf. Process., vol. 23, no. 8, pp. 1–22, 2024.
  12. [12]. S. Poria, E. Cambria, R. Bajpai, and A. Hussain, “A review of affective computing: From unimodal analysis to multimodal fusion,” Information Fusion, vol. 37, pp. 98–125, 2017.
  13. [13]. M. Neumann and N. T. Vu, “Attentive convolutional neural network based speech emotion recognition: A study on the impact of input features, signal length, and acted speech,” arXiv preprint arXiv:1706.00612, 2017.
  14. [14]. V. K. R. Pawar and N. Patel, “Emotion recognition from Hindi speech using MFCC and sparse DTW,” Int. J. Eng. Res. Technol., vol. 4, no. 6, pp. 1–5, 2015.
  15. [15]. C. Busso et al., “IEMOCAP: Interactive emotional dyadic motion capture database,” Lang. Resources Evaluation, vol. 42, no. 4, pp. 335–359, 2008.
  16. [16]. S. Zhang, S. Zhang, T. Huang, W. Gao, and Q. Tian, “Learning affective features with a hybrid deep model for audio–visual emotion recognition,” IEEE Trans. Circuits Syst. Video Technol., vol. 28, no. 10, pp. 3030–3043, 2017.
  17. [17]. Q. Mao, M. Dong, Z. Huang, and Y. Zhan, “Learning salient features for speech emotion recognition using convolutional neural networks,” IEEE Trans. Multimedia, vol. 16, no. 8, pp. 2203–2213, 2014.
  18. [18]. H. M. Fayek, M. Lech, and L. Cavedon, “Evaluating deep learning architectures for speech emotion recognition,” Neural Networks, vol. 92, pp. 60–68, 2017.
  19. [19]. R. Prabhavalkar et al., “End-to-end speech recognition: A survey,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 32, pp. 325–351, 2023.
  20. [20]. S. Latif et al., “Deep representation learning in speech processing: Challenges, recent advances, and future trends,” arXiv preprint arXiv:2001.00378, 2020.
  21. [21]. M. Khan, W. Gueaieb, A. El Saddik, and S. Kwon, “MSER: Multimodal speech emotion recognition using cross-attention with deep fusion,” Expert Syst. Appl., vol. 245, p. 122946, 2024.
  22. [22]. H. Lian et al., “A survey of deep learning-based multimodal emotion recognition: Speech, text, and face,” Entropy, vol. 25, no. 10, p. 1440, 2023.
  23. [23]. M. Feld, R. Neβelrath, and T. Schwartz, “Software platforms and toolkits for building multimodal systems and applications,” in Handbook of Multimodal-Multisensor Interfaces, vol. 3, pp. 145–190, 2019.
  24. [24]. P. Tzirakis et al., “End-to-end multimodal emotion recognition using deep neural networks,” IEEE J. Sel. Topics Signal Process., vol. 11, no. 8, pp. 1301–1309, 2017.
  25. [25]. P. Jiang et al., “Convolutional-recurrent neural networks with multiple attention mechanisms for speech emotion recognition,” IEEE Trans. Cogn. Develop. Syst., vol. 14, no. 4, pp. 1564–1573, 2021.
  26. [26]. S. Akinpelu, S. Viriri, and A. Adegun, “An enhanced speech emotion recognition using vision transformer,” Scientific Reports, vol. 14, no. 1, p. 13126, 2024.
  27. [27]. A. H. Sweidan, N. El-Bendary, and H. Al-Feel, “Sentence-level aspect-based sentiment analysis for classifying adverse drug reactions using hybrid ontology-XLNet transfer learning,” IEEE Access, vol. 9, pp. 90828–90846, 2021.
  28. [28]. N. K. Kundu et al., “Enhanced speech emotion recognition with efficient channel attention guided deep CNN-BiLSTM framework,” arXiv preprint arXiv:2412.10011, 2024.
  29. [29]. D. Sasu, “Leveraging and probing speech prosody to improve spoken language processing,” 2025.
  30. [30]. T. Dijkstra and D. Peeters, The New Psychology of Language: From Body to Mental Model and Back. Routledge, 2023.