Cultural dependency analysis for understanding speech emotion

作者:

Highlights:

摘要

Speech has been one of the major communication medium for years and will continue to do so until video communication becomes widely available and easily accessible. Although numerous technologies have been developed to improve the effectiveness of speech communication system, human interaction with machines and robots are still far from ideal. It is acknowledged that human can communicate effectively with each other through the telephony system. This situation motivates many researchers to study in depth the human communication system, with emphasis on its ability to express and infer emotion for effective social communication. Understanding the interlocutors’ emotion and recognizing the listeners’ perception is the key to boost communication effectiveness and interaction. Nonetheless, the perceived emotion is subjective and very much dependent on culture, environment and the pre-emotional state of the listener. Attempts have been made to understand the influence of culture in speech emotion and researchers have reported mixed findings that lead us to believe there are some common acoustical characteristics that enable similar emotion to be discriminated universally across culture. Yet there are unique speech attributes that facilitate exclusive emotion recognition of a particular culture. Understanding culture dependency is thus important to the performance of the speech emotion recognition system.In this paper three different speech emotion databases; namely: Berlin Emo-db, NTU_American and NTU_Asian dataset were selected to represent three different cultures of European, American and Asian respectively focusing on three basic emotions of anger, happiness and sadness with neutral acting as a reference. Different data arrangements with accordance to varying degree of culture dependency were designed for the experimental setup to provide better understanding of inter-cultural and intra-cultural effect in recognizing the speech emotion. Features were extracted using Mel Frequency Cepstral Co-efficient (MFCC) method and classified with neural network (Multi Layer Perceptron (MLP)) and fuzzy neural networks; namely: Adaptive Network Fuzzy Inference System (ANFIS) and Generic Self-Organizing Fuzzy Neural Network (GenSOFNN) representing precise and linguistic fuzzy rule conjuncts respectively. From the experimental results, it can be observed that culture influences the speech emotion recognition accuracy. 75% accuracy performance was recorded for generalized homogeneous intra-cultural experiments whereas the accuracy performance dropped to almost as low as chance probability (25% for 4 classes) for both homogeneous and heterogeneous mixed-cultural inter-culture experiments. The two-stage culture-sensitive speech emotion recognition approach was subsequently proposed to discriminate culture and speech emotion. Results of the analysis show potential of using the proposed technique to recognize culture-influenced speech emotion, which can be extended in many applications, for instance call center and intelligent vehicle. Such analysis may help us to better understand the culture dependency of speech emotion and as a result the accuracy performance of the speech emotion recognition system can be boosted.

论文关键词:Inter-cultural and Intra-cultural Assessment,Speech Emotion Recognition System,Affective space model,Mel Frequency Cepstral Co-efficient,Neural Network,Fuzzy Neural Network

论文评审过程:Available online 18 November 2011.

论文官网地址:https://doi.org/10.1016/j.eswa.2011.11.028