Multimodal sentiment analysis
Multimodal sentiment analysis is a technology for traditional text-based sentiment analysis, which includes modalities such as audio and visual data.[1] It can be bimodal, which includes different combinations of two modalities, or trimodal, which incorporates three modalities.[2] With the extensive amount of social media data available online in different forms such as videos and images, the conventional text-based sentiment analysis has evolved into more complex models of multimodal sentiment analysis,[3] which can be applied in the development of virtual assistants,[4] analysis of YouTube movie reviews,[5] analysis of news videos,[6] and emotion recognition (sometimes known as emotion detection) such as depression monitoring,[7] among others.
Similar to the traditional sentiment analysis, one of the most basic task in multimodal sentiment analysis is sentiment classification, which classifies different sentiments into categories such as positive, negative, or neutral.[8] The complexity of analyzing text, audio, and visual features to perform such a task requires the application of different fusion techniques, such as feature-level, decision-level, and hybrid fusion.[3] The performance of these fusion techniques and the classification algorithms applied, are influenced by the type of textual, audio, and visual features employed in the analysis.[9]
Features
Feature engineering, which involves the selection of features that are fed into machine learning algorithms, plays a key role in the sentiment classification performance.[9] In multimodal sentiment analysis, a combination of different textual, audio, and visual features are employed.[3]
Textual features
Similar to the conventional text-based
Audio features
Visual features
One of the main advantages of analyzing videos with respect to texts alone, is the presence of rich sentiment cues in visual data.[16] Visual features include facial expressions, which are of paramount importance in capturing sentiments and emotions, as they are a main channel of forming a person's present state of mind.[3] Specifically, smile, is considered to be one of the most predictive visual cues in multimodal sentiment analysis.[11] OpenFace is an open-source facial analysis toolkit available for extracting and understanding such visual features.[17]
Fusion techniques
Unlike the traditional text-based sentiment analysis, multimodal sentiment analysis undergo a fusion process in which data from different modalities (text, audio, or visual) are fused and analyzed together.[3] The existing approaches in multimodal sentiment analysis data fusion can be grouped into three main categories: feature-level, decision-level, and hybrid fusion, and the performance of the sentiment classification depends on which type of fusion technique is employed.[3]
Feature-level fusion
Feature-level fusion (sometimes known as early fusion) gathers all the features from each modality (text, audio, or visual) and joins them together into a single feature vector, which is eventually fed into a classification algorithm.[18] One of the difficulties in implementing this technique is the integration of the heterogeneous features.[3]
Decision-level fusion
Decision-level fusion (sometimes known as late fusion), feeds data from each modality (text, audio, or visual) independently into its own classification algorithm, and obtains the final sentiment classification results by fusing each result into a single decision vector.[18] One of the advantages of this fusion technique is that it eliminates the need to fuse heterogeneous data, and each modality can utilize its most appropriate classification algorithm.[3]
Hybrid fusion
Hybrid fusion is a combination of feature-level and decision-level fusion techniques, which exploits complementary information from both methods during the classification process.[5] It usually involves a two-step procedure wherein feature-level fusion is initially performed between two modalities, and decision-level fusion is then applied as a second step, to fuse the initial results from the feature-level fusion, with the remaining modality.[19][20]
Applications
Similar to text-based sentiment analysis, multimodal sentiment analysis can be applied in the development of different forms of recommender systems such as in the analysis of user-generated videos of movie reviews[5] and general product reviews,[21] to predict the sentiments of customers, and subsequently create product or service recommendations.[22] Multimodal sentiment analysis also plays an important role in the advancement of virtual assistants through the application of natural language processing (NLP) and machine learning techniques.[4] In the healthcare domain, multimodal sentiment analysis can be utilized to detect certain medical conditions such as stress, anxiety, or depression.[7] Multimodal sentiment analysis can also be applied in understanding the sentiments contained in video news programs, which is considered as a complicated and challenging domain, as sentiments expressed by reporters tend to be less obvious or neutral.[23]
References
- S2CID 19491070.
- .
- ^ S2CID 205433041.
- ^ a b "Google AI to make phone calls for you". BBC News. 8 May 2018. Retrieved 12 June 2018.
- ^ S2CID 12789201.
- arXiv:1604.02612 [cs.CL].
- ^ S2CID 24408937.
- ISBN 978-1601981509.
- ^ .
- S2CID 5275807.
- ^ S2CID 1132247.
- S2CID 342649.
- S2CID 52853112.
- S2CID 2081569.
- S2CID 1257599.
- .
- )
- ^ S2CID 15287807.
- )
- .
- ^ Pérez-Rosas, Verónica; Mihalcea, Rada; Morency, Louis Philippe (1 January 2013). "Utterance-level multimodal sentiment analysis". Long Papers. Association for Computational Linguistics (ACL).
- ^ Chui, Michael; Manyika, James; Miremadi, Mehdi; Henke, Nicolaus; Chung, Rita; Nel, Pieter; Malhotra, Sankalp. "Notes from the AI frontier. Insights from hundreds of use cases". McKinsey & Company. Retrieved 13 June 2018.
- S2CID 14112246.