Summary

Understanding consumer sentiment can go a long way in helping companies position their products and services. Advancements in AI are making significant strides in understanding natural language, opening doors for new insights for companies to tap in. This paper looks at how these models have evolved and what we can expect from this technology in the coming years.

Five years ago, we leveraged sentiment analysis on public Twitter data to understand the sentiment surrounding a popular tech event for one of our clients. Though the results were binary (positive or negative), the instant insights were amazing. Since then, the field of sentiment analysis has forged significantly ahead from binary polarity to intensity (scale of the emotion) to aspect-based analysis, to Natural Language Understanding (NLU), deriving actionable insights from texts. The output is no longer flattened to a single dimension but can now capture all the richness of the complex structure of human emotions.

Applied Machine Learning beyond Polarity

Companies are increasingly leveraging sentiment analysis to get product feedback. Some have adopted automated systems to convert online brand mentions into tickets. This requires technology that understands not just the sentiment echoed by the user but the text at a deeper level. For instance, topic modeling techniques (e.g., Latent Dirichlet Allocation) can help automatically identify the key topics, which can then be mapped to various ‘aspects’ of an entity to perform aspect-specific sentiment analysis on a text which lends itself to more-specific and actionable insights into the product. E.g. “the mobile phone display is excellent; its camera is awesome, but its battery discharges quickly.”

Another challenge for sentiment analysis is the growing complexity of texts. They are no longer restricted to restaurant or movie reviews but range from a few words to entire documents. The model needs to seamlessly handle spelling shortcuts, typos, understand emojis, and deal with a host of other related issues in the shorter versions. With the longer version, comes the challenge of memory constraints while simultaneously processing larger chunks of data. In all cases, the model needs to seamlessly handle:

  • Syntactics : which include sentence boundary disambiguation, POS tagging, text chunking, lemmatization, etc.
  • Semantics : which include word sense disambiguation, concept extraction, named entity recognition, etc., and
  • The vagaries of human nature reflected in the text they generate – sarcasm, irony, humor, metaphors, etc.

Modern-day applications of sentiment analysis span a wide spectrum beyond product feedback and include:

  • Analyzing public opinion – Politics, Health (e.g., COVID-19 vaccine sentiment), Law, etc.
  • Identifying toxicity in online communication
  • Applications in the medical domain – e.g., the fields of mental health, psychology, etc.
  • Applications in the field of business intelligence, e-commerce, and even Quality Assurance
  • Building better recommender systems by taking sentiment analysis as a feed
  • Financial applications such as analyzing emotive aspects of news texts which could dictate prices and volatilities of trades (e.g., commodities)

Rise of the Transformer Architectures

From a technology standpoint, sentiment analysis has grown significantly. We have come a long way from lexicon-based approaches to modern-day transformers like BERT (Bidirectional Encoder Representations from Transformers). Technology has overcome many challenges on this journey for e.g., in earlier models, words had static representations, and therefore the word ‘bank’ had the same representation in the sentence ‘I went to the river bank for a walk’ and ‘I went to the bank and deposited a cheque.’ Sometimes things are not so subtle, e.g., engine noise associated with a vehicle might be viewed in a different light by a Rolls-Royce user compared to a Harley-Davidson enthusiast. Lastly, come sarcasm, irony, etc., and the model needed to be able to perceive the negative undertones in ‘This book is an excellent read for insomniacs. A brilliant cure!’ or ‘The great customer support team took their own sweet time to respond.’

A landmark paper in 2014 on the concept of ‘Attention’ in neural networks by Bahdanau et al. showed how models could be vastly improved by incentivizing them to learn to pay more attention to specific words in the given text. This resulted in a series of developments leading to the rise of the transformer architecture released by Google (Attention is all you need in 2017), based on which the formidable language model named BERT was designed and released by Google in 2018.

BERT brought in the era of contextualized word vector representations where a single word could have different representations based on the sentence context. BERT was trained in an interesting way. Its training was completely unsupervised with two objectives Masked Language Modeling (MLM), where word tokens from the input texts were randomly masked, and the model was made to “guess” the original word. This forced the model to learn the context of each word. BERT is also trained with a second objective, NSP (Next Sentence Prediction), to build relationships across sentences though this objective has been dropped in later language models.

But the real impact of BERT was in Transfer-Learning. Unlike computer vision, where pre-trained models were re-used to perform image recognition tasks with minimal lines of code, in NLU, the process was far more complicated because textual data could be used in different contexts. But since BERT is context-aware, it could easily be fine-tuned on a much smaller domain-specific dataset resulting in a final model that could understand domain-specific nuances as well (say, understand texts related to the psychology domain). This was an inflection point in NLU because data collection and labeling is a costly exercise. BERT makes it possible to make sense of very limited domain data. Even a couple of 1000 labeled data points can be used for fine-tuning the 340 million machinery of weights and biases inside BERT to make it understand domain-specific texts with reasonable accuracy.

While the original transformer was a Neural Machine Translator (NMT) with an encoder and a decoder, BERT (Bidirectional Encoder Representations from Transformers) is based on the encoder half. The transformer architecture is so powerful that it can take spelling mistakes, spelling errors, unknown words, and accurately guess the meaning of those based on the context.

Sentiment analysis or NLU is performed using one or more of the following approaches:

  • Direct: The language model is directly used to make predictions by prepending it to a simple linear model.
  • Indirect: The word vector representations are pulled out and plugged into any AI architecture of choice. The art and technique of doing this is a detailed subject in itself.
  • Fine-Tuned: The entire language model is trained on a (smaller) labeled dataset to acquire domain-specific knowledge using a domain-specific objective. Care must be taken to avoid Catastrophic Forgetting : A process whereby the fine-tuned model’s weights are so altered as to render it ineffective.
  • Pre-trained: The language model is incrementally pre-trained on the original objective of MLM but using domain-specific data.

State of the art (as of Q1 2021)

The release of BERT was an inflection point in NLU. Several models inspired by BERT came out between 2019 and 2021, each establishing new benchmarks. In RoBERTaxiii(Robustly optimized BERT by Facebook, 2019), the authors trained a BERT-like model more vigorously, used more data, and dropped the NSP objective. The resulting model could match or exceed the performance of every model that came after BERT until the release of their paper. Even two years, post its release, RoBERTa continues to be a strong contender as a choice of language model in many situations.

While BERT itself is not much used nowadays, sentiment analysis/related NLU tasks leverage language models that came after it. Following are some of the successors to BERT & RoBERTa but the list is by no means complete:

  • T5xiv(Google, 2020) : The authors reframe every NLP task into a unified text-to-text format where the input and output are always text strings, thereby allowing reuse of the same model/loss function and hyper-parameters on any NLP task. They also significantly scale the model parameters to 11 billion on a much larger C4 dataset.
  • DeBERTaxv(Microsoft, 2020) : Here, each word has two vector representations that encode its content as well as position. This, along with some fine-tuning to improve the models generalization has set new benchmarks on several NLP. The relatively smaller size also makes it a favorite tool of choice in many situations
  • GPT-3xvi(OpenAI, 2020) uses 175 billion parameters leveraging the architecture used by its predecessor GPT-2 but with changes to the attention patterns. Even without fine-tuning, this model achieves excellent results on several NLP tasks.

A host of other equally good models (ELECTRA, BigBird, Reformer, several others) were released in 2020 and 2021.

Bias in the Machine: X is to Computer Programmer as Y is to Homemaker

Language models have been around for enough time for us to sufficiently understand the challenges of fairness and potential algorithmic bias. While these challenges apply to NLP in general, they also apply in equal measure to sentiment analysis which relies on NLU to make sense of the data. Bias creeps into a language model because of biased data sources. Even cleaned datasets such as news sources or Wikipedia are not totally immune to this. In the classic 2016 paper ‘Man is to Computer Programmer as Woman is to Homemaker? Bolukbasi et al. suggest neutralization of biased words by equalizing their distances to the stereotyped and non-stereotyped word groups. While this was a good start, this is a superficial fix because the neutralized words continue to remain in the same cluster and company of words as before.

The Data Augmentation method proposed by Zhao et al. (2019) for mitigating gender bias is interesting. The augmented dataset is designed with the intention of neutralizing the gender bias whilst simultaneously avoiding the corruption of its understanding of natural language. The gender identifying words are replaced with words of the opposite gender. These replacements are then combined with the original data and fed into the model for training. By doing this, the bias is balanced out, thus making the model neutral towards both groups.

In a 2020 paper, Huang et al. show that NLP models can also pick up a variety of cultural associations and undesirable social biases from the training data. Certain attributes could be rated as having a better sentiment over others (e.g., baker versus accountant as an occupation). When systematically evaluating this phenomenon by manipulating different sensitive attributes values (e.g., country names, occupations, or person names) across a fixed context, they find that sentiment scores for the generated texts can vary substantially, suggesting the existence of sentiment bias. The authors proposed counterfactual data augmentation as a remedy instead of de-biasing word embeddings.

More recently, in 2021, data augmentation techniques were further refined by Manela et al. and applied to state-of-the-art language models.

De-biasing techniques are still evolving, and there is no single golden standard as of 2021.

Way of the Future: 2000 Kenyon cells of a fruit fly or one Trillion?

The Stanford Sentiment Treebank dataset is one of the popular datasets w.r.t establishing sentiment analysis benchmarks. A huge uptick in scores can be seen starting 2019, with almost all top models being based on transformer-like architecture. The jump from 93.2% to 97.5% should not be viewed on a linear scale and can be better appreciated by looking at the reduction in error from 6.8% to 2.5%, an enormous 63%. The success of NLP transfer learning, and particularly the 2020 conference papers remediating the instability issues in large model fine-tuning, will lead to further adoption and progress of this field.

Architecture ensembles to leverage diversity will continue to gain popularity. Wrappers will further evolve to provide near script-less ML-like AutoNLP from Hugging Face.

The heart of ML is data. There are several initiatives being organized to examine and evaluate datasets. Google, for e.g., in 2021, introduced a dataset exploration tool, Know Your Data (KYD) which supports analysis of a small set of image datasets. Identifying and ferreting out hidden biases in NLP datasets is expected to get more streamlined and become a well-organized industry initiative in the coming months.

In early 2021, Google open-sourced Switch Transformers, which uses up to 1 trillion parameters (weights & biases). This architecture is slightly different from the predecessors since it does not use all the parameters simultaneously but relies on a Mixture of Experts (MoE) design to select different parameters for each data-point resulting in a sparsely activated but stable model which uses 10,000 times the number of parameters used by Bert-base released barely three years ago.

While large language models have passed the Turing test comfortably, despite all their size and power, they have been shown to be remarkably naive about the real understanding of things. For e.g., Lin et al. show that under certain conditions, BERT returns twice the probability of a bird having four legs rather than two or a car having two wheels instead of four. Thus, despite beating human benchmarks on most tasks, the core understanding is still superficial. It seems that Moravec’s paradox is applicable to NLU as well. Marvin Minsky had eloquently summarized this earlier, ‘In general, we’re least aware of what our minds do best, … we’re more aware of simple processes that don’t work well than of complex ones that work flawlessly’. Research is on to combat this paradox. Google Meena proposes a human evaluation metric called Sensibleness and Specificity Average (SSA) in addition to perplexity (the metric for measuring MLM objective). By training on this additional objective, language models should be able to get better insights and understand a larger perspective rather than just a superficial understanding. Google so far has not released the source code of Meena.

An interesting development in 2021 is the release of a paper called Can a Fruit Fly Learn Word Embeddings? This paper counters the dominant narrative of massive language models. Instead, it proposes a frugal model based on how a fruit fly’s memory is organized, thus harnessing the power of algorithms generated by natural evolution. The paper takes us back a full decade when words were represented using sparse binary vectors (just 0’s and 1s) but allowed for run-time contextualization, thus retaining all advantages of transformer-like models. I hope it spawns a litany of frugal yet efficient models that can compete with the trending massive models and perhaps even perform better on the SSA metric. Until then, the full power of the written word and the sentiment behind it can truly be appreciated by humans alone.

Loved what you read?

Get 10 practical thought leadership articles on AI and Automation delivered to your inbox

Subscribe


Loved what you read?

Get 10 practical thought leadership articles on AI and Automation delivered to your inbox

Subscribe