Sentiment Analysis — Introduction

3 min readDec 25, 2020

Source for the title picture: MIT News [1]

Ever got a message from your friend or your crush and you didn’t know how to evaluate the message. Is it now something positive or something negative?

This is Sentiment Analysis; evaluating texts or audios for their feeling. It’s a subfield in Natural Language Processing (NLP).

To solve this problem using computers, there are a lot of methods one can use. In this Introduction article, I want to talk about in general about such methods and build up from there.

Lexicon-based Approach

A lexicon-based approach is just that… a lexicon. Researchers can create these lexicons automatically using some sort of algorithm or they need to label them manually.
Words like “exciting, amazing, beautiful” would have a positive value (+1) and negative words like “unfunny, bad, boring” would have a negative value (-1). Words that are neutral, like pronouns, or words not in the vocabulary will get a value of zero. Looking at these two examples (the zeroes are not written but are accounted for), we can see that the method works well in concise sentences with clear meaning but if they get more complex with negations and intensifications, we will lose accuracy. We could write rules and exemptions for these cases but this requires another vocabulary and gets very messy very fast. The results after accounting for these issues would be satisfactory though, if you stay in the domain you intend to use it for. [2]

The Natural Language Processing Toolkit for Python provides such a model with their Sentiment Vader model, you can find it here.

Another model that is quite fast and very performant is the Naive Bayes Classifier using sci-kit learn. It’s based on Bayes Theorem of Probability and achieves very good accuracy, while only taking seconds to train on 50 thousand movie reviews.

Naive Bayes Theorem Source: Image by author

In this example, we have 100 sentences, evenly split between positive and negative. We also have a sentence “This movie is colorful” that appeared 16 times in total. It appeared 15 times in the positive part, once in the negative part, probably a colorblind person. If we would input this sentence in the trained classifier so that it can predict it’s sentiment, it would perform the following calculations:

Calculation the probability of the sentence being positive is 15 over 16, so close to 94%. Calculating the probability of it being negative is 1 over 16, so around 6%. Seeing that the probability of the sentence being positive is higher, the classifier will classify this sentence as being positive.

But taking whole sentences is very inefficient, as language is very complex and provides nearly endless possibilities for saying the same thing. This is where the model gets it’s affix “Naive” from; it assumes that every word in the sentence is independent for each other and calculates the probabilities from earlier for every word independently and takes the sum. The sum is the final probability and will be used for the classification.

These two models were the easiest models anyone can compute on their laptop, next time we will actually code examples using these approaches and compare their results.

[1]Danielle Pagano. Machine Learning will replace tasks, not jobs, say MIT researchers.
June 30, 2018. url: https://jwel.mit.edu/news/machine-learning-will-
replace-tasks-not-jobs-say-mit-researchers%C2%A0.

[2] Maite Taboada et al. ?Lexicon-Based Methods for Sentiment Analysis?. In: Com-
putational Linguistics 37.2 (Apr. 2011), pp. 267?307. issn: 0891–2017. doi: 10.
1162/COLI_a_00049. url: https://doi.org/10.1162/COLI_a_00049 (visited
on 06/24/2020).

Sentiment Analysis — Introduction

Written by Emretokyuez