Sentiment Analysis Using the Levenshtein Distance Algorithm
The data collected is not always in the form of neatly structured data such as in the form of excel sheets or neat reports. However, data can be retrieved from social media content in images, text, audio, and video. On this occasion, Machine Learning technology, part of artificial intelligence, has an essential role in improving data quality by conducting predictive analysis of text data, often called sentiment analysis. In addition, you need to know about Natural Language Processing (NLP). Natural Language Processing (NLP) is a set of computational techniques for analyzing and representing naturally occurring text at one or more levels of linguistic analysis to achieve human-like language processing for various tasks or applications. Natural Language Processing (NLP) is a branch of artificial intelligence to analyze written text automatically so that machines understand human language. For example, the problem faced is processing sentences or more extended texts, such as document summarization and machine translation. NLP has a strong intersection with Data Science and Artificial Intelligence technology.
The algorithm is usually the TF-IDF algorithm or word weighting to calculate the weight of each word commonly used, and the Levenshtein Distance method, which helps classify mental illness data in the form of text. The Levenshtein Distance algorithm works by finding the distance between the words entered by the user and the words in the database and then calculating the number of differences between the two strings in matrix form.
The TF-IDF algorithm is a method for calculating the weight of each word that is most commonly used in information retrieval. This algorithm combines two concepts for weighting calculations: the frequency of occurrence of a word in a document and the frequency of a document containing that word.
Calculation of Term Frequency (TF) using the equation:
Where,
tf = Term Frequency
tf ij= number of occurrences of the term in the document.
Calculation of Inverse Document Frequency (IDF) uses the equation:
Where,
idf fi = Inverse Document Frequency
N = the number of documents retrieved by the system
df i = the number of documents in the collection where the term appears in it / the word to be searched for.
Calculation of Term Frequency Inverse Document Frequency (TF-IDF) uses the equation:
Where,
W ij = document weight
tf ij=number of occurrences of the term
idf i= Inverse Document Frequency
The Levenshtein Distance algorithm is a string metric to measure the difference between two sequences. In this algorithm, the smaller the score, the higher the similarity value. The word distance means the number of modifications needed to change one string form to another.
Read also : Sentiment Analysis Using Linear Model Algorithm
For example, the string words “baru” and “batu” have a distance of 1 because only one operation is needed to change a string to another string. The distance between the two lines is also determined by the minimum number of change operations needed to change from one string to another. In the case of the two strings above, the string “baru” can become “batu” only by substituting the character “r” with “t”.
There are three primary operations in this algorithm, namely as follows:
- Insertion or insertion: adding a character to the string. For example, the string “bapa” becomes “bapak”.
- Deletion or deletion: removes a character from a string. For example, the string “kasur” becomes “kasu”.
- Exchange or substitution: exchanging a character with another character. For example, the string “baru” becomes “batu”.
Read also : Sentiment Analysis for Cyberbullying Activities of Teenagers on Social Media During the COVID-19 Pandemic
An example of a table in measurements with Levenshtein Distance is as follows:
For example, there are words X=baru and Y=batu
B | A | R | U | ||
0 | 1 | 2 | 3 | 4 | |
B | 1 | 0 | 1 | 2 | 3 |
A | 2 | 1 | 0 | 1 | 2 |
T | 3 | 2 | 1 | 1 | 2 |
U | 4 | 3 | 2 | 2 | 1 |
Based on the table above, it can be seen that the distance value is 1, which is located in the bottom right table. Then measure the value of similarity with the following formula:
Information,
Sim: similarity value
Dis: Levenshtein distance
Max length: the most extended string value
If the similarity value is 1, the two strings being compared are the same. On the other hand, if the similarity value is 0, the two strings being compared are not the same.