Sentiment Analysis for Cyberbullying Activities of Teenagers on Social Media During the COVID-19 Pandemic

The official stipulation of Coronavirus Disease 2019 (COVID-19) as a pandemic was on March 11, 2020, by the World Health Organization (WHO). This has had a significant impact on the lifestyle of many people, where the concrete evidence is that learning activities at school or campus, work, activities at public service facilities, activities at houses of worship, and other socio-cultural activities have been shifted from offline forums to the online platform since the enactment of the lockdown until Large-Scale Social Restrictions. Online forums are convenient for many people, especially for sharing profiles, daily life, and interacting with friends or relatives without complying with social distancing rules in cyberspace. However, this has resulted in increased activity on social media, one of which is the use of Instagram.

Instagram is a social media platform that is very popular with millennial netizens, even according to Hootsuite, Internet and Social Media Trend Data for 2020 globally, especially in Indonesia. The number of Instagram users at the beginning of January was 63 million. Another report from the marketing platform Klear shows that user posts on Instagram Stories per day have increased by 15% in a week, and the number of users who view other users’ Stories has also increased by 21% (Burhan, 2020). Instagram is widely used because it has many more exciting features than other social media. These features are selfie filters, Instagram Story, IGTV, photo and video sharing, comments, likes, explore, emoji sliders, GIFs, and polls (Sendari, 2019).

Remembering that everyone is only at home indirectly forces people always to do the same activities and creates boredom. As a result of boredom arises the excitement to get pleasure for yourself or a group through cyberbullying on social media.
According to Willard (2005), cyberbullying is sending or uploading harmful material or engaging in social aggression using the internet and other technologies. There are eight forms of behavior that can be indicators of cyberbullying, including:

  1. Flamming
  2. Harassment
  3. Denigration
  4. Impersonation
  5. Outing
  6. Trickery
  7. Exclusion
  8. Cyberstalking

Of particular concern in this study is the millennial generation of Instagram social media users with an age range of 15-19 years because, according to the Secretary General of the Association of Indonesian Internet Service Providers (APJII) Henri Kasyfi Soemartono, internet users in 2018 are dominated by the 15-19 year age segment which reaches 91 %. Several things can characterize the millennial generation, as quoted by Noveliati Sabani in a journal entitled Millennial Generation and the Absurdity of the Kusir Virtual Debate, saying that the millennial generation is the generation that almost always makes time to use social media. Unfortunately, the ability of the millennial generation to filter the information they get from social media is meager. They often immediately believe any content that is spread.

Thus, technology is needed to assess the level of cyberbullying. This is used to provide recommendations later in the process of preventing cyberbullying. Therefore, the level of cyberbullying can decrease.

Sentiment analysis can be done using machine learning algorithms. Sentiment analysis will classify the writing in a sentence to determine whether the expressed opinion is positive or negative. In this case, the content created has no formal structure. So sentiment analysis using data from the referred site will be more difficult. A data pre-processing process is required for data cleaning to make it easier to understand. Sentiment analysis analyzes opinions, judgments, sentiments, attitudes, and human emotions towards a product, service, topic, and other attributes. There is a large amount of research on sentiment analysis because it provides considerable benefits. In addition, Sentiment Analysis is the process of digesting or understanding data and extracting data in the form of text to obtain information in the form of a person’s attitude or opinion on a topic. The analysis results are in the form of positive and negative categories. This is done to analyze a person’s views and extract information on an entity, such as services, products, and specific topics.

Read also : Sentiment Analysis Using Levenshtein Distance

Sentiment Analysis Phases

Preprocessing

The data used in this study are public opinion taken from social media phrases with an unstructured writing style.

Therefore, it is necessary to do the preprocessing process so that the data can be more structured when classified. This study’s preprocessing consisted of 5 stages: case folding, tokenizing, normalization, stopword removal, and stemming.

  1. Case Folding
    Case Folding is a process that can be done by changing uppercase letters to lowercase letters. Case folding, besides changing uppercase letters to lowercase letters, removes punctuation marks or delimiters such as dots (.), commas (,) emoticons, and other characters.
  2. Tokenizing
    Tokenizing is the process of breaking a sentence into several parts. Tokenizing can facilitate the process of word counting or calculating the frequency of word appearances in the corpus.
  3. Normalization
    At this stage, the Normalization process is carried out to correct abbreviated or misspelled words in a particular form but have the same meaning. This is done to get a good-quality document.
  4. Stopword Removal
    Stopword Removal is a process used to remove words that have no effect, such as conjunctions, personal pronouns, and others, in a document.
  5. Stemming
    Stemming is taking essential words by removing affixes in a comment so that it is by the rules of the good and correct Indonesian language.

Labeling

Data labeling is carried out at this stage, namely the process of labeling data into two positive and negative classes. After the labeling process is carried out, the process of deleting neutral data is carried out because it is feared that the data will affect the level of accuracy produced by the model used. This research is done automatically using the lexicon approach. The Lexicon-based method uses a lexicon dictionary with a weight for each word as a lexical resource. In the upbeat category, sentences are written as expressions of satisfaction, thanks, praise, and others. As for the harmful type, the penalties are expressions of disappointment, dissatisfaction, and others.

This Inset Dictionary is used because it has been sufficiently well tested for sentiment analysis of data in Indonesian. The purpose of labeling is so the system can understand the meaning of the sentence to be tried. Lexicon Inset consists of 2 dictionaries: positive Lexicon, which contains 3,609 positive words, and negative Lexicon, which includes 6,609 negative words. Each word has a weighted value or polarity score with a weight range between -5 to +5.

Read also : Sentiment Analysis Using Linear Model Algorithm

TF-IDF Term Weighting

At this stage, it splits the sentence into several words and gives weight to each word using the TF-IDF (Term Frequency Inverse Document Frequency) method. This stage is a numerical statistic that is used to express the level of importance of a word in each document.

Word2Vec Vector Weighting

In this example, the sentence “that’s as fun as the sleeping mask” is used. This sentence is represented first in the form of one-hot encoding. The word is converted into a collection of numbers in the matrix.

Modeling

Support vector machines and Naïve Bayes can be used to classify a dataset as a sentiment analysis process. This process uses the Python programming language with data divided into two categories: training and testing. In sharing data, several training and testing comparison ratios are used, namely 8:2, 7:3, and 9:1. This is done to determine the correct balance in getting the best accuracy from the model used.

Model Evaluation

Model evaluation is carried out to determine the performance of the model. The model evaluation process is carried out by looking at the level of accuracy of the method through the Confusion Matrix and tables of accuracy and precision for each model. After the test data is tested against training data [AP1], it will produce a classification of accuracy values obtained, and conclusions can be drawn from the research that has been done.

Leave a Reply

Your email address will not be published. Required fields are marked *