3C TIC. Cuadernos de desarrollo aplicados a las TIC. ISSN: 2254 – 6529 Ed. 37 Vol. 10 N.º 2 Junio - Septiembre 2021
101
DEVELOPMENT OF COMPUTATIONAL LINGUISTIC
RESOURCES FOR AUTOMATED DETECTION OF TEXTUAL
CYBERBULLYING THREATS IN ROMAN URDU LANGUAGE
Amirita Dewani
Mehran University of Engineering & Technology, Jamshoro, Sindh, (Pakistan).
E-mail: amirita@faculty.muet.edu.pk ORCID: https://orcid.org/0000-0002-3816-3644
Mohsin Ali Memon
Mehran University of Engineering & Technology, Jamshoro, Sindh, (Pakistan).
E-mail: mohsin.memon@faculty.muet.edu.pk ORCID: https://orcid.org/0000-0003-2638-4252
Sania Bhatti
Mehran University of Engineering & Technology, Jamshoro, Sindh, (Pakistan).
E-mail: sania.bhatti@faculty.muet.edu.pk ORCID: https://orcid.org/0000-0002-0887-8083
Recepción:
20/04/2021
Aceptación:
10/06/2021
Publicación:
29/06/2021
Citación sugerida:
Dewani, A., Memon, M. A., y Bhatti, S. (2021). Development of computational linguistic resources for automated
detection of textual cyberbullying threats in Roman Urdu language. 3C TIC. Cuadernos de desarrollo aplicados a las TIC,
10(2), 101-121. https://doi.org/10.17993/3ctic.2021.102.101-121
3C TIC. Cuadernos de desarrollo aplicados a las TIC. ISSN: 2254 – 6529 Ed. 37 Vol. 10 N.º 2 Junio - Septiembre 2021
102
https://doi.org/10.17993/3ctic.2021.102.101-121
ABSTRACT
Automatic Cyberbullying detection has remained very challenging task since social media content and
conversations are usually posted in unstructured free-text form leaving behind the language norms. The
major concern and gap in formulating cyberbullying detection strategies is scarcity of available linguistic
resources typically for newly evolved languages. Roman Urdu has recently emerged and hence is a
resource poor language. Urdu has been widely known as the national language of Pakistan. However,
because of socio-cultural and multilingual aspects, Roman Urdu is used widely on the Internet by Asians
and more specically Pakistanis.
To full the above stated gap, this research work presents guidelines for data annotation process and
developed two linguistic resources: (i) Annotated corpus in Roman Urdu Language for cyberaggression
and oensive language detection. The process of data annotation involved bilingual annotators instead
of crowdsourcing. It has the benet of correctly annotating instances that constitute clear cases of
cyberbullying without compromising data quality. The developed corpus is highly balanced (with
almost negligible skew) unlike most of the existing corpuses even in mature languages. (ii) Processing
textual information for NLP tasks involves Stop-word elimination as a sub phase. Stop words carry least
semantic information and increase feature space as compared to the other tokens and index terms in
corpora. We have developed domain specic stop words for Roman Urdu Language considering all the
lexical variants and typically in the context of aggression detection and collected data. The work has
been carried out using python programming language and Pycharm IDE.
KEYWORDS
Linguistic Resources, Cyberaggression, Cyberbullying, Hate Speech Detection. Abusive Language
Automated Detection.
3C TIC. Cuadernos de desarrollo aplicados a las TIC. ISSN: 2254 – 6529 Ed. 37 Vol. 10 N.º 2 Junio - Septiembre 2021
103
https://doi.org/10.17993/3ctic.2021.102.101-121
1. INTRODUCTION
The rapid advancement in technology and compelling needs of users have made internet and typically
SNS’s an integral part of everyone’s life, resulting in huge amount of user generated content aka Big
social media data. Escalation in Social media has completely shifted the way in which people view,
create or share information and ideas (Namdeo et al., 2017).
Undeniably, Web 2.0 has a vital role in the communication, relationships, and collaboration in today’s
society. The communities belonging to dierent age groups (children, youngsters, and adults) interact
with each other anytime, anywhere in diverse ways (e.g. via laptops, smartphones, tablets etc.) and
using wide number of social networking platforms. Even though the perks and positive edges of digital
communication are evident since most of the user’s internet usage is harmless but the anonymity
preservation and freedom of speech often makes young people to be oensive and vulnerable leading
towards one of the alarming threat i-e cyberbullying/Cyberaggression or hate speech (Van Hee et al.,
2018). People, typically youngsters have reported life disturbing and annoying experiences thus drawing
the attention of researchers/scholars and making cyberbullying and its automatic detection a growing
community need and a promising area of Natural Language Processing (NLP) (Huang et al., 2018).
Several studies contributed by dierent researchers are evident that computational formation of
cyberaggression detection strategies is extremely challenging. One of the major challenges is posed by
the scarcity of the required resources typically for newly emerged languages. Moreover, most of the
datasets used for cyberbullying detection, even in mature languages, exhibit an extreme skew between
hate speech and non-hate speech textual contents (Emmery et al., 2020). This leads to formation of
inappropriate strategies, unreliable predictive performance (specically for the minority class) and more
sensitivity towards classication errors.
With advent of Unicode encoding, Urdu language content, written using roman script, is escalating
rapidly on social networking sites. Roman Urdu is a nonnormative language. The written script of
3C TIC. Cuadernos de desarrollo aplicados a las TIC. ISSN: 2254 – 6529 Ed. 37 Vol. 10 N.º 2 Junio - Septiembre 2021
104
https://doi.org/10.17993/3ctic.2021.102.101-121
this language does not follow any rigid set of grammatical rules or standards of spellings. A survey
statistics in (Shahroz et al., 2020) arms that about 300 million people are speaking Urdu language
and approximately 11 million speakers are in Pakistan from which maximum users switched to Roman
Urdu language for the textual communication, typically on social media (Shahroz et al., 2020). It is
linguistically rich and morphologically complex language (Mehmood et al., 2020).
Urdu orthography (aka imla) bears a resemblance to Trukish, Arabic and Persian languages. Moreover,
cursive Arabic and Nastaliques writing style is used (Syed et al., 2010). Roman Urdu uses Roman script.
An example instance of Roman Urdu script and its equivalent Urdu and English scripts are depicted in
Figure 1.
Figure 1. Script and Morphological variability in Roman Urdu, Urdu, and English Language.
Source: own elaboration.
Regardless of its huge prevalence worldwide (and more specically in South Asia), Roman Urdu is
an under-resource language. Linguistic resources for Asian languages are typically focused by some
of the conferences and journals such as ACM Transactions on Asian and Low-Resource Language
Information Processing (“ACM Transactions on Asian and Low-Resource Language Information
Processing”, n.d.), International Joint Conference of Natural Language Processing (“First Task for
Automatic Cyberbullying Detection for the Polish Language | ACL Member Portal”, 2019), Conference
3C TIC. Cuadernos de desarrollo aplicados a las TIC. ISSN: 2254 – 6529 Ed. 37 Vol. 10 N.º 2 Junio - Septiembre 2021
105
https://doi.org/10.17993/3ctic.2021.102.101-121
of Central Asian Language and Linguistics (“Central Asian Languages and Lingusitics”, n.d.) etc. for
supporting vast number of NLP tasks related to phonology, morphology, name entity recognition (NER),
language parsing and word segmentation.
To support the development of NLP applications for Roman Urdu typically in the eld of cyberaggression
and hate speech, this paper presents annotation guidelines, the rst-ever highly balanced Roman Urdu
dataset and development of domain specic stop words using python language.
Rest of this paper is organized as follows: Cyberaggression and existing resources are conferred in
section II. Section III puts light on Data extraction from Twitter social media platform. Data annotation
guidelines preparation and kappas weighing scheme are given in section VI & V respectively. Section VI
discusses Stop word development. Finally, Section VII conclude the research work.
2. RELATED WORK
Even though the researchers have widely used Natural Language Processing (NLP) and realized Machine
Learning (ML) techniques to uncover solutions for variety of tasks based on unstructured text data (e.g.
topic identication, opinion mining, document summarization, text translation etc.), but it’s applicability
for resolving automatic detection of cyber-crime related problems is relatively new and has encountered
so many challenges (Rosa et al., 2019).
The availability of appropriate data, huge data skew because of natural uneven distribution of hate
speech content on social media and NLP resources scarcity represents one amongst many signicant
issues in research on cyberbullying detection (Mahlangu et al., 2018; Gencoglu, 2020). A handful of
studies are contributed by scholars to develop resources and cyberbullying detection strategies in dierent
languages worldwide. Most studies have hateful instances ranging from 2 to 5% (Emmery et al., 2020).
The study by Sprugnoli et al. (2018), developed a WhatsApp dataset from WhatsApp chats to study
oensive language among Italian students. They also presented annotation scheme and user roles.
3C TIC. Cuadernos de desarrollo aplicados a las TIC. ISSN: 2254 – 6529 Ed. 37 Vol. 10 N.º 2 Junio - Septiembre 2021
106
https://doi.org/10.17993/3ctic.2021.102.101-121
Research work accomplished in Fersini et al. (2018) collected misogynous and hateful tweets data using a
combined approach. They monitored prospective victims of hate accounts, downloading the history of
identied haters and ltered twitter stream contents via keywords.
The study conducted in Fišer et al. (2017) extracted data from an online platform that collects impulsive
reports by internet users of any material having Child sex abuse; a special category of cyber-aggressiveness,
to develop a corpus. The validation of corpus by experts revealed that only 3% was illegal content
and more than 40% in non-disturbing content. Indonesian language hateful corpus was contributed
in Ibrohim and Budi (2018). The research work used twitter platform, crowdsourcing annotation, and
a multi-level scheme to identify Hate speech and non-hate speech categories along with their intensity
levels. Work carried out in Bohra et al. (2018) presents a dataset comprising of Hindi-English code-mixed
data. The tweets are annotated with the language at word level and the class they belong to (Hate Speech
or Normal Speech).
The study in Van Bruwaene et al. (2020) formed a dataset using multiple platforms in English language
from SafeToNet’s VISR-branded child safety app for adolescents. In collaboration with expert annotators,
they utilized crowd sourcing and machine learning techniques to enlarge the corpus and handle skew in
iterative manner. The work by Özel et al. (2017) is the rst study performed in Turkish Language. The
research has contributed corpus in Turkish language prepared using Instagram and twitter social media
platforms. Experimentation is also conducted using machine learning techniques.
Undeniably, English is the de facto common language among researchers at international level, hence
greater number of computational resources, as highlighted by a review study (Poletto et al., 2020) are
English corpora and datasets. Nevertheless, several other languages are represented too, and this certainly
is immensely signicant for international community that seeks to address a worldwide social issue of
cyberbullying and hate speech spread in many languages.
3C TIC. Cuadernos de desarrollo aplicados a las TIC. ISSN: 2254 – 6529 Ed. 37 Vol. 10 N.º 2 Junio - Septiembre 2021
107
https://doi.org/10.17993/3ctic.2021.102.101-121
Roman Urdu has become a contemporary trend these days as a language of communication for
Pakistani or more generally Asian youth. To the best of knowledge, this is the rst ever study that has
developed computational linguistic resource of Roman Urdu for Cyberaggression. This research study
presents our approach for collecting and annotating social media data to develop a cyberbullying corpus
in Roman Urdu language and domain specic stop words. The extraction of data was a multi-phase
process to ensure high quality data with minimum skew. It encompasses vast range of content inciting
hatred. The content is also based on wide bullying tactics like race, ethnic origin, religious aliation,
sexual orientation, caste, gender, identity and serious disease or disability Intelligence. Since a natural
distribution of social media data is heavily skewed which results in a scarcity of bullying instances to be
used in training, So we did extraction in phases. Moreover, this work used dierent weighing schemes for
automatic identication and bilingual expert annotators manual input to develop stop words related to
cyberbullying detection problem in Roman Urdu.
3. METHODOLOGY
3.1. DATA EXTRACTION
Twitter is one of the most popular microblogging service having 316 million monthly active users. As
compared to other social media platforms, twitter has attracted more to the academic researchers as it
makes its data available for research purposes via Application Programming Interface (API) (Ahmed
et al., 2017). To develop cyberbullying corpus, data was scrapped from twitter using python language,
tweepy and twitter streaming API in multiple phases over the duration of 3 months as depicted in Figure
2. The reasons were twofold: (i) Restrictions on data access imposed on standard API (ii) The natural
distribution of content is highly skewed.
3C TIC. Cuadernos de desarrollo aplicados a las TIC. ISSN: 2254 – 6529 Ed. 37 Vol. 10 N.º 2 Junio - Septiembre 2021
108
https://doi.org/10.17993/3ctic.2021.102.101-121
Figure 2. Data Extraction- Tweet Count.
Source: own elaboration.
Currently, no language code is available for Roman Urdu in API, So the queries for data collection were
formed based on geo-location information; taking coordinates from google maps of the areas in Pakistan
where high saturation of Roman Urdu content was expected. Secondly, we extracted tweets based on
insulting seeds or curse words typically used in Roman Urdu language for bullying. Thirdly we used hash
values for aggression and trash talking on recent topics from the regions in Pakistan. Substantial number
of tweets were in English Language and such content was ltered out leaving behind 3K tweets. In order
to retain writing patterns of Roman Urdu users on social media, data with inherent English words (such
as batting, topic, character, bowling, follow, design, ok, yes, no, music, video, free, hope, player, code, development, etc.) was
preserved. Some examples of such data instances are depicted in Figure 3.
3C TIC. Cuadernos de desarrollo aplicados a las TIC. ISSN: 2254 – 6529 Ed. 37 Vol. 10 N.º 2 Junio - Septiembre 2021
109
https://doi.org/10.17993/3ctic.2021.102.101-121
Figure 3. Natural writing patterns of users on social media in Roman Urdu Language.
Source: own elaboration.
3.2. DATA ANNOTATION PROCESS
Data annotation is indeed a Human Intelligence Task (HIT). Undeniably, crowdsourcing has obvious
organizational advantages, especially for a time consuming task as the annotation of textual data, but
annotation quality might get compromised from employing non-expert annotators typically for recently
evolved languages and challenging task like cyberaggression (Schmidt & Wiegand, 2017). Moreover,
many studies have uncovered that a non-trivial percentage of the data collected on MTurk is “doubtous”,
annotated either by “non-respondents” (bots instead of humans) or non-serious respondents (Dreyfuss et
al., 2018; Ahler et al., 2019).
Instead of crowdsourcing, data was annotated by linguistic experts having bilingual expertise (having
good knowledge of Nastaliq scripting and Roman Urdu patterns). Annotators were provided with
guidelines on how to label social media documents for bullying. The main task of each annotator was
to label each sample with one of three possible labels:
0 – text certainly does not contain any form of online violence, hate speech or abusive language.
1 – text certainly contains any form of online violence, hate speech or abusive language
2 Indeterminate case(doubtful) when text cannot be identied with good certainty to either
contain or do not contain any form of online violence, hate speech or abusive language.