![]()
3C TIC. Cuadernos de desarrollo aplicados a las TIC. ISSN: 2254 – 6529 Ed. 37 Vol. 10 N.º 2 Junio - Septiembre 2021
103
https://doi.org/10.17993/3ctic.2021.102.101-121
1. INTRODUCTION
The rapid advancement in technology and compelling needs of users have made internet and typically
SNS’s an integral part of everyone’s life, resulting in huge amount of user generated content aka Big
social media data. Escalation in Social media has completely shifted the way in which people view,
create or share information and ideas (Namdeo et al., 2017).
Undeniably, Web 2.0 has a vital role in the communication, relationships, and collaboration in today’s
society. The communities belonging to dierent age groups (children, youngsters, and adults) interact
with each other anytime, anywhere in diverse ways (e.g. via laptops, smartphones, tablets etc.) and
using wide number of social networking platforms. Even though the perks and positive edges of digital
communication are evident since most of the user’s internet usage is harmless but the anonymity
preservation and freedom of speech often makes young people to be oensive and vulnerable leading
towards one of the alarming threat i-e cyberbullying/Cyberaggression or hate speech (Van Hee et al.,
2018). People, typically youngsters have reported life disturbing and annoying experiences thus drawing
the attention of researchers/scholars and making cyberbullying and its automatic detection a growing
community need and a promising area of Natural Language Processing (NLP) (Huang et al., 2018).
Several studies contributed by dierent researchers are evident that computational formation of
cyberaggression detection strategies is extremely challenging. One of the major challenges is posed by
the scarcity of the required resources typically for newly emerged languages. Moreover, most of the
datasets used for cyberbullying detection, even in mature languages, exhibit an extreme skew between
hate speech and non-hate speech textual contents (Emmery et al., 2020). This leads to formation of
inappropriate strategies, unreliable predictive performance (specically for the minority class) and more
sensitivity towards classication errors.
With advent of Unicode encoding, Urdu language content, written using roman script, is escalating
rapidly on social networking sites. Roman Urdu is a nonnormative language. The written script of