103
Edición Especial Special Issue Mayo 2019
DOI: http://dx.doi.org/10.17993/3ctecno.2019.specialissue2.100-115
103
1. INTRODUCTION
About thirty to forty million people of Pakistan speak Sindhi language and it
is a big language. On internet Sindhi language is vastly used. The number of
news papers literary websites and blogs of Sindhi language is increasing daily.
The lexicon, fonts and common words processes are included and available for
NLP researchers and this is the evidence of usage and popularity of online. In
Sindhi language such as linguistic corpora are not initiated for the enhancement
of Sindhi language processing resources.
Sindhi language is being used and written in Arabic-Persian, Devanagari and
Roman letters. For Sindhi language in India Devanagari letters are also used.
Same as the Roman script is getting popularity for Sindhi language. On smart
phone devices, cell phones and communications on internet have been used
and available in Roman script for very few documents. It is unfortunate that
the linguistic corpora and detailed computational lexicon are still not initiated
because it was very essential for the development of Sindhi language processing
resources. It is factual position that in Sindhi language that excess written material
is available for oine and online. Sindhi Corpus the script is Persio-Arabic which
has been built in Persio-Arabic script using UTF-16 in coding. In these sections
we are discussing the orthography and Sindhi language corpus script which
is achieved are results of initial statistical analysis, preprocessing the issues of
corpus construction of Pakistani language corpora. In this conclusion we have
nally discussed the future work (Mahar & Memon, 2010).
2. PREVIOUS WORK
As for the Sindhi language processing resources concerned, apart from few
digital dictionaries, key board design and fonts, these are not generally and
publically available. Even in Sindhi language for resources like comprehensive
computational lexicon and linguistic corpora, studies or development projects are
not even initiated. Because of the improvement of linguistic corpus of various
languages of Pakistan the dierent research organizations and individuals are