Publicado en 3C Tecnología. Special Issue – May 2019
The present day state of Sindhi corpus construction is elaborated in detail in this paper. The issues like corpus acquisition, tokenization and preprocessing have been analyzed and discussed minutely for Sindhi corpus enhancement. Initial observations and results are included for letter unigram, bigram and trigram frequencies. There has been discussed the present status of Sindhi corpus in perspective of restriction and future work. Orthography and script were also explored in this paper with reference to corpus development. Basically the word corpus was used first time by German Scholar (Das Corpus). The plural of corpus is corpora, which is used for huge text data consists of millions and billions of text data. The task of Natural Language Processing was very challenging because there was the scarcity of resources for computational linguistics and research. Different text corpora have been made in different languages of different countries, after reviewing the corpora of different languages of various countries, we are trying to make the corpus for Sindhi language.
We NLP, Corpora, Linguistic, Lexicon, Phoneme.