4.2. MUTUAL INFORMATION
Mutual Information MI is widely used in statistical language modeling. If we use A to
denote the frequency of documents containing lexeme 1 and belonging to category
, B is the frequency of documents containing but not belonging to , C denotes the
frequency of documents belonging to but not containing , and N denotes the total
number of documents in the corpus, the mutual information of and . The mutual
information of 5 and 6 can be calculated by the following formula:
(6)
Where
is the number of categories. The words below a specific threshold are
removed from the original feature space, the dimension of the feature space is
reduced, and the words above the threshold are retained.
In conclusion, although CHI and MI perform well in the English text classification
problem, their performance is far less than that of DF. After careful analysis, it is found
that the reasons for this difference come from the fact that the feature extraction
methods using category information rely on low-frequency words and the fact that
Chinese has a higher dimension of the feature space compared with English.
4.3. REPEATED STRING FEATURE EXTRACTION
Most clustering algorithms regard a document as a collection of words only,
completely ignoring the order and co-occurrence relationship between words, which
may provide important information for document clustering. Therefore, we extract the
key repetitive strings from the whole document collection as text features. In order to
quickly extract most of the repetitive strings, the introduction of maximized repetitive
strings is necessary and sufficient. It can capture all the meaningful repetitive
structures of the strings in a very concise way, and also avoids generating a lot of
unnecessary output. Non-maximized repeated strings do not need to be reported,
because the text must be contained in some maximized repeated strings. However,
not all maximized repetitive strings are useful, many of them are only part of phrases,
which are semantically incomplete and meaningless, so further filtering is needed to
filter out the meaningful and interesting parts of them.
The algorithm first scans each document in the corpus and removes deactivated
words and non-word symbols such as numeric punctuation. A document is treated as
a string, and all documents are concatenated into a pseudo-document. Each word is
converted to a 2-byte integer so that each English word or Chinese character can be
treated as a unit. At the same time, each subscript in the record string corresponds to
the document number to which the character belongs, and the documents are
separated by a specific boundary symbol, which does not appear in any of the original
documents. Obviously, across the document boundary of the substring, is
meaningless, we limit the algorithm to find the duplicate string in a document. More
max
i
https://doi.org/10.17993/3cemp.2024.130153.214-230
223
3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376
Ed. 53 Iss.13 N.1 January - March, 2024