3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022
77
AN OPTIMIZED DEEP NEURAL NETWORK-BASED FINANCIAL
STATEMENT FRAUD DETECTION IN TEXT MINING
Ajit Kr. Singh Yadav
Assistant Professor, Department of Computer Science and Engineering,
NERIST, Itanagar, Arunachal Pradesh (India), and Research Scholar,
Department of Computer Science and Engineering,
Rajiv Gandhi University, Itanagar, Arunachal Pradesh, (India).
E-mail: ajityadav101@redimail.com ORCID: https://orcid.org/0000-0002-2208-0828
Marpe Sora
Associate Professor. Department of Computer Science and Engineering,
Rajiv Gandhi University, Itanagar, Arunachal Pradesh, (India).
E-mail: marpe.sora@rgu.ac.in ORCID: https://orcid.org/0000-0003-0159-5416
Recepción: 23/07/2021 Aceptación: 03/11/2021 Publicación: 24/11/2021
Citación sugerida:
Singh, A. K., y Sora, M. (2021). An optimized deep neural network-based nancial statement fraud
detection in text mining. 3C Empresa. Investigación y pensamiento crítico, 10(4), 77-105. https://doi.
org/10.17993/3cemp.2021.100448.77-105
3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022
78 https://doi.org/10.17993/3cemp.2021.100448.77-105
ABSTRACT
Identifying Financial Statement Fraud (FSF) events is very crucial in text mining. The researcher’s
community is mostly utilized the data mining method for detecting FSF. In this direction, mostly the
quantitative data has utilized by research i.e. the nancial ratio is presented for detecting fraud in
nancial statements. On the text investigation there is no researches like auditor's remarks present in
published reports. For this reason, this paper develops the optimized deep neural network-based FSF
detection in the qualitative data present in nancial reports. The pre-processing of text is performed
initially using ltering, lemmatization, and tokenization. Then, the feature selection is done by the
Harris Hawks Optimization (HHO) algorithm. Finally, a Deep Neural Network-Based Deer Hunting
Optimization (DNN-DHO) is utilized to identify the fraud or no-fraud report in the nancial statements.
The developed FSF detection methodology executed in Python environment using nancial statement
datasets. The output of the developed approach gives high classication accuracy (96%) in comparison
to the standard classiers like DNN, CART, LR, SVM, Bayes, BP-NN, and KNN. Also, it provides better
outcomes in all performance metrics.
KEYWORDS
Financial statements, Fraud, Non-fraud, Text mining, Deep neural network, Deer hunting optimization.
3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022
79 https://doi.org/10.17993/3cemp.2021.100448.77-105
1. INTRODUCTION
Financial fraud is a major challenging task for dierent administrations across industries and in several
states as it takes vast destruction to business. Due to nancial fraud, billions of dollars are lost every year
in the bank of America, for instance approves to pay $16.5 billion for solving the case of nancial fraud
(Rezaee & Kedia, 2012). Material omissions resultant from an intentional failure to report nancial data
in accordance with usually acknowledged secretarial ethics are termed as FSF (Dalnial et al., 2014). The
companies provide the nancial statements that include the textual data in the form of auditors' remarks
and expose as records with nancial proportions. The qualitative data consist of indicators of fraudulent
nancial reporting in the form of intentionally located idioms. The agents use the adverbial phrases,
selective sentence constructions, and selective adjectives to cover the fraudulent activity (Throckmorton
et al., 2015; Song et al., 2014). To identify fraudulent nancial fraud, nancial statement users and
regulators expect external auditors. Financial statements are the organization's elementary documents
to reect its scal rank (Kanapickienė & Grundienė, 2015).
A careful analysis of the nancial accounts can denote whether the corporation is running eciently or
is in crisis. If the corporation is in crisis, nancial accounts can show if the maximum dangerous entity
handled by the organization is prot or cash or something dierent (Perols & Lougee, 2011). In every
quarter and every year, most of the organizations are needed to publish their nancial statements (Gray
& Debreceny, 2014).
FSF can be executed to build stock values or to acquire loans from banks. It may be done to allocate smaller
prots to investors. One more feasible reason might be to stay away from the expense of assessments
(Manurung & Hardika, 2015). Recently, dierent organizations are creating usage of fraud nancial
reports to cover up their real scal rank and create self-interested improvements at the expenditure of
shareholders. In the detection of FSF, nancial ratios are prime elements because they present a pure
image of the nancial strength of the corporation (Hajek & Henriques, 2017).
3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022
80 https://doi.org/10.17993/3cemp.2021.100448.77-105
The economy of an organization is caused by the illegal task of FSF. In determining capitalizing in a
corporation, the investigation of nancial reports helps the contributors to the investment market (Omar
et al., 2014). The performance of the company provided by the data presented in these statements in
terms of scal rank to the creditors, shareholders, and auditors.
In worldwide organizations, nding and prevention of FSF have become a signicant challenge (Gupta
et al., 2012a). In the failure of the prevention process, the detection of fraudulent nancial reporting is
a challenging issue. Though, the prevention of FSF is a better method (Asare et al., 2015). The interior
and exterior auditors have to play a signicant task in the discovery and prevention of FSF. But they
cannot be said only accountable for the identication and detection of FSF (Gupta et al., 2012b). Study
about fraud detection and antecedents is signicant since it adds to the sympathetic about fraud. To
enhance the auditors’ and regulators’ capability, it has the potential to identify the fraud either directly
or by helping as a basis for future fraud research that does (Ravisankar et al., 2011). Better-quality fraud
detection can assist the defrauded organizations, and their workers, investors, and creditors curb costs
linked with fraud and also enhance the eciency of the market. This knowledge is interest to auditors
once delivering guarantee about whether nancial accounts are free of substantial misstatements aected
by fraud (Ngai et al., 2011), mainly during audit planning and client selection.
Several researchers have been analysed the quantitative data for the recognition of false nancial
reporting (Jan, 2018). Therefore, the text mining technique is utilized to recognize fraud and non-fraud
nancial reports in the qualitative contents of nancial statements (Lin et al., 2015). Text mining is
the method of mining signicant structured data from unstructured text. It can be utilized for nding
the fraud or non-fraud reports and also it can examine the words (Gupta et al., 2012c). At present,
extensive data is produced from dierent sources in the Internet-dependent world. In an unstructured
format, a vast amount of data is obtainable. Text mining and data mining methods can permit well
decision making for analysing unstructured data (Kumar & Ravi, 2016). Dierent types of tasks involved
in text mining, for example, text summarization, web page classication, sentiment analysis, detection
3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022
81 https://doi.org/10.17993/3cemp.2021.100448.77-105
of plagiarism, malware analysis, classication of the document, detection of a topic, patent analysis,
etc. In the nancial statements, the textual data is unstructured (Dong, Liao, & Liang, 2016). Before
applying any data mining approaches like classication or clustering, the text is must be transformed into
structured data because the form of text is shapeless for the discovery of FSF.
This work contributes mainly:
In nding the solution of the nancial report fraud discovery.
To design the model for identifying the fraudulent and non-fraudulent statement.
To use optimal feature selection approaches to get high accuracy.
To model a new hybrid classier for nancial statement fraud discovery.
The remaining work of this paper is shown in following sections: Section two denes the recent works
related to this paper. The proposed method to detect the FSF is given in section three, the section four
provides the outcomes of the simulation and conclusion and future scope is given in section ve.
2. RELATED WORKS
An interpretable fuzzy rule-based system was presented by Hajek (2019) for detecting FSF. The developed
fuzzy rule-based detection approach combines the rule extraction and element of feature selection to
obtain the granularity and rule complexity. A genetic feature selection method is utilized to eliminate
the irrelevant features. A qualied investigation of fuzzy systems was performed with evolutionary fuzzy
rule-based schemes and FURIA. The developed system leads both desirable interpretability and good
accuracy. The result provides signicant eects for auditors and other operators of discovery structures
of FSF.
Fraud detection was introduced by Chen et al. (2019) for economic reports of business groups. For
fraud discovery, this article suggests a methodology in the nancial reports of business assemblies. The
established technique to improve the welfares of investment for creditors and investors and to lessen the
3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022
82 https://doi.org/10.17993/3cemp.2021.100448.77-105
investment losses and risks. The learning points were obtained by the subsequent stages: (i) construct
an eective model for fraud discovery in the nancial reports of business assemblies, (ii) dierent fraud
nding methods were applied in the nancial reports, and (iii) valuation of the developed system.
A Financial Fraudulent Statements (FFS) detection was developed by Temponeras et al. (2019) using the
deep dense Articial Neural Network (ANN). This system reviews the nancial statements of multiple
companies. A deep dense ANN is derived from the decisions about conceivable accounting fraud. To
accurately classify the FFS, the data is obtained from 164 Greek companies. Therefore, the main objective
was to test a neural system structure in the forecasting FFS. In the classication FFS task, the developed
approach provides superior outcomes than other earlier classiers in investigating the Greek data.
A CHAID, SVM (Support Vector Machine), and C5.0 were discussed by Chi et al. (2019) for FSF
detection. Through an active detection scheme, an approach of C5.0, SVM, and CHAID are applied
to the discovery of FSF. From the Taiwan Economic Journal (TEJ), the research data is obtained. The
source sample contains 28 companies involved in FSF and 84 corporations are not intricate in such
frauds on the Taipei Exchange and the Taiwan Stock Exchange amid the investigation time. Before
constructing the system, the paper chooses key variables with C5.0 and SVM. For FSF, the non-nancial
and nancial variables are utilized to improve the precision of recognition.
An application of a cooperative Random Forest (RF) classier was presented by Patel et al. (2019) for
identifying nancial report management of Indian registered corporations. Recently, the investigator
has tried to discover the dierent modelling methods for FFS detection. The researcher has selected
a 92 non-FFS and 86 FFS of manufacturing corporations to accomplish the test. From the Bombay
stock exchange, the research data were obtained for the dimension of 2008-2011. For the identication
of non-FFS and FFS companies, the auditor's report was deliberated. The T-test utilises 31 signicant
nancial proportions. The training dataset is employed to train the model and the trained model is used
for classication with better accuracy.
3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022
83 https://doi.org/10.17993/3cemp.2021.100448.77-105
3. METHODOLOGY
The group of nancial statements is considered as the input in the text mining systems. Here, the fraud
and non-fraud types of nancial reports are gathered to classify fake nancial reports.
The nancial statement fraud discovery includes four steps such as text pre-processing, feature extraction,
feature selection, and text classication. the workow of the proposed approach is shown in Figure 1.
Figure 1. Overall proposed Methodology.
Source: own elaboration.
3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022
84 https://doi.org/10.17993/3cemp.2021.100448.77-105
In-text mining, pre-processing plays a major role. The high quality of the pre-processing step provides
better results. The pre-processing step includes the number of roles such as ltering, tokenization, and
lemmatization. The words in all documents are transformed into the lower case during pre-processing.
Then the TF-IDF, LDA and Word2vec approach is utilized for feature extraction. It describes the text
to have a set of measurable dimensions like frequency of words. The process of feature selection is
utilized to enhance the performance of a text classier and also decrease the dimension of the feature.
Here, the HHO algorithm is used for feature selection. Finally, the new hybrid classier of DNN-DHO
is proposed to identify the fraud and non-fraud nancial statements for classication. In the DNN,
weights are updated using the DHO algorithm. This hybrid classier concept minimizes the error during
classication.
3.1. PROBLEM STATEMENT
FSF is the main problem for society. The detection of FSF is a challenging process. FSF is not a victimless
corruption, but instead leaves behind actual genuine economic losses that contain workers, shareholders,
and investors. A trust in controllers, reduction in self-assurance and reduction in the reliability of
nancial markets are extensive costs to society. It leads to high transaction costs and minimum eciency.
In developing markets, the challenges of business-related with investing to improve the incentives for
handling nancial statements and also avoid the taxes in the home country. Recently, dierent cases of
FSF have been increased. Every incidence is a dense disappointment to shareholders, and investors and
it expenses the public extremely. So, the construction of an ecient scheme to identify FSF is a major
concern.
3.2. TEXT PRE-PROCESSING
It is a signicant role and a dangerous phase in text mining. To mining motivating, non-trivial, and
information from amorphous text data, a pre-processing method is applied in text mining. The basic
units of the fonts, words, and sentences are recognized in this phase and it’s delivered to all further
3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022
85 https://doi.org/10.17993/3cemp.2021.100448.77-105
processing phases. The steps of pre-processing contain the number of roles, for example, ltering,
lemmatization, and tokenization.
3.2.1. TOKENIZATION
A given text is broken into phrases, words, symbols, or other important components are known as tokens.
It may be thrown away particular characters like punctuation marks. The main application of this
process is to identify the signicant keywords.
3.2.2. FILTERING
This process eliminates the particular words in the documents. The elimination of stop words is a
common ltering approach. Stop words are repeatedly utilized common words like ‘this’, ‘are’, ‘and’ etc.
They are not applicable in document classication. Therefore, they must be eliminated.
3.2.3. LEMMATIZATION
This process to eliminate inectional terminations and to return the base form of a word, which is named
as the lemma. This process refers to the usage of the dictionary and morphological study of words.
3.3. TEXT FEATURE EXTRACTION
Text feature extraction is the procedure of extracting list of words from the textual data for the feature
selection in classier. In-text classication, it plays a major role because it directly impacts the classication
accuracy. The following methods are utilized for extracting features from text data.
3.3.1. TERM FREQUENCY AND INVERSE DOCUMENT FREQUENCY (TF-IDF)
TF-IDF is an important weighting method in text mining (Kalra et al., 2019). A word is frequent in the
number of times in a text which denoted as the word frequency. To compute the reverse likelihood of
nding a word, the IDF approach is employed in a text. The signicance of a term in a text is denoted
3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022
86 https://doi.org/10.17993/3cemp.2021.100448.77-105
as TF-IDF within a corpus. Here, a document refers to a nancial report, a term refers to a solitary word
in a statement, and a corpus refers to the assortment of reports. In a document d, the weight of TF-IDF
for a term t is computed by:
(1)
(2)
(3)
3.3.2. LATENT DIRICHLET ALLOCATION (LDA)
LDA is the topic modelling scheme (Jelodar et al., 2019). It adopts that every text can be dened as a
probabilistic distribution over hidden topic. In all documents, the common Dirichlet prior is shared by
the topic distribution. A common Dirichlet prior shared by the word distributions of topics. Assumed
a corpus D that contains M documents. Each document d having Nd words d 1,…, M. This method
based on the subsequent reproductive procedure:
From a Dirichlet dissemination with factor β, choose a multinomial spreading for a topic t(t
1,…,T).
For document d(d 1,…,M), select a multinomial spreading θd from a Dirichlet dissemination with
factor α.
Pick a topic zn from θd and take a word wn from zn for a word wn (n 1,…, Nd) in a document d.
Here, the words are only detected variables in documents whereas others are hyper factors (α and β) and
hidden variables ( and θ). The likelihood of perceived data D is calculated by:
(4)
3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022
87 https://doi.org/10.17993/3cemp.2021.100448.77-105
α the spreading of words over topics and β constraints of topic Dirichlet prior are obtained from Dirichlet
dissemination. Here the number of topics is dened by T, the number of documents is denoted by M,
and the size of the vocabulary is denoted by N. The Dirichlet-multinomial pair is considered as (α,θ)
and (β,) for the corpus-level topic distributions and the topic-word distributions. The document-level
variables are denoted by θd, and the word-level variables are represented by wdn.
3.3.3. WORD2VEC
In this process, the depiction of a word as a vector plays a signicant role. This process more helpful
for discovering antonyms, synonyms, and sentence equivalent with comparable meaning. This process
converting the word into a vector form (Wang, Ma, & Zhang, 2016). It contains two dierent models
for constraint updation. One is Continuous Bag of Words (CBOW) and skip-gram. CBOW is used to
forecast words utilizing contexts of its environments. The Skip-gram uses a word’s data in forecasting of
adjacent words. Three layers are used such as input, projection and output are used in both the methods.
Here, the CBOW approach is considered as an instance to clarify the working of word2vec.
A sentence S is assumed as:
, where wt refers to the target term. Then the input layer is dened
as follows:
(5)
Where c(v(wt)) refers to the context of the term v(wt). Next, the projecting layer is used to construct a
contextual vector v(wt) as follows:
(6)
A word is considered as a Leaf Node (LN) in a Human tree based on its event in the corpus in the
output layer. Every word has a single path between the Root Node (RN) and the LN. Using the logistic
3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022
88 https://doi.org/10.17993/3cemp.2021.100448.77-105
model, the likelihood of choosing left or right child can be computed at every node excluding the leaf
node which is given by:
(7)
At every node, using an invention of likelihoods p(v(xw)c(v(xw))) can be learned in the tree which is given by:
(8)
Here, the jth digit in word w's Human code is dened by dw
j [0,1] and any node on the path is denoted
as j excluding as the LN.
By enhancing the log-likelihood, the objective purpose can be erudite by (9). Then the gradient descent
approach is utilized to improve θ, v(xw) and its relative words.
(9)
3.4. FEATURE SELECTION USING HHO
It is a crucial step for text classication and it is the procedure of choosing a certain subcategory of
terms of the training set and these are utilizing for further classication procedure. It also lessens the size
of information, improves the classication accuracy by removing noisy features, eliminates overtting
problem and it makes the training faster. The HHO algorithm is introduced in the feature selection
process to choose the optimal nest features for text classication. This algorithm analyses the number
of features to obtain more relevant features.
3.4.1. HARRIS HAWKS OPTIMIZATION ALGORITHM
HHO is inspired by the behaviour of Harris hawks to discover the prey, surprise pounce, and dissimilar
violence methods in the environment (Heidari et al., 2019). The hawks are denoted as the applicant
solutions and the nest solution is termed as prey. Using their powerful eyes, the Harris hawks eort to
3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022
89 https://doi.org/10.17993/3cemp.2021.100448.77-105
trail the prey and execute the surprise pounce to hook the prey detected. In this process, three features
such as TF-IDF, LDA, and word2vec are taken as input. These three features are not similar to each text.
Therefore, the HHO is utilized to select the optimal feature for the classication of text.
Generally, HHO includes the exploration and exploitation stages. The HHO algorithm can be transferred
from exploration to exploitation. The exploration behaviour is improved based on the escaping energy
of prey (E) and it is given by:
(10)
(11)
Here the present iteration is denoted by t, the maximum number of iterations is represented by T, the
initial energy is dened by E0 that lies between [-1, 1] and r denoted as a random number in [0, 1].
3.4.1.1. EXPLORATION PHASE
Through the arbitrary position, the location of the hawk is modernized which can be given as:
(12)
Here, the location of the hawk is dened by X, the location of the arbitrarily chosen hawk is denoted as
Xk, and the location of the prey is dened as Xr. The lower and upper limits of hunt space are signied by
lb and ub individually. In the range of [0, 1], the ve independent arbitrary numbers are dened by r1 , r2 ,
r3 , r4 , and q. The ordinary location of the present populace of hawks is dened by Xm and it is given by:
(13)
Here, the nth hawk is denoted as Xn and the number of hawks is dened by N.