3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022

AN OPTIMIZED DEEP NEURAL NETWORK-BASED FINANCIAL

STATEMENT FRAUD DETECTION IN TEXT MINING

Ajit Kr. Singh Yadav

Assistant Professor, Department of Computer Science and Engineering,

NERIST, Itanagar, Arunachal Pradesh (India), and Research Scholar,

Department of Computer Science and Engineering,

Rajiv Gandhi University, Itanagar, Arunachal Pradesh, (India).

E-mail: ajityadav101@redimail.com ORCID: https://orcid.org/0000-0002-2208-0828

Marpe Sora

Associate Professor. Department of Computer Science and Engineering,

Rajiv Gandhi University, Itanagar, Arunachal Pradesh, (India).

E-mail: marpe.sora@rgu.ac.in ORCID: https://orcid.org/0000-0003-0159-5416

Recepción: 23/07/2021 Aceptación: 03/11/2021 Publicación: 24/11/2021

Citación sugerida:

Singh, A. K., y Sora, M. (2021). An optimized deep neural network-based nancial statement fraud

detection in text mining. 3C Empresa. Investigación y pensamiento crítico, 10(4), 77-105. https://doi.

org/10.17993/3cemp.2021.100448.77-105

3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022

78 https://doi.org/10.17993/3cemp.2021.100448.77-105

ABSTRACT

Identifying Financial Statement Fraud (FSF) events is very crucial in text mining. The researcher’s

community is mostly utilized the data mining method for detecting FSF. In this direction, mostly the

quantitative data has utilized by research i.e. the nancial ratio is presented for detecting fraud in

nancial statements. On the text investigation there is no researches like auditor's remarks present in

published reports. For this reason, this paper develops the optimized deep neural network-based FSF

detection in the qualitative data present in nancial reports. The pre-processing of text is performed

initially using ltering, lemmatization, and tokenization. Then, the feature selection is done by the

Harris Hawks Optimization (HHO) algorithm. Finally, a Deep Neural Network-Based Deer Hunting

Optimization (DNN-DHO) is utilized to identify the fraud or no-fraud report in the nancial statements.

The developed FSF detection methodology executed in Python environment using nancial statement

datasets. The output of the developed approach gives high classication accuracy (96%) in comparison

to the standard classiers like DNN, CART, LR, SVM, Bayes, BP-NN, and KNN. Also, it provides better

outcomes in all performance metrics.

KEYWORDS

Financial statements, Fraud, Non-fraud, Text mining, Deep neural network, Deer hunting optimization.

3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022

79 https://doi.org/10.17993/3cemp.2021.100448.77-105

1. INTRODUCTION

Financial fraud is a major challenging task for dierent administrations across industries and in several

states as it takes vast destruction to business. Due to nancial fraud, billions of dollars are lost every year

in the bank of America, for instance approves to pay $16.5 billion for solving the case of nancial fraud

(Rezaee & Kedia, 2012). Material omissions resultant from an intentional failure to report nancial data

in accordance with usually acknowledged secretarial ethics are termed as FSF (Dalnial et al., 2014). The

companies provide the nancial statements that include the textual data in the form of auditors' remarks

and expose as records with nancial proportions. The qualitative data consist of indicators of fraudulent

nancial reporting in the form of intentionally located idioms. The agents use the adverbial phrases,

selective sentence constructions, and selective adjectives to cover the fraudulent activity (Throckmorton

et al., 2015; Song et al., 2014). To identify fraudulent nancial fraud, nancial statement users and

regulators expect external auditors. Financial statements are the organization's elementary documents

to reect its scal rank (Kanapickienė & Grundienė, 2015).

A careful analysis of the nancial accounts can denote whether the corporation is running eciently or

is in crisis. If the corporation is in crisis, nancial accounts can show if the maximum dangerous entity

handled by the organization is prot or cash or something dierent (Perols & Lougee, 2011). In every

quarter and every year, most of the organizations are needed to publish their nancial statements (Gray

& Debreceny, 2014).

FSF can be executed to build stock values or to acquire loans from banks. It may be done to allocate smaller

prots to investors. One more feasible reason might be to stay away from the expense of assessments

(Manurung & Hardika, 2015). Recently, dierent organizations are creating usage of fraud nancial

reports to cover up their real scal rank and create self-interested improvements at the expenditure of

shareholders. In the detection of FSF, nancial ratios are prime elements because they present a pure

image of the nancial strength of the corporation (Hajek & Henriques, 2017).

3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022

80 https://doi.org/10.17993/3cemp.2021.100448.77-105

The economy of an organization is caused by the illegal task of FSF. In determining capitalizing in a

corporation, the investigation of nancial reports helps the contributors to the investment market (Omar

et al., 2014). The performance of the company provided by the data presented in these statements in

terms of scal rank to the creditors, shareholders, and auditors.

In worldwide organizations, nding and prevention of FSF have become a signicant challenge (Gupta

et al., 2012a). In the failure of the prevention process, the detection of fraudulent nancial reporting is

a challenging issue. Though, the prevention of FSF is a better method (Asare et al., 2015). The interior

and exterior auditors have to play a signicant task in the discovery and prevention of FSF. But they

cannot be said only accountable for the identication and detection of FSF (Gupta et al., 2012b). Study

about fraud detection and antecedents is signicant since it adds to the sympathetic about fraud. To

enhance the auditors’ and regulators’ capability, it has the potential to identify the fraud either directly

or by helping as a basis for future fraud research that does (Ravisankar et al., 2011). Better-quality fraud

detection can assist the defrauded organizations, and their workers, investors, and creditors curb costs

linked with fraud and also enhance the eciency of the market. This knowledge is interest to auditors

once delivering guarantee about whether nancial accounts are free of substantial misstatements aected

by fraud (Ngai et al., 2011), mainly during audit planning and client selection.

Several researchers have been analysed the quantitative data for the recognition of false nancial

reporting (Jan, 2018). Therefore, the text mining technique is utilized to recognize fraud and non-fraud

nancial reports in the qualitative contents of nancial statements (Lin et al., 2015). Text mining is

the method of mining signicant structured data from unstructured text. It can be utilized for nding

the fraud or non-fraud reports and also it can examine the words (Gupta et al., 2012c). At present,

extensive data is produced from dierent sources in the Internet-dependent world. In an unstructured

format, a vast amount of data is obtainable. Text mining and data mining methods can permit well

decision making for analysing unstructured data (Kumar & Ravi, 2016). Dierent types of tasks involved

in text mining, for example, text summarization, web page classication, sentiment analysis, detection

3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022

81 https://doi.org/10.17993/3cemp.2021.100448.77-105

of plagiarism, malware analysis, classication of the document, detection of a topic, patent analysis,

etc. In the nancial statements, the textual data is unstructured (Dong, Liao, & Liang, 2016). Before

applying any data mining approaches like classication or clustering, the text is must be transformed into

structured data because the form of text is shapeless for the discovery of FSF.

This work contributes mainly:

• In nding the solution of the nancial report fraud discovery.

• To design the model for identifying the fraudulent and non-fraudulent statement.

• To use optimal feature selection approaches to get high accuracy.

• To model a new hybrid classier for nancial statement fraud discovery.

The remaining work of this paper is shown in following sections: Section two denes the recent works

related to this paper. The proposed method to detect the FSF is given in section three, the section four

provides the outcomes of the simulation and conclusion and future scope is given in section ve.

2. RELATED WORKS

An interpretable fuzzy rule-based system was presented by Hajek (2019) for detecting FSF. The developed

fuzzy rule-based detection approach combines the rule extraction and element of feature selection to

obtain the granularity and rule complexity. A genetic feature selection method is utilized to eliminate

the irrelevant features. A qualied investigation of fuzzy systems was performed with evolutionary fuzzy

rule-based schemes and FURIA. The developed system leads both desirable interpretability and good

accuracy. The result provides signicant eects for auditors and other operators of discovery structures

of FSF.

Fraud detection was introduced by Chen et al. (2019) for economic reports of business groups. For

fraud discovery, this article suggests a methodology in the nancial reports of business assemblies. The

established technique to improve the welfares of investment for creditors and investors and to lessen the

3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022

82 https://doi.org/10.17993/3cemp.2021.100448.77-105

investment losses and risks. The learning points were obtained by the subsequent stages: (i) construct

an eective model for fraud discovery in the nancial reports of business assemblies, (ii) dierent fraud

nding methods were applied in the nancial reports, and (iii) valuation of the developed system.

A Financial Fraudulent Statements (FFS) detection was developed by Temponeras et al. (2019) using the

deep dense Articial Neural Network (ANN). This system reviews the nancial statements of multiple

companies. A deep dense ANN is derived from the decisions about conceivable accounting fraud. To

accurately classify the FFS, the data is obtained from 164 Greek companies. Therefore, the main objective

was to test a neural system structure in the forecasting FFS. In the classication FFS task, the developed

approach provides superior outcomes than other earlier classiers in investigating the Greek data.

A CHAID, SVM (Support Vector Machine), and C5.0 were discussed by Chi et al. (2019) for FSF

detection. Through an active detection scheme, an approach of C5.0, SVM, and CHAID are applied

to the discovery of FSF. From the Taiwan Economic Journal (TEJ), the research data is obtained. The

source sample contains 28 companies involved in FSF and 84 corporations are not intricate in such

frauds on the Taipei Exchange and the Taiwan Stock Exchange amid the investigation time. Before

constructing the system, the paper chooses key variables with C5.0 and SVM. For FSF, the non-nancial

and nancial variables are utilized to improve the precision of recognition.

An application of a cooperative Random Forest (RF) classier was presented by Patel et al. (2019) for

identifying nancial report management of Indian registered corporations. Recently, the investigator

has tried to discover the dierent modelling methods for FFS detection. The researcher has selected

a 92 non-FFS and 86 FFS of manufacturing corporations to accomplish the test. From the Bombay

stock exchange, the research data were obtained for the dimension of 2008-2011. For the identication

of non-FFS and FFS companies, the auditor's report was deliberated. The T-test utilises 31 signicant

nancial proportions. The training dataset is employed to train the model and the trained model is used

for classication with better accuracy.

3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022

83 https://doi.org/10.17993/3cemp.2021.100448.77-105

3. METHODOLOGY

The group of nancial statements is considered as the input in the text mining systems. Here, the fraud

and non-fraud types of nancial reports are gathered to classify fake nancial reports.

The nancial statement fraud discovery includes four steps such as text pre-processing, feature extraction,

feature selection, and text classication. the workow of the proposed approach is shown in Figure 1.

Figure 1. Overall proposed Methodology.

Source: own elaboration.

3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022

84 https://doi.org/10.17993/3cemp.2021.100448.77-105

In-text mining, pre-processing plays a major role. The high quality of the pre-processing step provides

better results. The pre-processing step includes the number of roles such as ltering, tokenization, and

lemmatization. The words in all documents are transformed into the lower case during pre-processing.

Then the TF-IDF, LDA and Word2vec approach is utilized for feature extraction. It describes the text

to have a set of measurable dimensions like frequency of words. The process of feature selection is

utilized to enhance the performance of a text classier and also decrease the dimension of the feature.

Here, the HHO algorithm is used for feature selection. Finally, the new hybrid classier of DNN-DHO

is proposed to identify the fraud and non-fraud nancial statements for classication. In the DNN,

weights are updated using the DHO algorithm. This hybrid classier concept minimizes the error during

classication.

3.1. PROBLEM STATEMENT

FSF is the main problem for society. The detection of FSF is a challenging process. FSF is not a victimless

corruption, but instead leaves behind actual genuine economic losses that contain workers, shareholders,

and investors. A trust in controllers, reduction in self-assurance and reduction in the reliability of

nancial markets are extensive costs to society. It leads to high transaction costs and minimum eciency.

In developing markets, the challenges of business-related with investing to improve the incentives for

handling nancial statements and also avoid the taxes in the home country. Recently, dierent cases of

FSF have been increased. Every incidence is a dense disappointment to shareholders, and investors and

it expenses the public extremely. So, the construction of an ecient scheme to identify FSF is a major

concern.

3.2. TEXT PRE-PROCESSING

It is a signicant role and a dangerous phase in text mining. To mining motivating, non-trivial, and

information from amorphous text data, a pre-processing method is applied in text mining. The basic

units of the fonts, words, and sentences are recognized in this phase and it’s delivered to all further

3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022

85 https://doi.org/10.17993/3cemp.2021.100448.77-105

processing phases. The steps of pre-processing contain the number of roles, for example, ltering,

lemmatization, and tokenization.

3.2.1. TOKENIZATION

A given text is broken into phrases, words, symbols, or other important components are known as tokens.

It may be thrown away particular characters like punctuation marks. The main application of this

process is to identify the signicant keywords.

3.2.2. FILTERING

This process eliminates the particular words in the documents. The elimination of stop words is a

common ltering approach. Stop words are repeatedly utilized common words like ‘this’, ‘are’, ‘and’ etc.

They are not applicable in document classication. Therefore, they must be eliminated.

3.2.3. LEMMATIZATION

This process to eliminate inectional terminations and to return the base form of a word, which is named

as the lemma. This process refers to the usage of the dictionary and morphological study of words.

3.3. TEXT FEATURE EXTRACTION

Text feature extraction is the procedure of extracting list of words from the textual data for the feature

selection in classier. In-text classication, it plays a major role because it directly impacts the classication

accuracy. The following methods are utilized for extracting features from text data.

3.3.1. TERM FREQUENCY AND INVERSE DOCUMENT FREQUENCY (TF-IDF)

TF-IDF is an important weighting method in text mining (Kalra et al., 2019). A word is frequent in the

number of times in a text which denoted as the word frequency. To compute the reverse likelihood of

nding a word, the IDF approach is employed in a text. The signicance of a term in a text is denoted

3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022

86 https://doi.org/10.17993/3cemp.2021.100448.77-105

as TF-IDF within a corpus. Here, a document refers to a nancial report, a term refers to a solitary word

in a statement, and a corpus refers to the assortment of reports. In a document d, the weight of TF-IDF

for a term t is computed by:

(1)

(2)

(3)

3.3.2. LATENT DIRICHLET ALLOCATION (LDA)

LDA is the topic modelling scheme (Jelodar et al., 2019). It adopts that every text can be dened as a

probabilistic distribution over hidden topic. In all documents, the common Dirichlet prior is shared by

the topic distribution. A common Dirichlet prior shared by the word distributions of topics. Assumed

a corpus D that contains M documents. Each document d having Nd words d ∈ 1,…, M. This method

based on the subsequent reproductive procedure:

• From a Dirichlet dissemination with factor β, choose a multinomial spreading  for a topic t(t ∈

1,…,T).

• For document d(d ∈ 1,…,M), select a multinomial spreading θd from a Dirichlet dissemination with

factor α.

• Pick a topic zn from θd and take a word wn from zn for a word wn (n ∈ 1,…, Nd) in a document d.

Here, the words are only detected variables in documents whereas others are hyper factors (α and β) and

hidden variables ( and θ). The likelihood of perceived data D is calculated by:

(4)

3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022

87 https://doi.org/10.17993/3cemp.2021.100448.77-105

α the spreading of words over topics and β constraints of topic Dirichlet prior are obtained from Dirichlet

dissemination. Here the number of topics is dened by T, the number of documents is denoted by M,

and the size of the vocabulary is denoted by N. The Dirichlet-multinomial pair is considered as (α,θ)

and (β,) for the corpus-level topic distributions and the topic-word distributions. The document-level

variables are denoted by θd, and the word-level variables are represented by wdn.

3.3.3. WORD2VEC

In this process, the depiction of a word as a vector plays a signicant role. This process more helpful

for discovering antonyms, synonyms, and sentence equivalent with comparable meaning. This process

converting the word into a vector form (Wang, Ma, & Zhang, 2016). It contains two dierent models

for constraint updation. One is Continuous Bag of Words (CBOW) and skip-gram. CBOW is used to

forecast words utilizing contexts of its environments. The Skip-gram uses a word’s data in forecasting of

adjacent words. Three layers are used such as input, projection and output are used in both the methods.

Here, the CBOW approach is considered as an instance to clarify the working of word2vec.

A sentence S is assumed as:

, where wt refers to the target term. Then the input layer is dened

as follows:

(5)

Where c(v(wt)) refers to the context of the term v(wt). Next, the projecting layer is used to construct a

contextual vector v(wt) as follows:

(6)

A word is considered as a Leaf Node (LN) in a Human tree based on its event in the corpus in the

output layer. Every word has a single path between the Root Node (RN) and the LN. Using the logistic

3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022

88 https://doi.org/10.17993/3cemp.2021.100448.77-105

model, the likelihood of choosing left or right child can be computed at every node excluding the leaf

node which is given by:

(7)

At every node, using an invention of likelihoods p(v(xw)c(v(xw))) can be learned in the tree which is given by:

(8)

Here, the jth digit in word w's Human code is dened by dw

j ∈ [0,1] and any node on the path is denoted

as j excluding as the LN.

By enhancing the log-likelihood, the objective purpose can be erudite by (9). Then the gradient descent

approach is utilized to improve θ, v(xw) and its relative words.

(9)

3.4. FEATURE SELECTION USING HHO

It is a crucial step for text classication and it is the procedure of choosing a certain subcategory of

terms of the training set and these are utilizing for further classication procedure. It also lessens the size

of information, improves the classication accuracy by removing noisy features, eliminates overtting

problem and it makes the training faster. The HHO algorithm is introduced in the feature selection

process to choose the optimal nest features for text classication. This algorithm analyses the number

of features to obtain more relevant features.

3.4.1. HARRIS HAWKS OPTIMIZATION ALGORITHM

HHO is inspired by the behaviour of Harris hawks to discover the prey, surprise pounce, and dissimilar

violence methods in the environment (Heidari et al., 2019). The hawks are denoted as the applicant

solutions and the nest solution is termed as prey. Using their powerful eyes, the Harris hawks eort to

3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022

89 https://doi.org/10.17993/3cemp.2021.100448.77-105

trail the prey and execute the surprise pounce to hook the prey detected. In this process, three features

such as TF-IDF, LDA, and word2vec are taken as input. These three features are not similar to each text.

Therefore, the HHO is utilized to select the optimal feature for the classication of text.

Generally, HHO includes the exploration and exploitation stages. The HHO algorithm can be transferred

from exploration to exploitation. The exploration behaviour is improved based on the escaping energy

of prey (E) and it is given by:

(10)

(11)

Here the present iteration is denoted by t, the maximum number of iterations is represented by T, the

initial energy is dened by E0 that lies between [-1, 1] and r denoted as a random number in [0, 1].

3.4.1.1. EXPLORATION PHASE

Through the arbitrary position, the location of the hawk is modernized which can be given as:

(12)

Here, the location of the hawk is dened by X, the location of the arbitrarily chosen hawk is denoted as

Xk, and the location of the prey is dened as Xr. The lower and upper limits of hunt space are signied by

lb and ub individually. In the range of [0, 1], the ve independent arbitrary numbers are dened by r1 , r2 ,

r3 , r4 , and q. The ordinary location of the present populace of hawks is dened by Xm and it is given by:

(13)

Here, the nth hawk is denoted as Xn and the number of hawks is dened by N.

3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022

90 https://doi.org/10.17993/3cemp.2021.100448.77-105

3.4.1.2. FITNESS FUNCTION

The tness value is computed for each hawk and stored for future reference. The tness function of this

feature selection process is computed by:

(14)

3.4.1.3. EXPLOTATION PHASE

According to the four dissimilar conditions, the location of the hawk is improved in this process. This

process is accomplished only depends on the chance of prey is eectively escaping (r < 0.5) or not

eectively escaping (r ≥ 0.5) beforehand surprise bounce and the escaping energy of prey (E).

• Soft Besiege

If |E| ≥ 0.5 and r ≥ 0.5, this stage only occurs. Here, the location of the hawk is updated by the

subsequent expression:

(15)

Here, the dissimilarity between the current hawk and the position of the prey is denoted as ∆X and the

jump strength is denoted by J. Both parameters can be dened as:

(16)

(17)

Where r5 is a constant value in the range of 0 and 1 that changes unevenly in every single iteration.

• Hard Besiege

If |E| < 0.5 and r ≥ 0.5, this phase only happens. Here, the location of the hawk is updated by the

following expression:

3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022

91 https://doi.org/10.17993/3cemp.2021.100448.77-105

(18)

• Soft Besiege with Progressive Rapid Dives

If |E| ≥ 0.5 and r < 0.5, this stage is happened. The hawk gradually picks the nest probable dive to

catch the prey. Here, the two dierent solutions are produced by,

(19)

(20)

Here, the newly produced hawks are denoted by Y and Z. the total number of dimensions is denoted as

D, α is an arbitrary vector and Levy is the function of levy ight which is given by:

(21)

Here, u and v are the self-governing arbitrary numbers produced from the standard distribution and σ

is given by:

(22)

Where β is a constant value xed to 1.5. Here, the location of the hawk is reorganized by:

(23)

Where the tness function is dened as F(.) , Y and Z are two dierent solutions gained from Equations

(19) and (20).

3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022

92 https://doi.org/10.17993/3cemp.2021.100448.77-105

• Hard Besiege with Progressive Rapid Dives

If |E| < 0.5 and r < 0.5, this process is occurred. The two dierent solutions are made by:

(24)

(25)

The location of the hawk is updated by:

(26)

Where Y and Z are two fresh solutions achieved from Equations (24) and (25).

3.5. OPTIMIZED DNN BASED CLASSIFICATION USING DHO

The structure of DNN includes the input layer, hidden layers, and output layer as exposed in Figure 2. By

the exertion of weight tness, the network is constructed. DNN updates the weight value in the hidden

layer using the DHO algorithm (Brammya et al., 2019). Owing to the improved training repetitions, this

system frequently ts the considered training information's judgment border. The total quantity of nodes

is evaluated in the hidden layers which are given as:

(27)

3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022

93 https://doi.org/10.17993/3cemp.2021.100448.77-105

Figure 2. DNN with SoftMax regression.

Source: own elaboration.

Here, the sum of the hidden layer is n, the layer of input is a, the layer of output is dened as b and c is

a constant where 0≤c≥1. The sigmoid utility is used as an activation function for empowering the non-

linear capability which is computed as:

(28)

The input information of the system is considered as and the mapping function is dened as Mf .

(29)

In this w is weight matrix, and β is bias between output and hidden layer. A data model (x, l) can be taken

and the loss form computed as:

(30)

3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022

94 https://doi.org/10.17993/3cemp.2021.100448.77-105

Here, Ws and bs are bias subsets, hidden layer nodes are m the sum of neurons in the hidden layer is

signied as m. The Cross-Entropy (CE) for the testing and training of the model is taken as loss form for

the deep neural network. This can be estimated as:

(31)

Here sample of training is n, the kth output is yk from training set and the expected kth output is Ŷk. The

network weight value is estimated by the DHO method.

Then the old and fresh solutions are compared. Only the best solutions are considered for the next

iteration. Furthermore, it simply needs the alteration of the population dimensions. The number of

iterations updates the calculation in only one stage.

In DHO the two hunters one is leader and other is successor must be at their best position. For this they

update their angle and position hunt the deer. the leader updates his angle and position as:

(32)

Where, Yi is the current position, Yi+1 is the next position, p is a random number belongs to [0, 2].

Leader present position is Ylead from a present population.

The position can be updated by successor position as:

(33)

Where successor position is Ysuccessor. The coecient factors can be calculated as:

(34)

(35)

3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022

95 https://doi.org/10.17993/3cemp.2021.100448.77-105

imax is maximum repetition, b is an arbitrary number between -1 to 1, here c is a number from 0 to 1. The

mean value of leader and successor can be used for weight updation:

(36)

The current and earlier solution are compared. It will replace the earlier solution if the earlier solution is

enhanced otherwise, it will keep the earlier solution. This procedure is frequent up to the end condition

is satised. DHO algorithm for weight estimation is shown in Figure 3.

Figure 3. DHO for weight updation.

Source: own elaboration.

3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022

96 https://doi.org/10.17993/3cemp.2021.100448.77-105

3.6. DATASET DESCRIPTION

The standard datasets are utilized for FSF detection. The full dataset is taken from the link https://

surfdrive.surf.nl/les/index.php/s/m34LCElefSj6M8y. Here annual reports are stored in two zip les.

One le contains the annual reports in the 'fraud' category and other les in the 'no fraud' category. 1646

statements are included in the datasets. It contains 1319 no fraud statements and 327 fraud statements.

For this work 70% data is used for training purpose and 30% data is used for testing purpose.

4. RESULTS

DNN-DHO methods are proposed here to optimize the DNN model for detection of FSF. Dierent

classiers such as DNN, K-nearest neighbour (KNN) SVM, backpropagation neural network (BP-

NN), classication and regression tree (CART), Bayes classier (Bayes), and logistic regression (LR) are

compared with the proposed approach and a comparative evaluation performance is made.

4.1 PERFORMANCE ANALYSIS

The proposed (DNN-DHO) approach is executed on the nancial statement dataset. The proposed

scheme correctly identies the fraud and non-fraud statement. Initially, the pre-processing step is executed

on the text data. It includes the number of tasks such as tokenization, ltering, and lemmatization. These

stages are performed on the text. After that, dierent feature extraction methods like TF-IDF, LDA, and

word2vec are utilized. Then, the HHO algorithm selects the optimal. The features selected optimally is

used by DNN-DHO algorithm to classify the nancial statement. The proposed method provides better

outcomes compared to the standard classiers like CART, DNN, SVM, NB, LR, BP-NN, and KNN.

The performance metrics are evaluated for dierent classiers. The accuracy obtained by this method is

better (96%) than other standard classiers.

3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022

97 https://doi.org/10.17993/3cemp.2021.100448.77-105

309 18

35 1284

Fraud

Non Fraud

Fraud

Confusion Matrix

Figure 4. Confusion Matrix.

Source: own elaboration.

The confusion matrix obtained by the proposed method is shown in Figure 4. The nancial statements

consist of 327 fraud statements and 1319 non-fraud statements. In 327 fraud statements, 309 nancial

statements are correctly identied as a fraud statement remaining 18 nancial statements are wrongly

identied as a non-fraud statement. Similarly, in the 1319 non-fraud statements, 1284 nancial statements

are correctly identied as a non-fraud statement remaining 35 nancial statements are wrongly identied

as a fraud statement. Therefore, the proposed scheme properly identies the fraud or non-fraud in the

nancial statements. The accuracy evaluation is shown in Figure 5.

3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022

98 https://doi.org/10.17993/3cemp.2021.100448.77-105

14.6%

14.1%

11.0%

13.0%

12.9% 12.0%

11.2%

11.2%DNN

SVM

LR KNN

CART

BAYES

BPNN

DNN-DHO

(Proposed)

Accuracy

Figure 5. Accuracy evaluation.

Source: own elaboration.

The outcomes of the proposed classier and standard classiers like SVM, CART, Bayes, BP-NN, DNN,

LR, and KNN is shown in Table 1 and Figure 6. The performance of accuracy is 96% for the DNN-

DHO, 93% for DNN, 86% for SVM, 74% for CART, 73% for BP-NN, 85% for LR, 74% for Naïve

Bayes and 79% for KNN. DNN-DHO method outperforms in all other parameters also. The proposed

approach is better in comparison to the existing classiers for classication of FSF.

Table 1. performance comparison among DNN-DHO classier and others.

Performance

parameters

DNN-DHO

(Proposed) DNN SVM LR KNN CART Bayes BPNN

Accuracy 0.9678 0.93 0.86 0.85 0.79 0.74 0.74 0.73

Sensitivity 0.9734 0.94 0.89 0.87 0.82 0.83 0.82 0.81

FPR 0.055 0.065 0.16 0.16 0.23 0.34 0.34 0.35

3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022

99 https://doi.org/10.17993/3cemp.2021.100448.77-105

FNR 0.0265 0.06 0.106 0.12 0.17 0.16 0.17 0.18

Precision 0.9861 0.93 0.84 0.84 0.77 0.71 0.70 0.69

F1 score 0.9797 0.93 0.87 0.85 0.80 0.76 0.76 0.75

Specicity 0.9449 0.935 0.84 0.83 0.76 0.66 0.65 0.64

BER 0.0321 0.0625 0.133 0.14 0.20 0.25 0.25 0.26

AUC 0.9423 0.9178 0.8725 0.86 0.79 0.78 0.75 0.72

Source: own elaboration.

1.2

0.8

0.6

0.4

0.2

DNN-DHO (Proposed) DNN SVM LR KNN CART Bayes BPNN

Figure 6. performance comparison among DNN-DHO classier and others.

Source: own elaboration.

5. CONCLUSIONS

In this paper, an optimized deep neural network based FSF discovery in text mining has been proposed.

The model of fraud detection initiates with an assortment of nancial reports for both fraud and no-

3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022

100 https://doi.org/10.17993/3cemp.2021.100448.77-105

fraud administrations. The pre-processing stage is performed through lemmatization, ltering, and

tokenization. Then the TF-IDF, LDA and word2vec approach is used for mining the data concealed

in the document for fraud and no-fraud administrations. Further, the HHO procedure is utilized to

select the nest features. Then the DNN-DHO classier utilises these features with a SoftMax classier

for classication of fraud and no-fraud statements. In the classication process, the weight of the

whole network is updated by the DHO algorithm. The outcomes shows that the proposed method

is the best model for detecting FSF. The accuracy (96%), Sensitivity (97%), precision (98%), F1 score

(97%), Specicity (94%), BER (0.03), FPR (0.05), FNR (0.026) and AUC is 0.94 are calculated for the

developed method and it’s compared to the existing classiers. The proposed approach is to provide the

best performance results than other classiers of BP-NN, DNN, CART, SVM, LR, KNN, and Bayes.

REFERENCES

Asare, S. K., Wright, A., & Zimbelman, M. F. (2015). Challenges facing auditors in detecting

nancial statement fraud: Insights from fraud investigations. Journal of Forensic and Investigative

Accounting, 7(2), 63-111. http://web.nacva.com/JFIA/Issues/JFIA-2015-2_4.pdf

Brammya, G., Praveena, S., Ninu, N. S., Ramya, R., Rajakumar, B. R., & Binu, D. (2019). Deer

Hunting Optimization Algorithm: A New Nature-Inspired Meta-heuristic Paradigm. The Computer

Journal, bxy133. https://doi.org/10.1093/comjnl/bxy133

Chen, Y.-J., Liou, W.-C., Chen, Y.-M., & Wu, J.-H. (2019). Fraud detection for nancial statements

of business groups. International Journal of Accounting Information Systems, 32(C), 1-23. https://ideas.

repec.org/a/eee/ijoais/v32y2019icp1-23.html

Chi, D.-J., Chu, C.-C., & Chen, D. (2019). Applying Support Vector Machine, C5. 0, and CHAID to

the Detection of Financial Statements Frauds. In International Conference on Intelligent Computing, pp.

327-336. Springer, Cham.

3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022

101 https://doi.org/10.17993/3cemp.2021.100448.77-105

Dalnial, H., Kamaluddin, A., Sanusi, Z. M., & Khairuddin, K. S. (2014). Detecting fraudulent

nancial reporting through nancial statement analysis. Journal of Advanced Management Science, 2(1),

17-22. http://www.joams.com/index.php?m=content&c=index&a=show&catid=36&id=108

Dong, W., Liao, S., & Liang, L. (2016). Financial Statement Fraud Detection using Text Mining: A

Systemic Functional Linguistics Theory Perspective. In Pacic Asia Conference On Information Systems

(PACIS), p. 188. https://core.ac.uk/download/pdf/301369656.pdf

Gray, G. L., & Debreceny, S. R. (2014). A taxonomy to guide research on the application of data

mining to fraud detection in nancial statement audits. International Journal of Accounting Information

Systems, 15(4), 357-380. https://doi.org/10.1016/j.accinf.2014.05.006

Gupta, R., & Gill, N. S. (2012a). A data mining framework for prevention and detection of nancial

statement fraud. International Journal of Computer Applications, 50(8). https://research.ijcaonline.org/

volume50/number8/pxc3880889.pdf

Gupta, R., & Gill, N. S. (2012b). Financial statement fraud detection using text mining. International

Journal of Advanced Computer Science and Applications (IJACSA), 3(12). http://dx.doi.org/10.14569/

IJACSA.2012.031230

Gupta, R., & Gill, N. S. (2012c). Prevention and detection of nancial statement fraud–An

implementation of data mining framework. International Journal of Advanced Computer Science and

Applications (IJACSA), 3(8). http://dx.doi.org/10.14569/IJACSA.2012.030825

Hajek, P. (2019). Interpretable Fuzzy Rule-Based Systems for Detecting Financial Statement Fraud. In

IFIP International Conference on Articial Intelligence Applications and Innovations, pp. 425-436. Springer,

Cham.

3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022

102 https://doi.org/10.17993/3cemp.2021.100448.77-105

Hajek, P., & Henriques, R. (2017). Mining corporate annual reports for intelligent detection of

nancial statement fraud–A comparative study of machine learning methods. Knowledge-Based

Systems, 128, 139-152. https://doi.org/10.1016/j.knosys.2017.05.001

Heidari, A. A., Mirjalili, S., faris, H., Aljarah, I., Mafarja, M., & Chen, H. (2019). Harris hawks

optimization: Algorithm and applications. Future Generation Computer Systems 97, 849-872. https://

doi.org/10.1016/j.future.2019.02.028

Jan, C.-L. (2018). An eective nancial statement fraud detection model for the sustainable development

of nancial markets: Evidence from Taiwan. Sustainability, 10(2), 513. https://doi.org/10.3390/

su10020513

Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., & Zhao, L. (2019). Latent Dirichlet

Allocation (LDA) and Topic modeling: models, applications, a survey. Multimedia Tools and

Applications, 78(11), 15169-15211. https://arxiv.org/abs/1711.04305

Kalra, S., Li, L., & Tizhoosh, H. R. (2019). Automatic Classication of Pathology Reports using TF-

IDF Features. arXiv preprint arXiv:1903.07406. https://arxiv.org/abs/1903.07406

Kanapickienė, R., & Grundienė, Ž. (2015). The model of fraud detection in nancial statements

by means of nancial ratios. Procedia-Social and Behavioral Sciences, 213, 321-327. https://doi.

org/10.1016/j.sbspro.2015.11.545

Kumar, B. S., & Ravi, V. (2016). A survey of the applications of text mining in the nancial domain.

Knowledge-Based Systems, 114, 128-147. https://doi.org/10.1016/j.knosys.2016.10.003

Lin, C., Chiu, A., Huang, S.Y., & Yen, D.C. (2015). Detecting the nancial statement fraud: The

analysis of the dierences between data mining techniques and experts’ judgments. Knowledge-

Based Systems, 89, 459-470. https://www.semanticscholar.org/paper/Detecting-the-nancial-

statement-fraud%3A-The-of-the-Lin-Chiu/48bc08514070341439e382f887faba42b21212d9

3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022

103 https://doi.org/10.17993/3cemp.2021.100448.77-105

Manurung, D. T. H., & Hardika, A. L. (2015). Analysis of factors that inuence nancial statement

fraud in the perspective fraud diamond: Empirical study on banking companies listed on the

Indonesia Stock Exchange year 2012 to 2014. In International Conference on Accounting Studies (ICAS),

279-286. https://core.ac.uk/download/pdf/42984276.pdf

Ngai, E. W. T., Hu, Y., Wong, Y. H., Chen, Y., & Sun, X. (2011). The application of data mining

techniques in nancial fraud detection: A classication framework and an academic review of the

literature. Decision support systems, 50(3), 559-569. https://doi.org/10.1016/j.dss.2010.08.006

Omar, N. B., Koya, R. K., Sanusi, Z. M., & Shae, N. A. (2014). Financial Statement Fraud: A

Case Examination Using Beneish Model and Ratio Analysis. International journal trade, economics and

nance, 5, 184-186. https://www.semanticscholar.org/paper/Financial-Statement-Fraud%3A-A-

Case-Examination-Using-Omar-Koya/75657feb5f290f2c5447eb71573b3b6753c17bfb

Patel, H., Parikh, S., Patel, A., & Parikh, A. (2019). An application of ensemble random forest

classier for detecting nancial statement manipulation of Indian listed companies. In Advances

in Intelligent Systems and Computing, Recent Developments in Machine Learning and Data Analytics. Springer

Proceedings. https://www.researchgate.net/prole/Satyen-Parikh/publication/327604170_

An_Application_of_Ensemble_Random_Forest_Classier_for_Detecting_Financial_Statement_

Manipulation_of_Indian_Listed_Companies_IC3_2018/links/5e8b631e299bf1307983c98e/

An-Application-of-Ensemble-Random-Forest-Classifier-for-Detecting-Financial-Statement-

Manipulation-of-Indian-Listed-Companies-IC3-2018.pdf

Perols, J. L., & Lougee, B. A. (2011). The relation between earnings management and nancial

statement fraud. Advances in Accounting, 27(1), 39-53. https://doi.org/10.1016/j.adiac.2010.10.004

Ravisankar, P., Ravi, V., Rao, G. R., & Bose, I. (2011). Detection of nancial statement fraud and

feature selection using data mining techniques. Decision Support Systems, 50(2), 491-500. https://doi.

org/10.1016/j.dss.2010.11.006

3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022

104 https://doi.org/10.17993/3cemp.2021.100448.77-105

Rezaee, Z., & Kedia, B. L. (2012). Role of corporate governance participants in preventing and

participants in preventing and detecting nancial statement fraud. Journal of Forensic & Investigative

Accounting, 4(2).

Song, X.-P., Hu, Z.-H., Du, J.-G., & Sheng, Z.-H. (2014). Application of machine learning methods

to risk assessment of nancial statement fraud: evidence from China. Journal of Forecasting, 33(8),

611-626. https://doi.org/10.1002/for.2294

Temponeras, G. S., Alexandropoulos, S. N., Kotsiantis, S. B., & Vrahatis, M. N. (2019).

Financial Fraudulent Statements Detection through a Deep Dense Articial Neural Network. In

10th International Conference on Information, Intelligence, Systems, and Applications (IISA), pp. 1-5. IEEE.

https://ieeexplore.ieee.org/abstract/document/8900741

Throckmorton, C. S., Mayew, W. J., Venkatachalam, M., & Collins, L. M. (2015). Financial

fraud detection using vocal, linguistic and nancial cues. Decision Support Systems, 74, 78-87. https://

doi.org/10.1016/j.dss.2015.04.006

Wang, Z., Ma, L., & Zhang, Y. (2016). A Hybrid Document Feature Extraction Method Using Latent

Dirichlet Allocation and Word2Vec. In 2016 IEEE First International Conference on Data Science in

Cyberspace (DSC), 98-103. https://www.semanticscholar.org/paper/A-Hybrid-Document-Feature-

Extraction-Method-Using-Wang-Ma/840894b784378fe64ef977c44db759b8aa0527cf

3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376 Ed. 48 Vol. 10 N.º 4 Noviembre 2021 - Febrero 2022

105 https://doi.org/10.17993/3cemp.2021.100448.77-105