Yunusov Valentin
Kazan Federal University, Kazan, (Russian Federation).
Gafarov Fail
Kazan Federal University, Kazan, (Russian Federation).
Ustin Pavel
Kazan Federal University, Kazan, (Russian Federation).
Reception: 26/10/2022 Acceptance: 10/11/2022 Publication: 29/12/2022
Suggested citation:
Valentin, Y., Fail, G., y Pavel, U. (2022). Shapley values to explain machine learning models of school student’s
academic performance during COVID-19. 3C TIC. Cuadernos de desarrollo aplicados a las TIC, 11(2),
3C TIC. Cuadernos de desarrollo aplicados a las TIC. ISSN: 2254-6529
Ed. 41 Vol. 11 N.º 2 August - December 2022
In this work we perform an analysis of distance learning format influence, caused by COVID-19
pandemic on school students’ academic performance. This study is based on a large dataset consisting
of school students grades for 2020 academic year taken from “Electronic education in Tatarstan
Republic” system. The analysis is based on the use of machine learning methods and feature
importance technique realized by using Python programming language. One of the priorities of this
work is to identify the academic factors causing the most sensitive impact on school students’
performance. In this work we used the Shapley values method for solving this task. This method is
widely used for the feature importance estimation task and can evaluate impact of every studied
feature on the output of machine learning models. The study-related conditional factors include
characteristics of teachers, types and kinds of educational organization, area of their location and
subjects for which marks were obtained.
Data Science, Python, education, Machine Learning, Feature Importance.
3C TIC. Cuadernos de desarrollo aplicados a las TIC. ISSN: 2254-6529
Ed. 41 Vol. 11 N.º 2 August - December 2022
Failure to achieve educational goals negatively affects society as a whole and is a serious problem.
This problem can manifest itself most significantly during periods of drastic changes, one of which
was the introduction of distance learning during the COVID-19 pandemic. To quantify the influence of
this event on educational system, a variety of quantitative models based on modern statistical methods
in combination with Big Data approaches can be used, as has shown in Li et al. [2021].
Machine learning (ML) is one of the new and actively developing methods of analysis, combining
approaches that can "learn" based on the received data, which allows to perform a wide range of
different tasks. ML can be used to solve problems of detection, recognition, prediction, prediction,
diagnostics, and optimization.
A large number of huge datasets has been accumulated recently in educational system, which can be
used to analyze and then improve educational process, as was demonstrated by Park [2020]. For
example, Livieris et al. [2019] analyze a dataset consisting of performance of 3716 students in course
of Mathematics of the first 5 years of secondary school. They develop two semisupervised machine
learning algorithms to predict students’ performance in the final examinations and then evaluate
methods’ accuracy. Authors compare these two methods with supervised machine learning method and
as a result, these approaches outperform it, and the final accuracy exceeds 80%.
Jeslet et al. [2021] used well-known algorithms of machine learning Logistic Regression and Support
Vector Machine to predict whether student is eligible to acquire a degree or not. Authors analyzed
dataset of 1460 students’ final years results and obtained a model trained to 99.27% and 99.72%
accuracy. Also, Nuanmeeseri et al. [2022] analyzed dataset of 1650 university students’ academic
performance. As a result, after adjusting model’s parameters, authors achieved accuracy of 96.98%, so
their model outperformed other considered machine learning methods and can be effectively used to
evaluate significant academic performance factors in drastically changing period.
In our work, we study changes of academic performance of whole school grades in the framework of a
variety of machine learning methods with the following feature importance analysis to identify
significant parameters that affect academic performance the most after the introduction of distance
learning format due to the COVID-19 pandemic.
Hastie et al. [2009] introduce Machine learning as a set of mathematical techniques that give computer
algorithms an ability to learn. This methodology is based on the input and required output of the algorithms
and can automate the way how humans are able to carry out the task, as stated by Mnih et al. [2015].
Ensemble methods are groups of algorithms that use several machine learning methods at once and makes
correction of each other's errors. Bostanabad et al. [2016] define supervised learning as a type of algorithms
where the method is supplied with example inputs along with the required output, which then allows it to
learn a rule that maps inputs to outputs. Bengio et al. [2013] state that in unsupervised learning, on the
contrary, only the inputs are supplied, and the learning algorithm is required to determine the structure of
the input and perform according to unknown characteristics [10].
In this work we use supervised machine learning methods: Decision Tree, Gradient Boosting, K-nearest
neighbors (KNN) Regressor, Lasso Regression, Linear Regression and MultiLayer Perceptron neural
networks, Support Vector Regressor; and ensemble method: Random Forest.
3C TIC. Cuadernos de desarrollo aplicados a las TIC. ISSN: 2254-6529
Ed. 41 Vol. 11 N.º 2 August - December 2022
In our study, we solved the regression task to predict Cohen’s effect size, defined by Cohen [1988], based
on subsets of school grades’ marks in February and March, and April and May. Cohen’s effect size
measures the difference between mean values of two variables Cohen [1988].
Usually, machine learning models are difficult to interpret and it’s hard to identify which features affect the
output of the models the most. SHAP method (Shapley additive explanations) is one of the techniques used
to solve this problem. This method is based on cooperative game theory, explained by Shapley [1953], and
is used to increase transparency and interpretability of machine learning models. Absolute SHAP value
shows us how much a single feature affected the prediction. SHAP values can represent the local
importance of features and how it changes with lower and higher values, as shown by Sahakyan et al.
In this work, we study the influence of COVID-19 pandemic on school students’ academic
performance by analyzing a large dataset consisting of data from all schools in Tatarstan Republic,
introduced by Ustin et al. [2022]. The dataset includes marks of entire grades of school students for
main subjects for grades from 2 to 11.
During the preprocessing of original data, for the following analysis by machine learning methods, the
initial dataset was modified into a new dataset consisting of features describing different parameters.
These parameters included teachers’ characteristics (age, sex, and educational category), mean mark of
grade for February and March of 2020, school characteristics (location in or out of town, region of
location, organization kind and type, subject). Data was filtered to consider school grades with at least
60 school grades in certain time periods (February and March, April and May 2020). For every row in
dataset, Cohen’s effect size was calculated. Figure 1 shows histograms for certain grades that
represent whole dataset. It should be noted that most parameter values are positive, i.e., after the
introduction of distance learning format, grades have generally increased.
Fig. 1. Histograms of parameter d for: (a) 5th grade; (b) 7th grade; (c) 8th grade; (d) 11th grade.
3C TIC. Cuadernos de desarrollo aplicados a las TIC. ISSN: 2254-6529
Ed. 41 Vol. 11 N.º 2 August - December 2022