Abstract Methodology Dataset Findings Contributions Limitations Cite
Survey Paper  ·  Open Dataset

Artificial Intelligence Applications in International Large-Scale Assessments:
A Survey with LLM-Assisted Evidence Synthesis

A survey paper covering 130 peer-reviewed studies examining how artificial intelligence and machine learning methods are applied across seven international large-scale assessment programs, covering methods, outcomes, and methodological rigor.

PISATIMSSPIRLS TALISICCSICILS PIAAC
130
Studies
202
Findings
1,907
Predictors

Abstract

What this survey does

"The first comprehensive evidence map of artificial intelligence applications in International Large-Scale Assessments, revealing how the field has evolved and where methodological gaps remain."

International Large-Scale Assessments (ILSAs) have produced rich, large-scale datasets through programs such as PISA, TIMSS, PIRLS, TALIS, ICCS, ICILS, and PIAAC. Over the past decade, researchers have increasingly applied artificial intelligence and machine learning techniques to these data to predict educational outcomes, identify learner profiles, automate scoring, analyze process data, and support educational measurement.

Despite this rapid growth, the evidence has remained fragmented across assessment programs, research domains, and methodological approaches. No previous study has systematically synthesized AI applications across the full ILSA ecosystem, examined methodological practices, and transformed the literature into a structured, reusable evidence resource.

This survey addresses that gap through a PRISMA-guided review of 130 peer-reviewed studies, accompanied by an openly available structured evidence repository that enables transparent exploration, comparison, and reuse of the accumulated research.

Motivation

AI applications in International Large-Scale Assessments have grown rapidly, but evidence remains fragmented across programs, methods, and countries, limiting comprehensive evaluation and gap identification.

gap analysis
Approach

PRISMA-guided review, comprehensive full-text evaluation, structured evidence coding, thematic classification, and standardized metadata extraction across all included studies.

survey paper
Output

A reusable open evidence resource comprising 130 studies, 202 synthesized findings, 1,907 predictive records, and a CC BY 4.0 dataset supporting transparent exploration, comparison, and future research.

open dataset

Methodology

Study selection & screening process

RQ 1
How can AI applications in International Large-Scale Assessments be systematically categorized, and how are studies distributed across these computational domains between 2020 and April 2026?
RQ 2
What methodological characteristics, including model types, data sources, and the balance between predictive performance and interpretability, characterize AI applications in International Large-Scale Assessment research?
RQ 3
What gaps remain in the integration of AI methods into International Large-Scale Assessment research, and what methodological roadmap can support their adoption for generating pedagogically actionable insights?
695
Step 01 · Identification
Literature search
675 records were identified through a systematic search of Web of Science using search strings combining International Large-Scale Assessment (ILSA) program names with artificial intelligence and machine learning terminology. An additional 20 records were identified through targeted Google Scholar snowballing.
118
Step 02 · Screening
Title and abstract screening
The 675 Web of Science records were independently screened based on titles and abstracts, resulting in the exclusion of 557 irrelevant studies. The remaining 118 records advanced to full-text assessment together with the 20 studies identified through snowballing.
138
Step 03 · Eligibility
Full-text assessment
A total of 138 full-text articles were evaluated against predefined eligibility criteria, including peer-reviewed empirical studies applying artificial intelligence or machine learning methods to ILSA microdata and published between 2020 and April 2026. Eight Google Scholar records were excluded during this stage.
130
Step 04 · Final Corpus
Evidence synthesis
A total of 130 studies met the eligibility criteria and were included in the final evidence synthesis. Each study was read in full by the authors, processed through an LLM-assisted structured metadata extraction pipeline, and systematically coded into a standardized evidence framework spanning nine analytical categories.
Inclusion Criteria
  • Peer-reviewed journal or conference publications
  • Empirical application of AI or machine learning
  • Analysis of ILSA microdata
  • Published between 2020–April 2026
  • English-language publications
Exclusion Criteria
  • Non-ILSA or aggregated data only
  • Non-empirical or conceptual studies
  • Reports, theses, and gray literature
  • AI systems without ILSA data analysis
  • Non-English language publications
Structured Extraction Framework
  • Study metadata
  • ILSA dataset characteristics
  • AI/ML methodology
  • Outcome domains
  • Predictor taxonomy
  • Methodological design
  • Model evaluation
  • Evidence synthesis
  • Quality assessment
Quality & Reproducibility Indicators
  • Plausible value handling: Correct use of plausible values
  • Sampling weights: Survey design properly accounted for
  • Model evaluation: Performance metrics transparently reported
  • Sample documentation: Sample size and country coverage reported
  • Reproducibility: Sufficient information for replication
  • Overfitting control: Validation procedures to ensure generalizability
Dataset

Open structured research dataset

Sheet / Table Records Description Key fields
Articles
130
One record per study containing publication metadata, AI/ML methods, study design, survey methodology, and quality indicators.
DOIML techniquesML familyPV handlingSampling weightsStudy type
31 columns · One record per study
Findings
202
One record per reported finding, including the ILSA program, assessment cycle, outcome, key predictors, model performance, and standardized evidence labels.
DOITarget variableTop predictorsPerformance metricsOutcome domain
12 columns · One record per finding
Predictors
1,907
One record per predictor–study pair with standardized variable names, educational level, and controlled taxonomy labels.
DOIVariable nameCategoryPredictor levelPredictor category
7 columns · One record per predictor
// sample extraction record
"metadata": { "title": "ML to predict science achievement TIMSS 2019", "year": 2024, "open_access": true }, "data": { "ml_techniques": { "primary": "Random Forest", "all": ["Random Forest", "XGBoost"] }, "plausible_values_handling": "not_reported", "survey_design": { "student_weights_used": false }, "main_findings": [{ "target_variable": "Science (TIMSS 2019)", "performance_metrics": "R² = 0.71" }] }
// controlled vocabulary taxonomy
source_category
Peer-Reviewed Review Article Methodology Paper
ml_family
Tree-Based Deep Learning GLM Clustering
target_domain
Mathematics Science Reading Non-cognitive
predictor_level
Student School System
pv_filter_label
Not Applicable Rubin Rules Single PV Average PVs WLE/IRT All PVs Not Reported
weights_filter
True False Unknown
Key Findings

ML landscape & methodological rigor

Tree-Based / Ensemble Learning
Random Forest · XGBoost · Gradient Boosting · SHAP Interpretation
71%
Generalized Linear Models
LASSO · Ridge · Elastic Net · Logistic Regression
11%
Deep Learning
Neural Networks · CNN · LSTM · Automated Scoring
9%
Other ML / Not Classified
SVM · Naive Bayes · Mixed-Method · Unspecified ML
9%

Percentages are computed among 89 empirical ML studies. The remaining 41 Review/Methodology papers are excluded from this breakdown.

Outcome domains targeted
33%
Other / Unspecified
23%
Composite / Multi-Domain
12%
Mathematics
12%
Non-Cognitive
9%
Reading
5%
Problem Solving
4%
Science
<1%
Civic Education
Policy Actionability Framework
Mean actionability score (1–5 scale)
3.21
SD across 130 studies
0.87
High actionability — Score 4–5 (45 studies)
35%

Three-dimensional rubric: inferential warrant × effect specification × population boundedness.

Methodological rigor indicators
Performance metrics reported
86%
Sampling weights applied
13%
Plausible values correctly handled
29%
Sample size explicitly reported
75%
Countries / economies specified
79%
Cross-validation or test-set reported
39%
Contributions

What This Survey Contributes

01
First Comprehensive Survey of AI in ILSAs
The first systematic review of artificial intelligence and machine learning applications across all seven major International Large-Scale Assessment programs, providing a unified view of the field within a consistent analytical framework.
7 ILSA programs130 studiesPRISMA-guided review
02
Open Structured Evidence Repository
An openly available, structured research dataset with standardized metadata, evidence records, and predictor taxonomy, designed to support secondary analyses, methodological comparisons, and reproducible research.
3 relational tables2,239 structured recordsCC BY 4.0
03
Methodological Quality Assessment
A systematic evaluation of methodological practices, including plausible value handling, sampling weights, model validation, and reporting transparency, revealing substantial opportunities to improve methodological rigor.
Plausible valuesSurvey weightsModel validation
04
Research Gaps and Future Directions
A comprehensive gap analysis identifying underexplored assessment programs, application domains, methodological challenges, and emerging research opportunities to guide future AI research in International Large-Scale Assessments.
Evidence gapsResearch roadmapFuture directions
Limitations

Known constraints

English-language restriction

The search was restricted to English-language publications. Studies in Chinese, Spanish, Turkish, and other languages may represent a non-trivial share of the ILSA-ML literature and are not captured here.

Single primary database

The primary search was conducted in Web of Science, supplemented by Google Scholar snowballing (n = 20). Studies indexed exclusively outside Web of Science may be underrepresented, particularly conference proceedings and regional journals.

Search date cutoff

The corpus covers January 2020 to April 2026. Methodological advances published after this date are not reflected in the synthesis.

Reporting incompleteness

Many studies do not explicitly state their plausible value protocol or sampling weight decisions. Rigor coding for these studies relies on textual inference, introducing some classification uncertainty.

Extraction schema granularity

The schema captures the primary ML method and top-level model characteristics. Hyperparameter choices, ensemble configurations, and secondary analytical decisions are not systematically coded.

Future Directions

Multilingual extension

Systematic replication of the search in non-English databases to assess whether the patterns identified here hold across a broader linguistic corpus.

Statistical meta-analysis

Where studies report comparable performance metrics for the same outcome-domain pair, quantitative meta-analytic synthesis would enable more precise estimation of predictor importance and cross-study heterogeneity.

Formal replication with correct psychometric protocols

A targeted replication applying Rubin's Rules across all ten plausible values and appropriate variance estimation to published ML analyses, to assess sensitivity to these methodological choices.

Causal inference integration

Causal inference methods applied to ILSA data represent a rapidly growing subfield not fully captured in the current corpus and warrant a dedicated review.

Process data as a dedicated sub-review

Log-file and clickstream data from PISA 2022 and TIMSS 2023 represent a qualitatively distinct data type. A focused sub-review would allow more granular analysis of behavioral indicators and sequence modeling approaches.

Citation

Cite this work

BibTeX
@article{dede_cetinkaya2026ilsa_survey, title = {Artificial Intelligence Applications in International Large-Scale Assessments: A Survey with LLM-Assisted Evidence Synthesis}, author = {Dede, Merve and Çetinkaya, Ekrem}, year = {2026}, note = {Open dataset: HuggingFace Datasets}, url = {https://huggingface.co/datasets/dedemerve/ILSA-Survey-Dataset} }
Access
Open Dataset on HuggingFace View on GitHub