AI & ML in ILSA Research: Survey Paper

Abstract

What this survey does

"The first comprehensive evidence map of artificial intelligence applications in International Large-Scale Assessments, revealing how the field has evolved and where methodological gaps remain."

International Large-Scale Assessments (ILSAs) have produced rich, large-scale datasets through programs such as PISA, TIMSS, PIRLS, TALIS, ICCS, ICILS, and PIAAC. Over the past decade, researchers have increasingly applied artificial intelligence and machine learning techniques to these data to predict educational outcomes, identify learner profiles, automate scoring, analyze process data, and support educational measurement.

Despite this rapid growth, the evidence has remained fragmented across assessment programs, research domains, and methodological approaches. No previous study has systematically synthesized AI applications across the full ILSA ecosystem, examined methodological practices, and transformed the literature into a structured, reusable evidence resource.

This survey addresses that gap through a PRISMA-guided review of 130 peer-reviewed studies, accompanied by an openly available structured evidence repository that enables transparent exploration, comparison, and reuse of the accumulated research.

Motivation

AI applications in International Large-Scale Assessments have grown rapidly, but evidence remains fragmented across programs, methods, and countries, limiting comprehensive evaluation and gap identification.

gap analysis

Approach

PRISMA-guided review, comprehensive full-text evaluation, structured evidence coding, thematic classification, and standardized metadata extraction across all included studies.

survey paper

Output

A reusable open evidence resource comprising 130 studies, 202 synthesized findings, 1,907 predictive records, and a CC BY 4.0 dataset supporting transparent exploration, comparison, and future research.

open dataset

Methodology

Study selection & screening process

RQ 1

How can AI applications in International Large-Scale Assessments be systematically categorized, and how are studies distributed across these computational domains between 2020 and April 2026?

RQ 2

What methodological characteristics, including model types, data sources, and the balance between predictive performance and interpretability, characterize AI applications in International Large-Scale Assessment research?

RQ 3

What gaps remain in the integration of AI methods into International Large-Scale Assessment research, and what methodological roadmap can support their adoption for generating pedagogically actionable insights?

695

Step 01 · Identification

Literature search

675 records were identified through a systematic search of Web of Science using search strings combining International Large-Scale Assessment (ILSA) program names with artificial intelligence and machine learning terminology. An additional 20 records were identified through targeted Google Scholar snowballing.

118

Step 02 · Screening

Title and abstract screening

The 675 Web of Science records were independently screened based on titles and abstracts, resulting in the exclusion of 557 irrelevant studies. The remaining 118 records advanced to full-text assessment together with the 20 studies identified through snowballing.

138

Step 03 · Eligibility

Full-text assessment

A total of 138 full-text articles were evaluated against predefined eligibility criteria, including peer-reviewed empirical studies applying artificial intelligence or machine learning methods to ILSA microdata and published between 2020 and April 2026. Eight Google Scholar records were excluded during this stage.

130

Step 04 · Final Corpus

Evidence synthesis

A total of 130 studies met the eligibility criteria and were included in the final evidence synthesis. Each study was read in full by the authors, processed through an LLM-assisted structured metadata extraction pipeline, and systematically coded into a standardized evidence framework spanning nine analytical categories.

Inclusion Criteria

Peer-reviewed journal or conference publications
Empirical application of AI or machine learning
Analysis of ILSA microdata
Published between 2020–April 2026
English-language publications

Exclusion Criteria

Non-ILSA or aggregated data only
Non-empirical or conceptual studies
Reports, theses, and gray literature
AI systems without ILSA data analysis
Non-English language publications

Structured Extraction Framework

Study metadata
ILSA dataset characteristics
AI/ML methodology
Outcome domains
Predictor taxonomy
Methodological design
Model evaluation
Evidence synthesis
Quality assessment

Quality & Reproducibility Indicators

Plausible value handling: Correct use of plausible values
Sampling weights: Survey design properly accounted for
Model evaluation: Performance metrics transparently reported
Sample documentation: Sample size and country coverage reported
Reproducibility: Sufficient information for replication
Overfitting control: Validation procedures to ensure generalizability

Dataset

Open structured research dataset

Sheet / Table	Records	Description	Key fields
Articles	130	One record per study containing publication metadata, AI/ML methods, study design, survey methodology, and quality indicators. DOIML techniquesML familyPV handlingSampling weightsStudy type	31 columns · One record per study
Findings	202	One record per reported finding, including the ILSA program, assessment cycle, outcome, key predictors, model performance, and standardized evidence labels. DOITarget variableTop predictorsPerformance metricsOutcome domain	12 columns · One record per finding
Predictors	1,907	One record per predictor–study pair with standardized variable names, educational level, and controlled taxonomy labels. DOIVariable nameCategoryPredictor levelPredictor category	7 columns · One record per predictor

// sample extraction record

"metadata": { "title": "ML to predict science achievement TIMSS 2019", "year": 2024, "open_access": true }, "data": { "ml_techniques": { "primary": "Random Forest", "all": ["Random Forest", "XGBoost"] }, "plausible_values_handling": "not_reported", "survey_design": { "student_weights_used": false }, "main_findings": [{ "target_variable": "Science (TIMSS 2019)", "performance_metrics": "R² = 0.71" }] }

// controlled vocabulary taxonomy

source_category

Peer-Reviewed Review Article Methodology Paper

ml_family

Tree-Based Deep Learning GLM Clustering

target_domain

Mathematics Science Reading Non-cognitive

predictor_level

Student School System

pv_filter_label

Not Applicable Rubin Rules Single PV Average PVs WLE/IRT All PVs Not Reported

weights_filter

True False Unknown

Key Findings

ML landscape & methodological rigor

Tree-Based / Ensemble Learning

Random Forest · XGBoost · Gradient Boosting · SHAP Interpretation

71%

Generalized Linear Models

LASSO · Ridge · Elastic Net · Logistic Regression

11%

Deep Learning

Neural Networks · CNN · LSTM · Automated Scoring

Other ML / Not Classified

SVM · Naive Bayes · Mixed-Method · Unspecified ML

Percentages are computed among 89 empirical ML studies. The remaining 41 Review/Methodology papers are excluded from this breakdown.

Outcome domains targeted

33%

Other / Unspecified

23%

Composite / Multi-Domain

12%

Mathematics

12%

Non-Cognitive

Reading

Problem Solving

Science

<1%

Civic Education

Policy Actionability Framework

Mean actionability score (1–5 scale)

3.21

SD across 130 studies

0.87

High actionability — Score 4–5 (45 studies)

35%

Three-dimensional rubric: inferential warrant × effect specification × population boundedness.

Methodological rigor indicators

Performance metrics reported

86%

Sampling weights applied

13%

Plausible values correctly handled

29%

Sample size explicitly reported

75%

Countries / economies specified

79%

Cross-validation or test-set reported

39%

Contributions

What This Survey Contributes

First Comprehensive Survey of AI in ILSAs

The first systematic review of artificial intelligence and machine learning applications across all seven major International Large-Scale Assessment programs, providing a unified view of the field within a consistent analytical framework.

7 ILSA programs130 studiesPRISMA-guided review

Open Structured Evidence Repository

An openly available, structured research dataset with standardized metadata, evidence records, and predictor taxonomy, designed to support secondary analyses, methodological comparisons, and reproducible research.

3 relational tables2,239 structured recordsCC BY 4.0

Methodological Quality Assessment

A systematic evaluation of methodological practices, including plausible value handling, sampling weights, model validation, and reporting transparency, revealing substantial opportunities to improve methodological rigor.

Plausible valuesSurvey weightsModel validation

Research Gaps and Future Directions

A comprehensive gap analysis identifying underexplored assessment programs, application domains, methodological challenges, and emerging research opportunities to guide future AI research in International Large-Scale Assessments.

Evidence gapsResearch roadmapFuture directions

Limitations

Known constraints

English-language restriction

The search was restricted to English-language publications. Studies in Chinese, Spanish, Turkish, and other languages may represent a non-trivial share of the ILSA-ML literature and are not captured here.

Single primary database

The primary search was conducted in Web of Science, supplemented by Google Scholar snowballing (n = 20). Studies indexed exclusively outside Web of Science may be underrepresented, particularly conference proceedings and regional journals.

Search date cutoff

The corpus covers January 2020 to April 2026. Methodological advances published after this date are not reflected in the synthesis.

Reporting incompleteness

Many studies do not explicitly state their plausible value protocol or sampling weight decisions. Rigor coding for these studies relies on textual inference, introducing some classification uncertainty.

Extraction schema granularity

The schema captures the primary ML method and top-level model characteristics. Hyperparameter choices, ensemble configurations, and secondary analytical decisions are not systematically coded.

Future Directions

Multilingual extension

Systematic replication of the search in non-English databases to assess whether the patterns identified here hold across a broader linguistic corpus.

Statistical meta-analysis

Where studies report comparable performance metrics for the same outcome-domain pair, quantitative meta-analytic synthesis would enable more precise estimation of predictor importance and cross-study heterogeneity.

Formal replication with correct psychometric protocols

A targeted replication applying Rubin's Rules across all ten plausible values and appropriate variance estimation to published ML analyses, to assess sensitivity to these methodological choices.

Causal inference integration

Causal inference methods applied to ILSA data represent a rapidly growing subfield not fully captured in the current corpus and warrant a dedicated review.

Process data as a dedicated sub-review

Log-file and clickstream data from PISA 2022 and TIMSS 2023 represent a qualitatively distinct data type. A focused sub-review would allow more granular analysis of behavioral indicators and sequence modeling approaches.

Artificial Intelligence Applications in International Large-Scale Assessments:
A Survey with LLM-Assisted Evidence Synthesis

What this survey does

Study selection & screening process

Open structured research dataset

ML landscape & methodological rigor

What This Survey Contributes

Known constraints

Future Directions

Cite this work

Artificial Intelligence Applications in International Large-Scale Assessments:A Survey with LLM-Assisted Evidence Synthesis

What this survey does

Study selection & screening process

Open structured research dataset

ML landscape & methodological rigor

What This Survey Contributes

Known constraints

Future Directions

Cite this work

Artificial Intelligence Applications in International Large-Scale Assessments:
A Survey with LLM-Assisted Evidence Synthesis