A survey paper covering 130 peer-reviewed studies examining how artificial intelligence and machine learning methods are applied across seven international large-scale assessment programs, covering methods, outcomes, and methodological rigor.
"The first comprehensive evidence map of artificial intelligence applications in International Large-Scale Assessments, revealing how the field has evolved and where methodological gaps remain."
International Large-Scale Assessments (ILSAs) have produced rich, large-scale datasets through programs such as PISA, TIMSS, PIRLS, TALIS, ICCS, ICILS, and PIAAC. Over the past decade, researchers have increasingly applied artificial intelligence and machine learning techniques to these data to predict educational outcomes, identify learner profiles, automate scoring, analyze process data, and support educational measurement.
Despite this rapid growth, the evidence has remained fragmented across assessment programs, research domains, and methodological approaches. No previous study has systematically synthesized AI applications across the full ILSA ecosystem, examined methodological practices, and transformed the literature into a structured, reusable evidence resource.
This survey addresses that gap through a PRISMA-guided review of 130 peer-reviewed studies, accompanied by an openly available structured evidence repository that enables transparent exploration, comparison, and reuse of the accumulated research.
| Sheet / Table | Records | Description | Key fields |
|---|---|---|---|
| Articles | 130 |
One record per study containing publication metadata, AI/ML methods, study design, survey methodology, and quality indicators. | 31 columns · One record per study |
| Findings | 202 |
One record per reported finding, including the ILSA program, assessment cycle, outcome, key predictors, model performance, and standardized evidence labels. | 12 columns · One record per finding |
| Predictors | 1,907 |
One record per predictor–study pair with standardized variable names, educational level, and controlled taxonomy labels. | 7 columns · One record per predictor |
Percentages are computed among 89 empirical ML studies. The remaining 41 Review/Methodology papers are excluded from this breakdown.
Three-dimensional rubric: inferential warrant × effect specification × population boundedness.
The search was restricted to English-language publications. Studies in Chinese, Spanish, Turkish, and other languages may represent a non-trivial share of the ILSA-ML literature and are not captured here.
The primary search was conducted in Web of Science, supplemented by Google Scholar snowballing (n = 20). Studies indexed exclusively outside Web of Science may be underrepresented, particularly conference proceedings and regional journals.
The corpus covers January 2020 to April 2026. Methodological advances published after this date are not reflected in the synthesis.
Many studies do not explicitly state their plausible value protocol or sampling weight decisions. Rigor coding for these studies relies on textual inference, introducing some classification uncertainty.
The schema captures the primary ML method and top-level model characteristics. Hyperparameter choices, ensemble configurations, and secondary analytical decisions are not systematically coded.
Systematic replication of the search in non-English databases to assess whether the patterns identified here hold across a broader linguistic corpus.
Where studies report comparable performance metrics for the same outcome-domain pair, quantitative meta-analytic synthesis would enable more precise estimation of predictor importance and cross-study heterogeneity.
A targeted replication applying Rubin's Rules across all ten plausible values and appropriate variance estimation to published ML analyses, to assess sensitivity to these methodological choices.
Causal inference methods applied to ILSA data represent a rapidly growing subfield not fully captured in the current corpus and warrant a dedicated review.
Log-file and clickstream data from PISA 2022 and TIMSS 2023 represent a qualitatively distinct data type. A focused sub-review would allow more granular analysis of behavioral indicators and sequence modeling approaches.