Abstract Pipeline Validation Dataset Demo Use Cases Limitations Cite
Research Infrastructure Open Dataset CC BY 4.0

ILSA Knowledge Infrastructure

A schema-guided framework that transforms fragmented international large-scale assessment literature into a searchable evidence infrastructure.

1,266
Studies
11,862
Records
130
Audited
96%
Alignment
PISA TIMSS PIRLS TALIS ICILS ICCS PIAAC
Research Question
Can large language models transform fragmented international large-scale assessment literature into a reliable, structured knowledge base for evidence-informed educational research?

Merve Dede  &  Ekrem Çetinkaya

Abstract

Abstract

International Large-Scale Assessments have generated thousands of studies across PISA, TIMSS, PIRLS, TALIS, PIAAC, ICILS, and ICCS. Yet methodological knowledge remains fragmented across publications, making systematic evidence synthesis, methodological comparison, and cumulative knowledge building increasingly difficult.

Problem

Thousands of ILSA studies contain valuable methodological and analytical evidence, yet this knowledge remains scattered across publications, countries, assessment cycles, and research traditions. Researchers often spend weeks reviewing literature before beginning a new analysis.

Approach

A multi-stage AI pipeline extracts methodological evidence, standardizes terminology, and links findings into a structured knowledge base, transforming heterogeneous ILSA literature into a searchable research infrastructure.

Research Infrastructure

11,862 structured records across 1,266 documents — enabling cumulative evidence synthesis across fragmented ILSA literature. The framework provides evidence-informed analytical guidance for researchers beginning new ILSA analyses.

Core contribution: This project establishes a reusable knowledge infrastructure for International Large-Scale Assessment research, transforming 1,266 studies into 11,862 structured records that support evidence synthesis, methodological discovery, and AI-assisted research at scale.

Contributions

Research contributions

01
Methodology-Aware Knowledge Extraction
A structured extraction framework designed specifically for ILSA research captures methodological decisions, analytical practices, plausible value handling, sampling weights, and study characteristics in a standardized format.
9 field categories · 7 assessment programs
02
Large-Scale Literature Transformation
A multi-stage AI workflow transforms heterogeneous ILSA literature into a structured evidence base through extraction, standardization, and knowledge integration.
1,266 studies · 11,862 records
03
Open ILSA Knowledge Base
An openly accessible structured knowledge resource spanning multiple international assessment programs, enabling cross-study methodological synthesis and evidence discovery at scale.
CC BY 4.0 · Parquet · JSON
04
Evidence-Grounded Research Assistant
A citation-grounded research assistant capable of answering methodological questions, identifying relevant studies, suggesting analytical approaches, and supporting evidence-informed ILSA research.
Citation-grounded · Evidence-based · Searchable
Why a Structured Knowledge Infrastructure Was Needed
ILSA studies differ substantially in terminology, reporting practices, analytical design, and methodological transparency. These differences make large-scale evidence synthesis difficult using conventional manual review or rule-based extraction approaches. A structured AI-assisted extraction framework enables methodological information to be represented consistently across studies.
Heterogeneous reporting
Studies report findings, variables, and methodological decisions using highly diverse formats, making direct comparison and synthesis difficult.
Non-standard descriptions
Similar analytical methods are often described using different terminology, creating challenges for systematic evidence aggregation.
Terminology variation
ESCS, HISEI, and PARED refer to overlapping constructs across programs — consistent interpretation requires contextual understanding at the study level.
Context-dependent extraction
Critical methodological decisions, such as plausible value handling or sampling-weight usage, often require interpretation within the broader analytical context.
Methods

Four-stage pipeline

Data Pipeline Transparent Processing Hierarchy · 4 Stages
STAGE 01
CORPUS CONSTRUCTION
1,624
Peer-reviewed ILSA publications retrieved from four institutional databases
IEA · 308 OECD · 591 Scopus · 423 WoS · 302
STAGE 02
DEDUPLICATION
1,266
Unique publications after DOI matching and normalized title/author comparison
DOI matching Title normalization
STAGE 03
HUMAN VALIDATION
130
Articles manually read and audited from the 1,266 — field-by-field verification against source PDFs
~10% stratified sample 96% alignment 94% pass rate
STAGE 04
KNOWLEDGE INFRASTRUCTURE
11,862
Structured methodological evidence transformed into 11,862 linked records supporting evidence synthesis, methodological discovery, and AI-assisted research.
1,266 studies 11,862 records 7 assessment programs
Outcome A reusable evidence infrastructure supporting systematic literature synthesis, methodological guidance, and AI-assisted ILSA research.

Corpus Construction — Literature was collected from four major sources and deduplicated using DOI matching and metadata normalization, yielding 1,266 unique ILSA-related publications.

Structured Extraction — A schema-constrained extraction framework transformed the corpus into 11,862 structured methodological records suitable for large-scale evidence synthesis.

Equivalent analytical methods and variable definitions were aligned across studies using domain-informed terminology mapping. A domain expert adjudicates ambiguous cases and validates semantic consistency across the full knowledge base.

130
Manually audited articles ~10% stratified sample · 96% mean alignment

Researchers can query prior ILSA evidence, identify relevant variables, discover methodological precedents, generate evidence-grounded hypotheses, and obtain citation-supported analytical recommendations — all grounded in the structured knowledge base.

Study discoveryHypothesis generationMethodological guidance

Evidence Quality and Reliability

Controlled Methodological Vocabulary
Standardized schema constraints ensure methodological information is extracted using a consistent vocabulary across all studies.
Completeness Validation
Every record must satisfy predefined completeness criteria before entering the knowledge base. Incomplete records are rejected at the point of creation.
Human Expert Audit
130 articles were manually reviewed field-by-field against source PDFs by a domain expert. Ambiguities were resolved and schema vocabulary iteratively tightened across audit rounds — yielding 96% semantic alignment.
Author-Reported Findings Preserved
Findings are extracted directly from author-reported conclusions. The system does not infer or reinterpret outcomes.
Evidence-Bounded Responses
The research assistant generates responses only from retrieved, indexed records in the knowledge base — unsupported claims are structurally impossible.
Validation

Validation Results

To evaluate extraction quality, a stratified sample of 130 publications (~10% of the corpus) was manually audited against the original source documents. Validation focused on methodological accuracy, semantic consistency, source attribution, and fabrication detection.

130
Human Validation
130 publications manually audited · ~10% stratified sample
96%
Mean Alignment Score
Average field-level semantic agreement between extracted records and source publications during expert verification.
17/18
Validation Scenarios Passed
Only one failure case was detected and corrected during audit — a causal language drift in one Korea-context record.
100%
Source Attribution Accuracy
All 1,266 records carry validated source database tags with zero misattribution detected in the audited sample.
Zero
Critical Fabrication Errors
No fabricated DOIs, program labels, or method families detected. Minor semantic drift confined to causal-language edge cases only.
Validation Coverage
ILSA program identification
Analytical method classification
Sampling weight usage
Plausible value handling
Source attribution
Dataset limitations
Informal method descriptions
Studies reporting "a machine learning approach" without naming the algorithm are assigned to the closest schema category; the Iterative Semantic Alignment audit flagged and resolved the majority of such cases.
Effect size heterogeneity
ILSA studies rarely report standardized effect sizes. Extracted effect direction (Positive / Negative / Null) reflects the author-reported conclusion, not a model inference.
Scope of qualitative validation
Validation is qualitative and expert-driven. The dataset is best used as a structured synthesis instrument — individual records should be verified against source publications for high-stakes decisions.
Data

Open dataset

11,862 structured records from 1,266 studies across PISA, TIMSS, PIRLS, TALIS, PIAAC, ICILS, and ICCS — released under CC BY 4.0 and hosted on HuggingFace in Parquet format. The full extraction pipeline is available on GitHub.

1,266
Studies
One record per publication — ILSA program, methodology, country coverage, and primary finding.
2,128
Key Findings
Effect sizes, top predictors, and outcome variables extracted from each study.
8,336
Covariates
Controlled variables and confounders reported by study authors, linked to each finding.
Example record — articles_master
"doi": "10.1016/j.learninstr.2023.101800", "ilsa_program": "PISA", "ml_family": "Tree-Based / Ensemble Learning", "sampling_weight_used": true, "plausible_values_handling": "Correctly applied (5 PVs pooled)", "pv_correct": true, "countries_json": ["Turkey", "Germany", "Japan"], "primary_finding": "Socioeconomic status (SES) and prior academic achievement strongly predict mathematics outcomes across cycles, with school-level resources acting as a critical moderator."
CC BY 4.0 Licensed under Creative Commons Attribution 4.0 International.
Explore on HuggingFace
Interactive Demo

Ask the ILSA Knowledge Base

The RAG agent retrieves evidence from 1,266 documents (11,862 structured records) and generates evidence-grounded analytical suggestions. Select an example query to see a representative response with source attribution.

ILSA Knowledge Base · RAG Agent
Agent Response · KB 1,266 studies  ·  130 validated
Across PISA studies KB · n=412, the most consistently reported predictors of mathematics achievement are:

1. SES / ESCS — reported in 78% of studies as a top predictor. KB Most studies use the PISA ESCS composite; note that HISEI and PARED are subcomponents and should not be entered simultaneously.
2. Prior academic performance — where available (longitudinal designs), dominates variance explained. KB
3. School-level variables — school SES, resource availability, and urbanicity consistently moderate individual effects in multilevel models. KB
Retrieved sources
Xu et al. (2021) 10.1016/j.learninstr.2021.101510 KB
Areepattamannil (2014) 10.1007/s10763-013-9479-5 KB
Agasisti & Longobardi (2017) 10.1080/09243453.2016.1222830 KB
+ 409 additional records from knowledge base
Responses are generated from retrieved evidence and should be independently verified.
Among TIMSS studies KB · n=218, the most frequently applied ML method families are:

1. Tree-Based / Ensemble (41%) KB — Random Forest and XGBoost dominate; typically used for variable importance ranking.
2. Regression-Based (33%) KB — Linear regression and logistic regression remain common baselines, especially in multilevel specifications.
3. Neural Networks (14%) KB — Increasing after 2018; often used for cross-country prediction tasks.

Note: only 38% of TIMSS ML studies in the knowledge base correctly applied sampling weights — a critical methodological gap.
Retrieved sources
Sirin (2005) 10.3102/00346543075003417 KB
Zhao et al. (2020) 10.1007/s10648-020-09535-x KB
TIMSS Technical Reports (2019) IEA Database SURVEY
+ 215 additional records from knowledge base
Responses are generated from retrieved evidence and should be independently verified.
Based on 1,266 studies KB and OECD technical documentation SURVEY, plausible value (PV) handling is the most common methodological error in ILSA research:

pv_correct = true in only ~61% of KB studies. Common errors: using only the first PV, averaging across PVs before analysis (incorrect), and ignoring PV-specific variance estimation.

Recommended approach: Run your analysis 5–10 times across all plausible values, pool point estimates using Rubin's rules, and combine sampling error with imputation variance. The BIFIEsurvey or survey R packages implement this correctly.
Retrieved sources
OECD PISA Technical Standards (2022) SURVEY
von Davier et al. (2021) 10.1007/978-3-030-53636-8 SURVEY
KB audit results — 130-record stratified sample KB
Responses are generated from retrieved evidence and should be independently verified.
In PIRLS studies KB · n=94, the most frequently controlled confounders are:

1. Home literacy environment KB — books in home, reading materials, parental reading habits; present in 71% of PIRLS studies.
2. Language spoken at home KB — especially relevant in multilingual country samples; controlled in 58% of studies.
3. Early childhood education KB — years of preschool attendance consistently reduces residual variance in reading models.
4. Teacher characteristics KB — experience, professional development, and instructional time; important at school level.
Retrieved sources
PIRLS 2021 International Results SURVEY
Stoet & Geary (2015) 10.1177/0956797614548444 KB
Toropova et al. (2021) 10.1016/j.tate.2020.103298 KB
+ 91 additional records from knowledge base
Responses are generated from retrieved evidence and should be independently verified.
Use Cases

Who benefits from this dataset

Researchers
  • Query which factors predict reading scores across 20 years of PISA data
  • See how plausible values were handled across hundreds of TIMSS studies — instantly
  • Spot methodological gaps before designing your next study
Policymakers
  • Find which student background factors consistently predict low performance across countries
  • See what the last decade of ILSA research says about a specific policy lever
  • Build evidence-based briefings without manually reading hundreds of studies
Data Scientists
  • Access 11,862 structured records in a single, analysis-ready Parquet file
  • Build or test retrieval systems on a real-world academic corpus
  • Reproduce or extend the extraction pipeline using the open-source code
Limitations

Known limitations

English-language bias
The corpus is restricted to English-language publications, excluding potentially relevant studies in other languages. This may under-represent research from non-English-speaking ILSA participant countries.
Publication bias
The dataset reflects the academic literature available in four institutional databases, which may over-represent statistically significant findings and under-represent null results.
Extraction uncertainty
Automated extraction may introduce errors in complex or ambiguous methodological descriptions. Individual records should be verified against source publications for high-stakes decisions.
Temporal coverage
Coverage reflects studies indexed in the four databases at collection time. Newly published studies require re-running the pipeline to be included in the knowledge base.
Citation

Cite this work

If you use this dataset or pipeline in your research, please cite:

BibTeX
@article{dede-cetinkaya2026schema, title = {A Schema-Guided {LLM} Framework for Evidence Extraction and Systematic Synthesis in Large-Scale Assessment Research}, author = {Dede, Merve and {\c{C}}etinkaya, Ekrem}, year = {2026}, note = {Manuscript under review} }

Explore the Infrastructure

Open access — everything is freely available for research, teaching, and reuse.