Research Infrastructure Open Dataset CC BY 4.0

ILSA Knowledge Infrastructure

A schema-guided framework that transforms fragmented international large-scale assessment literature into a searchable evidence infrastructure.

1,266

Studies

11,862

Records

130

Audited

96%

Alignment

PISA TIMSS PIRLS TALIS ICILS ICCS PIAAC

Research Question

Can large language models transform fragmented international large-scale assessment literature into a reliable, structured knowledge base for evidence-informed educational research?

Discover the most common predictors of mathematics achievement across PISA studies ↵

Which ML methods are most used in TIMSS studies with correct plausible value handling? ↵

What does prior TIMSS evidence say about teacher quality and student outcomes? ↵

Merve Dede & Ekrem Çetinkaya

Explore Dataset Launch Research Assistant

Abstract

International Large-Scale Assessments have generated thousands of studies across PISA, TIMSS, PIRLS, TALIS, PIAAC, ICILS, and ICCS. Yet methodological knowledge remains fragmented across publications, making systematic evidence synthesis, methodological comparison, and cumulative knowledge building increasingly difficult.

Problem

Thousands of ILSA studies contain valuable methodological and analytical evidence, yet this knowledge remains scattered across publications, countries, assessment cycles, and research traditions. Researchers often spend weeks reviewing literature before beginning a new analysis.

Approach

A multi-stage AI pipeline extracts methodological evidence, standardizes terminology, and links findings into a structured knowledge base, transforming heterogeneous ILSA literature into a searchable research infrastructure.

Research Infrastructure

11,862 structured records across 1,266 documents — enabling cumulative evidence synthesis across fragmented ILSA literature. The framework provides evidence-informed analytical guidance for researchers beginning new ILSA analyses.

Core contribution: This project establishes a reusable knowledge infrastructure for International Large-Scale Assessment research, transforming 1,266 studies into 11,862 structured records that support evidence synthesis, methodological discovery, and AI-assisted research at scale.

Contributions

Research contributions

Methodology-Aware Knowledge Extraction

A structured extraction framework designed specifically for ILSA research captures methodological decisions, analytical practices, plausible value handling, sampling weights, and study characteristics in a standardized format.

9 field categories · 7 assessment programs

Large-Scale Literature Transformation

A multi-stage AI workflow transforms heterogeneous ILSA literature into a structured evidence base through extraction, standardization, and knowledge integration.

1,266 studies · 11,862 records

Open ILSA Knowledge Base

An openly accessible structured knowledge resource spanning multiple international assessment programs, enabling cross-study methodological synthesis and evidence discovery at scale.

CC BY 4.0 · Parquet · JSON

Evidence-Grounded Research Assistant

A citation-grounded research assistant capable of answering methodological questions, identifying relevant studies, suggesting analytical approaches, and supporting evidence-informed ILSA research.

Citation-grounded · Evidence-based · Searchable

Why a Structured Knowledge Infrastructure Was Needed

ILSA studies differ substantially in terminology, reporting practices, analytical design, and methodological transparency. These differences make large-scale evidence synthesis difficult using conventional manual review or rule-based extraction approaches. A structured AI-assisted extraction framework enables methodological information to be represented consistently across studies.

Heterogeneous reporting

Studies report findings, variables, and methodological decisions using highly diverse formats, making direct comparison and synthesis difficult.

Non-standard descriptions

Similar analytical methods are often described using different terminology, creating challenges for systematic evidence aggregation.

Terminology variation

ESCS, HISEI, and PARED refer to overlapping constructs across programs — consistent interpretation requires contextual understanding at the study level.

Context-dependent extraction

Critical methodological decisions, such as plausible value handling or sampling-weight usage, often require interpretation within the broader analytical context.

Methods

Four-stage pipeline

Data Pipeline Transparent Processing Hierarchy · 4 Stages

STAGE 01
CORPUS CONSTRUCTION

1,624

Peer-reviewed ILSA publications retrieved from four institutional databases

IEA · 308 OECD · 591 Scopus · 423 WoS · 302

STAGE 02
DEDUPLICATION

1,266

Unique publications after DOI matching and normalized title/author comparison

DOI matching Title normalization

STAGE 03
HUMAN VALIDATION

130

Articles manually read and audited from the 1,266 — field-by-field verification against source PDFs

~10% stratified sample 96% alignment 94% pass rate

STAGE 04
KNOWLEDGE INFRASTRUCTURE

11,862

Structured methodological evidence transformed into 11,862 linked records supporting evidence synthesis, methodological discovery, and AI-assisted research.

1,266 studies 11,862 records 7 assessment programs

Outcome A reusable evidence infrastructure supporting systematic literature synthesis, methodological guidance, and AI-assisted ILSA research.

Corpus Construction — Literature was collected from four major sources and deduplicated using DOI matching and metadata normalization, yielding 1,266 unique ILSA-related publications.

Structured Extraction — A schema-constrained extraction framework transformed the corpus into 11,862 structured methodological records suitable for large-scale evidence synthesis.

Equivalent analytical methods and variable definitions were aligned across studies using domain-informed terminology mapping. A domain expert adjudicates ambiguous cases and validates semantic consistency across the full knowledge base.

130

Manually audited articles ~10% stratified sample · 96% mean alignment

Researchers can query prior ILSA evidence, identify relevant variables, discover methodological precedents, generate evidence-grounded hypotheses, and obtain citation-supported analytical recommendations — all grounded in the structured knowledge base.

Study discoveryHypothesis generationMethodological guidance

Evidence Quality and Reliability

Controlled Methodological Vocabulary
Standardized schema constraints ensure methodological information is extracted using a consistent vocabulary across all studies.

Completeness Validation
Every record must satisfy predefined completeness criteria before entering the knowledge base. Incomplete records are rejected at the point of creation.

Human Expert Audit
130 articles were manually reviewed field-by-field against source PDFs by a domain expert. Ambiguities were resolved and schema vocabulary iteratively tightened across audit rounds — yielding 96% semantic alignment.

Author-Reported Findings Preserved
Findings are extracted directly from author-reported conclusions. The system does not infer or reinterpret outcomes.

Evidence-Bounded Responses
The research assistant generates responses only from retrieved, indexed records in the knowledge base — unsupported claims are structurally impossible.

Validation

Validation Results

To evaluate extraction quality, a stratified sample of 130 publications (~10% of the corpus) was manually audited against the original source documents. Validation focused on methodological accuracy, semantic consistency, source attribution, and fabrication detection.

130

Human Validation

130 publications manually audited · ~10% stratified sample

96%

Mean Alignment Score

Average field-level semantic agreement between extracted records and source publications during expert verification.

17/18

Validation Scenarios Passed

Only one failure case was detected and corrected during audit — a causal language drift in one Korea-context record.

100%

Source Attribution Accuracy

All 1,266 records carry validated source database tags with zero misattribution detected in the audited sample.

Zero

Critical Fabrication Errors

No fabricated DOIs, program labels, or method families detected. Minor semantic drift confined to causal-language edge cases only.

Validation Coverage

ILSA program identification

Analytical method classification

Sampling weight usage

Plausible value handling

Source attribution

Dataset limitations

Informal method descriptions

Studies reporting "a machine learning approach" without naming the algorithm are assigned to the closest schema category; the Iterative Semantic Alignment audit flagged and resolved the majority of such cases.

Effect size heterogeneity

ILSA studies rarely report standardized effect sizes. Extracted effect direction (Positive / Negative / Null) reflects the author-reported conclusion, not a model inference.

Scope of qualitative validation

Validation is qualitative and expert-driven. The dataset is best used as a structured synthesis instrument — individual records should be verified against source publications for high-stakes decisions.

Data

Open dataset

11,862 structured records from 1,266 studies across PISA, TIMSS, PIRLS, TALIS, PIAAC, ICILS, and ICCS — released under CC BY 4.0 and hosted on HuggingFace in Parquet format. The full extraction pipeline is available on GitHub.

1,266

Studies

One record per publication — ILSA program, methodology, country coverage, and primary finding.

2,128

Key Findings

Effect sizes, top predictors, and outcome variables extracted from each study.

8,336

Covariates

Controlled variables and confounders reported by study authors, linked to each finding.

Example record — articles_master

"doi": "10.1016/j.learninstr.2023.101800", "ilsa_program": "PISA", "ml_family": "Tree-Based / Ensemble Learning", "sampling_weight_used": true, "plausible_values_handling": "Correctly applied (5 PVs pooled)", "pv_correct": true, "countries_json": ["Turkey", "Germany", "Japan"], "primary_finding": "Socioeconomic status (SES) and prior academic achievement strongly predict mathematics outcomes across cycles, with school-level resources acting as a critical moderator."

CC BY 4.0 Licensed under Creative Commons Attribution 4.0 International.

Explore on HuggingFace

Interactive Demo

Ask the ILSA Knowledge Base

The RAG agent retrieves evidence from 1,266 documents (11,862 structured records) and generates evidence-grounded analytical suggestions. Select an example query to see a representative response with source attribution.

Agent Response · KB 1,266 studies · 130 validated

Across PISA studies KB · n=412, the most consistently reported predictors of mathematics achievement are:

1. SES / ESCS — reported in 78% of studies as a top predictor. KB Most studies use the PISA ESCS composite; note that HISEI and PARED are subcomponents and should not be entered simultaneously.
2. Prior academic performance — where available (longitudinal designs), dominates variance explained. KB
3. School-level variables — school SES, resource availability, and urbanicity consistently moderate individual effects in multilevel models. KB

Retrieved sources

Xu et al. (2021) 10.1016/j.learninstr.2021.101510 KB

Areepattamannil (2014) 10.1007/s10763-013-9479-5 KB

Agasisti & Longobardi (2017) 10.1080/09243453.2016.1222830 KB

+ 409 additional records from knowledge base

Responses are generated from retrieved evidence and should be independently verified.

Among TIMSS studies KB · n=218, the most frequently applied ML method families are:

1. Tree-Based / Ensemble (41%) KB — Random Forest and XGBoost dominate; typically used for variable importance ranking.
2. Regression-Based (33%) KB — Linear regression and logistic regression remain common baselines, especially in multilevel specifications.
3. Neural Networks (14%) KB — Increasing after 2018; often used for cross-country prediction tasks.

Note: only 38% of TIMSS ML studies in the knowledge base correctly applied sampling weights — a critical methodological gap.

Retrieved sources

Sirin (2005) 10.3102/00346543075003417 KB

Zhao et al. (2020) 10.1007/s10648-020-09535-x KB

TIMSS Technical Reports (2019) IEA Database SURVEY

+ 215 additional records from knowledge base

Responses are generated from retrieved evidence and should be independently verified.

Based on 1,266 studies KB and OECD technical documentation SURVEY, plausible value (PV) handling is the most common methodological error in ILSA research:

pv_correct = true in only ~61% of KB studies. Common errors: using only the first PV, averaging across PVs before analysis (incorrect), and ignoring PV-specific variance estimation.

Recommended approach: Run your analysis 5–10 times across all plausible values, pool point estimates using Rubin's rules, and combine sampling error with imputation variance. The BIFIEsurvey or survey R packages implement this correctly.

Retrieved sources

OECD PISA Technical Standards (2022) SURVEY

von Davier et al. (2021) 10.1007/978-3-030-53636-8 SURVEY

KB audit results — 130-record stratified sample KB

Responses are generated from retrieved evidence and should be independently verified.

In PIRLS studies KB · n=94, the most frequently controlled confounders are:

1. Home literacy environment KB — books in home, reading materials, parental reading habits; present in 71% of PIRLS studies.
2. Language spoken at home KB — especially relevant in multilingual country samples; controlled in 58% of studies.
3. Early childhood education KB — years of preschool attendance consistently reduces residual variance in reading models.
4. Teacher characteristics KB — experience, professional development, and instructional time; important at school level.

Retrieved sources

PIRLS 2021 International Results SURVEY

Stoet & Geary (2015) 10.1177/0956797614548444 KB

Toropova et al. (2021) 10.1016/j.tate.2020.103298 KB

+ 91 additional records from knowledge base

Responses are generated from retrieved evidence and should be independently verified.

Limitations

Known limitations

English-language bias

The corpus is restricted to English-language publications, excluding potentially relevant studies in other languages. This may under-represent research from non-English-speaking ILSA participant countries.

Publication bias

The dataset reflects the academic literature available in four institutional databases, which may over-represent statistically significant findings and under-represent null results.

Extraction uncertainty

Automated extraction may introduce errors in complex or ambiguous methodological descriptions. Individual records should be verified against source publications for high-stakes decisions.

Temporal coverage

Coverage reflects studies indexed in the four databases at collection time. Newly published studies require re-running the pipeline to be included in the knowledge base.

ILSA Knowledge Infrastructure

Abstract

Research contributions

Four-stage pipeline

Evidence Quality and Reliability

Validation Results

Open dataset

Ask the ILSA Knowledge Base

Who benefits from this dataset

Known limitations

Cite this work

Explore the Infrastructure