Curation Guidelines

How we decide whether a dataset belongs in TabArena, and how we curate the ones we keep. This is the working reference for the curation log on the previous page.

Scope at a glance

We include real-world tabular datasets published for a predictive classification or regression task, where a random (IID), temporal, or grouped split is the appropriate protocol and tabular models are competitive.

We exclude:

time-series forecasting;
recommendation / ranking / click-through-rate / information-retrieval tasks;
non-predictive scientific-discovery or survey tables;
non-tabular modality data (image / text / audio) where modality-specific models clearly win;
artificial / deterministic / simulated data;
trivial tasks, irreversible data-quality issues or target leakage, and duplicates / re-uploads;
anything with ethical concerns.

The sections below give the full criteria and the reasoning behind each call.

1 · Background: IID and non-IID tabular data

We use an application-dependent definition of IID and non-IID data. Whether a dataset is IID or non-IID is decided by the appropriate train–test split — the split that most closely mirrors the original real-world application.

Illustration — the same data, two applications

Two practitioners use the same transaction data for fundamentally different goals. Alice predicts whether past transactions were fraudulent for follow-up investigations — a random (IID) split is appropriate. Bob predicts whether new transactions are fraudulent to prevent fraud in real time — a temporal (non-IID) split is appropriate, to simulate the distribution of unseen future transactions.

If Bob used a random split, he would overestimate models that exploit temporal leakage. If Alice used a temporal split, she would underestimate such models. The appropriate split depends on the application — TabArena extends this thinking from practitioners to data curation and benchmarking.

IID tabular data

Data is IID if the test samples in the associated application do not follow a particular structure, so a random hold-out is appropriate.

Non-IID tabular data

Data is non-IID when the application requires a temporal or grouped split:

Temporal split — a time index is required; test samples occur strictly after the training data, reflecting prediction of future observations (e.g., future transactions).
Grouped split — a group index is required; all samples with the same index stay together, so no group appears in both train and test. The aim is to generalize to unseen entities (groups). Grouped tasks come in two flavors:
- label-per-group — all samples in a group share one label; predict a group-level label (e.g., is a customer fraudulent, based on a collection of their transactions).
- label-per-sample — each sample has its own label; predict individual labels for sources unseen during training (e.g., a new country or hospital).

For some datasets, either a temporal or a grouped split is plausible. A temporal split does not remove group structure, and a grouped split does not make the task time-invariant — the split only determines which dependency matters, it does not remove it.

TabArena excludes time-series forecasting data: although tabular-like and non-IID, forecasting has fundamentally different assumptions and needs a different validation protocol. During curation we carefully distinguish temporal tabular regression from time-series forecasting.

2 · Dataset curation, in brief

The curation process extends tabular benchmarking to better represent challenging real-world predictive ML applications. To keep selected datasets high-quality and representative, curation is manual and human-verified. This is labor-intensive, but crucial to scientific rigor — data quality directly affects the correctness of the benchmark's conclusions.

Datasets are gathered from prior tabular benchmark studies (re-evaluating accepted datasets and those previously rejected for being non-IID, too small, or for data issues) and from non-IID/multimodal benchmarks, plus new datasets found by browsing public repositories and competition sites (UCI, OpenML, Hugging Face, Kaggle, Zindi, ASlib) and government data portals. Search is guided by popularity, task, and application domain, while avoiding obvious duplicates.

3 · Dataset selection criteria (summary)

A dataset must satisfy all of the following to enter TabArena:

Unique — the dataset and its predictive ML task are unique within the benchmark.
IID or non-IID tabular task — a random, temporal, or grouped split is the appropriate validation protocol.
Published for a predictive task — explicitly published for classification or regression.
Representative of a real application — stems from a real random distribution (not generated by a deterministic function), and tabular models are a suitable, competitive choice (e.g., not a vectorized version of ImageNet).
No ethical concerns — the dataset and task raise no (obvious) ethical concerns.

4 · Selection criteria in detail

These criteria involve subjective human interpretation; when in doubt, discuss and record the decision in the curation log.

Unique dataset

Ensure each dataset has a unique original data source. Many datasets are re-uploaded to OpenML/Kaggle under different names without attribution; gamified incentives can even push users to make data look non-duplicate. So the first priority for any candidate is to determine its original source (or confirm it is an original contribution). Sometimes trivial; sometimes it requires prolonged investigation — comparing data statistics and comparing against popular datasets from the same domain. When the source stays unclear after a thorough check, the curator team uses subjective judgment to assess authenticity and uniqueness.

How to check for duplicates (recommended). The reliable signal is the original source, not the data layout — two uploads can have different column names, dtypes, row counts, or preprocessing and still be the same dataset. So:

compare the versions and all links of the candidates (OpenML ids, Kaggle, UCI/DOI, GitHub, the originating paper) — overlap in any canonical link is a strong signal;
follow each link to its origin — a Kaggle/OpenML page usually references an upstream source; two records that bottom out at the same source are duplicates even when their direct links differ;
spot-check by name and rough field structure against datasets you already know, to recognise a re-upload under a new name (and rule out false matches).

Rule of thumb: duplicates almost always share the same source even if the data structure differs — shared provenance is strong evidence, a structural diff is weak. Conversely, two records that merely share a source can be legitimately distinct tasks (e.g. a dataset and its telemonitoring/temporal variant) — the record's comments are authoritative.

IID and non-IID tabular datasets

Include any tabular dataset whose original task requires a random, temporal, or grouped split. Exclude time-series forecasting tasks — representing them faithfully needs a different protocol than temporal tabular tasks, and they have their own benchmarking community.

Predictive classification or regression tasks

Exclude datasets that do not originally stem from a predictive classification/regression task. In particular, exclude scientific-discovery tasks (e.g., survey data or non-predictive tables), and tabular-adjacent tasks such as click-through-rate prediction and ranking / information-retrieval (recommender) tasks — these need specific validation and have their own communities.

Representative of real-world tabular ML

Exclude datasets representing tasks for which tabular ML models would not be used, that do not represent a real-world application, or that cannot be preprocessed to be representative. Concretely, exclude datasets that:

(A) Non-tabular modality — where modality-specific models are clearly superior. This is judged per dataset (not by blanket modality exclusion): vectorized image, text, audio, or time-series data is allowed if tabular models are competitive with modality-specific solutions. Consult recent benchmarks / the introductory paper for the dataset's application.
Key exclusion — features that are an algorithmic vectorization of image/video content: if the columns are computed from an image/video to describe its content — raw pixels, HOG, colour/texture histograms, or morphological / geometric / spectral descriptors extracted by a CV / image-analysis pipeline, video-derived kinematics, remote-sensing spectra — the underlying task is a vision task and the modality-specific model is the natural tool, so it is excluded as Image / Wrong Domain / Source Modality even when the dataset ships as a numeric / pre-extracted feature table. The feature vector does not make it tabular. This covers recognition sets (letter, mfeat_*, gina, gtsrb_*, optdigits, pendigits, usps, semeion, texture, one_hundred_plants_*) and measured-from-image tables (magic_gamma_telescope, wdbc, image_gesture_phase_segmentation, dry_bean/raisin/pumpkin_seeds, satimage/wilt, banknote_authentication). Narrow carve-in (stays in-scope): features that are not a vectorization of image content — non-visual metadata (e.g. internet_advertisements: URL/anchor/alt-text tokens) or human-assigned semantic/clinical attributes that are themselves the instrument (the kept wbcatt; breast_w's 1–10 cytology grades a pathologist assigned). Audio/sonar follow the same case-by-case logic (interpretable engineered measures like jitter/shimmer can be in; raw spectral bins lean out). Decide on the data's actual columns, not the dataset's fame.
(B) Not a real random distribution — exclude artificial data and data generated by a deterministic function. Simulated data from real applications (e.g., Higgs bosons) represents a real distribution but is still excluded here, as such tasks have dedicated domain-specific benchmarks.
(C) Trivial — practitioners would not use ML for it. A dataset is trivial if all models, without tuning, reach the same better-than-random score (no variance) or solve it perfectly (error 0).
(D) Irreversible data quality issues — exclude datasets where irreversible preprocessing leaks the target or the test feature distribution (e.g., PCA-transformed data).
(E) Insufficient information — when a prolonged investigation fails to find source information needed to judge the criteria, it is the curation team's call whether to use it.

Ethically unambiguous tasks

Exclude datasets whose tasks pose ethical concerns. This is extended to include cases where data subjects or creators request that the data not be used for ML research (e.g., the Pima Indian Diabetes Dataset).

Adapted from the TabArena paper — Section 2 (background), Section 4 (curation & selection criteria), and Appendix B.2 (detailed criteria). Citations omitted. See the Dataset Processing (extended) tab for processing conventions.

Dataset Processing extended

All datasets undergo a processing procedure for consistency and reproducibility. Below are the recurring patterns; case-by-case details and comments for each dataset live in Data Foundry's notebooks.

General

Sample identifiers — drop uninformative ones (often just storage artifacts). Keep informative identifiers (e.g., a time index) and process them to represent their original meaning.
Missing values — convert proxy missing values (e.g., 999, -1) to explicit NA whenever the encoding can be reliably inferred from the data description and task context.
Skewed targets — for numerical targets with strong skew / heavy tails (e.g., housing prices), consider logarithmic scaling on a case-by-case basis.
Naming — standardize dataset names with snake_case.
Sample order — always shuffle IID and grouped data to prevent methods from exploiting implicit order leakage. For temporal data, sort chronologically by the time index.

Feature types (dtypes)

Transform object/string features to categorical (a fixed, finite set of values, inferred from the description/context) or string (otherwise).
Convert date features to standardized YYYY-MM-DD datetime where possible; for numerically-encoded dates, reconstruct as much of the date information as possible.
Encode all other features as numerical types.

Creating temporal tasks and splits

Manually define the prediction horizon and the associated test time points.
Verify every feature would have been available at prediction time (no target leakage from future information); filter samples and features as needed.
Analyze for grouped-temporal structure — repeated observations from the same entity over time (e.g., multiple transactions from one customer); carefully consider a grouped split during construction.
The first split of a temporal task is always the one with the most recent test time point — it also has the most training data and is deemed the most representative. Order all other splits by descending test time point.

Adapted from the TabArena paper — Appendix B.3 (details on dataset processing). Citations omitted.

Curation log & workflow

How the curation log on the previous page works — the states, the AI-assisted triage, and the human verification step.

One record per dataset

Each candidate dataset is a single markdown record (structured front-matter + free-text comments) under curation/records/, edited by hand or in the dashboard. Curators triage the backlog, commit, and open a PR; the dashboard edits the same files in place, so every change is a reviewable diff.

Suggestion states

Yes — include; TBD -> Yes — likely include, pending verification.
TBD -> 2nd Tier — plausible but secondary; No — exclude.
Disagreement — curators genuinely disagree; not yet shipped, needs resolution.
Yes (Disagreement) — shipped on purpose, but with an open disagreement to re-evaluate; counts as accepted, but stays visible under the ⚡ Disagreement filter.

Decision markers

decision_markers flag issues (duplicate, trivial, leakage, out-of-scope, …). A clean, includable dataset usually has none — their absence is the normal, good state. A marker can also be a provisional best-guess of a concern, to confirm or rule out later (recorded in the comments), so an accepted dataset may still carry a watch-item marker.

AI-assisted triage & verification

An AI assistant can draft a provisional triage for untriaged datasets — a suggestion, justified markers, and a written assessment — labelled AI (UNVERIFIED) and shown with a 🤖 status. These stay in the Review queue until a human reads the record and verifies it (the ✓ action), which records the curator as reviewer and clears the unverified flag. The AI never has the final say; every decision is human-verified.

For AI / agent contributors

The full agentic curation brief — how an assistant should triage, the decision patterns, and the duplicate-checking method — lives in the repo as a skill: .claude/commands/curate.md (the /curate slash command).

Operational reference for the curation dashboard; see the Guidelines tab for the selection criteria.