TabArena — Curation Guidelines
how to decide & curate datasets
← Back to curation

Curation Guidelines

How we decide whether a dataset belongs in TabArena, and how we curate the ones we keep. This is the working reference for the curation log on the previous page.

Scope at a glance

We include real-world tabular datasets published for a predictive classification or regression task, where a random (IID), temporal, or grouped split is the appropriate protocol and tabular models are competitive.

We exclude:

  • time-series forecasting;
  • recommendation / ranking / click-through-rate / information-retrieval tasks;
  • non-predictive scientific-discovery or survey tables;
  • non-tabular modality data (image / text / audio) where modality-specific models clearly win;
  • artificial / deterministic / simulated data;
  • trivial tasks, irreversible data-quality issues or target leakage, and duplicates / re-uploads;
  • anything with ethical concerns.

The sections below give the full criteria and the reasoning behind each call.

1 · Background: IID and non-IID tabular data

We use an application-dependent definition of IID and non-IID data. Whether a dataset is IID or non-IID is decided by the appropriate train–test split — the split that most closely mirrors the original real-world application.

Illustration — the same data, two applications

Two practitioners use the same transaction data for fundamentally different goals. Alice predicts whether past transactions were fraudulent for follow-up investigations — a random (IID) split is appropriate. Bob predicts whether new transactions are fraudulent to prevent fraud in real time — a temporal (non-IID) split is appropriate, to simulate the distribution of unseen future transactions.

If Bob used a random split, he would overestimate models that exploit temporal leakage. If Alice used a temporal split, she would underestimate such models. The appropriate split depends on the application — TabArena extends this thinking from practitioners to data curation and benchmarking.

IID tabular data

Data is IID if the test samples in the associated application do not follow a particular structure, so a random hold-out is appropriate.

Non-IID tabular data

Data is non-IID when the application requires a temporal or grouped split:

For some datasets, either a temporal or a grouped split is plausible. A temporal split does not remove group structure, and a grouped split does not make the task time-invariant — the split only determines which dependency matters, it does not remove it.

TabArena excludes time-series forecasting data: although tabular-like and non-IID, forecasting has fundamentally different assumptions and needs a different validation protocol. During curation we carefully distinguish temporal tabular regression from time-series forecasting.

2 · Dataset curation, in brief

The curation process extends tabular benchmarking to better represent challenging real-world predictive ML applications. To keep selected datasets high-quality and representative, curation is manual and human-verified. This is labor-intensive, but crucial to scientific rigor — data quality directly affects the correctness of the benchmark's conclusions.

Datasets are gathered from prior tabular benchmark studies (re-evaluating accepted datasets and those previously rejected for being non-IID, too small, or for data issues) and from non-IID/multimodal benchmarks, plus new datasets found by browsing public repositories and competition sites (UCI, OpenML, Hugging Face, Kaggle, Zindi, ASlib) and government data portals. Search is guided by popularity, task, and application domain, while avoiding obvious duplicates.

3 · Dataset selection criteria (summary)

A dataset must satisfy all of the following to enter TabArena:

  1. Unique — the dataset and its predictive ML task are unique within the benchmark.
  2. IID or non-IID tabular task — a random, temporal, or grouped split is the appropriate validation protocol.
  3. Published for a predictive task — explicitly published for classification or regression.
  4. Representative of a real application — stems from a real random distribution (not generated by a deterministic function), and tabular models are a suitable, competitive choice (e.g., not a vectorized version of ImageNet).
  5. No ethical concerns — the dataset and task raise no (obvious) ethical concerns.

4 · Selection criteria in detail

These criteria involve subjective human interpretation; when in doubt, discuss and record the decision in the curation log.

Unique dataset

Ensure each dataset has a unique original data source. Many datasets are re-uploaded to OpenML/Kaggle under different names without attribution; gamified incentives can even push users to make data look non-duplicate. So the first priority for any candidate is to determine its original source (or confirm it is an original contribution). Sometimes trivial; sometimes it requires prolonged investigation — comparing data statistics and comparing against popular datasets from the same domain. When the source stays unclear after a thorough check, the curator team uses subjective judgment to assess authenticity and uniqueness.

How to check for duplicates (recommended). The reliable signal is the original source, not the data layout — two uploads can have different column names, dtypes, row counts, or preprocessing and still be the same dataset. So:

  • compare the versions and all links of the candidates (OpenML ids, Kaggle, UCI/DOI, GitHub, the originating paper) — overlap in any canonical link is a strong signal;
  • follow each link to its origin — a Kaggle/OpenML page usually references an upstream source; two records that bottom out at the same source are duplicates even when their direct links differ;
  • spot-check by name and rough field structure against datasets you already know, to recognise a re-upload under a new name (and rule out false matches).

Rule of thumb: duplicates almost always share the same source even if the data structure differs — shared provenance is strong evidence, a structural diff is weak. Conversely, two records that merely share a source can be legitimately distinct tasks (e.g. a dataset and its telemonitoring/temporal variant) — the record's comments are authoritative.

IID and non-IID tabular datasets

Include any tabular dataset whose original task requires a random, temporal, or grouped split. Exclude time-series forecasting tasks — representing them faithfully needs a different protocol than temporal tabular tasks, and they have their own benchmarking community.

Predictive classification or regression tasks

Exclude datasets that do not originally stem from a predictive classification/regression task. In particular, exclude scientific-discovery tasks (e.g., survey data or non-predictive tables), and tabular-adjacent tasks such as click-through-rate prediction and ranking / information-retrieval (recommender) tasks — these need specific validation and have their own communities.

Representative of real-world tabular ML

Exclude datasets representing tasks for which tabular ML models would not be used, that do not represent a real-world application, or that cannot be preprocessed to be representative. Concretely, exclude datasets that:

  • (A) Non-tabular modality — where modality-specific models are clearly superior. This is judged per dataset (not by blanket modality exclusion): vectorized image, text, audio, or time-series data is allowed if tabular models are competitive with modality-specific solutions. Consult recent benchmarks / the introductory paper for the dataset's application.
    Key exclusion — features that are an algorithmic vectorization of image/video content: if the columns are computed from an image/video to describe its content — raw pixels, HOG, colour/texture histograms, or morphological / geometric / spectral descriptors extracted by a CV / image-analysis pipeline, video-derived kinematics, remote-sensing spectra — the underlying task is a vision task and the modality-specific model is the natural tool, so it is excluded as Image / Wrong Domain / Source Modality even when the dataset ships as a numeric / pre-extracted feature table. The feature vector does not make it tabular. This covers recognition sets (letter, mfeat_*, gina, gtsrb_*, optdigits, pendigits, usps, semeion, texture, one_hundred_plants_*) and measured-from-image tables (magic_gamma_telescope, wdbc, image_gesture_phase_segmentation, dry_bean/raisin/pumpkin_seeds, satimage/wilt, banknote_authentication). Narrow carve-in (stays in-scope): features that are not a vectorization of image content — non-visual metadata (e.g. internet_advertisements: URL/anchor/alt-text tokens) or human-assigned semantic/clinical attributes that are themselves the instrument (the kept wbcatt; breast_w's 1–10 cytology grades a pathologist assigned). Audio/sonar follow the same case-by-case logic (interpretable engineered measures like jitter/shimmer can be in; raw spectral bins lean out). Decide on the data's actual columns, not the dataset's fame.
  • (B) Not a real random distribution — exclude artificial data and data generated by a deterministic function. Simulated data from real applications (e.g., Higgs bosons) represents a real distribution but is still excluded here, as such tasks have dedicated domain-specific benchmarks.
  • (C) Trivial — practitioners would not use ML for it. A dataset is trivial if all models, without tuning, reach the same better-than-random score (no variance) or solve it perfectly (error 0).
  • (D) Irreversible data quality issues — exclude datasets where irreversible preprocessing leaks the target or the test feature distribution (e.g., PCA-transformed data).
  • (E) Insufficient information — when a prolonged investigation fails to find source information needed to judge the criteria, it is the curation team's call whether to use it.

Ethically unambiguous tasks

Exclude datasets whose tasks pose ethical concerns. This is extended to include cases where data subjects or creators request that the data not be used for ML research (e.g., the Pima Indian Diabetes Dataset).

Adapted from the TabArena paper — Section 2 (background), Section 4 (curation & selection criteria), and Appendix B.2 (detailed criteria). Citations omitted. See the Dataset Processing (extended) tab for processing conventions.