Original Reddit post

Disclosure: I built LabelSets (labelsets.ai). Sharing the technical approach behind how we score dataset quality. THE PROBLEM Most dataset quality issues aren’t visible until a model fails in production. Mislabeled examples, demographic coverage gaps, annotator fatigue at scale — none of this shows up in a README.

HOW LQS WORKS (Label Quality Score) We run 7 automated checks on every dataset: ANNOTATION ACCURACY Spot-checks labels against a validation model trained on known-good examples. Flags statistical outliers in label distribution that suggest systematic mislabeling. LABEL CONSISTENCY Checks if identical or near-identical inputs receive consistent labels. High inconsistency = annotator disagreement or unclear guidelines. CLASS BALANCE Measures Gini coefficient across label classes. Flags datasets where top class > 60% of samples without documentation. COVERAGE Checks for demographic and edge-case representation gaps using stratified sampling across known subgroup dimensions. FRESHNESS Scores based on collection date, version history, and whether the distribution matches current real-world data. FORMAT COMPLIANCE Validates schema consistency, null rates, encoding issues, and whether the actual format matches what’s documented. ANNOTATION DENSITY Measures labels-per-sample ratio and flags sparse annotation that would degrade model performance.

WHAT WE FOUND Auditing 140+ datasets the score range was 61% to 97% on datasets claiming to be the same type. The dimensions that failed most often:

  • Class balance (most datasets underdocument skew)
  • Coverage (gaps almost always fall along demographic lines)
  • Consistency (drops sharply after ~50k samples — annotator fatigue is measurable)

LIMITATIONS

  • Accuracy check is only as good as our validation model
  • Freshness scoring is partially manual for older datasets
  • Some dimensions are weighted equally when they probably shouldn’t be for every use case
  • Synthetic datasets score differently and are disclosed separately

LESSONS LEARNED The hardest part wasn’t building the scoring — it was deciding what a “good” score means for different tasks. A dataset that’s great for classification is often terrible for detection. We’re still working on task-specific scoring profiles. Happy to discuss methodology, what we got wrong, or how you’d approach scoring differently. Demo: labelsets.ai/quality-audit submitted by /u/plomii

Originally posted by u/plomii on r/ArtificialInteligence