I’m building a “Swiss Army knife” dataset toolkit for AI workflows and wanted some honest feedback from other people who also work with messy data. Current features it already supports: • Scans a dataset folder recursively • Validates images / video / audio files (corruption checks) • Extracts metadata (size, resolution, fps, duration, etc.) • Hash-based deduplication • Auto organizes by file type • Generates CSV + JSON manifests • Train/val/test split • Optional image embeddings (Torch) • Builds a FAISS vector index for similarity search • Multiprocessing pipeline • Optional label CSV merge The goal is basically: point it at a chaotic dataset folder → get a clean, indexed, ML-ready dataset with manifests and vectors. I’m considering adding: • similarity (embedding/perceptual) dedup — not just hash • dataset audit report (class balance, stats, leakage warnings) • bad sample detection (blurry / tiny / silent / broken media) • dataset versioning + diffing • export formats (COCO / YOLO / HF datasets) • CLI + config file support • resume/checkpoint runs • symlink mode instead of moving files • embedding cache For other people here who train models or curate datasets: Would you actually use a tool like this? What features would make it a “must-have” instead of a “nice toy”? What’s missing from existing dataset tools that annoys you most? Brutal honesty welcome — I’d rather build what’s actually useful for everyone than just keep adding features that’s useful just for me. submitted by /u/Lakshendra_Singh
Originally posted by u/Lakshendra_Singh on r/ArtificialInteligence
