Is it a mistake to treat PII filtering as a retrieval-time step instead of an ingestion constraint in RAG?

www.reddit.com

Is it a mistake to treat PII filtering as a retrieval-time step instead of an ingestion constraint in RAG?

www.reddit.com

eifachposteMB to AI (Reddit RSS)English · 3 hours ago

Original Reddit post

It seems like RAG pipelines often do: raw docs -> chunk -> embed -> retrieve -> mask output But if documents contain emails, phone numbers, names, employee IDs, etc., the vector index is already derived from sensitive data. docs -> docs__pii_redacted -> chunk -> embed Invariant: unsanitized text never gets chunked or embedded. This seems safer from a data-lineage / attack-surface perspective, especially for local or enterprise RAG systems. Or am I wrong? Example: https://github.com/mloda-ai/rag_integration/blob/main/demo.ipynb submitted by /u/coldoven

Originally posted by u/coldoven on r/ArtificialInteligence

You must log in or # to comment.

Chat