Original Reddit post

It seems like RAG pipelines often do: raw docs -> chunk -> embed -> retrieve -> mask output But if documents contain emails, phone numbers, names, employee IDs, etc., the vector index is already derived from sensitive data. docs -> docs__pii_redacted -> chunk -> embed Invariant: unsanitized text never gets chunked or embedded. This seems safer from a data-lineage / attack-surface perspective, especially for local or enterprise RAG systems. Or am I wrong? Example: https://github.com/mloda-ai/rag_integration/blob/main/demo.ipynb submitted by /u/coldoven

Originally posted by u/coldoven on r/ArtificialInteligence