Show HN: VeilPhantom – Open-source on-device PII detection for AI pipelines

1 points by nakaiwilliams 25 days ago

I built VeilPhantom, a Python SDK that detects and tokenizes PII before text reaches any LLM.

The problem: AI agents processing meetings, emails, support tickets are handling raw sensitive data. Names, salaries, medical details — all flowing through cloud APIs.

The solution: Detect PII on-device, replace with tokens ([PERSON_1], [AMOUNT_1]), send safe tokens to LLM, rehydrate response locally.

Interesting finding: In benchmarks (98 scenarios, 8 verticals, Claude Haiku), accuracy went UP with PII redaction — 91.5% → 93.3%. Token-structured input seems to help models parse arguments more reliably.

Technical details: - Shade V7: 22M param PhoneticDeBERTa (DeBERTa-v3-xsmall + Double Metaphone embeddings) - Trained on 72M words of meeting/business data - 7 detection layers (NER, gazetteers, regex, NLP, contextual) - 19 PII token types - 6ms average overhead - 97.1% F1 on meeting transcripts

The phonetic embeddings help catch ASR-mangled names across cultures — "Nkosinathi" transcribed as "Ink Casino Thea" still gets detected.

pip install veil-phantom

Docs + benchmarks: https://helloveil.com/sdk GitHub: https://github.com/helloveil/veil-phantom

Apache 2.0. Happy to answer questions about the architecture or training approach.