March 3, 2026 · Krunal Sabnis
How We Pushed PII Recall from 76% to 98% — Right-Sized Models, No Fine-Tuning, No LLMs
Statistical NER + five pattern recognizers + one threshold change. No fine-tuning. No GPU. No data leaving your perimeter unmasked. A practical guide to enterprise PII detection under GDPR.
The Problem
You’re building AI-powered workflows for regulated industries — healthcare, finance, telecom. Your users type prompts containing patient names, credit card numbers, social security numbers. Before any of that reaches a cloud LLM, it needs to be detected and masked.
If your PII detection pipeline sends data to a cloud LLM, you’ve already lost. The data left your perimeter before you decided whether it was safe to leave. Under GDPR, that transfer is the compliance event — not what the provider does with it afterward.
We’re building a local-first prompt router designed for GDPR compliance. The rule is simple: sensitive data gets detected and masked before it ever leaves your infrastructure. No PII in transit. No third-party exposure. Every redaction decision auditable.
The PII detection layer uses Microsoft Presidio, an open-source engine that combines statistical NER (spaCy’s named entity recognition — a trained ML model) with pattern-based recognizers. Out of the box, it’s solid. But “solid” isn’t good enough when you’re making redaction decisions on regulated data.
The Baseline: 76.4% Recall
We created 60 synthetic prompts across three domains (healthcare, telecom, finance) — each labeled with the PII entities it contains. Names, SSNs, credit cards, phone numbers, dates of birth, medical license numbers.
First run with default Presidio configuration:
| Metric | Value |
|---|---|
| PII recall | 76.4% (42/55 entities) |
| Entities missed | 13 |
That means 1 in 4 PII entities slips through undetected. In healthcare or finance, that’s a compliance violation waiting to happen.
Diagnosing the Failures
Before reaching for a larger model, we asked: what specifically is Presidio missing, and why?
We ran every missed entity through the analyzer at score threshold 0.0 to see raw detection results:
| Entity Type | Count Missed | Root Cause |
|---|---|---|
| PHONE_NUMBER | 4 | 2 format variants unrecognized, 2 detected but scored 0.40 (below 0.6 threshold) |
| CREDIT_CARD | 3 | Luhn checksum validation — synthetic card numbers fail checksum |
| US_SSN | 2 | Same — Presidio validates SSN format strictly, synthetic numbers rejected |
| DATE_OF_BIRTH | 2 | No built-in DOB recognizer in Presidio |
| MEDICAL_LICENSE | 1 | No built-in recognizer |
| PERSON | 1 | Not a detection failure — routing error sent prompt to wrong path, detector never ran |
Key insight: Most failures weren’t “the system can’t detect this.” They were:
- Checksum validation rejecting edge-case formats (SSN, credit card)
- Missing recognizers for domain-specific entities (DOB, medical license)
- Threshold filtering valid detections (phone numbers scored 0.40, threshold was 0.60)
All three are fixable without changing models or adding LLM calls.
The Fix: Five Pattern Recognizers + Threshold Tuning
Here’s what the detection stack looks like:
| Layer | What it does | ML? |
|---|---|---|
| spaCy NER (en_core_web_lg) | Statistical model trained on annotated text — recognizes PERSON, ORG, etc. from context | Yes — 400MB trained model |
| Presidio built-in recognizers | Regex + checksum validation for SSN, credit card, IBAN, etc. | No |
| Custom PatternRecognizers | Regex + context keyword boosting (what we added) | Hybrid |
| Presidio orchestrator | Combines all layers, deduplicates, scores | No |
This is what right-sizing looks like in practice — instead of routing text through a large language model, we’re running a 400MB statistical NER model purpose-built for entity recognition. The right model for the right job.
Custom recognizers we added
1. SSN Context Recognizer — Matches XXX-XX-XXXX format and boosts confidence when keywords like “SSN” or “social security” appear nearby. Doesn’t require Luhn validation — because for a redaction system, you want to catch PII-shaped data, not validate it.
2. Credit Card Context Recognizer — Matches XXXX-XXXX-XXXX-XXXX format. The 16-digit grouped pattern is unambiguous enough to flag without checksum validation.
3. Date of Birth Recognizer — Presidio has no built-in DOB recognizer. Ours matches ISO (1985-03-15), EU (15/03/1985), and written (March 15, 1985) formats when contextual keywords like “DOB” or “date of birth” appear.
4. Medical License Recognizer — Matches patterns like MD-45892, RN-12345 with medical context keywords. Domain-specific — irrelevant in telecom, critical in healthcare.
5. Broad Phone Recognizer — Catches format variants that the built-in recognizer misses: international with dashes (+1-555-987-6543), 8-digit local formats, and boosts standard 10-digit numbers when “phone” or “contact” appear nearby.
Threshold tuning
Lowered detection threshold from 0.60 to 0.35.
The rationale: for a detection pass, you want high recall — catch everything suspicious. The redaction step downstream can apply a separate, stricter threshold for what it actually masks. Detection and redaction are different policy decisions with different risk tolerances.
No training data. No GPU. No new infrastructure. This runs on your existing IT stack — the same hardware already in your data centre.
The Result: 98.2% Recall
| Metric | Before | After | Change |
|---|---|---|---|
| PII recall | 76.4% (42/55) | 98.2% (54/55) | +21.8pp |
| Avg latency | 22.5ms | 22.3ms | No change |
| LLM calls | 0 | 0 | Still zero |
| Infrastructure | CPU-only | CPU-only | No GPU required |
The single remaining miss: a PERSON entity in a prompt that was misrouted — sent to local handling instead of the PII detection pipeline. The detector never got a chance to run. That’s a routing problem, not a detection problem — and it’s exactly the kind of ambiguous case where a small language model adds value as a routing classifier. More on that in Part 2.
What This Means for Enterprise PII Systems
1. Diagnose before you prescribe
The instinct is “recall is low, let’s throw a bigger model at it.” But 10 of our 13 misses were configuration issues — thresholds, missing recognizers, format coverage. Fixing configuration is cheaper, faster, and more predictable than model changes.
2. Synthetic data exposes design decisions
Our synthetic prompts used fake SSNs and credit card numbers. Presidio’s checksum validation correctly rejected them — which is proper behaviour for real data. But it revealed a design decision: do you want to detect PII-shaped patterns, or only validated PII? For redaction, you want the former. For fraud detection, the latter. That’s a policy choice, not a technical limitation.
3. The threshold is a policy decision
0.6 vs 0.35 isn’t about accuracy — it’s about organisational risk tolerance. High-regulation domains (healthcare, finance) should detect aggressively. Low-regulation domains can filter more. Your compliance team sets this number, not your ML team.
4. Domain policy packs are the enterprise differentiator
Each domain has different PII entities that matter. Medical license numbers are irrelevant in telecom. IBANs are irrelevant in US healthcare. Custom recognizers per domain means: plug in a policy pack for your vertical.
What This Approach Can’t Catch — And What We’re Building Next
This approach handles structured PII — entities with known formats and patterns. It gets you to 95-98% recall. The remaining gap is where a language model genuinely adds value:
- Implicit sensitivity: “my therapist said…” (no named entity, but sensitive context)
- Multi-lingual PII: names and addresses in non-Latin scripts
- Obfuscated patterns: “my social is one two three dash…”
- Contextual judgment: is “Dr. Smith” PII or a public figure reference?
That’s where a small language model (SLM) comes in — but as a second pass on the cases the deterministic layer can’t resolve. Not as the primary detector. We’re building this layer next, and we’ll publish the three-mode comparison (rules-only vs rules+SLM vs SLM-only) with full benchmark data.
The architecture: rules for speed, NER for structure, SLM for judgment. Each layer does what it’s good at.
Part 2: Adding a small language model for the ambiguous cases — when deterministic rules say “I don’t know” and you need contextual judgment. Coming next week.