← Back to blog

March 3, 2026 · Krunal Sabnis

How We Pushed PII Recall from 76% to 98% — Right-Sized Models, No Fine-Tuning, No LLMs

Statistical NER + five pattern recognizers + one threshold change. No fine-tuning. No GPU. No data leaving your perimeter unmasked. A practical guide to enterprise PII detection under GDPR.

PII Detection GDPR NLP Enterprise AI Presidio Data Privacy

The Problem

You’re building AI-powered workflows for regulated industries — healthcare, finance, telecom. Your users type prompts containing patient names, credit card numbers, social security numbers. Before any of that reaches a cloud LLM, it needs to be detected and masked.

If your PII detection pipeline sends data to a cloud LLM, you’ve already lost. The data left your perimeter before you decided whether it was safe to leave. Under GDPR, that transfer is the compliance event — not what the provider does with it afterward.

We’re building a local-first prompt router designed for GDPR compliance. The rule is simple: sensitive data gets detected and masked before it ever leaves your infrastructure. No PII in transit. No third-party exposure. Every redaction decision auditable.

The PII detection layer uses Microsoft Presidio, an open-source engine that combines statistical NER (spaCy’s named entity recognition — a trained ML model) with pattern-based recognizers. Out of the box, it’s solid. But “solid” isn’t good enough when you’re making redaction decisions on regulated data.

The Baseline: 76.4% Recall

We created 60 synthetic prompts across three domains (healthcare, telecom, finance) — each labeled with the PII entities it contains. Names, SSNs, credit cards, phone numbers, dates of birth, medical license numbers.

First run with default Presidio configuration:

MetricValue
PII recall76.4% (42/55 entities)
Entities missed13

That means 1 in 4 PII entities slips through undetected. In healthcare or finance, that’s a compliance violation waiting to happen.

Diagnosing the Failures

Before reaching for a larger model, we asked: what specifically is Presidio missing, and why?

We ran every missed entity through the analyzer at score threshold 0.0 to see raw detection results:

Entity TypeCount MissedRoot Cause
PHONE_NUMBER42 format variants unrecognized, 2 detected but scored 0.40 (below 0.6 threshold)
CREDIT_CARD3Luhn checksum validation — synthetic card numbers fail checksum
US_SSN2Same — Presidio validates SSN format strictly, synthetic numbers rejected
DATE_OF_BIRTH2No built-in DOB recognizer in Presidio
MEDICAL_LICENSE1No built-in recognizer
PERSON1Not a detection failure — routing error sent prompt to wrong path, detector never ran

Key insight: Most failures weren’t “the system can’t detect this.” They were:

  1. Checksum validation rejecting edge-case formats (SSN, credit card)
  2. Missing recognizers for domain-specific entities (DOB, medical license)
  3. Threshold filtering valid detections (phone numbers scored 0.40, threshold was 0.60)

All three are fixable without changing models or adding LLM calls.

The Fix: Five Pattern Recognizers + Threshold Tuning

Here’s what the detection stack looks like:

LayerWhat it doesML?
spaCy NER (en_core_web_lg)Statistical model trained on annotated text — recognizes PERSON, ORG, etc. from contextYes — 400MB trained model
Presidio built-in recognizersRegex + checksum validation for SSN, credit card, IBAN, etc.No
Custom PatternRecognizersRegex + context keyword boosting (what we added)Hybrid
Presidio orchestratorCombines all layers, deduplicates, scoresNo

This is what right-sizing looks like in practice — instead of routing text through a large language model, we’re running a 400MB statistical NER model purpose-built for entity recognition. The right model for the right job.

Custom recognizers we added

1. SSN Context Recognizer — Matches XXX-XX-XXXX format and boosts confidence when keywords like “SSN” or “social security” appear nearby. Doesn’t require Luhn validation — because for a redaction system, you want to catch PII-shaped data, not validate it.

2. Credit Card Context Recognizer — Matches XXXX-XXXX-XXXX-XXXX format. The 16-digit grouped pattern is unambiguous enough to flag without checksum validation.

3. Date of Birth Recognizer — Presidio has no built-in DOB recognizer. Ours matches ISO (1985-03-15), EU (15/03/1985), and written (March 15, 1985) formats when contextual keywords like “DOB” or “date of birth” appear.

4. Medical License Recognizer — Matches patterns like MD-45892, RN-12345 with medical context keywords. Domain-specific — irrelevant in telecom, critical in healthcare.

5. Broad Phone Recognizer — Catches format variants that the built-in recognizer misses: international with dashes (+1-555-987-6543), 8-digit local formats, and boosts standard 10-digit numbers when “phone” or “contact” appear nearby.

Threshold tuning

Lowered detection threshold from 0.60 to 0.35.

The rationale: for a detection pass, you want high recall — catch everything suspicious. The redaction step downstream can apply a separate, stricter threshold for what it actually masks. Detection and redaction are different policy decisions with different risk tolerances.

No training data. No GPU. No new infrastructure. This runs on your existing IT stack — the same hardware already in your data centre.

The Result: 98.2% Recall

MetricBeforeAfterChange
PII recall76.4% (42/55)98.2% (54/55)+21.8pp
Avg latency22.5ms22.3msNo change
LLM calls00Still zero
InfrastructureCPU-onlyCPU-onlyNo GPU required

The single remaining miss: a PERSON entity in a prompt that was misrouted — sent to local handling instead of the PII detection pipeline. The detector never got a chance to run. That’s a routing problem, not a detection problem — and it’s exactly the kind of ambiguous case where a small language model adds value as a routing classifier. More on that in Part 2.

What This Means for Enterprise PII Systems

1. Diagnose before you prescribe

The instinct is “recall is low, let’s throw a bigger model at it.” But 10 of our 13 misses were configuration issues — thresholds, missing recognizers, format coverage. Fixing configuration is cheaper, faster, and more predictable than model changes.

2. Synthetic data exposes design decisions

Our synthetic prompts used fake SSNs and credit card numbers. Presidio’s checksum validation correctly rejected them — which is proper behaviour for real data. But it revealed a design decision: do you want to detect PII-shaped patterns, or only validated PII? For redaction, you want the former. For fraud detection, the latter. That’s a policy choice, not a technical limitation.

3. The threshold is a policy decision

0.6 vs 0.35 isn’t about accuracy — it’s about organisational risk tolerance. High-regulation domains (healthcare, finance) should detect aggressively. Low-regulation domains can filter more. Your compliance team sets this number, not your ML team.

4. Domain policy packs are the enterprise differentiator

Each domain has different PII entities that matter. Medical license numbers are irrelevant in telecom. IBANs are irrelevant in US healthcare. Custom recognizers per domain means: plug in a policy pack for your vertical.

What This Approach Can’t Catch — And What We’re Building Next

This approach handles structured PII — entities with known formats and patterns. It gets you to 95-98% recall. The remaining gap is where a language model genuinely adds value:

  • Implicit sensitivity: “my therapist said…” (no named entity, but sensitive context)
  • Multi-lingual PII: names and addresses in non-Latin scripts
  • Obfuscated patterns: “my social is one two three dash…”
  • Contextual judgment: is “Dr. Smith” PII or a public figure reference?

That’s where a small language model (SLM) comes in — but as a second pass on the cases the deterministic layer can’t resolve. Not as the primary detector. We’re building this layer next, and we’ll publish the three-mode comparison (rules-only vs rules+SLM vs SLM-only) with full benchmark data.

The architecture: rules for speed, NER for structure, SLM for judgment. Each layer does what it’s good at.


Part 2: Adding a small language model for the ambiguous cases — when deterministic rules say “I don’t know” and you need contextual judgment. Coming next week.