How We Pushed PII Recall from 76% to 98% — Right-Sized Models, No Fine-Tuning, No LLMs

The Problem

You’re building AI-powered workflows for regulated industries — healthcare, finance, telecom. Your users type prompts containing patient names, credit card numbers, social security numbers. Before any of that reaches a cloud LLM, it needs to be detected and masked.

The stakes are domain-specific but the pattern is universal: under HIPAA, undetected PHI in a cloud-bound prompt is a reportable breach. PCI DSS requires masking of cardholder data in transit — including AI prompts. GDPR treats the transfer itself as the compliance event — not what the provider does with it afterward.

We’re building a local-first prompt router designed for GDPR compliance. The rule is simple: sensitive data gets detected and masked before it ever leaves your infrastructure. No PII in transit. No third-party exposure. Every redaction decision auditable.

The PII detection layer uses Microsoft Presidio, an open-source engine that combines statistical NER (spaCy’s named entity recognition — a trained ML model) with pattern-based recognizers. Out of the box, it’s solid. But “solid” isn’t good enough when you’re making redaction decisions on regulated data.

The Baseline: 76.4% Recall

We built a benchmark dataset of 60 prompts modelled on real enterprise patterns across three regulated domains — healthcare, telecom, and finance. Each prompt is labelled with the PII entities it contains: names, SSNs, credit cards, phone numbers, dates of birth, medical license numbers.

First run with default Presidio configuration:

Metric	Value
PII recall	76.4% (42/55 entities)
Entities missed	13

That means 1 in 4 PII entities slips through undetected. In healthcare or finance, that’s a compliance violation waiting to happen.

Diagnosing the Failures

Before reaching for a larger model, we asked: what specifically is Presidio missing, and why?

We ran every missed entity through the analyzer at score threshold 0.0 to see raw detection results:

Entity Type	Count Missed	Root Cause
PHONE_NUMBER	4	2 format variants unrecognized, 2 detected but scored 0.40 (below 0.6 threshold)
CREDIT_CARD	3	Luhn checksum validation — synthetic card numbers fail checksum
US_SSN	2	Same — Presidio validates SSN format strictly, synthetic numbers rejected
DATE_OF_BIRTH	2	No built-in DOB recognizer in Presidio
MEDICAL_LICENSE	1	No built-in recognizer
PERSON	1	Not a detection failure — routing error sent prompt to wrong path, detector never ran

Key insight: Most failures weren’t “the system can’t detect this.” They were:

Checksum validation rejecting edge-case formats (SSN, credit card)
Missing recognizers for domain-specific entities (DOB, medical license)
Threshold filtering valid detections (phone numbers scored 0.40, threshold was 0.60)

All three are fixable without changing models or adding LLM calls.

The Fix: Five Pattern Recognizers + Threshold Tuning

Here’s what the detection stack looks like:

Layer	What it does	ML?
spaCy NER (en_core_web_lg)	Statistical model trained on annotated text — recognizes PERSON, ORG, etc. from context	Yes — 400MB trained model
Presidio built-in recognizers	Regex + checksum validation for SSN, credit card, IBAN, etc.	No
Custom PatternRecognizers	Regex + context keyword boosting (what we added)	Hybrid
Presidio orchestrator	Combines all layers, deduplicates, scores	No

This is what right-sizing looks like in practice — instead of routing text through a large language model, we’re running a 400MB statistical NER model purpose-built for entity recognition. The right model for the right job.

Custom recognizers we added

1. SSN Context Recognizer — Matches XXX-XX-XXXX format and boosts confidence when keywords like “SSN” or “social security” appear nearby. Doesn’t require Luhn validation — because for a redaction system, you want to catch PII-shaped data, not validate it.

2. Credit Card Context Recognizer — Matches XXXX-XXXX-XXXX-XXXX format. The 16-digit grouped pattern is unambiguous enough to flag without checksum validation.

3. Date of Birth Recognizer — Presidio has no built-in DOB recognizer. Ours matches ISO (1985-03-15), EU (15/03/1985), and written (March 15, 1985) formats when contextual keywords like “DOB” or “date of birth” appear.

4. Medical License Recognizer — Matches patterns like MD-45892, RN-12345 with medical context keywords. Domain-specific — irrelevant in telecom, critical in healthcare.

5. Broad Phone Recognizer — Catches format variants that the built-in recognizer misses: international with dashes (+1-555-987-6543), 8-digit local formats, and boosts standard 10-digit numbers when “phone” or “contact” appear nearby.

Threshold tuning

Lowered detection threshold from 0.60 to 0.35.

The rationale: for a detection pass, you want high recall — catch everything suspicious. The redaction step downstream can apply a separate, stricter threshold for what it actually masks. Detection and redaction are different policy decisions with different risk tolerances.

No training data. No GPU. No new infrastructure. This runs on your existing IT stack — the same hardware already in your data centre.

The Result: 98.2% Recall

Metric	Before	After	Change
PII recall	76.4% (42/55)	98.2% (54/55)	+21.8pp
Avg latency	22.5ms	22.3ms	No change
LLM calls	0	0	Still zero
Infrastructure	CPU-only	CPU-only	No GPU required

The single remaining miss: a PERSON entity in a prompt that was misrouted — sent to local handling instead of the PII detection pipeline. The detector never got a chance to run. That’s a routing problem, not a detection problem — and it’s exactly the kind of ambiguous case where a small language model adds value as a routing classifier. More on that in Part 2.

What This Means for Enterprise PII Systems

1. Diagnose before you prescribe

The instinct is “recall is low, let’s throw a bigger model at it.” But 10 of our 13 misses were configuration issues — thresholds, missing recognizers, format coverage. Fixing configuration is cheaper, faster, and more predictable than model changes.

2. Benchmark data exposes design decisions

Our benchmark prompts used generated SSNs and credit card numbers. Presidio’s checksum validation correctly rejected them — which is proper behaviour for real data. But it revealed a design decision: do you want to detect PII-shaped patterns, or only validated PII? For redaction, you want the former. For fraud detection, the latter. That’s a policy choice, not a technical limitation.

3. The threshold is a policy decision

0.6 vs 0.35 isn’t about accuracy — it’s about organisational risk tolerance. High-regulation domains (healthcare, finance) should detect aggressively. Low-regulation domains can filter more. Your compliance team sets this number, not your ML team.

4. Domain policy packs are the enterprise differentiator

Each domain has different PII entities that matter. Medical license numbers are irrelevant in telecom. IBANs are irrelevant in US healthcare. Custom recognizers per domain means: plug in a policy pack for your vertical.

What This Approach Can’t Catch — And What We’re Building Next

This approach handles structured PII — entities with known formats and patterns. It gets you to 95-98% recall. The remaining gap is where a language model genuinely adds value:

Implicit sensitivity: “my therapist said…” (no named entity, but sensitive context)
Multi-lingual PII: names and addresses in non-Latin scripts
Obfuscated patterns: “my social is one two three dash…”
Contextual judgment: is “Dr. Smith” PII or a public figure reference?

That’s where a small language model (SLM) comes in — but as a second pass on the cases the deterministic layer can’t resolve. Not as the primary detector. In Part 2, we added this SLM layer and ran a three-mode comparison (rules-only vs rules+SLM vs SLM-only) with full benchmark data — the results make a clear case for layered architecture.

The architecture: rules for speed, NER for structure, SLM for judgment. Each layer does what it’s good at.

This is Part 1 of a series on building governed AI architecture for the enterprise. Part 2: what happens when you let an SLM route every request — and why layered architecture wins.

Facing these challenges in your AI stack? Get early access → or get in touch.