PII Shield Docs
Full reference for both surfaces. The MCP plugin runs inside Claude Desktop; the CLI runs anywhere Node 22 does. Same engine, same 33 entity types, same on-disk session format. Switch the toggle above and the command reference swaps; everything else applies to both.
Pick a surface, run three lines.
Both surfaces share ~/.pii_shield/: install one or both. Sessions, mappings, audit logs, and the GLiNER model are interchangeable.
MCP plugin (Claude Desktop)
# 1. Download pii-shield-v2.1.0-<platform>.mcpb from GitHub releases
# https://github.com/gregmos/PII-Shield/releases
# 2. In Claude Desktop:
# Settings → Extensions → Advanced Settings → Install extension
# Pick the .mcpb file
# 3. First call opens an in-chat panel to download the GLiNER model
# (~634 MB, fetched by your browser)
CLI (any terminal)
# 1. Install the npm package globally
npm install -g pii-shield
pii-shield --version
# 2. Health check (model will report missing — fix in step 3)
pii-shield doctor
# 3. Download the GLiNER model (~634 MB, one-off)
pii-shield install-model
anonymize or scan, the engine installs onnxruntime-node, @xenova/transformers, and gliner at pinned versions into ~/.pii_shield/deps/installs/<hash>/. Roughly 300 MB, 1–2 min, deterministic. Subsequent runs are instant. Everything (model, deps, mappings, audit logs) survives npm uninstall -g pii-shield.
Parity at install time
| Aspect | MCP plugin | CLI |
|---|---|---|
| Node version | Bundled (macOS) / host (Win/Linux) | Host (≥22.0.0) |
| Distribution | .mcpb in GitHub releases | npm install -g pii-shield |
| Model location | ~/.pii_shield/models/gliner-pii-base-v1.0/ | |
| Data directory | ~/.pii_shield/ (shared) | |
| Audit log | ~/.pii_shield/audit/mcp_audit.log | |
| Session format | Identical — exports from one open in the other | |
Environment variables, defaults, and where data lives.
All settings are environment variables — there is no config file. Set them in your shell, in .env, or per-command (KEY=value pii-shield …). Both MCP and CLI read the same vars from the same places.
Detection sensitivity
| Variable | Default | Range | Effect |
|---|---|---|---|
PII_MIN_SCORE | 0.30 | 0.0–1.0 | Minimum confidence for pattern recognizers (email, SSN, IBAN, …). Lower = more recall, more false positives. |
PII_NER_THRESHOLD | 0.30 | 0.0–1.0 | Minimum confidence for NER detections (PERSON, ORG, LOCATION, NRP). 0.20 catches obscure names; 0.50 catches only obvious ones. |
# legal contracts (default tuning)
PII_MIN_SCORE=0.30 PII_NER_THRESHOLD=0.30 pii-shield anonymize contract.pdf
# medical records (higher recall — miss nothing)
PII_MIN_SCORE=0.20 PII_NER_THRESHOLD=0.20 pii-shield anonymize chart.pdf
# news clippings (higher precision — only obvious entities)
PII_NER_THRESHOLD=0.55 pii-shield anonymize article.txt
Behaviour
| Variable | Default | Description |
|---|---|---|
PII_SKIP_REVIEW | false | Set true to never open the HITL panel — useful for CI / scripting. |
PII_MAPPING_TTL_DAYS | 7 | Sessions older than N days are deleted on next run. Bump for long matters: PII_MAPPING_TTL_DAYS=90. |
PII_WORK_DIR | (unset) | Default working directory for relative paths. |
PII_DEBUG | (unset) | Set true (or pass --debug) for stack traces and full audit trail. |
PII_AUDIT_STDERR | CLI: false · MCP: true | Mirror server logs to stderr. CLI sets false so output stays clean; --debug flips to true. |
PII_QUIET | (unset) | Set by --quiet. Disables progress bars and summary writes. |
NO_COLOR | (unset) | Honoured per no-color.org. Disables ANSI in terminal output. |
FORCE_COLOR | (unset) | Force ANSI even in non-TTY contexts. |
Paths (rare overrides)
| Variable | Default | Description |
|---|---|---|
PII_SHIELD_DATA_DIR | ~/.pii_shield | Root for all PII Shield state. Override to relocate everything (deps, model, audit, mappings) — useful for shared/networked installs. |
PII_SHIELD_MAPPINGS_DIR | <DATA_DIR>/mappings | Override only the mappings dir — e.g. point to a network share so a team has shared sessions. |
PII_SHIELD_MODELS_DIR | <DATA_DIR>/models | Override only the model directory. |
PII_SHIELD_MODEL_DOWNLOADS_DIR | (auto) | Where install-model looks for an existing gliner-pii-base-v1.0.zip (Downloads / OneDrive / Desktop / Documents are scanned by default). |
Data directory layout
~/.pii_shield/ ← PII_SHIELD_DATA_DIR
├── models/
│ └── gliner-pii-base-v1.0/ ← 634 MB, survives uninstall
│ ├── model.onnx
│ ├── tokenizer.json
│ ├── tokenizer_config.json
│ ├── special_tokens_map.json
│ └── gliner_config.json
├── deps/
│ └── installs/<hash>/ ← onnxruntime-node + transformers + gliner (pinned)
├── mappings/ ← PII_SHIELD_MAPPINGS_DIR (can be relocated)
│ ├── <session_id>.json ← placeholder → real PII (the one secret)
│ └── review_<session_id>.json ← HITL state (entities + overrides)
├── audit/
│ ├── mcp_audit.log ← every CLI command + arguments + result
│ ├── ner_init.log ← NER bootstrap trace
│ ├── ner_debug.log ← per-call NER detail
│ └── server.log ← lifecycle + errors
└── cache/
└── gliner-pii-base-v1.0.zip ← deleted after install
The mapping files are the one place real PII lives outside your source documents. They have 0o700 permissions on POSIX. Delete a mapping = no way to deanonymize that session.
National IDs, tax numbers, passports — and the obvious stuff too.
17 country-specific patterns on top of generic entity detection. The fields a "find all emails" regex would never catch. Hover any chip for the patterns it covers.
- UK5 NIN · NHS · Passport · CRN · Driving licence
- DE2 Tax ID · Social security
- FR2 NIR · CNI
- IT2 Fiscal code · VAT
- ES2 DNI · NIE
- CY2 TIC · National ID
- EU2 VAT · Passport
- GLiNER zero-shot
- person
- org
- location
- NRP (nationality / religion / politics)
- all 17 jurisdiction patterns
- email · phone · URL · IP
- IBAN · credit card · crypto wallet
- US SSN · US passport · US driver licence
- medical licence · ID doc
Full enumerated list of 33 types in §10 Entity types.
One engine. Two surfaces.
Toggle in the page header to swap panes. MCP exposes 17 tools to Claude Desktop / Claude Code via stdio. CLI exposes 11 commands to your shell. Same engine underneath — every entity, every session format, every audit log line is identical.
17 MCP tools, four groups
Hover any tool name for a one-line description. Each tool is a single MCP function callable from Claude Desktop. Inputs validated with Zod; errors return MCP-protocol-shaped { isError: true, content: [...] } rather than throwing.
Round-trip example
// 1. anonymize a file — original stays on disk; only placeholders cross the wire
→ anonymize_file({ file_path: "~/contracts/nda.docx" })
← { status: "success",
session_id: "m9q8x4-7b5a91",
entity_count: 14,
output_path: ".../nda.anon.txt",
docx_output_path: ".../nda.anon.docx",
by_type: { PERSON: 4, ORG: 3, MONEY: 2, ... } }
// 2. Claude reads output_path and drafts a memo using placeholders only
// 3. restore PII into the draft, locally
→ deanonymize_text({ text: draft, session_id: "m9q8x4-7b5a91" })
← { status: "success",
deanonymized_text: "...John Smith signed the NDA..." }
11 CLI commands, four groups
All commands share two global flags: -q, --quiet (suppresses progress + summary), --debug (full audit trail + stack traces). Session ids accept unique prefixes (git-style): pii-shield review 2026-04 works if exactly one session matches.
Detail by command
anonymize<files…> [-o out] [-s session] [--no-review] [--lang en] [--prefix p] [-y] [--json]
Anonymize one or more files. All files in one invocation share one session and one mapping pool — identical entities across files get the same placeholder. Supported inputs: .pdf, .docx, .txt, .md, .csv, .log, .html.
| Flag | Default | Description |
|---|---|---|
-o, --out <dir> | <input-dir>/pii_shield_<sid>/ | Single output directory for all files in this run. |
-s, --session <id> | (new session) | Extend an existing session: pool + placeholders are reused. |
--no-review | (off — review opens) | Skip HITL — write outputs and exit. |
--lang <code> | en | Language hint for NER. |
--prefix <p> | (empty) | Prepend to placeholder tags, e.g. --prefix DOC1_ → <DOC1_PERSON_1>. |
-y, --yes | — | Auto-confirm prompts (model download, double-anon warning). |
--json | — | Emit structured JSON to stdout. Implies --no-review. |
pii-shield anonymize contract.pdf --no-review
pii-shield anonymize matter/*.pdf matter/*.docx --out anonymized/
pii-shield anonymize new.pdf --session 2026-04-29
pii-shield anonymize NDA.docx --json --yes
| Input | Output |
|---|---|
.docx | <name>_anonymized.docx (formatting preserved) and <name>_anonymized.txt (extracted text) |
.pdf | <name>_anonymized.txt (PDF write-back not supported) |
.txt / .md / .csv | <name>_anonymized.<ext> |
0 = OK · 1 = error · 2 = wrong arguments.
See also: MCP anonymize_file · §05 Workflows
deanonymize<file> [-s session] [-o out]
Restore real PII from placeholders. Works on any file containing placeholders that match a known session — even files an LLM authored externally, as long as the placeholders survived intact.
| Flag | Default | Description |
|---|---|---|
-s, --session <id> | embedded → latest | Session id or unique prefix. Falls through .docx metadata, then latest session on machine. |
-o, --out <path> | <name>_restored.<ext> | Explicit output path. |
- Explicit
--session <id>wins. - For
.docxonly: readpii_shield.session_idfromdocProps/custom.xml(auto-embedded byanonymize). - Latest session on this machine.
pii-shield deanonymize contract_anonymized.docx
pii-shield deanonymize analysis.docx --session 2026-04-29_120000_ab12
pii-shield deanonymize summary.txt --session 2026-04 --out final.txt
See also: MCP deanonymize_text / deanonymize_docx
scan<file> [--json] [--lang en] [--wait-ner 30] [-y]
Detect PII without writing anything. Useful for previewing what would be anonymized, or for piping JSON into custom workflows.
| Flag | Default | Description |
|---|---|---|
--json | — | Emit machine-readable JSON to stdout. |
--lang <code> | en | Language hint. |
--wait-ner <s> | 30 | Max seconds to wait for NER on cold start. |
-y, --yes | — | Auto-confirm model-download prompt. |
{
"file": "contract.pdf",
"ext": ".pdf",
"bytes": 145823,
"char_count": 8421,
"entities": [
{"text": "John Smith", "type": "PERSON", "start": 12, "end": 22, "score": 0.93}
]
}
pii-shield scan contract.pdf
pii-shield scan contract.pdf --json | jq '.entities | length'
pii-shield scan chart.pdf --json > entities.json
See also: MCP scan_text
verify<file> -s session [--json] [--lang en]
Re-detect PII on an anonymized file and fail if any non-placeholder entity is found. Use as a final compliance gate before sending output to an external LLM.
| Flag | Default | Description |
|---|---|---|
-s, --session <id> | required | Session id or unique prefix. |
--json | — | Emit machine-readable JSON to stdout. |
--lang <code> | en | Language hint. |
-y, --yes | — | Auto-confirm model-download prompt. |
- Re-runs the engine over the anonymized text.
- For each detected entity, checks if its text is a placeholder from the session's mapping (
<PERSON_1>, etc.). Placeholders are skipped. - Anything else → flagged as a possible leak (offset + 60-char context printed).
pii-shield verify contract_anonymized.txt --session 2026-04-29
pii-shield verify summary.docx --session 2026-04 --json
0 = clean · 1 = possible leaks (or model/session error) · 2 = usage.
See also: §05 Workflows → Compliance gate
review<session-id> [-y]
Open the HITL review UI in your default browser for an existing session. Useful when you ran anonymize --no-review initially and want to review later, or after sessions import from another machine. The CLI prints a localhost URL with a 32-hex bearer token; only your machine + browser can reach it.
| Flag | Default | Description |
|---|---|---|
-y, --yes | — | Auto-confirm any prompts. |
pii-shield review 2026-04-29_120000_ab12
# → Review URL: http://127.0.0.1:6789/?token=4b4c77cf6d12ba7d6c8d47a1c91775cc
# → opens browser
After approval, any documents whose review carries non-empty overrides are re-anonymized in place against a fresh shared placeholder state — multi-doc consistency stays intact even after corrections.
See also: MCP start_review · §07 HITL review
doctor[--json]
Health check — verifies Node version, write permissions, model presence, deps cache, and NER status. Use as a CI gate before running anonymization in unattended pipelines.
| Flag | Default | Description |
|---|---|---|
--json | — | Emit JSON to stdout. |
pii-shield doctor
pii-shield doctor --json
Exit code 0 if all checks pass, 1 if any fail.
install-model[--force] [-y]
Download and install the GLiNER ONNX model (~634 MB) into ~/.pii_shield/models/. If a valid gliner-pii-base-v1.0/model.onnx already exists (>100 MB), the command exits 0 without action unless --force is passed.
| Flag | Default | Description |
|---|---|---|
--force | — | Reinstall over a present model. |
-y, --yes | — | Skip the download confirmation prompt. |
pii-shield install-model # interactive
pii-shield install-model --yes # non-interactive (CI)
pii-shield install-model --force # reinstall over a present model
sessions list[--json]
Table of local sessions, newest first.
| Column | Meaning |
|---|---|
session_id | Format YYYY-MM-DD_HHMMSS_xxxx |
modified | ISO-8601 timestamp of last write |
docs | Number of documents anonymized under this session |
entities | Total placeholders in the mapping pool |
pii-shield sessions list
pii-shield sessions list --json
sessions show<session-id> [--json]
Detail view: every doc in the session with doc_id, source path, source SHA-256, anonymized timestamp, plus review status.
pii-shield sessions show 2026-04-29_120000_ab12
pii-shield sessions show 2026-04-29 --json # short prefix
sessions find<path> [--json]
Find which session(s) include a given source file. Linear scan over all sessions in ~/.pii_shield/mappings/.
pii-shield sessions find ~/Documents/contract.pdf
# → Found 1 session(s) including /home/user/Documents/contract.pdf:
# → 2026-04-29_120000_ab12 doc_id=lkhc91b-3e2f4a 2026-04-29T12:00:14.123Z
pii-shield sessions find /tmp/missing.txt --json
# → [] (exit 1)
Exit 0 if at least one session matches, exit 1 if none.
sessions export<session-id> -o out [-p passphrase]
Export an encrypted archive (.pii-session) for hand-off to a colleague. AES-256-GCM with scrypt key derivation (N=16384, r=8, p=1). Wrong passphrase fails loudly — never silent corruption.
| Flag | Default | Description |
|---|---|---|
-o, --out <path> | required | Output file path. |
-p, --passphrase <p> | (prompts) | Passphrase. If omitted, the CLI prompts (input is masked). |
manifest.json— version + integrity hashmapping.json— placeholder → real-PII mapping + per-doc metadatareview.json— HITL review state (if any)
pii-shield sessions export 2026-04-29_120000_ab12 \
--passphrase "correct horse battery staple" \
--out contract-matter.pii-session
See also: MCP export_session · §08 Cross-machine handoff
sessions import<archive> [-p passphrase] [--overwrite]
Decrypt and persist a session archive locally.
| Flag | Default | Description |
|---|---|---|
-p, --passphrase <p> | (prompts) | Passphrase. Prompts (masked) if omitted. |
--overwrite | — | Replace an existing session of the same id. |
pii-shield sessions import contract-matter.pii-session \
--passphrase "correct horse battery staple"
pii-shield sessions import a.pii-session --overwrite
See also: MCP import_session
Parity at a glance
Five flows the engine is built for.
Each works on either surface — sessions exported from one open in the other, with the same mapping format on disk. Pick the one that matches your tooling.
1. Anonymize → external LLM → deanonymize
The headline use case. Send placeholders to any LLM, restore real data on your machine. Works with ChatGPT, Gemini, Claude, local Llama / Mistral, internal corporate gateways — anything that accepts text.
# 1. Anonymize. Note the session id.
pii-shield anonymize NDA.docx --no-review
# → Session: 2026-04-29_120000_ab12
# 2. Take NDA_anonymized.docx to ChatGPT / Gemini / Claude / DeepSeek.
# Ask: "Summarise this NDA. Keep tokens like <PERSON_1> verbatim."
# Save the response as summary.docx.
# 3. Restore PII in the LLM output.
pii-shield deanonymize summary.docx --session 2026-04-29_120000_ab12
# → summary_restored.docx with real names back
You are reviewing an anonymized contract. Tokens of the form <PERSON_1>, <ORG_3>, <EMAIL_ADDRESS_2> are anonymous placeholders. Keep them VERBATIM in your output — do not rephrase, translate, hyphenate, or split them. Treat them as opaque proper nouns.2. Multi-file matter (one session, many docs)
In a single anonymize invocation, all files share one session and one mapping pool. The same entity gets the same placeholder in every file, so cross-doc references stay coherent.
pii-shield anonymize matter/contract.pdf matter/sow.pdf matter/nda.pdf matter/*.docx
# → ONE session_id covers all files
# → "Acme Corp" in any file → <ORG_1> in every file
# → "John Smith" → <PERSON_1> consistently
# → bulk-mode review panel opens with one tab per file
Variant suffixes (a, b, …) come from family-based dedup: longest form wins as canonical, shorter / variant spellings get suffixed family numbers. Acme Corp → <ORG_1>, Acme Corp. → <ORG_1>, Acme Corporation → <ORG_1a>, Acme → <ORG_1b>. Every variant maps back verbatim on deanonymize.
3. Compliance gate in CI
Use verify as a fail-loud step in your pipeline. Re-runs the engine on the anonymized output; if any non-placeholder PII is detected, exit 1 and the pipeline stops.
# bash, GitHub Actions / GitLab CI / Jenkins
pii-shield anonymize "$INPUT" --no-review --json > result.json
SID=$(jq -r .session_id result.json)
OUT=$(jq -r '.results[0].output_path' result.json)
pii-shield verify "$OUT" --session "$SID"
# exit 0 → safe to send to external LLM
# exit 1 → leak detected; pipeline stops
4. Cross-machine handoff (encrypted)
A colleague needs to deanonymize the LLM output, but the anonymization happened on your machine. Send them an encrypted archive — PII never crosses the wire in plain form. Pass the passphrase out-of-band (Signal, phone, separate email) — never in the same channel as the file.
# Your machine
pii-shield sessions export 2026-04-29_120000_ab12 \
--passphrase "<some long passphrase>" \
--out matter-1234.pii-session
# Their machine
pii-shield sessions import matter-1234.pii-session \
--passphrase "<the same passphrase>"
pii-shield deanonymize summary.docx --session 2026-04-29_120000_ab12
5. Re-anonymize with HITL corrections
The HITL review lets you remove false positives and add missed entities. After approval, the engine re-anonymizes affected files in place against a fresh shared placeholder state — multi-doc consistency is preserved even after corrections.
pii-shield anonymize contracts/*.pdf
# → browser opens; click some entities to remove,
# select text to add as missed PII, click Approve
# → CLI prints "Re-anonymized 3 doc(s) with corrections" + final paths
# Reopen review later (e.g., browser closed, or after sessions import)
pii-shield review 2026-04-29_120000_ab12
Three logs, all on disk. No telemetry.
Everything PII Shield does is recorded locally. There is no telemetry endpoint to disable. If your DPO asks "what did the tool do with that file at 14:02 UTC," the answer is on your filesystem.
What leaves your machine
If you use either surface alone, nothing leaves your machine. The engine reads files locally, runs NER + pattern matching on CPU only, writes anonymized outputs locally, and (for CLI HITL review) opens a server bound to 127.0.0.1 only.
The only outbound networking happens during install-model — downloads from github.com/gregmos/PII-Shield/releases/. Everything else is local.
Audit log format
2026-04-29T12:00:01.234Z >>> CALL anonymize_text_cli({"file":"/x/contract.pdf","char_count":8421,"session_id":""})
2026-04-29T12:00:14.123Z <<< RESP anonymize_text_cli -> {"anonymized":"... <ORG_1> ..."} ... [127 chars total]
PII is truncated at 200 chars in the log. The full PII is only in mappings/<session>.json (mode 0o700). To prove no PII left the machine, read the audit log: every external operation is recorded. There are no hidden side channels.
Trust boundary
| Component | Trust level |
|---|---|
| Source documents | Trusted |
~/.pii_shield/mappings/<sid>.json | Highly sensitive — contains real PII |
<file>_anonymized.<ext> outputs | Safe to send to LLMs / colleagues |
.pii-session export archives | Encrypted; safe IF passphrase is strong |
~/.pii_shield/audit/*.log | Mostly safe — file paths only, no real PII |
~/.pii_shield/models/, deps/ | Public — same content as the npm package + GitHub release |
Audit log rotation
The log grows append-only. There is no built-in rotation in 2.0.x — wipe manually for compliance, or schedule rotation via your OS (logrotate on Linux, Task Scheduler on Windows).
# keep the directory; new lines start fresh
rm ~/.pii_shield/audit/*.log
Encrypted handoff details
Session archives use AES-256-GCM with a key derived via scrypt from the passphrase (parameters: N=16384, r=8, p=1). Archive format is versioned (PII1 magic, version byte 0x01) so future PII Shield versions stay backward-compatible. Wrong passphrase = loud decryption failure (no silent corruption).
Add what was missed. Remove false positives.
A pre-flight panel that surfaces every detected entity before placeholders ship to any LLM. MCP renders it inside Claude Desktop as an iframe; CLI renders it in your default browser on a localhost port.
What you can do in the panel
- Remove false positives — click any highlighted entity. All occurrences across all documents are removed.
- Add missed entities — select text, click a type button. All word-boundary matches are added across all tabs.
- See impact live — the right pane shows the anonymized version updating as you make changes.
MCP vs CLI rendering
| Aspect | MCP | CLI |
|---|---|---|
| Where it renders | Inside Claude Desktop (MCP Apps iframe) | Default browser, localhost |
| Triggered by | start_review tool call | End of anonymize, or review <sid> |
| Network binding | n/a (in-process) | 127.0.0.1:6789 (auto 6789–6799 if port busy) |
| Auth | n/a | 32-hex bearer token in URL only, never on disk; origin-checked on every POST |
| Idle timeout | n/a | 30 minutes (browser heartbeats every 30 s) |
Bulk mode (multi-doc)
When the session has ≥2 documents, the panel shows tabs at the top — one per document. Each doc keeps independent removed / added state. Approve once per tab; when all tabs are approved, the success overlay appears and the CLI continues. Adding an entity in one tab propagates word-boundary matches across all other tabs (case-insensitive). Removing is per-tab.
Keyboard shortcuts
| Key | Action |
|---|---|
Tab / Shift+Tab | Next / previous entity |
D | Remove the focused entity |
R | Restore the focused entity (undo remove) |
A | Approve the document |
/ | Focus the entity filter |
Esc | Clear focus / close dropdowns |
When NOT to use HITL
- Unattended CI / cron — set
PII_SKIP_REVIEW=trueor pass--no-review. - SSH / headless servers — no browser available. Use
--no-reviewand inspect outputs manually withverify. - Streaming pipelines — same as above; rely on the default 0.30 thresholds.
Move a session safely between machines.
A colleague needs to deanonymize the LLM output, but the anonymization happened on your machine. The .pii-session archive packs the mapping + review state, encrypted with a passphrase. PII never transits in plain form. Both surfaces produce and consume the same archive format.
Step-by-step
# 1. Source machine — export
pii-shield sessions export 2026-04-29_120000_ab12 \
--passphrase "<long passphrase>" \
--out matter-1234.pii-session
# → matter-1234.pii-session (a few KB to a few MB)
# 2. Transfer the file via any channel.
# Pass the passphrase out-of-band (Signal, phone,
# separate email) — NEVER in the same channel as the file.
# 3. Destination machine — import
pii-shield sessions import matter-1234.pii-session \
--passphrase "<same passphrase>"
# 4. Use it as if it were created locally
pii-shield deanonymize summary.docx --session 2026-04-29_120000_ab12
pii-shield review 2026-04-29_120000_ab12 # reopen HITL
Cross-surface handoff
Sessions exported by the CLI open in MCP, and vice versa. The mapping format is version-locked at the file level, not the surface. A typical mixed flow:
- Anonymize a 30-doc matter in the CLI (faster for batches via
--no-review). - Export a session archive.
- A reviewer on a different machine imports it into Claude Desktop (MCP).
- The reviewer runs the analysis in-chat; deanonymize_docx restores PII into the final memo.
What the archive contains
manifest.json— version + integrity hashmapping.json— placeholder → real-PII mapping + per-doc metadata (source SHA-256, language, timestamps)review.json— HITL review state (added / removed entities)
Crypto: AES-256-GCM, key derived via scrypt (N=16384, r=8, p=1). Wrong passphrase fails loudly with wrong passphrase or corrupted archive — no silent corruption.
Every command speaks JSON.
There is no separate Python SDK — the CLI is the integration surface. Every read-side command supports --json; exit codes are stable; subprocess spawn is the integration pattern. For high-throughput pipelines, the same backend exposes MCP via stdio.
JSON outputs by command
| Command | --json available | Top-level keys |
|---|---|---|
anonymize | ✓ | session_id, entity_count, pool_size, ner_ready, results[] |
deanonymize | ✗ | (path printed on stdout) |
scan | ✓ | file, bytes, char_count, entities[] |
verify | ✓ | ok, leaks_found, leaks[] |
sessions list | ✓ | array of {session_id, modified, docs, entities} |
sessions show | ✓ | session_id, entity_count, documents[], review |
sessions find | ✓ | array of {session_id, doc_id, source_path, anonymized_at} |
doctor | ✓ | version, node_version, checks[], ok |
--json on anonymize implies --no-review (no browser inside a non-interactive script). Exit codes: 0 = success, 1 = error / leak detected / 0 hits (for find), 2 = wrong arguments.
Minimal Python wrapper
import subprocess, json, pathlib
class PiiShield:
def __init__(self, binary: str = "pii-shield"):
self.bin = binary
def _run(self, *args, check_zero: bool = True) -> str:
r = subprocess.run([self.bin, *args], capture_output=True, text=True)
if check_zero and r.returncode != 0:
raise RuntimeError(f"pii-shield {' '.join(args)} → {r.returncode}: {r.stderr}")
return r.stdout
def doctor(self) -> dict:
return json.loads(self._run("doctor", "--json", check_zero=False))
def scan(self, path: str) -> dict:
return json.loads(self._run("scan", path, "--json"))
def anonymize(self, *paths: str, session: str | None = None,
out: str | None = None) -> dict:
args = ["anonymize", *paths, "--json", "--yes"]
if session: args += ["--session", session]
if out: args += ["--out", out]
return json.loads(self._run(*args))
def deanonymize(self, path: str, session: str, out: str | None = None) -> str:
args = ["deanonymize", path, "--session", session]
if out: args += ["--out", out]
self._run(*args)
return out or self._infer_restored_path(path)
def verify(self, path: str, session: str) -> dict:
return json.loads(self._run("verify", path, "--session", session, "--json",
check_zero=False))
def session_for_file(self, path: str) -> str | None:
hits = json.loads(self._run("sessions", "find", path, "--json",
check_zero=False))
return hits[0]["session_id"] if hits else None
@staticmethod
def _infer_restored_path(input_path: str) -> str:
p = pathlib.Path(input_path)
return str(p.with_name(p.stem + "_restored" + p.suffix))
Common patterns
Pipeline with verification gate:
result = shield.anonymize(*input_files, out="staging/")
for r in result["results"]:
v = shield.verify(r["output_path"], session=result["session_id"])
if not v["ok"]:
raise RuntimeError(f"Leak in {r['output_path']}: {v['leaks']}")
# Safe to send result["results"] to external LLM
Idempotent re-runs (skip already-anonymized files):
sid = shield.session_for_file("contract.pdf")
if sid is None:
result = shield.anonymize("contract.pdf")
sid = result["session_id"]
else:
print(f"Already anonymized in session {sid}")
Cost control before sending to a paid LLM:
scan = shield.scan(file)
if any(e["type"] == "US_SSN" for e in scan["entities"]):
raise SystemExit("File contains SSNs; redact manually first.")
result = shield.anonymize(file)
Performance notes
- Subprocess spawn cost: ~200–500 ms cold per
pii-shieldinvocation. For batches, prefer oneanonymizecall with all files over N single-file calls. - NER deps install (
~/.pii_shield/deps/): once per machine, ~1–2 min. After that all runs are warm. - Model load (
~/.pii_shield/models/): ~15 s on a fresh process. Subprocess pattern means each call pays this cost. - For 10–50 docs in one session: subprocess pattern is fine. Spawn cost is ≪ NER inference time.
- For thousands of docs: run PII Shield as a long-lived MCP server (
node dist/server.bundle.mjs) and talk JSON-RPC. Persistent process; full tool surface; no per-call boot cost. See @modelcontextprotocol/python-sdk.
33 types, grouped by source.
Authoritative list lives in nodejs-v2/src/engine/entity-types.ts (SUPPORTED_ENTITIES). NER detections come from GLiNER zero-shot; the rest are pattern recognizers.
NER-based (4)
Generic patterns (9)
United States (3)
United Kingdom (5)
EU-wide (2)
Country-specific (10)
List at runtime in JSON: pii-shield scan small.txt --json | jq '.entities[].type' | sort -u (assuming the sample file has at least one of each type you want to confirm).
What's underneath.
Current release: v2.1.0 (production). v1 was Python + Presidio + spaCy and remains downloadable in gregmos/PII-Shield for legacy installs. v2.1.0 is a complete Node.js rewrite — same detection coverage, no Python.
The CLI ships from the same monorepo as the MCP server (nodejs-v2/cli/). Both surfaces share the engine in src/engine/, so version numbers stay aligned.
Things to know up front.
- HITL via browser only (CLI) — there's no terminal review UI. SSH / headless setups must
--no-reviewand rely on default thresholds, then verify withverify. - NER inference is single-threaded. Don't run multiple
pii-shieldprocesses in parallel against the same session — placeholder counters race. Sequential batches inside one invocation are fine. - PDF write-back is not supported. Anonymizing a
.pdfproduces a.txt(text only). For redacted PDFs, anonymize → analyse → re-render through your own PDF generator. .docxformatting preservation is best-effort. Tracked changes (<w:ins>,<w:del>), comments, and split runs are handled. Exotic structures (SmartArt, embedded Excel objects, math equations) may not be processed.- Audit log is append-only. Manage retention yourself via OS log rotation.
- No cross-instance locking. Two
pii-shieldprocesses on the same machine writing to the same session simultaneously will produce a "last writer wins" mapping. Use distinctPII_SHIELD_DATA_DIRper process if you must run them in parallel. - Model is English-tuned. GLiNER is multilingual but
gliner-pii-base-v1.0was tuned on English-leaning legal corpora. Other languages work but recall is lower; bumpPII_NER_THRESHOLDdown to 0.20–0.25 to compensate. - Browser panel scaling. Above ~30 docs / ~1500 entities in one review, the HITL panel can become sluggish (DOM not virtualised). Split into two sessions if you hit this.
Performance envelope
| Batch | Cold start | Warm |
|---|---|---|
| 1 short doc (5 KB text) | 2–3 min (deps + model on first ever run) | 1–2 s |
| 5 docs avg 10 KB | 4–5 min cold | ~50 s |
| 30 docs avg 10 KB | 6–8 min cold | ~5 min |
| 1 large doc (200 KB text) | 3 min cold | ~1 min |
Disable HITL (--no-review) for unattended batches. Tune PII_NER_THRESHOLD=0.4 for a 20–30% speedup if you can tolerate slightly lower recall.
Common errors. Click for the fix.
For deeper diagnostics, set PII_DEBUG=true on any command and tail ~/.pii_shield/audit/server.log live.
pii-shield: command not found
npm's global bin/ isn't on PATH. Run npm root -g to find it; add the parent bin/ to PATH.
On Windows: typically %AppData%\npm\.
[!] NER model loading... takes 1–2 min
First run installs deps. Look at ~/.pii_shield/audit/ner_init.log for progress. Subsequent runs are instant (the deps directory is cached and reused).
model.onnx missing in doctor output
Run pii-shield install-model. If you already downloaded the zip elsewhere, set PII_SHIELD_MODEL_DOWNLOADS_DIR=/path/to/dir and re-run — the installer will pick it up without re-downloading.
Unsupported model IR version: 9
Stale onnxruntime-node cached. Delete ~/.pii_shield/deps/ — the next anonymize call reinstalls the pinned 1.22.0 triplet.
Cannot find module '../build/Release/sharp-…'
Old build without the sharp shim. Update: npm install -g pii-shield@latest.
HTTP server fails — port in use
The HITL server tries 6789–6799. Close other PII Shield instances; check lsof -i :6789 (Linux/macOS) or netstat -ano | findstr 6789 (Windows).
Browser doesn't open during review
Copy the printed URL manually. On Linux/macOS, set BROWSER=firefox (or whichever) to override the OS default.
deanonymize silently leaves placeholders in the output
The LLM mangled them — common variants: lowercase (<person_1>), added spaces (<PERSON 1>), translated type (<PERSONA_1>).
Tighten the prompt: "Keep tokens VERBATIM. Do not modify case or spacing inside <…>." Or sed-replace the mangled forms before deanonymize:
sed -E 's/<person_([0-9]+)>/<PERSON_\1>/g' summary.txt > fixed.txt
session 'X' not found on import
Wrong session id. List with pii-shield sessions list.
Wrong passphrase on import
Decryption fails loudly with wrong passphrase or corrupted archive — there is no silent corruption. Recheck the passphrase; the archive is intact.
Mapping disappeared after a week
TTL cleanup. Bump PII_MAPPING_TTL_DAYS=90 for long matters. Already-deleted mappings can only be recovered if you have a .pii-session export.
First-run downloads keep restarting
Network timeouts on slow connections. Run install-model directly — single-shot download with progress bar. Or use install-model.ps1 / .sh in the repo for a curl-based fallback.
Anonymize misses obvious names
NER threshold too high. Try PII_NER_THRESHOLD=0.20. If still missed, the model genuinely doesn't see them — open an issue with a redacted example.
Anonymize too eager (false positives)
Lower recall: PII_MIN_SCORE=0.45 PII_NER_THRESHOLD=0.45. Or use the HITL panel to remove specific false positives — corrections are session-local.
Live logs
PII_DEBUG=true pii-shield <command> # prints stack traces on errors
tail -f ~/.pii_shield/audit/server.log # live tool calls + lifecycle events
tail -f ~/.pii_shield/audit/ner_init.log # NER bootstrap detail