PII Shield — Documentation

§ 01 / Quick install

Pick a surface, run three lines.

Both surfaces share ~/.pii_shield/: install one or both. Sessions, mappings, audit logs, and the GLiNER model are interchangeable.

MCP plugin (Claude Desktop)

# 1. Download pii-shield-v2.1.0-<platform>.mcpb from GitHub releases
#    https://github.com/gregmos/PII-Shield/releases

# 2. In Claude Desktop:
#    Settings → Extensions → Advanced Settings → Install extension
#    Pick the .mcpb file

# 3. First call opens an in-chat panel to download the GLiNER model
#    (~634 MB, fetched by your browser)

CLI (any terminal)

# 1. Install the npm package globally
npm install -g pii-shield
pii-shield --version

# 2. Health check (model will report missing — fix in step 3)
pii-shield doctor

# 3. Download the GLiNER model (~634 MB, one-off)
pii-shield install-model

First-run engine deps

On first anonymize or scan, the engine installs onnxruntime-node, @xenova/transformers, and gliner at pinned versions into ~/.pii_shield/deps/installs/<hash>/. Roughly 300 MB, 1–2 min, deterministic. Subsequent runs are instant. Everything (model, deps, mappings, audit logs) survives npm uninstall -g pii-shield.

Parity at install time

Aspect	MCP plugin	CLI
Node version	Bundled (macOS) / host (Win/Linux)	Host (≥22.0.0)
Distribution	`.mcpb` in GitHub releases	`npm install -g pii-shield`
Model location	`~/.pii_shield/models/gliner-pii-base-v1.0/`
Data directory	`~/.pii_shield/` (shared)
Audit log	`~/.pii_shield/audit/mcp_audit.log`
Session format	Identical — exports from one open in the other

§ 02 / Configuration

Environment variables, defaults, and where data lives.

All settings are environment variables — there is no config file. Set them in your shell, in .env, or per-command (KEY=value pii-shield …). Both MCP and CLI read the same vars from the same places.

Detection sensitivity

Variable	Default	Range	Effect
`PII_MIN_SCORE`	0.30	0.0–1.0	Minimum confidence for pattern recognizers (email, SSN, IBAN, …). Lower = more recall, more false positives.
`PII_NER_THRESHOLD`	0.30	0.0–1.0	Minimum confidence for NER detections (PERSON, ORG, LOCATION, NRP). 0.20 catches obscure names; 0.50 catches only obvious ones.

# legal contracts (default tuning)
PII_MIN_SCORE=0.30 PII_NER_THRESHOLD=0.30 pii-shield anonymize contract.pdf

# medical records (higher recall — miss nothing)
PII_MIN_SCORE=0.20 PII_NER_THRESHOLD=0.20 pii-shield anonymize chart.pdf

# news clippings (higher precision — only obvious entities)
PII_NER_THRESHOLD=0.55 pii-shield anonymize article.txt

Behaviour

Variable	Default	Description
`PII_SKIP_REVIEW`	`false`	Set `true` to never open the HITL panel — useful for CI / scripting.
`PII_MAPPING_TTL_DAYS`	7	Sessions older than N days are deleted on next run. Bump for long matters: `PII_MAPPING_TTL_DAYS=90`.
`PII_WORK_DIR`	(unset)	Default working directory for relative paths.
`PII_DEBUG`	(unset)	Set `true` (or pass `--debug`) for stack traces and full audit trail.
`PII_AUDIT_STDERR`	CLI: `false` · MCP: `true`	Mirror server logs to stderr. CLI sets `false` so output stays clean; `--debug` flips to `true`.
`PII_QUIET`	(unset)	Set by `--quiet`. Disables progress bars and summary writes.
`NO_COLOR`	(unset)	Honoured per no-color.org. Disables ANSI in terminal output.
`FORCE_COLOR`	(unset)	Force ANSI even in non-TTY contexts.

Paths (rare overrides)

Variable	Default	Description
`PII_SHIELD_DATA_DIR`	`~/.pii_shield`	Root for all PII Shield state. Override to relocate everything (deps, model, audit, mappings) — useful for shared/networked installs.
`PII_SHIELD_MAPPINGS_DIR`	`<DATA_DIR>/mappings`	Override only the mappings dir — e.g. point to a network share so a team has shared sessions.
`PII_SHIELD_MODELS_DIR`	`<DATA_DIR>/models`	Override only the model directory.
`PII_SHIELD_MODEL_DOWNLOADS_DIR`	(auto)	Where `install-model` looks for an existing `gliner-pii-base-v1.0.zip` (Downloads / OneDrive / Desktop / Documents are scanned by default).

Data directory layout

~/.pii_shield/                     ← PII_SHIELD_DATA_DIR
├── models/
│   └── gliner-pii-base-v1.0/       ← 634 MB, survives uninstall
│       ├── model.onnx
│       ├── tokenizer.json
│       ├── tokenizer_config.json
│       ├── special_tokens_map.json
│       └── gliner_config.json
├── deps/
│   └── installs/<hash>/            ← onnxruntime-node + transformers + gliner (pinned)
├── mappings/                       ← PII_SHIELD_MAPPINGS_DIR (can be relocated)
│   ├── <session_id>.json           ← placeholder → real PII (the one secret)
│   └── review_<session_id>.json    ← HITL state (entities + overrides)
├── audit/
│   ├── mcp_audit.log               ← every CLI command + arguments + result
│   ├── ner_init.log                ← NER bootstrap trace
│   ├── ner_debug.log               ← per-call NER detail
│   └── server.log                  ← lifecycle + errors
└── cache/
    └── gliner-pii-base-v1.0.zip    ← deleted after install

The mapping files are the one place real PII lives outside your source documents. They have 0o700 permissions on POSIX. Delete a mapping = no way to deanonymize that session.

§ 03 / Detection coverage

National IDs, tax numbers, passports — and the obvious stuff too.

17 country-specific patterns on top of generic entity detection. The fields a "find all emails" regex would never catch. Hover any chip for the patterns it covers.

EU + UK patterns

UK5 NIN · NHS · Passport · CRN · Driving licence
DE2 Tax ID · Social security
FR2 NIR · CNI
IT2 Fiscal code · VAT
ES2 DNI · NIE
CY2 TIC · National ID
EU2 VAT · Passport

Detection · NER

GLiNER zero-shot
person
org
location
NRP (nationality / religion / politics)

Detection · pattern

all 17 jurisdiction patterns
email · phone · URL · IP
IBAN · credit card · crypto wallet
US SSN · US passport · US driver licence
medical licence · ID doc

Full enumerated list of 33 types in §10 Entity types.

§ 04 / Command reference

One engine. Two surfaces.

Toggle in the page header to swap panes. MCP exposes 17 tools to Claude Desktop / Claude Code via stdio. CLI exposes 11 commands to your shell. Same engine underneath — every entity, every session format, every audit log line is identical.

17 MCP tools, four groups

Hover any tool name for a one-line description. Each tool is a single MCP function callable from Claude Desktop. Inputs validated with Zod; errors return MCP-protocol-shaped { isError: true, content: [...] } rather than throwing.

Anonymize4

anonymize_filecore

anonymize_text

anonymize_docx

anonymize_next_chunk

Review5

start_review

apply_review_overrides

get_full_anonymized_text

deanonymize_text

deanonymize_docx

Session3

export_session

import_session

get_mapping

Utility5

scan_textpreview

list_entities

find_file

resolve_path

install_model_from_download

Round-trip example

// 1. anonymize a file — original stays on disk; only placeholders cross the wire
→ anonymize_file({ file_path: "~/contracts/nda.docx" })
← { status: "success",
    session_id: "m9q8x4-7b5a91",
    entity_count: 14,
    output_path: ".../nda.anon.txt",
    docx_output_path: ".../nda.anon.docx",
    by_type: { PERSON: 4, ORG: 3, MONEY: 2, ... } }

// 2. Claude reads output_path and drafts a memo using placeholders only

// 3. restore PII into the draft, locally
→ deanonymize_text({ text: draft, session_id: "m9q8x4-7b5a91" })
← { status: "success",
    deanonymized_text: "...John Smith signed the NDA..." }

11 CLI commands, four groups

All commands share two global flags: -q, --quiet (suppresses progress + summary), --debug (full audit trail + stack traces). Session ids accept unique prefixes (git-style): pii-shield review 2026-04 works if exactly one session matches.

Core4

anonymizecore deanonymize scan verifygate

Sessions5

sessions list sessions show sessions find sessions export sessions import

Review1

review

Diagnostics2

doctor install-model

Detail by command

anonymize<files…> [-o out] [-s session] [--no-review] [--lang en] [--prefix p] [-y] [--json]

Anonymize one or more files. All files in one invocation share one session and one mapping pool — identical entities across files get the same placeholder. Supported inputs: .pdf, .docx, .txt, .md, .csv, .log, .html.

Options

Flag	Default	Description
`-o, --out <dir>`	`<input-dir>/pii_shield_<sid>/`	Single output directory for all files in this run.
`-s, --session <id>`	(new session)	Extend an existing session: pool + placeholders are reused.
`--no-review`	(off — review opens)	Skip HITL — write outputs and exit.
`--lang <code>`	`en`	Language hint for NER.
`--prefix <p>`	(empty)	Prepend to placeholder tags, e.g. `--prefix DOC1_` → `<DOC1_PERSON_1>`.
`-y, --yes`	—	Auto-confirm prompts (model download, double-anon warning).
`--json`	—	Emit structured JSON to stdout. Implies `--no-review`.

Examples

pii-shield anonymize contract.pdf --no-review
pii-shield anonymize matter/*.pdf matter/*.docx --out anonymized/
pii-shield anonymize new.pdf --session 2026-04-29
pii-shield anonymize NDA.docx --json --yes

Output formats

Input	Output
`.docx`	`<name>_anonymized.docx` (formatting preserved) and `<name>_anonymized.txt` (extracted text)
`.pdf`	`<name>_anonymized.txt` (PDF write-back not supported)
`.txt` / `.md` / `.csv`	`<name>_anonymized.<ext>`

Exit codes

0 = OK · 1 = error · 2 = wrong arguments.

See also: MCP anonymize_file · §05 Workflows

deanonymize<file> [-s session] [-o out]

Restore real PII from placeholders. Works on any file containing placeholders that match a known session — even files an LLM authored externally, as long as the placeholders survived intact.

Options

Flag	Default	Description
`-s, --session <id>`	embedded → latest	Session id or unique prefix. Falls through .docx metadata, then latest session on machine.
`-o, --out <path>`	`<name>_restored.<ext>`	Explicit output path.

Session resolution priority

Explicit --session <id> wins.
For .docx only: read pii_shield.session_id from docProps/custom.xml (auto-embedded by anonymize).
Latest session on this machine.

Examples

pii-shield deanonymize contract_anonymized.docx
pii-shield deanonymize analysis.docx --session 2026-04-29_120000_ab12
pii-shield deanonymize summary.txt --session 2026-04 --out final.txt

See also: MCP deanonymize_text / deanonymize_docx

scan<file> [--json] [--lang en] [--wait-ner 30] [-y]

Detect PII without writing anything. Useful for previewing what would be anonymized, or for piping JSON into custom workflows.

Options

Flag	Default	Description
`--json`	—	Emit machine-readable JSON to stdout.
`--lang <code>`	`en`	Language hint.
`--wait-ner <s>`	30	Max seconds to wait for NER on cold start.
`-y, --yes`	—	Auto-confirm model-download prompt.

JSON shape

{
  "file": "contract.pdf",
  "ext": ".pdf",
  "bytes": 145823,
  "char_count": 8421,
  "entities": [
    {"text": "John Smith", "type": "PERSON", "start": 12, "end": 22, "score": 0.93}
  ]
}

Examples

pii-shield scan contract.pdf
pii-shield scan contract.pdf --json | jq '.entities | length'
pii-shield scan chart.pdf --json > entities.json

Flag	Default	Description
`-s, --session <id>`	required	Session id or unique prefix.
`--json`	—	Emit machine-readable JSON to stdout.
`--lang <code>`	`en`	Language hint.
`-y, --yes`	—	Auto-confirm model-download prompt.

Flag	Default	Description
`-y, --yes`	—	Auto-confirm any prompts.

Flag	Default	Description
`--json`	—	Emit JSON to stdout.

Flag	Default	Description
`--force`	—	Reinstall over a present model.
`-y, --yes`	—	Skip the download confirmation prompt.

Column	Meaning
`session_id`	Format `YYYY-MM-DD_HHMMSS_xxxx`
`modified`	ISO-8601 timestamp of last write
`docs`	Number of documents anonymized under this session
`entities`	Total placeholders in the mapping pool

Flag	Default	Description
`-o, --out <path>`	required	Output file path.
`-p, --passphrase <p>`	(prompts)	Passphrase. If omitted, the CLI prompts (input is masked).

Flag	Default	Description
`-p, --passphrase <p>`	(prompts)	Passphrase. Prompts (masked) if omitted.
`--overwrite`	—	Replace an existing session of the same id.

Parity at a glance

MCP toolCLI command

anonymize_file↔pii-shield anonymize

anonymize_text↔pii-shield anonymize -

deanonymize_text / _docx↔pii-shield deanonymize

scan_text↔pii-shield scan

start_review↔pii-shield review

export_session↔pii-shield sessions export

import_session↔pii-shield sessions import

list_entities↔pii-shield sessions list / show

§ 05 / Workflows

Five flows the engine is built for.

Each works on either surface — sessions exported from one open in the other, with the same mapping format on disk. Pick the one that matches your tooling.

1. Anonymize → external LLM → deanonymize

The headline use case. Send placeholders to any LLM, restore real data on your machine. Works with ChatGPT, Gemini, Claude, local Llama / Mistral, internal corporate gateways — anything that accepts text.

# 1. Anonymize. Note the session id.
pii-shield anonymize NDA.docx --no-review
# → Session: 2026-04-29_120000_ab12

# 2. Take NDA_anonymized.docx to ChatGPT / Gemini / Claude / DeepSeek.
#    Ask: "Summarise this NDA. Keep tokens like <PERSON_1> verbatim."
#    Save the response as summary.docx.

# 3. Restore PII in the LLM output.
pii-shield deanonymize summary.docx --session 2026-04-29_120000_ab12
# → summary_restored.docx with real names back

Prompt template that survives token round-trips

You are reviewing an anonymized contract. Tokens of the form <PERSON_1>, <ORG_3>, <EMAIL_ADDRESS_2> are anonymous placeholders. Keep them VERBATIM in your output — do not rephrase, translate, hyphenate, or split them. Treat them as opaque proper nouns.

2. Multi-file matter (one session, many docs)

In a single anonymize invocation, all files share one session and one mapping pool. The same entity gets the same placeholder in every file, so cross-doc references stay coherent.

pii-shield anonymize matter/contract.pdf matter/sow.pdf matter/nda.pdf matter/*.docx
# → ONE session_id covers all files
# → "Acme Corp" in any file → <ORG_1> in every file
# → "John Smith" → <PERSON_1> consistently
# → bulk-mode review panel opens with one tab per file

Variant suffixes (a, b, …) come from family-based dedup: longest form wins as canonical, shorter / variant spellings get suffixed family numbers. Acme Corp → <ORG_1>, Acme Corp. → <ORG_1>, Acme Corporation → <ORG_1a>, Acme → <ORG_1b>. Every variant maps back verbatim on deanonymize.

3. Compliance gate in CI

Use verify as a fail-loud step in your pipeline. Re-runs the engine on the anonymized output; if any non-placeholder PII is detected, exit 1 and the pipeline stops.

# bash, GitHub Actions / GitLab CI / Jenkins
pii-shield anonymize "$INPUT" --no-review --json > result.json
SID=$(jq -r .session_id result.json)
OUT=$(jq -r '.results[0].output_path' result.json)

pii-shield verify "$OUT" --session "$SID"
# exit 0 → safe to send to external LLM
# exit 1 → leak detected; pipeline stops

4. Cross-machine handoff (encrypted)

A colleague needs to deanonymize the LLM output, but the anonymization happened on your machine. Send them an encrypted archive — PII never crosses the wire in plain form. Pass the passphrase out-of-band (Signal, phone, separate email) — never in the same channel as the file.

# Your machine
pii-shield sessions export 2026-04-29_120000_ab12 \
  --passphrase "<some long passphrase>" \
  --out matter-1234.pii-session

# Their machine
pii-shield sessions import matter-1234.pii-session \
  --passphrase "<the same passphrase>"
pii-shield deanonymize summary.docx --session 2026-04-29_120000_ab12

5. Re-anonymize with HITL corrections

The HITL review lets you remove false positives and add missed entities. After approval, the engine re-anonymizes affected files in place against a fresh shared placeholder state — multi-doc consistency is preserved even after corrections.

pii-shield anonymize contracts/*.pdf
# → browser opens; click some entities to remove,
#   select text to add as missed PII, click Approve
# → CLI prints "Re-anonymized 3 doc(s) with corrections" + final paths

# Reopen review later (e.g., browser closed, or after sessions import)
pii-shield review 2026-04-29_120000_ab12

§ 06 / Audit & compliance

Three logs, all on disk. No telemetry.

Everything PII Shield does is recorded locally. There is no telemetry endpoint to disable. If your DPO asks "what did the tool do with that file at 14:02 UTC," the answer is on your filesystem.

What leaves your machine

If you use either surface alone, nothing leaves your machine. The engine reads files locally, runs NER + pattern matching on CPU only, writes anonymized outputs locally, and (for CLI HITL review) opens a server bound to 127.0.0.1 only.

The only outbound networking happens during install-model — downloads from github.com/gregmos/PII-Shield/releases/. Everything else is local.

Audit log format

2026-04-29T12:00:01.234Z >>> CALL anonymize_text_cli({"file":"/x/contract.pdf","char_count":8421,"session_id":""})
2026-04-29T12:00:14.123Z <<< RESP anonymize_text_cli -> {"anonymized":"... <ORG_1> ..."} ... [127 chars total]

PII is truncated at 200 chars in the log. The full PII is only in mappings/<session>.json (mode 0o700). To prove no PII left the machine, read the audit log: every external operation is recorded. There are no hidden side channels.

Trust boundary

Component	Trust level
Source documents	Trusted
`~/.pii_shield/mappings/<sid>.json`	Highly sensitive — contains real PII
`<file>_anonymized.<ext>` outputs	Safe to send to LLMs / colleagues
`.pii-session` export archives	Encrypted; safe IF passphrase is strong
`~/.pii_shield/audit/*.log`	Mostly safe — file paths only, no real PII
`~/.pii_shield/models/`, `deps/`	Public — same content as the npm package + GitHub release

Audit log rotation

The log grows append-only. There is no built-in rotation in 2.0.x — wipe manually for compliance, or schedule rotation via your OS (logrotate on Linux, Task Scheduler on Windows).

# keep the directory; new lines start fresh
rm ~/.pii_shield/audit/*.log

Encrypted handoff details

Session archives use AES-256-GCM with a key derived via scrypt from the passphrase (parameters: N=16384, r=8, p=1). Archive format is versioned (PII1 magic, version byte 0x01) so future PII Shield versions stay backward-compatible. Wrong passphrase = loud decryption failure (no silent corruption).

§ 07 / HITL review

Add what was missed. Remove false positives.

A pre-flight panel that surfaces every detected entity before placeholders ship to any LLM. MCP renders it inside Claude Desktop as an iframe; CLI renders it in your default browser on a localhost port.

What you can do in the panel

Remove false positives — click any highlighted entity. All occurrences across all documents are removed.
Add missed entities — select text, click a type button. All word-boundary matches are added across all tabs.
See impact live — the right pane shows the anonymized version updating as you make changes.

MCP vs CLI rendering

Aspect	MCP	CLI
Where it renders	Inside Claude Desktop (MCP Apps iframe)	Default browser, localhost
Triggered by	`start_review` tool call	End of `anonymize`, or `review <sid>`
Network binding	n/a (in-process)	`127.0.0.1:6789` (auto 6789–6799 if port busy)
Auth	n/a	32-hex bearer token in URL only, never on disk; origin-checked on every POST
Idle timeout	n/a	30 minutes (browser heartbeats every 30 s)

Bulk mode (multi-doc)

When the session has ≥2 documents, the panel shows tabs at the top — one per document. Each doc keeps independent removed / added state. Approve once per tab; when all tabs are approved, the success overlay appears and the CLI continues. Adding an entity in one tab propagates word-boundary matches across all other tabs (case-insensitive). Removing is per-tab.

Keyboard shortcuts

Key	Action
`Tab` / `Shift+Tab`	Next / previous entity
`D`	Remove the focused entity
`R`	Restore the focused entity (undo remove)
`A`	Approve the document
`/`	Focus the entity filter
`Esc`	Clear focus / close dropdowns

When NOT to use HITL

Unattended CI / cron — set PII_SKIP_REVIEW=true or pass --no-review.
SSH / headless servers — no browser available. Use --no-review and inspect outputs manually with verify.
Streaming pipelines — same as above; rely on the default 0.30 thresholds.

§ 08 / Cross-machine handoff

Move a session safely between machines.

A colleague needs to deanonymize the LLM output, but the anonymization happened on your machine. The .pii-session archive packs the mapping + review state, encrypted with a passphrase. PII never transits in plain form. Both surfaces produce and consume the same archive format.

Step-by-step

# 1. Source machine — export
pii-shield sessions export 2026-04-29_120000_ab12 \
  --passphrase "<long passphrase>" \
  --out matter-1234.pii-session
# → matter-1234.pii-session (a few KB to a few MB)

# 2. Transfer the file via any channel.
#    Pass the passphrase out-of-band (Signal, phone,
#    separate email) — NEVER in the same channel as the file.

# 3. Destination machine — import
pii-shield sessions import matter-1234.pii-session \
  --passphrase "<same passphrase>"

# 4. Use it as if it were created locally
pii-shield deanonymize summary.docx --session 2026-04-29_120000_ab12
pii-shield review 2026-04-29_120000_ab12   # reopen HITL

Cross-surface handoff

Sessions exported by the CLI open in MCP, and vice versa. The mapping format is version-locked at the file level, not the surface. A typical mixed flow:

Anonymize a 30-doc matter in the CLI (faster for batches via --no-review).
Export a session archive.
A reviewer on a different machine imports it into Claude Desktop (MCP).
The reviewer runs the analysis in-chat; deanonymize_docx restores PII into the final memo.

What the archive contains

manifest.json — version + integrity hash
mapping.json — placeholder → real-PII mapping + per-doc metadata (source SHA-256, language, timestamps)
review.json — HITL review state (added / removed entities)

Crypto: AES-256-GCM, key derived via scrypt (N=16384, r=8, p=1). Wrong passphrase fails loudly with wrong passphrase or corrupted archive — no silent corruption.

§ 09 / Python / scripting integration

Every command speaks JSON.

There is no separate Python SDK — the CLI is the integration surface. Every read-side command supports --json; exit codes are stable; subprocess spawn is the integration pattern. For high-throughput pipelines, the same backend exposes MCP via stdio.

JSON outputs by command

Command	--json available	Top-level keys
`anonymize`	✓	`session_id`, `entity_count`, `pool_size`, `ner_ready`, `results[]`
`deanonymize`	✗	(path printed on stdout)
`scan`	✓	`file`, `bytes`, `char_count`, `entities[]`
`verify`	✓	`ok`, `leaks_found`, `leaks[]`
`sessions list`	✓	array of `{session_id, modified, docs, entities}`
`sessions show`	✓	`session_id`, `entity_count`, `documents[]`, `review`
`sessions find`	✓	array of `{session_id, doc_id, source_path, anonymized_at}`
`doctor`	✓	`version`, `node_version`, `checks[]`, `ok`

--json on anonymize implies --no-review (no browser inside a non-interactive script). Exit codes: 0 = success, 1 = error / leak detected / 0 hits (for find), 2 = wrong arguments.

Minimal Python wrapper

import subprocess, json, pathlib

class PiiShield:
    def __init__(self, binary: str = "pii-shield"):
        self.bin = binary

    def _run(self, *args, check_zero: bool = True) -> str:
        r = subprocess.run([self.bin, *args], capture_output=True, text=True)
        if check_zero and r.returncode != 0:
            raise RuntimeError(f"pii-shield {' '.join(args)} → {r.returncode}: {r.stderr}")
        return r.stdout

    def doctor(self) -> dict:
        return json.loads(self._run("doctor", "--json", check_zero=False))

    def scan(self, path: str) -> dict:
        return json.loads(self._run("scan", path, "--json"))

    def anonymize(self, *paths: str, session: str | None = None,
                  out: str | None = None) -> dict:
        args = ["anonymize", *paths, "--json", "--yes"]
        if session: args += ["--session", session]
        if out:     args += ["--out", out]
        return json.loads(self._run(*args))

    def deanonymize(self, path: str, session: str, out: str | None = None) -> str:
        args = ["deanonymize", path, "--session", session]
        if out: args += ["--out", out]
        self._run(*args)
        return out or self._infer_restored_path(path)

    def verify(self, path: str, session: str) -> dict:
        return json.loads(self._run("verify", path, "--session", session, "--json",
                                    check_zero=False))

    def session_for_file(self, path: str) -> str | None:
        hits = json.loads(self._run("sessions", "find", path, "--json",
                                    check_zero=False))
        return hits[0]["session_id"] if hits else None

    @staticmethod
    def _infer_restored_path(input_path: str) -> str:
        p = pathlib.Path(input_path)
        return str(p.with_name(p.stem + "_restored" + p.suffix))

Common patterns

Pipeline with verification gate:

result = shield.anonymize(*input_files, out="staging/")
for r in result["results"]:
    v = shield.verify(r["output_path"], session=result["session_id"])
    if not v["ok"]:
        raise RuntimeError(f"Leak in {r['output_path']}: {v['leaks']}")
# Safe to send result["results"] to external LLM

Idempotent re-runs (skip already-anonymized files):

sid = shield.session_for_file("contract.pdf")
if sid is None:
    result = shield.anonymize("contract.pdf")
    sid = result["session_id"]
else:
    print(f"Already anonymized in session {sid}")

Cost control before sending to a paid LLM:

scan = shield.scan(file)
if any(e["type"] == "US_SSN" for e in scan["entities"]):
    raise SystemExit("File contains SSNs; redact manually first.")
result = shield.anonymize(file)

Performance notes

Subprocess spawn cost: ~200–500 ms cold per pii-shield invocation. For batches, prefer one anonymize call with all files over N single-file calls.
NER deps install (~/.pii_shield/deps/): once per machine, ~1–2 min. After that all runs are warm.
Model load (~/.pii_shield/models/): ~15 s on a fresh process. Subprocess pattern means each call pays this cost.
For 10–50 docs in one session: subprocess pattern is fine. Spawn cost is ≪ NER inference time.
For thousands of docs: run PII Shield as a long-lived MCP server (node dist/server.bundle.mjs) and talk JSON-RPC. Persistent process; full tool surface; no per-call boot cost. See @modelcontextprotocol/python-sdk.

§ 10 / Detected entity types

33 types, grouped by source.

Authoritative list lives in nodejs-v2/src/engine/entity-types.ts (SUPPORTED_ENTITIES). NER detections come from GLiNER zero-shot; the rest are pattern recognizers.

NER-based (4)

PERSONORGANIZATIONLOCATIONNRP

Generic patterns (9)

EMAIL_ADDRESSPHONE_NUMBERURLIP_ADDRESSID_DOCCREDIT_CARDIBAN_CODECRYPTOMEDICAL_LICENSE

United States (3)

US_SSNUS_PASSPORTUS_DRIVER_LICENSE

United Kingdom (5)

UK_NHSUK_NINUK_PASSPORTUK_CRNUK_DRIVING_LICENCE

EU-wide (2)

EU_VATEU_PASSPORT

Country-specific (10)

DE_TAX_IDDE_SOCIAL_SECURITYFR_NIRFR_CNIIT_FISCAL_CODEIT_VATES_DNIES_NIECY_TICCY_ID_CARD

List at runtime in JSON: pii-shield scan small.txt --json | jq '.entities[].type' | sort -u (assuming the sample file has at least one of each type you want to confirm).

§ 11 / Stack & versioning

What's underneath.

Built with

Node.js 22+ MCP (Model Context Protocol) commander.js (CLI) GLiNER onnxruntime-node @xenova/transformers DOCX (pure JS) pdf-parse AES-GCM / scrypt Zod (validation) single-port HITL server (CLI) npm package: pii-shield MIT

Tested on

Ubuntu Windows macOS Node 22 Node 24 8 test suites MCP protocol smoke

Versioning

Current release: v2.1.0 (production). v1 was Python + Presidio + spaCy and remains downloadable in gregmos/PII-Shield for legacy installs. v2.1.0 is a complete Node.js rewrite — same detection coverage, no Python.

The CLI ships from the same monorepo as the MCP server (nodejs-v2/cli/). Both surfaces share the engine in src/engine/, so version numbers stay aligned.

§ 12 / Limits & caveats

Things to know up front.

HITL via browser only (CLI) — there's no terminal review UI. SSH / headless setups must --no-review and rely on default thresholds, then verify with verify.
NER inference is single-threaded. Don't run multiple pii-shield processes in parallel against the same session — placeholder counters race. Sequential batches inside one invocation are fine.
PDF write-back is not supported. Anonymizing a .pdf produces a .txt (text only). For redacted PDFs, anonymize → analyse → re-render through your own PDF generator.
.docx formatting preservation is best-effort. Tracked changes (<w:ins>, <w:del>), comments, and split runs are handled. Exotic structures (SmartArt, embedded Excel objects, math equations) may not be processed.
Audit log is append-only. Manage retention yourself via OS log rotation.
No cross-instance locking. Two pii-shield processes on the same machine writing to the same session simultaneously will produce a "last writer wins" mapping. Use distinct PII_SHIELD_DATA_DIR per process if you must run them in parallel.
Model is English-tuned. GLiNER is multilingual but gliner-pii-base-v1.0 was tuned on English-leaning legal corpora. Other languages work but recall is lower; bump PII_NER_THRESHOLD down to 0.20–0.25 to compensate.
Browser panel scaling. Above ~30 docs / ~1500 entities in one review, the HITL panel can become sluggish (DOM not virtualised). Split into two sessions if you hit this.

Performance envelope

Batch	Cold start	Warm
1 short doc (5 KB text)	2–3 min (deps + model on first ever run)	1–2 s
5 docs avg 10 KB	4–5 min cold	~50 s
30 docs avg 10 KB	6–8 min cold	~5 min
1 large doc (200 KB text)	3 min cold	~1 min

Disable HITL (--no-review) for unattended batches. Tune PII_NER_THRESHOLD=0.4 for a 20–30% speedup if you can tolerate slightly lower recall.

§ 13 / Troubleshooting

Common errors. Click for the fix.

For deeper diagnostics, set PII_DEBUG=true on any command and tail ~/.pii_shield/audit/server.log live.

pii-shield: command not found

npm's global bin/ isn't on PATH. Run npm root -g to find it; add the parent bin/ to PATH.

On Windows: typically %AppData%\npm\.

[!] NER model loading... takes 1–2 min

First run installs deps. Look at ~/.pii_shield/audit/ner_init.log for progress. Subsequent runs are instant (the deps directory is cached and reused).

model.onnx missing in doctor output

Run pii-shield install-model. If you already downloaded the zip elsewhere, set PII_SHIELD_MODEL_DOWNLOADS_DIR=/path/to/dir and re-run — the installer will pick it up without re-downloading.

Unsupported model IR version: 9

Stale onnxruntime-node cached. Delete ~/.pii_shield/deps/ — the next anonymize call reinstalls the pinned 1.22.0 triplet.

Cannot find module '../build/Release/sharp-…'

Old build without the sharp shim. Update: npm install -g pii-shield@latest.

HTTP server fails — port in use

The HITL server tries 6789–6799. Close other PII Shield instances; check lsof -i :6789 (Linux/macOS) or netstat -ano | findstr 6789 (Windows).

Browser doesn't open during review

Copy the printed URL manually. On Linux/macOS, set BROWSER=firefox (or whichever) to override the OS default.

deanonymize silently leaves placeholders in the output

The LLM mangled them — common variants: lowercase (<person_1>), added spaces (<PERSON 1>), translated type (<PERSONA_1>).

Tighten the prompt: "Keep tokens VERBATIM. Do not modify case or spacing inside <…>." Or sed-replace the mangled forms before deanonymize:

sed -E 's/<person_([0-9]+)>/<PERSON_\1>/g' summary.txt > fixed.txt

session 'X' not found on import

Wrong session id. List with pii-shield sessions list.

Wrong passphrase on import

Decryption fails loudly with wrong passphrase or corrupted archive — there is no silent corruption. Recheck the passphrase; the archive is intact.

Mapping disappeared after a week

TTL cleanup. Bump PII_MAPPING_TTL_DAYS=90 for long matters. Already-deleted mappings can only be recovered if you have a .pii-session export.

First-run downloads keep restarting

Network timeouts on slow connections. Run install-model directly — single-shot download with progress bar. Or use install-model.ps1 / .sh in the repo for a curl-based fallback.

Anonymize misses obvious names

NER threshold too high. Try PII_NER_THRESHOLD=0.20. If still missed, the model genuinely doesn't see them — open an issue with a redacted example.

Anonymize too eager (false positives)

Lower recall: PII_MIN_SCORE=0.45 PII_NER_THRESHOLD=0.45. Or use the HITL panel to remove specific false positives — corrections are session-local.

Live logs

PII_DEBUG=true pii-shield <command>     # prints stack traces on errors
tail -f ~/.pii_shield/audit/server.log     # live tool calls + lifecycle events
tail -f ~/.pii_shield/audit/ner_init.log   # NER bootstrap detail