Designing secure data pipelines for PHI claims adjudication

Where you place the PHI tokenization boundary in a streaming NCPDP claims pipeline decides whether protected identifiers ever reach application memory, logs, or a crash dump — and it is an adjudication-correctness decision before it is a security one. Tokenize too late and the raw 302-C2 Cardholder ID rides the same DataFrame the pricing engine touches; tokenize with the wrong construction and either the join key stops being stable across batches (breaking reconciliation) or a low-entropy member ID is trivially brute-forced back out of a plain hash. This page resolves exactly where the boundary sits inside a chunked adjudication run, which keyed construction to use, and how to prove — with a test — that no unprotected identifier survives past ingestion. It is the PHI-isolation contract that the rest of the Security & Compliance Boundaries for Claims Data workflows assume is already in force.

Decision: which tokenization construction, and where it runs

Two choices dominate: where in the stream the identifier is replaced, and what one-way (or reversible) transform replaces it. The placement rule is unconditional — tokenize immediately after schema validation and before any routing, pricing, or logging call reads the row, so the raw value exists only inside the parse buffer and is overwritten before the row is materialized. The construction is a genuine tradeoff between reversibility, brute-force resistance, and key-custody cost.

Construction	Reversible?	Brute-force resistance	Stable join key across batches	Key custody	Added latency / claim	Use when
Plain `SHA-256(id)`	No	Weak — member IDs are low-entropy, defeated by a precomputed table	Yes	None	~1 µs	Never for PHI
`HMAC-SHA-256(key, id)`	No	Strong while key stays secret	Yes	KMS / HSM	~2 µs	Default — join keys you never need to reverse
AES-SIV (deterministic AEAD)	Yes, with key	Strong	Yes	KMS / HSM	~5 µs	An authorized downstream system must recover the raw ID
KMS envelope + external token vault	Yes, via vault	Strongest — token is random, unrelated to the value	Yes (vault-issued)	KMS + vault	network hop	Tokens cross a trust boundary or must be revocable

For most adjudication joins the answer is HMAC-SHA-256 with the key held in a KMS or HSM: it is deterministic (the same 302-C2 yields the same token every run, so reconciliation and dedup still work), it is not reversible from application code, and it defeats the rainbow-table attack that makes a plain hash of a 9-digit member ID worthless. Reach for a keyed reversible construction or a token vault only when a legitimate downstream consumer must recover the original value — and even then the recovery path lives behind the vault, never in the pipeline process. Whichever you pick, the raw 301-C1 Group ID, 310-CA patient first name, and 311-CB patient last name fields are dropped from the stream entirely; only the tokenized 302-C2 survives as a join key.

Step-by-step: a PHI-safe chunked adjudication pipeline

The pipeline streams NCPDP D.0 claims in bounded chunks, casts them against a version-pinned schema, tokenizes PHI at the boundary, resolves reject codes deterministically, and writes only tokenized rows. It never loads the full file into memory and never logs a raw claim byte. The stages below map one-to-one onto the code that follows.

1. Pin an explicit adjudication schema. Reject schema drift at the gate rather than letting a mistyped 442-E7 Quantity Dispensed reach the pricing engine as a silently coerced value. A version-controlled PyArrow schema is the contract; the same discipline governs the taxonomy in Schema Validation & Error Categorization.

2. Load the tokenization key from a managed secret — never a literal in source or an image layer.

3. Tokenize PHI before anything else reads the row. This is the boundary; everything downstream sees only the token.

4. Resolve NCPDP reject codes through a static matrix, defaulting the unmapped case to REVIEW_REQUIRED so no claim leaks past routing unaudited — the same reject-code contract consumed by Fallback Routing Logic Design.

5. Stream in bounded batches so peak heap is a function of batch_size, not file size.

python

import os
import hmac
import hashlib
import logging
import pyarrow as pa
import pyarrow.parquet as pq
from typing import Dict, Optional

# Structured logs only — never emit raw claim bytes or untokenized PHI.
logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")

# Step 1: version-pinned NCPDP D.0 adjudication schema (the drift gate).
ADJUDICATION_SCHEMA = pa.schema([
    ("bin", pa.string()),            # 101-A1 BIN — switch routing
    ("pcn", pa.string()),            # 104-A4 PCN — processor identification
    ("group_id", pa.string()),       # 301-C1 Group ID — dropped after routing (PHI-adjacent)
    ("patient_id", pa.string()),     # 302-C2 Cardholder ID — tokenized at the boundary
    ("product_id", pa.string()),     # 407-D7 Product/Service ID (NDC) — safe to retain/log
    ("qty_dispensed", pa.float32()),  # 442-E7 Quantity Dispensed — pricing input
    ("reject_code", pa.string()),    # raw NCPDP reject code, normalized then dropped
    ("routing_flag", pa.string()),   # internal adjudication state
    ("processed_ts", pa.timestamp("us")),
])

# Step 4: deterministic reject-code routing matrix (NCPDP D.0). Unmapped -> REVIEW_REQUIRED.
REJECT_ROUTING: Dict[str, str] = {
    "70": "NOT_COVERED",                # Product/Service Not Covered -> formulary exception queue
    "75": "PRIOR_AUTH_REQUIRED",        # Prior Auth Required -> halt pricing, trigger PA workflow
    "76": "PLAN_LIMITATIONS_EXCEEDED",  # enforce quantity / days-supply override logic
    "90": "PLAN_EXCLUDED",              # log for sponsor reporting
    "93": "DRUG_UTILIZATION_REVIEW",    # DUR conflict
}


def _load_tokenization_key() -> bytes:
    """Step 2: load the HMAC key from a KMS-injected env var.

    In production a secrets manager (AWS Secrets Manager, HashiCorp Vault,
    or a KMS-decrypted mount) sets this, so the key never appears in a
    process listing, shell history, or container image layer — the custody
    requirement behind a FIPS 140-2/140-3 keyed construction.
    """
    raw = os.environ.get("PHI_TOKENIZATION_KEY")
    if not raw:
        raise RuntimeError("PHI_TOKENIZATION_KEY environment variable is not set")
    return raw.encode("utf-8")


_TOKENIZATION_KEY: bytes = _load_tokenization_key()


def tokenize_phi(value: Optional[str], key: bytes = _TOKENIZATION_KEY) -> Optional[str]:
    """Step 3: deterministic, keyed tokenization of a PHI identifier.

    HMAC-SHA-256 keeps the token stable across batches (so joins and dedup
    still work) while defeating the rainbow-table attack a plain hash of a
    low-entropy 302-C2 Cardholder ID would fall to.
    """
    if not value:
        return None
    return hmac.new(key, value.strip().encode("utf-8"), hashlib.sha256).hexdigest()


def normalize_reject_code(code: Optional[str]) -> str:
    """Step 4: map a raw NCPDP reject code to an internal routing flag."""
    if not code:
        return "CLEAN"
    return REJECT_ROUTING.get(code.strip(), "REVIEW_REQUIRED")


def process_chunk(chunk: pa.Table) -> pa.Table:
    """Enforce schema, tokenize PHI at the boundary, normalize reject codes."""
    try:
        # Step 1 enforced: cast fails loudly on drift instead of coercing silently.
        chunk = chunk.cast(ADJUDICATION_SCHEMA, safe=False)
    except pa.ArrowInvalid as e:
        # Log the failure, never the offending row (it contains PHI).
        logging.error(f"Schema validation failed for batch: {e}")
        raise

    # Step 3: the boundary — 302-C2 is replaced before any routing/pricing reads it.
    tokenized_ids = pa.array([tokenize_phi(v.as_py()) for v in chunk.column("patient_id")])
    chunk = chunk.set_column(chunk.column_names.index("patient_id"), "patient_id", tokenized_ids)

    # Step 4: reject codes -> routing flags.
    normalized_flags = pa.array([normalize_reject_code(v.as_py()) for v in chunk.column("reject_code")])
    chunk = chunk.set_column(chunk.column_names.index("routing_flag"), "routing_flag", normalized_flags)

    # Drop the raw reject code and the PHI-adjacent 301-C1 Group ID post-routing.
    chunk = chunk.drop_columns(["reject_code", "group_id"])
    return chunk


def run_adjudication_pipeline(input_uri: str, output_uri: str, batch_size: int = 250_000) -> None:
    """Step 5: stream NCPDP claims through bounded, zero-copy chunks."""
    reader = pq.ParquetFile(input_uri)
    writer: Optional[pq.ParquetWriter] = None
    total_processed = 0

    for batch in reader.iter_batches(batch_size=batch_size):
        table = pa.Table.from_batches([batch])
        processed = process_chunk(table)

        if writer is None:
            writer = pq.ParquetWriter(output_uri, processed.schema)

        writer.write_table(processed)
        total_processed += processed.num_rows
        # Log counts and aggregates only — no identifiers.
        logging.info(f"Processed batch: {processed.num_rows} claims | cumulative: {total_processed}")

    if writer:
        writer.close()
        logging.info(f"Pipeline complete. Output written to {output_uri}")

A note on pricing math that this stage feeds: once 442-E7 and 407-D7 reach the MAC/AWP engine, every copay, deductible, and rebate figure must be computed with decimal.Decimal, never float — a binary-float cent error compounds across a daily batch into a payer-audit finding. The tokenized identifier flows unchanged into the Tier Mapping & Copay Calculation Logic that performs those calculations.

Figure: Secure adjudication pipeline stages, ingesting NCPDP claims through PyArrow schema validation, keyed HMAC PHI tokenization, and reject_code routing into a PHI-safe audit log or dead-letter queue.

Verifying the boundary holds

The boundary is only real if a test proves it. Two invariants matter: tokenization is deterministic (or reconciliation silently breaks), and no raw identifier survives in the output. Assert both against a known fixture so the check runs in CI on every schema or key-handling change.

python

import pyarrow as pa


def test_tokenization_is_deterministic():
    # Same 302-C2 -> same token across independent calls, so batch joins hold.
    assert tokenize_phi("MEMBER00042") == tokenize_phi("MEMBER00042")
    assert tokenize_phi("MEMBER00042") != tokenize_phi("MEMBER00043")


def test_boundary_strips_raw_phi():
    raw = pa.table({
        "bin": ["610279"], "pcn": ["MEDDPRIME"], "group_id": ["RX1000"],
        "patient_id": ["MEMBER00042"], "product_id": ["00093-7146-56"],
        "qty_dispensed": [30.0], "reject_code": ["75"],
        "routing_flag": [None], "processed_ts": [pa.scalar(0, pa.timestamp("us")).as_py()],
    })
    out = process_chunk(raw)
    # Raw member id must not appear anywhere in the output.
    assert "MEMBER00042" not in out.column("patient_id").to_pylist()
    # 301-C1 Group ID column is dropped entirely post-routing.
    assert "group_id" not in out.column_names
    # Reject 75 resolved to the PA flag deterministically.
    assert out.column("routing_flag").to_pylist() == ["PRIOR_AUTH_REQUIRED"]

Because both the tokenizer and the reject matrix are deterministic, a diff in either expected value is a real regression, not flakiness — which is what makes an adjudication run replayable for audit. Pair this with automated drift detection: compare each incoming payload schema against the pinned ADJUDICATION_SCHEMA at the gateway and reject mismatches, exactly as the upstream NCPDP D.0 Message Parsing Strategies contract guarantees a well-formed payload before it arrives.

Gotchas and PHI guardrails

Tokenize before you log, not after. The most common leak is a debug logging.info(row) placed for convenience above the tokenization call. Any log statement that can see a row must run after Step 3, and even then should emit counts and reject distributions only.
A plain hash of a member ID is not de-identification. 302-C2 values are low-entropy and enumerable; SHA-256 without a secret key is reversed by a precomputed table in seconds. Always key the construction (HMAC or AEAD) and hold the key in a KMS/HSM.
Never coerce identifiers or NDCs to int. Leading zeros in 302-C2 and in the 407-D7 NDC are significant; an int() anywhere upstream mis-keys both the token and the crosswalk. Keep them str end to end — the same rule the NDC to GPI Crosswalk Automation resolver depends on.
The unmapped reject case must fail closed. Defaulting an unknown code to CLEAN leaks an unadjudicated claim; default to REVIEW_REQUIRED and route it to a dead-letter queue so the audit trail stays complete.
The NDC is safe to log; the claim it rides on is not. 407-D7 and reject codes belong in structured logs; 302-C2, 310-CA, and 311-CB never do. Retention and access-control policy for those logs should follow the HIPAA Security Rule and the NCPDP Telecommunication Standard.
Rotate the key without breaking joins. An HMAC key rotation changes every token, so historical joins break unless you version tokens by key generation and re-tokenize on read. Plan the rotation and the re-key window before you need it.

Security & Compliance Boundaries for Claims Data — the parent workflow this PHI-isolation contract enforces
Schema Validation & Error Categorization — the drift gate that guarantees a castable payload before tokenization
Fallback Routing Logic Design — where the normalized reject flags are consumed
NDC to GPI Crosswalk Automation — the 407-D7-keyed step that follows routing
Tier Mapping & Copay Calculation Logic — the decimal.Decimal pricing stage the tokenized claim flows into

← Back to Security & Compliance Boundaries for Claims Data

Designing secure data pipelines for PHI claims adjudication

Decision: which tokenization construction, and where it runs #

Step-by-step: a PHI-safe chunked adjudication pipeline #

Verifying the boundary holds #

Gotchas and PHI guardrails #

Related #

Decision: which tokenization construction, and where it runs

Step-by-step: a PHI-safe chunked adjudication pipeline

Verifying the boundary holds

Gotchas and PHI guardrails

Related