How to map legacy NDC codes to GPI standards in Python

The exact implementation decision on this page is how to turn an inconsistent legacy National Drug Code (NDC) — the value carried in 407-D7 Product/Service ID, arriving 9-, 10-, or 11-digit with leading zeros stripped by an upstream switch — into a canonical 14-digit Generic Product Identifier (GPI) without ever guessing. Get the normalization or the fallback order wrong and the claim resolves to the wrong therapeutic class, which silently reassigns the member’s copay tier and becomes a payer-audit finding. This page implements the resolver that sits at the heart of the NDC to GPI Crosswalk Automation cluster: strict 5-4-2 normalization, a memory-bounded lookup, deterministic tiered fallback, and an NCPDP reject code plus audit event for every outcome. Every branch is deterministic — the same NDC against the same crosswalk snapshot must always produce the same GPI and the same reject code, because adjudication has to be replayable.

Why normalization is the correctness bottleneck

The NDC standard defines an 11-digit 5-4-2 structure — 5-digit labeler, 4-digit product, 2-digit package — but the value in 407-D7 is rarely delivered that way. The 10-digit FDA registration format omits a padding zero in one of the three segments, older systems strip leading zeros as if the NDC were an integer, and pharmacy switches hyphenate inconsistently. The GPI, by contrast, is a fixed 14-digit hierarchy: positions 1–10 encode therapeutic class, generic name, strength, and dosage form (this is the clinical grouping key that drives DUR edits and prior-authorization triggers), and positions 11–14 encode manufacturer and package. Because the clinical layer keys on the first 10 GPI digits, a normalization error that shifts a single NDC segment resolves to a different drug family entirely — not a near miss. That is why normalization, not lookup speed, is the correctness bottleneck, and why NDC and GPI values must be handled as str throughout: coercing either to int destroys the leading zeros the standard depends on.

Figure: The normalization funnel that makes lookup safe — three inconsistently delivered 407-D7 values are stripped, left zero-padded, and resolved to one canonical 5-4-2 key, while empty or oversized input diverts to reject 70 rather than fabricating a match. The canonical key resolves to a 14-digit GPI whose first ten positions are the clinical key that drives DUR edits and prior-authorization triggers, so a one-segment normalization slip lands on a different drug family, not a near miss.

Decision matrix: matching strategy and lookup structure

Two engineering choices dominate this resolver: the order of matching tiers, and the in-memory structure that backs the lookup. Each matching tier trades specificity for coverage and must terminate in either a GPI or an NCPDP reject code — never a silent pass.

Tier	Match key	Coverage vs precision	On hit	On miss
0 — Normalize	canonical 11-digit `5-4-2`	prerequisite for all tiers	proceed to Tier 1	reject `70` invalid format, quarantine
1 — Exact	full 11-digit `407-D7` NDC	highest precision, package-exact	`status=resolved`, reject `00`	fall through to Tier 2
2 — Package-agnostic	9-digit labeler-product prefix	broader, drops package precision	`status=degraded`, reject `00`	fall through to Tier 3
3 — Unmapped	—	none	—	reject `75` drug not on file, manual queue

A structural guard wraps every hit: a resolved GPI must itself be a valid 14-digit numeric string, or it returns reject 81 (invalid GPI) rather than propagating a malformed therapeutic class from a dirty crosswalk row. The reject codes here are the same ones the Fallback Routing Logic Design layer consumes downstream, so they must be emitted consistently.

The lookup structure choice is about surviving peak point-of-sale windows without an OOM kill. Commercial crosswalk files routinely exceed 500,000 rows.

Structure	Load approach	Peak heap (500k rows)	Lookup	Notes
`pandas` (object dtype, inferred)	default `read_csv`	~450–600 MB	vectorized, not O(1) per key	type inference triggers RAM spikes; avoid on the hot path
`pandas` (`category`/`string[pyarrow]`)	explicit dtypes	~120–180 MB	still row-scan for prefix	good for offline prep, not per-claim resolution
`polars` → `dict[str, str]`	zero-copy CSV read, materialize once	~90–130 MB	O(1) exact, O(n) prefix	recommended: fast load, cheap dict for the resolver

The pattern below reads the crosswalk once with polars for a fast, schema-pinned load, then materializes a plain dict[str, str] for O(1) exact matches on the hot path.

Step-by-step implementation

Step 1 — Normalize the `407-D7` NDC to a canonical 11-digit key

Strip everything non-numeric, then left zero-pad to recover stripped leading zeros. Reject empty input as format-invalid rather than padding it into a false match.

python

import re
import logging
from typing import Dict, Tuple, Optional
import polars as pl

# PHI-safe logging: never emit 302-C2 Cardholder ID, 310-CA Patient First Name,
# or raw claim bytes. Only taxonomy identifiers, reject codes, and timestamps.
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(message)s",
    handlers=[logging.FileHandler("ndc_gpi_mapping.log")],
)
logger = logging.getLogger(__name__)

# NCPDP reject-code constants (emitted to the fallback-routing layer)
REJ_RESOLVED = "00"      # resolved (paid path)
REJ_INVALID_DRUG = "70"  # Product/Service Not Covered — bad 407-D7 format
REJ_NOT_ON_FILE = "75"   # Drug Not on File — all tiers miss
REJ_INVALID_GPI = "81"   # crosswalk value is not a valid 14-digit GPI

def normalize_ndc(raw_ndc: str) -> Optional[str]:
    """Normalize a 407-D7 Product/Service ID to a canonical unhyphenated
    11-digit (5-4-2) NDC. Returns None for un-normalizable input so the
    caller can raise reject 70 rather than fabricate a match.

    NDC is handled as str throughout — never int — to preserve leading zeros.
    """
    digits = re.sub(r"[^0-9]", "", raw_ndc or "")
    if not digits:
        return None
    padded = digits.zfill(11)  # recover leading zeros stripped upstream
    if len(padded) > 11:
        # Oversized input is a data defect; log the shape, not the value's origin.
        logger.warning("Oversized NDC truncated to trailing 11 digits (len=%d)", len(digits))
        padded = padded[-11:]
    return padded

def _is_valid_gpi(gpi: str) -> bool:
    """A GPI is exactly 14 numeric digits; anything else is a dirty crosswalk row."""
    return bool(gpi) and len(gpi) == 14 and gpi.isdigit()

Step 2 — Load the crosswalk with a pinned schema

Pin the dtypes so polars never infers int (which would drop leading zeros), then materialize the O(1) dict once at worker startup.

python

def load_crosswalk(filepath: str) -> Dict[str, str]:
    """Load the NDC->GPI snapshot with an explicit Utf8 schema and build an
    O(1) exact-match dict. The file is a version-stamped snapshot (e.g.
    crosswalk.version = 'v2026.06'); the resolver must use the snapshot active
    at the dispensing timestamp so adjudications stay replayable under audit."""
    schema = {"NDC_11": pl.Utf8, "GPI_14": pl.Utf8}  # Utf8 preserves leading zeros
    df = pl.read_csv(filepath, schema=schema, ignore_errors=True, encoding="utf-8")
    return {row["NDC_11"]: row["GPI_14"] for row in df.iter_rows(named=True)}

Step 3 — Resolve through the tiered fallback

Each tier returns (gpi, ncpdp_code, resolution_path). Tier 2 pre-indexes the 9-digit prefix so package-agnostic matching stays fast instead of scanning the whole dict per claim.

python

def build_prefix_index(crosswalk: Dict[str, str]) -> Dict[str, str]:
    """Map each 9-digit labeler-product prefix to a representative GPI for
    Tier 2. Built once alongside the crosswalk to keep resolve_gpi O(1)."""
    prefix_index: Dict[str, str] = {}
    for ndc11, gpi in crosswalk.items():
        prefix_index.setdefault(ndc11[:9], gpi)
    return prefix_index

def resolve_gpi(
    ndc: str,
    crosswalk: Dict[str, str],
    prefix_index: Dict[str, str],
) -> Tuple[str, str, str]:
    """Deterministic tiered resolution of a 407-D7 NDC to a GPI.
    Returns (resolved_gpi, ncpdp_reject_code, resolution_path)."""
    norm = normalize_ndc(ndc)
    if not norm:  # Tier 0 failed
        return "", REJ_INVALID_DRUG, "invalid_format"

    # Tier 1 — exact 11-digit match (highest precision)
    gpi = crosswalk.get(norm)
    if gpi is not None:
        if not _is_valid_gpi(gpi):
            return "", REJ_INVALID_GPI, "invalid_gpi"
        return gpi, REJ_RESOLVED, "exact_match"

    # Tier 2 — package-agnostic (9-digit labeler-product prefix)
    base = prefix_index.get(norm[:9])
    if base is not None:
        if not _is_valid_gpi(base):
            return "", REJ_INVALID_GPI, "invalid_gpi"
        # Package precision lost -> flag degraded for downstream UoM routing
        return base, REJ_RESOLVED, "package_agnostic_degraded"

    # Tier 3 — unmapped -> manual review queue
    logger.warning("Unmapped NDC (normalized) -> NCPDP %s", REJ_NOT_ON_FILE)
    return "", REJ_NOT_ON_FILE, "unmapped_manual_queue"

The full branch structure, including the reject codes each terminal emits, is shown below.

Figure: resolve_gpi tiered fallback. Decisions run down the left spine — normalize, then Tier 1 exact, then Tier 2 prefix — while each tier hit clears the _is_valid_gpi guard before resolving. Every path terminates in an NCPDP code: 70 bad format, 00 resolved (exact or degraded), 81 invalid GPI, or 75 unmapped, never a silent pass.

Step 4 — Stream claims without buffering PHI

Read the claim feed line-by-line so a large batch never lands wholesale in RAM, and discard the transport payload immediately after extracting 407-D7.

python

def process_claims_stream(claims_file: str, crosswalk_path: str) -> None:
    """Stream inbound claims and resolve each 407-D7 NDC to a GPI.
    Only taxonomy identifiers and reject codes are logged — never the
    302-C2 Cardholder ID or any 3xx patient field."""
    crosswalk = load_crosswalk(crosswalk_path)
    prefix_index = build_prefix_index(crosswalk)
    logger.info("Loaded %d crosswalk entries", len(crosswalk))

    with open(claims_file, "r", encoding="utf-8") as f:
        header = f.readline().strip().split(",")
        ndc_idx = header.index("NDC") if "NDC" in header else 0
        for line in f:
            fields = line.rstrip("\n").split(",")
            raw_ndc = fields[ndc_idx]  # 407-D7 Product/Service ID
            gpi, rej_code, path = resolve_gpi(raw_ndc, crosswalk, prefix_index)
            # Log resolution outcome only; the raw claim line is not persisted.
            logger.info("GPI=%s | REJ=%s | PATH=%s", gpi, rej_code, path)
            # Emit (gpi, rej_code) to the adjudication router; drop the raw line here.
            del fields, line

Verification and testing pattern

Assert against fixed NDC/GPI fixtures so every tier and every reject code is pinned. The critical cases are leading-zero recovery, the package-agnostic downgrade, the invalid-GPI guard, and the unmapped path — these are exactly the branches that silently mis-tier a claim if they regress.

python

import pytest

CROSSWALK = {
    "00002323230": "27100010100310",  # exact, valid 14-digit GPI
    "00093721001": "SHORTGPI",        # deliberately dirty crosswalk row
}
PREFIX = build_prefix_index(CROSSWALK)

@pytest.mark.parametrize("raw,exp_gpi,exp_rej,exp_path", [
    ("0002-3232-30", "27100010100310", "00", "exact_match"),   # zero-padded 5-4-2
    ("00002323230",  "27100010100310", "00", "exact_match"),   # already canonical
    ("000023232",    "27100010100310", "00", "package_agnostic_degraded"),  # 9-digit prefix
    ("00093721001",  "",               "81", "invalid_gpi"),   # dirty GPI guarded
    ("99999999999",  "",               "75", "unmapped_manual_queue"),
    ("---",          "",               "70", "invalid_format"),
])
def test_resolve_gpi(raw, exp_gpi, exp_rej, exp_path):
    gpi, rej, path = resolve_gpi(raw, CROSSWALK, PREFIX)
    assert (gpi, rej, path) == (exp_gpi, exp_rej, exp_path)

def test_normalization_preserves_leading_zeros():
    # int coercion would collapse this to "2323230" and mis-key the lookup
    assert normalize_ndc("0002-3232-30") == "00002323230"

Run the same fixture set in CI on every crosswalk-snapshot bump. Because resolution is deterministic, a diff in any expected tuple is either an intended rule change or a regression — there is no flaky third state.

Gotchas and PHI guardrails

Never coerce NDC or GPI to int. A single int(ndc) upstream drops leading zeros and turns an exact match into an unmapped 75. Keep both as str end to end, including CSV load (pl.Utf8).
A package-agnostic hit is degraded, not clean. Tier 2 drops the last two package digits, so it is safe for tier placement but must be flagged so unit-of-measure and specialty routing downstream can decide whether the missing precision matters. Do not collapse it into exact_match.
Guard the crosswalk value, not just the key. A malformed GPI in the source file is a data-quality defect; returning reject 81 isolates it instead of pricing a claim against a corrupt therapeutic class.
Pin the crosswalk snapshot version. Resolve against the snapshot active at the dispensing timestamp, the same versioned-snapshot discipline required for Tier Mapping & Copay Calculation Logic; a mutable table makes an audited adjudication unreplayable.
NDC and GPI are not PHI, but the claim they ride on is. The 407-D7 NDC is safe to log; 302-C2 Cardholder ID and the 310-CA / 311-CB patient name fields are not. The logging config above records only taxonomy identifiers, reject codes, and timestamps, and the stream loop discards the raw line after routing — consistent with Designing secure data pipelines for PHI claims adjudication. For NDC formatting authority, reference the FDA National Drug Code Directory; for reject-code semantics, the NCPDP Telecommunication Standard Implementation Guide.

NDC to GPI Crosswalk Automation — the parent workflow this resolver plugs into
Fallback Routing Logic Design — where the 70/75/81 reject codes are consumed
Tier Mapping & Copay Calculation Logic — the GPI-keyed step that follows resolution
Schema Validation & Error Categorization — upstream validation that guarantees a parseable 407-D7
Designing secure data pipelines for PHI claims adjudication — the PHI-boundary rules the logging here obeys

← Back to NDC to GPI Crosswalk Automation

How to map legacy NDC codes to GPI standards in Python

Why normalization is the correctness bottleneck #

Decision matrix: matching strategy and lookup structure #

Step-by-step implementation #

Step 1 — Normalize the 407-D7 NDC to a canonical 11-digit key #

Step 2 — Load the crosswalk with a pinned schema #

Step 3 — Resolve through the tiered fallback #

Step 4 — Stream claims without buffering PHI #

Verification and testing pattern #

Gotchas and PHI guardrails #

Related #

Why normalization is the correctness bottleneck

Decision matrix: matching strategy and lookup structure

Step-by-step implementation

Step 1 — Normalize the `407-D7` NDC to a canonical 11-digit key

Step 2 — Load the crosswalk with a pinned schema

Step 3 — Resolve through the tiered fallback

Step 4 — Stream claims without buffering PHI

Verification and testing pattern

Gotchas and PHI guardrails

Related