Defect Classification Model Methods

How ML-Based Defect Pattern Classification Works for Wafer Yield Analytics

March 12, 2026 9 min read By Jonas Falk

Abstract visualization of ML defect pattern classification on wafer map data

Defect pattern classification is a computer vision problem. That sounds obvious when you say it out loud, but the semiconductor industry spent a long time treating it as a rules-based lookup problem instead — and the gap between those two approaches explains most of the frustration yield engineers have with automated classification systems today.

This post covers how Wafertune's classification engine works: the model architecture, what "spatial features" actually means in this context, why training data composition matters more than model size, and where the approach still has limits.

The Input: What a Wafer Map Actually Is

A wafer map at the classification stage is a 2D grid of pass/fail (or multi-bin) test results from electrical wafer sort (EWS). Each cell in the grid represents one die. The spatial arrangement of failing dies — not just the count — carries the signal about root cause.

A ring of failing dies near the edge of the wafer means something different from a diagonal scratch of failing dies, which means something different from a cluster of fails near the center. That's the whole problem: the same defect density can correspond to completely different process failure modes depending on where the fails are and what shape they form.

The input to Wafertune's classifier is either a structured die map (CSV or STDF-derived grid) or a rasterized PNG of the wafer map. Both paths normalize to the same internal representation: a 64×64 or 128×128 grid with per-cell bin values, zero-padded to a circular mask to match the wafer geometry. The spatial arrangement is what the model learns to read.

Architecture: Why Convolutions Alone Aren't Enough

Early wafer map classifiers used straight CNNs — often ResNet-18 or ResNet-34 variants pretrained on ImageNet and fine-tuned on labeled wafer maps. That worked well for canonical patterns like edge exclusion rings and uniform scratch lines. It performed poorly on anything more spatially ambiguous: multi-cluster patterns, partial rings, or the kind of diffuse low-density contamination you see in wet-etch tools running at end-of-life consumables.

The issue with pure convolution is locality. A 3×3 or 5×5 kernel captures what's happening in a small neighborhood, and you build up global structure through many stacked layers. But wafer-map defects are often defined by their global spatial relationship — a RING_EDGE_EXCL pattern is only a ring if you can see the full annular geometry, and that requires attending to the entire wafer simultaneously.

Wafertune's classifier uses a hybrid architecture: a convolutional backbone for local feature extraction, feeding into a spatial transformer block that learns to attend to global arrangements. Concretely:

# Simplified architecture sketch (Wafertune internal)
class WaferClassifier(nn.Module):
    def __init__(self, num_classes=180, backbone='resnet18'):
        super().__init__()
        self.backbone = ResNetFeatureExtractor(backbone)
        # Backbone output: [B, 512, H/16, W/16]
        self.spatial_attn = SpatialAttentionBlock(
            in_channels=512,
            num_heads=8,
            grid_size=(8, 8)   # coarse spatial grid
        )
        self.classifier = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),
            nn.Flatten(),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(256, num_classes)
        )
    def forward(self, x):
        features = self.backbone(x)
        features = self.spatial_attn(features)
        return self.classifier(features)

The spatial attention block is loosely inspired by vision transformer (ViT) attention, but applied over the convolutional feature map rather than raw patches. This preserves the inductive bias of convolution (local texture matters — a scratch looks different from a cluster under a 5×5 lens) while adding the global-relational awareness that transformers provide.

On the public WM-811K dataset (a Kaggle-sourced collection of labeled wafer maps from mixed-node fabs), this hybrid architecture achieves macro-averaged F1 around 0.91 on the 9-class version of the dataset. We're not saying that WM-811K performance is the number that matters for production — WM-811K is heavily weighted toward logic and memory patterns. But it validates that the architecture generalizes correctly on canonical patterns before we shift to specialty-node pretraining.

Why Specialty-Node Pretraining Changes the Results

The pretraining question is where Wafertune's approach diverges most visibly from generic wafer map classifiers.

Standard pretraining pipelines use whatever labeled data is available — which in practice means mostly 300mm logic and memory wafer maps, because those are the fabs with the largest data teams and the most mature EWS infrastructure. If you fine-tune such a model on a 200mm BCD process fab's data, it will converge, but it will miss defect classes that simply don't appear in the pretraining distribution.

BCD (Bipolar-CMOS-DMOS) processes run higher-voltage implants and have thicker oxide stacks than logic nodes. The dominant defect signatures are different: HV-MOS gate oxide pinholes generate localized cluster patterns that look superficially like random particulate contamination but have a specific size distribution and preferred location relative to the stepper field. A model pretrained only on logic will collapse these into CLUSTER_RANDOM because it has never seen the feature that distinguishes them.

Wafertune's pretraining corpus includes synthetic wafer maps generated from process physics simulations for analog, BCD, LDMOS, BiCMOS, and MEMS node families, combined with semi-supervised learning on unlabeled real wafer maps sourced from our internal validation partners. The semi-supervised loop uses a contrastive learning objective: maps with similar spatial structure are pulled toward the same representation, regardless of whether they share a process node. This builds representations that transfer across node families.

In our internal validation on synthetic wafer-map datasets constructed to mirror specialty-node defect distributions, adding specialty-node pretraining improves F1 on rare classes (HV_OXIDE_CLUSTER, BIPOLAR_PIPE, LDMOS_BODY_RING) by 15–25 percentage points compared to a baseline ResNet fine-tuned from ImageNet. The gains on canonical classes like SCRATCH_LINEAR and RING_EDGE_EXCL are smaller (3–7 points), which makes sense — those are the patterns well-represented in any training corpus.

Confidence Scores and Calibration

The classifier outputs a softmax probability distribution over all 180+ pattern classes. The argmax of that distribution is the predicted class, but the raw softmax value is not a reliable confidence score without calibration.

Neural networks are notoriously overconfident: a model might output 0.97 probability for SCRATCH_LINEAR when the actual pattern is ambiguous between a scratch and an edge-handling mark. Temperature scaling — a single learned scalar applied to the logits before softmax — is the standard fix, and it works well for single-class predictions. For multi-label patterns (wafer maps that contain more than one defect type), we use a separate calibration head trained on held-out data.

The API response includes a confidence field per predicted class, a calibration_source tag (temperature_scaled vs multilabel_calibrated), and a review_recommended boolean that triggers when the top-1 confidence falls below a threshold or when the top-1 and top-2 confidences are within 0.08 of each other. That last heuristic catches the ambiguous patterns that are most likely to be human-reviewable disagreements rather than clear model errors.

We're not saying the confidence scores are perfectly calibrated for every process node — they're not. Calibration quality degrades on node families underrepresented in the pretraining corpus, and we track expected calibration error (ECE) per node family as part of our model versioning process. If your node family shows high ECE, that's a signal to consider fine-tuning the calibration layer on your labeled data.

What the Model Doesn't Do Well

Honest about limitations: the current classifier struggles with three categories of wafer map.

First, thin-film deposition defects in MEMS release etch sequences. The pattern of void formation after HF vapor etch is spatially structured, but not in a way that maps cleanly onto the ring/cluster/linear taxonomy — it looks like a stochastic spatial texture that depends on local stress gradients in the device layer. The model classifies these as UNKNOWN_STRUCTURED more often than we'd like. This is an active research area for us, and it's the reason we're building a custom MEMS defect class extension.

Second, compound patterns at high density. When a wafer has both a CMP edge ring and a lithography repeater pattern active simultaneously, the classification task becomes multi-label, and multi-label wafer maps are harder. Detection rate on compound patterns in our synthetic validation dataset is around 0.74 macro F1 versus 0.91 for single-pattern maps.

Third, very small die geometries on dense wafer grids (≥300 die across). The spatial resolution of the normalized 128×128 input grid means that patterns spanning fewer than 5 die in any dimension can be lost to quantization. This affects MEMS foundry wafers with high die-per-wafer counts more than standard analog wafers.

These limits are knowable and bounded. If your workflow involves any of these three categories at high frequency, the pilot conversation should start there so we're honest about fit before you build a pipeline dependency.

From Classification to Action

A classification result is only useful if it routes to a decision. The API response includes a process_origin_hint field that maps each detected pattern class to a plausible process module — not as a definitive root-cause statement, but as a starting hypothesis for the yield engineer. RING_EDGE_EXCL maps to ["CMP", "bevel_etch"]. LITHO_REPEAT maps to ["photolithography", "reticle_inspection"]. These hints are built from our taxonomy definitions, not from your process data — they're a structured vocabulary, not a diagnosis.

The full Pattern Library documents the process origin mapping for each of the 180+ classes. If you're evaluating whether Wafertune's taxonomy covers your specific failure modes, that's the right place to check before running a pilot. And if you need a custom class that doesn't exist in the library yet, that's a conversation we're prepared to have — our fine-tuning pipeline for custom classes is documented in the API Reference under the model management endpoints.