Defect ClassificationModel Methods

Managing False Positives in Automated Defect Classification

December 10, 2025 6 min read By Jonas Falk

Confidence score threshold visualization for defect classification false-positive management

When you deploy an automated defect classifier in a production yield monitoring pipeline, you're implicitly making a bet about the cost tradeoff. False negatives — missed yield killers that get through undetected — have one cost profile. False positives — non-events flagged as alerts, routing yield engineers and process owners to waste time investigating nothing — have a different, often underestimated cost profile.

In our experience observing pilot deployments, false positives are what erode confidence in automated systems. A yield engineer who gets three spurious alerts about RING_EDGE_EXCL on wafers that turn out to be normal CMP variation will start ignoring the alert queue within a week. That erosion of trust is harder to recover from than a missed detection.

This post is about how to tune your confidence thresholds, identify pattern classes with high false-positive rates, and design a human-review workflow that keeps the system useful without overwhelming the review queue.

Understanding What Confidence Scores Mean

The Wafertune classifier outputs a confidence score between 0 and 1 for each detected pattern class. This score is a calibrated probability — after temperature scaling, it should be approximately correct as a frequency estimate. A score of 0.80 means that, across the calibration distribution, the classifier is correct about 80% of the time when it returns that score.

"Approximately correct" has limits. Calibration quality degrades on pattern classes that are sparsely represented in the pretraining corpus, and on node families that are structurally different from the training distribution. If your fab runs a process node that Wafertune hasn't been calibrated on, the raw confidence scores may be systematically over- or under-estimated for certain classes. The API response includes a calibration_source field that tells you whether calibration was applied at the class level (class_calibrated), at the temperature-scaling level (temperature_scaled), or not at all (uncalibrated) — treat uncalibrated outputs with appropriate skepticism.

We're not saying confidence scores should be ignored — they're useful and the calibration is genuine for well-covered classes. We're saying that a confidence score is an input to your threshold decision, not a substitute for it.

Setting Thresholds: Start With Class-Level Analysis

A single global threshold (e.g., "alert on all classifications with confidence ≥ 0.75") is a reasonable starting point but not an end state. Different pattern classes have different false-positive profiles, and a threshold that works well for SCRATCH_LINEAR may generate excessive false positives for CLUSTER_RANDOM.

The approach that works in practice is to run a calibration pass on your historical labeled data — if you have it — to measure precision-recall at different confidence thresholds for each class. If you don't have labeled historical data, start with the Wafertune default thresholds and track alert-to-confirmed-defect ratio per class over the first 30 days of operation.

Pattern class	Recommended starting threshold	Rationale
`SCRATCH_LINEAR`	0.70	High visual distinctiveness; low ambiguity with other classes
`RING_EDGE_EXCL`	0.75	Common class with good training coverage; can distinguish from normal CMP variation at moderate confidence
`LITHO_REPEAT`	0.80	Can be confused with multi-field CMP non-uniformity; higher threshold reduces process-normal false positives
`CLUSTER_RANDOM`	0.80	Normal particulate baseline on well-maintained process can generate borderline cluster detections; tune to your fab's baseline
`SLIP_LINE_THERMAL`	0.65	Rare but high-impact; prefer lower threshold (accept more false positives to avoid missing a thermal excursion)

These are starting points for a 200mm analog/BCD context. The right thresholds for your specific process depend on: die size (affects how well the classifier can resolve spatial patterns), typical yield level (a 99% yield wafer has more random noise relative to signal than a 90% yield wafer), and process maturity (stable processes have more consistent normal-variation profiles that are easier to separate from real excursions).

The Ambiguous Pattern Problem

Some patterns are genuinely ambiguous — not because the model is uncertain, but because the underlying physics can produce wafer maps where two different failure modes look nearly identical at the bin map level.

A recurring example: partial-ring patterns near the wafer edge. These can be caused by three distinct mechanisms: CMP edge loading (a CMP process issue), bevel contamination from a previous wet clean step, or edge-of-chuck edge exclusion during a deposition step. At low to moderate yield impact, these three mechanisms produce nearly identical wafer map signatures. The model assigns high confidence to RING_EDGE_EXCL because that's the correct spatial category, but the process_origin_hint correctly returns three candidate origins.

For ambiguous patterns like this, automated alerting should route to human review rather than directly to a process action queue. A yield engineer looking at the map alongside other process data (deposition equipment log, CMP pad wear state, wet clean chemical concentration measurements) can disambiguate in minutes. The classifier's job is to surface the pattern; the engineer's job is to assign root cause.

Designing a Human Review Workflow

The review_recommended flag in the API response is your routing signal. When it's set to true, the classification should go to a review queue rather than triggering automated downstream actions. The flag fires when either:

Top-1 confidence is below 0.60 (the model is genuinely uncertain)
Top-1 and top-2 confidence scores are within 0.08 of each other (the model is choosing between two plausible classes)
The detected pattern is a compound (multi-label) map with individual class confidences below 0.75

In production deployments observed by our pilot users, the review queue typically captures 8–15% of all classified wafer maps. That's a manageable load for a yield team reviewing at the start of each shift — roughly 10–30 wafers per shift for a 200mm fab running 1,500–2,000 wafer starts per week. If your review queue exceeds this range, it usually means thresholds are too aggressive (too many marginal detections routing to review) or the model has calibration issues for your node family.

A well-designed review interface presents the wafer map image alongside the top-3 candidate classes with confidence scores, the process_origin_hint for each, and — if you have historical data — similar wafer maps from past reviews. The engineer confirms, overrides, or escalates. Confirmed overrides feed back as labeled data for model fine-tuning. This is the feedback loop that makes the system better over time rather than statically accurate.

When False Positives Indicate a Deeper Problem

Sustained high false-positive rates on a specific class often indicate something other than threshold misconfiguration. A few diagnostic patterns to watch:

High false-positive rate on CLUSTER_RANDOM at specific die regions may indicate that your process has a systematic contamination source that the model is correctly detecting but that your team has normalized as acceptable. What looks like a false positive from the alerting perspective may be a real defect that isn't currently driving yield fallout — but could become one if contamination levels increase.

High false-positive rate on LITHO_REPEAT often traces to CMP thickness non-uniformity at stepper-field boundaries — a pattern that is genuinely periodic (hence the LITHO_REPEAT classification) but not caused by a reticle defect. The model is pattern-matching correctly; the disambiguation requires knowing whether your CMP process has field-boundary non-uniformity on a given lot.

Sudden increase in false-positive rate after a model update is a signal to check the model version in the API response (model_version field) and review the update notes for any class threshold changes. Model updates can shift calibration slightly even for well-covered classes.

For threshold tuning, review queue design, and per-class calibration analysis, the How It Works page covers the response schema in detail. For pattern-class-specific guidance on which classes are prone to false positives in specific node families, the Pattern Library notes the expected precision range per class. If your false-positive profile doesn't respond to threshold tuning, reach out — that's usually a signal that the model needs calibration on your specific process node, and we handle those conversations through the pilot program.