Back to BlogAutomotive

The $2 Billion Question GM Never Asked: When Two Rare Defects Meet in One Battery Cell

Madhusudhan ChellappaMarch 10, 202611 min read0 views
Share:

In 2016, the Chevrolet Bolt EV was GM's flagship answer to Tesla. Affordable. Long-range. The car that was supposed to prove a 100-year-old automaker could lead the electric transition.

By 2021, GM was telling 140,000 Bolt owners: Don't park in your garage. Don't charge overnight. Don't let the battery drop below 70 miles of range. Your car might burn your house down.

The recall cost $2 billion. Thirteen confirmed fires. Production halted for months. A key supplier relationship strained almost to breaking point.

What went wrong wasn't exotic. It was a gap in how we think about failure.

The Fires That Came From Nowhere

The first reports trickled in during 2020. Bolts were catching fire — not while driving, but while parked. In garages. Overnight while plugged in. The worst possible scenario for consumer confidence in EVs.

GM's initial response was a software update in November 2020. A BMS patch that was supposed to detect anomalies before they became dangerous. It didn't work. More fires followed.

By July 2021, GM recalled 69,000 vehicles. Then in August, two more fires — in vehicles that had already received the "fix." GM expanded the recall to every Bolt ever manufactured. All model years. Every single one.

Two Rare Things That Weren't So Rare Together

After months of forensic investigation, GM and LG Energy Solution identified the root cause. It wasn't a single manufacturing defect. It was the simultaneous presence of two separate defects in the same battery cell:

  1. A torn anode tab — a physical defect in the electrode structure
  2. A folded separator — the insulating layer between anode and cathode was misaligned

Either defect alone was considered rare. Quality control processes were designed to catch each one individually, and the individual occurrence rates were low enough to pass statistical thresholds.

But when both defects existed in the same cell, lithium could plate during charging. Over time, this created an internal short circuit. The short circuit led to thermal runaway. The cell caught fire.

The defects occurred at LG's plants in both Korea (Ochang) and Michigan (Holland). Geography didn't matter. The process gap was systemic.

The Failure Mode Nobody Analyzed

Here's where the systems engineering lesson gets sharp.

Process FMEA at the cell manufacturing level almost certainly analyzed torn anode tabs as a failure mode. It almost certainly analyzed folded separators as a failure mode. Each would have had its own severity, occurrence, and detection ratings. Each would have had its own set of controls.

What the FMEA almost certainly did NOT do was analyze the combination: "What if both defects occur in the same cell?"

This is a well-known gap in traditional FMEA practice. Single-point failure analysis asks "what happens if X fails?" It does not systematically ask "what happens if X and Y fail simultaneously?" That question belongs to a different analytical discipline — common cause analysis, fault tree analysis, or interaction FMEA — and it is routinely skipped, especially in high-volume manufacturing where individual defect rates are low.

The mathematics of combined probability can be deceiving. If defect A occurs at a rate of 1 in 10,000 and defect B occurs at a rate of 1 in 10,000, the probability of both occurring in the same cell is roughly 1 in 100 million. Sounds impossibly rare. But when you manufacture billions of cells across multiple plants over multiple years, "1 in 100 million" stops being theoretical.

The Software Fix That Could Never Work

There's a secondary lesson buried in the timeline. GM's first response was a software update — a BMS patch to detect anomalies. It failed because the root cause was never in the software.

The BMS couldn't detect the developing internal short circuit because the defect existed at the physical cell level, below the electrical signatures the BMS was designed to monitor. By the time electrical anomalies became detectable, thermal runaway was already inevitable.

This pattern — reaching for a software fix when the root cause is in manufacturing or hardware — is endemic in modern engineering. Software is easy to update. It doesn't require physical recall. It feels fast and responsive. But when the failure mode lives in a physical process, software is a bandage on a fracture.

The reason GM reached for software first wasn't incompetence. It was a traceability gap. Without clear traceability from cell-level process parameters to vehicle-level safety requirements, the investigation couldn't immediately identify where in the chain the failure originated. The FMEA didn't connect the manufacturing process to the vehicle fire because the specific combination failure mode didn't exist in the analysis.

What This Means for the Industry

The Bolt recall isn't a story about one company's mistake. It's a story about an industry-wide blind spot in how we analyze manufacturing process risk for high-energy systems.

Combination failure modes must be explicitly analyzed. Traditional FMEA asks "what if this one thing fails?" For battery cells — where the energy density makes any internal short circuit potentially catastrophic — the analysis must also ask "what if these two things fail together?" This requires a structured approach to identifying which failure mode pairs can interact, and it requires that the FMEA methodology actually support multi-factor analysis.

Supplier process risk is OEM vehicle risk. GM's vehicle caught fire because of a defect in LG's manufacturing process. The cell-level PFMEA gap became a vehicle-level safety failure. Any OEM integrating cells, modules, or packs from suppliers must have visibility into — and influence over — the supplier's process FMEA.

Traceability must span the full chain. From raw material properties to cell manufacturing parameters to module assembly to pack integration to vehicle safety requirements — the chain must be traceable. When a vehicle catches fire, you need to be able to trace backward from the event to the specific process step and specific failure mode that caused it. Without this traceability, your first response will be a guess. And as GM learned, guesses cost time, money, and trust.

Statistical rarity is not safety. A defect rate of 1 in 10,000 sounds safe. But manufacturing volumes in the EV era are measured in billions of cells. At that scale, rare events become certainties. Your FMEA must account for actual production volumes, not just theoretical occurrence rates.

The Question Worth $2 Billion

GM and LG eventually found the root cause and replaced every battery pack. The financial cost was absorbed. The reputational cost is harder to measure — the Bolt was discontinued in 2023.

The question that would have prevented all of it is deceptively simple: "What happens when two rare defects exist in the same cell?"

That question wasn't in the FMEA. It should have been. And if you're building battery systems today — whether for EVs, energy storage, or any other high-energy application — it should be in yours.

Not because GM failed. Because the physics hasn't changed. The manufacturing processes still produce defects at nonzero rates. The energy densities are higher than ever. And the only thing standing between a "rare" combination and a thermal runaway event is whether someone thought to ask the question.

Share this article:

Share:

Madhusudhan Chellappa

CTO & Founder at Gannet Engineering. Two decades of experience in systems engineering across automotive, aerospace, and safety-critical domains.

Follow on LinkedIn

Ready to improve your systems engineering?

See how NirmIQ connects requirements to FMEA analysis.