Skip to main content

Preventing Capacitor Failures in AI Server Power Systems

Mitigating High-Stress Conditions Through Passive Component Selection

(Vadym/stock.adobe.com; generated with AI)

In the race to build faster, smarter, and more powerful artificial intelligence (AI) servers, designers often focus on the headline components like graphics processing units (GPUs), tensor processing units (TPUs), and high-speed interconnects. But beneath the surface, passive components like capacitors quietly shoulder the burden of maintaining system stability. When capacitors fail, the consequences ripple through the entire system: voltage regulators falter, processors crash, and servers go offline. And yet, these failures often occur despite the capacitors having passed rigorous lab tests.

So, why is this happening? In this blog, we examine why lab-tested passive components such as capacitors may fail in the field, and consider how designers can mitigate component failure by selecting passive components specifically designed for demanding AI server environments.

Demanding AI Server Workloads

AI servers operate in environments far more demanding than the conditions under which most capacitors are tested. High temperatures, elevated humidity, and intense power densities create a perfect storm for component degradation. In data centers, ambient temperatures can exceed 50–60°C, and localized hotspots near processors can push well beyond 100°C. Add to that the presence of moisture from air economizers or liquid cooling systems, and you have a recipe for accelerated wear—even for components that passed standard qualification tests.

How Standard Capacitors Fail in the Field

To understand why capacitors fail, we need to confront misconceptions surrounding them. Although capacitors are characterized in textbooks as simple components with a dielectric between conducting plates, their actual construction is far more varied and complicated.

Among the many capacitor types are three basic groups and some of their associated failure modes:

  • So-called "wet" aluminum electrolytic capacitors, often called bulk capacitors, have relatively larger values at tens and hundreds of microfarads, sometimes thousands. Their role is to filter out the ripple in DC rails and also maintain a steady DC voltage despite load transients. These capacitors lose electrolyte solution over time, which increases their equivalent series resistance (ESR), thus causing more voltage ripple and heat buildup and potentially leading to a self-reinforcing cycle of failure known as thermal runaway.
  • Polymer capacitors use a conductive polymer that forms the cathode layer on the aluminum oxide dielectric, replacing the liquid electrolyte used in conventional aluminum electrolytics. Their low ESR and stable performance make them ideal for high-frequency and low-impedance applications. The primary failure mechanism is polymer oxidation, which increases ESR and reduces capacitance over time under thermal or voltage stress.
  • Multilayer ceramic capacitors (MLCCs) use stacked ceramic dielectric and metal electrode layers for compact, low-ESR capacitance. They are widely used for decoupling and filtering. Class II types can experience significant capacitance loss under DC bias and are susceptible to mechanical cracking, while Class I types offer superior temperature and voltage stability.

Admittedly, capacitors and their designation terminology can be confusing. Sometimes, they are named by their conductive or dielectric materials, such as aluminum, ceramic, or plastic; while other times, they are called out by their construction, such as film or multilayered, and these designations overlap.

Capacitors do not necessarily fail outright or have a single failure mode. While their nominal farad values can change significantly, they are also prone to changes in other specifications. They can degrade in key attributes, such as an increase in ESR or leakage current, or experience changes in other critical parameters.

Do not be misled by their functional simplicity. As with any component, capacitors have multiple potential points of failure (Figure 1).

Figure 1: Despite its conceptual simplicity, the capacitor—like all other components—has many possible failure causes, modes, effects, and consequences, shown here for the metallized film capacitor. (Source: CERN, CC BY 4.0 http://creativecommons.org/licenses/by/4.0/)[1]

Changes in capacitor performance can lead to processor slowdowns, noise-induced issues, voltage-regulator instability, erratic system performance, or full server outages that could negatively affect uptime service level agreements (SLAs) and customer workloads. Many of these system problems are difficult to diagnose due to their intermittent aspects or lack of an obvious link between cause and associated effect.

Testing Standards vs. AI Server Environments

Capacitor reliability is typically assessed using standardized tests that simulate stress conditions—such as 105°C for 2,000 hours—but often in dry ovens, without ripple current, and under controlled humidity. These include the numerous standards for testing and evaluating capacitors with details for pre-, during-, and post-test setups and procedures, such as:

  • IEC 60384-4, an international standard for aluminum electrolytic capacitors, providing general specifications that are supplemented by detailed specifications for specific capacitor types and applications;
  • MIL-STD-202, which outlines various methods for capacitor testing, including methods for thermal shock and humidity testing;
  • MIL-PRF-55681, a general-purpose military high-reliability specification for surface mount sizes 0805 through 2225 in 50V and 100V;
  • MIL-PRF-123, which defines an increased reliability level over MIL-PRF-55681 for space, missile, and other high-reliability applications like medical implants or life-support equipment; and
  • EIA IS-749, used by some manufacturers to detail requirements for capacitor mounting, airflow, and defining end-of-life (EOL) criteria.

While these standards and tests are comprehensive, detailed, and valuable for benchmarking, they do not adequately reflect the chaotic reality of AI server deployments. Modern AI servers run 24/7, and these systems are not only thermally stressed but also exposed to humidity levels that can lead to condensation.

A data center’s recommended temperature should range between 18°–27°C, according to ASHRAE guidelines.[2] With power densities reaching 30–50kW per rack,[3] and projections suggesting clusters may soon hit 1,000kW,[4] thermal dissipation becomes a real challenge. Additionally, ASHRAE guidelines allow dew points up to 15°C, meaning moisture intrusion is a real concern. In such conditions, capacitors face challenges that lab tests simply don’t account for.

Humidity and ripple current are particularly damaging. Moisture can degrade packaging materials, while ripple current stresses the internal structure of the capacitor. Together, they accelerate failure mechanisms that are rarely triggered in lab environments

A Better Design with More Suitable Specifications

Recognizing the difficulties surrounding capacitors within data centers, YAGEO Group has introduced capacitors that are optimized and evaluated for these AI server applications. The A798 aluminum organic capacitors (AO-CAP®) are high-humidity, high-temperature solid-state aluminum capacitors built to withstand the rigors of AI server operation. With a rated voltage of 2V to 2.5V, these polarized units are available in capacitance values from 150µF to 470µF and housed in two diminutive package sizes measuring just 7.3mm × 4.3mm × 1.9mm and 7.3mm × 4.3mm × 2.8mm (L x W x H).

Their cathode is a solid conductive organic polymer, which results in very low ESR and improved capacitance retention at high frequency. Since there is no liquid electrolyte, the A798 offers long operational lifetimes and high operating temperatures. The inherent low ESR makes the capacitors suitable for handling normally detrimental high ripple currents.

The A798 construction is based on a stacking of aluminum elements, which includes the dielectric Al2O3 and the polymer counter electrode on the surface, while the external layers are built with carbon and silver (Figure 2).

Figure 2: Components in the A798 family offer high capacitance and long life under the stressful conditions of AI servers, due to their advanced materials, sophisticated design, and enhanced implementation. (Source: YAGEO Group)

Internally, several element foils are stacked and positioned within the capacitor construction, which is largely responsible for the very low ESR (Figure 3).

Figure 3: A cutaway diagram of an A798 capacitor shows the many elements that are needed to create the capacitance function. (Source: YAGEO Group)

Enhancements to the design and selected material upgrades were introduced in the A798 series to deliver 1,000 hours at 85°C and a very high 85 percent relative humidity (RH)—at the rated voltage—along with 125°C endurance life and storage. The small package size, high ripple-current capability, high operating temperature, low parasitics, and capacitance stability over life span make the A798 an ideal solution in demanding AI server applications.

Conclusion

Capacitors are essential in ensuring the reliable operation of demanding AI workloads. Failures often result not from poor quality, but from the inability of standardized lab tests to capture all the harsh realities of modern data centers. Standardized tests capture many individual stresses, yet real-world AI server operation often combines multiple factors, such as thermal cycling, ripple current, DC bias, humidity, and localized hotspots that interact in ways the tests do not fully replicate.

As system power densities continue to rise, designers must account for the full range of capacitor limitations and select components proven to perform beyond the lab and specifically designed for demanding AI server environments.

 

Sources

[1]http://dx.doi.org/10.5170/CERN-2015-003.45
[2]https://www.ashrae.org/file%20library/technical%20resources/bookstore/ashrae_tc0909_power_white_paper_22_june_2016_revised.pdf
[3]https://174powerglobal.com/blog/how-ai-changes-data-center-design-forever/
[4]https://www.datacenterdynamics.com/en/news/hyperscalers-prepare-for-1mw-racks-at-ocp-emea-google-announces-new-cdu/

Author

Bill SchweberBill Schweber is a contributing writer for Mouser Electronics and an electronics engineer who has written three textbooks on electronic communications systems, as well as hundreds of technical articles, opinion columns, and product features. In past roles, he worked as a technical web-site manager for multiple topic-specific sites for EE Times, as well as both the Executive Editor and Analog Editor at EDN. He has an MSEE (Univ. of Mass) and BSEE (Columbia Univ.), is a Registered Professional Engineer, and holds an Advanced Class amateur radio license. Bill has also planned, written, and presented online courses on a variety of engineering topics, including MOSFET basics, ADC selection, and driving LEDs.

   

About the Author

YAGEO Group makes the future possible with electrical innovations and solutions that power the world forward. With one of the broadest selections of component technologies from some of the industry’s most recognized brands, YAGEO Group components are designed to meet the diverse requirements of customers and a full range of end-market segments.