Skip to main content

Sr Manager, AI Systems Quality & Reliability , Annapurna AI Servers and Systems

AWS Annapurna Labs is seeking a Senior Manager of Quality & Reliability Engineering to lead the QnR function within the Trainium Manufacturing, Quality and Reliability organization. You will own quality and reliability outcomes for all Trainium AI server products — from component qualification through fleet performance — leading an engineering team across multiple concurrent chip and system . This role defines reliability strategy for liquid-cooled and air-cooled platforms at rapidly scaling volumes, builds quality systems across a multi-supplier global manufacturing base, drives fleet failure investigations to root cause, and establishes the reliability characterization capabilities required for next- technologies.

Key job responsibilities
- Lead and grow a QnR engineering team, hiring, developing, and retaining top reliability and quality engineering talent.
- Set technical direction for component qualification, reliability testing (HALT, HTOL, thermal cycling, QRV), DFMEA, and vendor quality standards across all Trainium programs.
- Own quality and reliability outcomes end-to-end — from DFM input during design through fleet reliability performance.
- Drive component specific manufacturing process quality improvements in partnership with Manufacturing Engineering, establishing incoming quality requirements and process controls at all supplier sites.
- Build and maintain the reliability prediction and monitoring infrastructure — ensuring fleet performance is tracked against predictions, degradation trends are identified early, and corrective actions are data-driven.
- Establish systematic failure analysis processes that connect field failures back to manufacturing history, supplier data, and component-level root cause for rapid containment.
- Scale qualification processes to keep pace with multi-supplier, multi- production — including automation of qualification workflows and standardization of test methodologies across vendors.

About the team
Annapurna Labs is a wholly owned subsidiary of AWS, focused on developing custom silicon and servers including the Nitro, Graviton, and Trainium families of processors. Machine Learning Annapurna (MLA) functions as a vertically integrated team including software, firmware, hardware, and silicon design in a single organization. We are the Trainium Servers and Systems organization under MLA focused on Hardware Development, Software Development, Fleet Ops Systems, and Manufacturing, Quality, and Reliability. This position leads the Quality and Reliability Engineering function within the Manufacturing, Quality and Reliability team.