Skip to main content

Senior Software Development Engineer, EC2 Trainium AI Infra

The Software Development Engineer will lead the team in technical strategy, design, build, and operation of infrastructure services including provisioning and availability of AWS Trainium-based AI servers. This role requires expertise in architecting large-scale systems, building micro services, and cross-functional collaboration with several other teams such as capacity management, hardware engineering, and datacenter teams to manage AI/ML infrastructure.

Key job responsibilities
- Design and develop innovative technologies that power the infrastructure supporting AI workloads on Ultraservers
- Lead technical projects establishing EC2 as the pioneer in cloud computing for AI/ML workloads across diverse applications including LLMs, multimodal systems, and emerging model architectures.
- Collaborate with various teams to influence architecture of provisioning systems and improve to operate at scale and efficiently.
- Build customer relationships by investigating complex performance challenges, developing solutions, and publishing actionable best practices through multiple channels.

About the team
The EC2 UltraServer Provisioning team is a high-performing engineering organization responsible for delivering AWS Trainium-based UltraServers infrastructure at scale. We manage end-to-end provisioning workflows from host ingestion through testing, repair, and recovery.