Job DescriptionJob DescriptionHPC / GPU Infrastructure Engineer (MAAS, Slurm, NVIDIA Stack)
We are seeking experienced engineers to support the deployment of a high-performance GPU cluster (~100 NVIDIA A100 GPUs) in Mountain View, CA. This is a hands-on role focused on standing up a production-ready environment for provisioning, configuration management, workload scheduling, and observability using industry-standard tools. The engagement is short-term (1–2 months) and will require strong execution and troubleshooting capabilities.
Scope:
• Deploy and configure Canonical MAAS for bare metal provisioning, including IPMI/BMC integration and Ubuntu 22.04 LTS imaging.
• Develop and execute Ansible playbooks to fully automate system configuration across all nodes.
• Install and manage the NVIDIA software stack (drivers, CUDA, DCGM), including version control and validation.
• Configure Slurm Workload Manager, including fair-share scheduling, preemption policies, and user management (LDAP/local).
• Set up Infiniband/RDMA networking for high-performance communication between nodes.
• Deploy Prometheus and Grafana for monitoring, including GPU telemetry via DCGM exporter.
• Troubleshoot system-level issues (GPU errors, PCIe issues, networking, performance bottlenecks).
• Deliver a fully automated, reproducible cluster setup with minimal manual intervention.
Required:
• Strong experience with Linux systems (Ubuntu ) and bare metal environments.
• Hands-on expertise with Ansible for infrastructure automation.
• Experience working with GPU clusters, HPC environments, or ML infrastructure.
• Familiarity with NVIDIA stack (drivers, CUDA, DCGM).
• Experience with Slurm or similar workload schedulers.
• Understanding of networking, ideally including Infiniband / RDMA.
• Experience deploying or managing monitoring tools (Prometheus, Grafana).
• Strong troubleshooting and debugging skills across hardware and software layers.
Pluses:
• Experience with MAAS (Metal-as-a-Service) or similar provisioning tools.
• Background in AI/ML training infrastructure.
• Experience debugging GPU-level issues (XID errors, PCIe failures).
• Familiarity with enterprise security agents (e.g., CrowdStrike, Fleetspeak/GRR).
• Prior experience in data center or large-scale cluster environments.
Estimated Min Rate: $70.00
Estimated Max Rate: $80.00/+hr. Depends on work experience
What’s In It for You?
We welcome you to be a part of the largest and legendary global staffing companies to meet your career aspirations. Yoh’s network of client companies has been employing professionals like you for over 65 years in the U.S., UK and Canada. Join Yoh’s extensive talent community that will provide you with access to Yoh’s vast network of opportunities and gain access to this exclusive opportunity available to you. Benefit eligibility is in accordance with applicable laws and client requirements. Benefits include:
- Medical, Prescription, Dental & Vision Benefits (for employees working 20+ hours per week)
- Health Savings Account (HSA) (for employees working 20+ hours per week)
- Life & Insurance (for employees working 20+ hours per week)
- MetLife Voluntary Benefits
- Employee Assistance Program (EAP)
- 401K Retirement Savings Plan
- Direct Deposit & weekly epayroll
- Referral Bonus Programs
- Certification and training opportunities
Note: Any pay ranges displayed are estimations. Actual pay is determined by an applicant's experience, technical expertise, and other qualifications as listed in the job description. All qualified applicants are welcome to apply.
Yoh, a Day & Zimmermann company, is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to , , , , , , , , or status as a protected veteran.
Visit https://www.yoh.com/applicants-with-disabilities to contact us if you are an individual with a and require accommodation in the application process.
For California applicants, qualified applicants with arrest or conviction records will be considered for employment in accordance with the Los Angeles County Fair Chance Ordinance for Employers and the California Fair Chance Act. All of the material job duties described in this posting are job duties for which a criminal history may have a direct, adverse, and negative relationship potentially resulting in the withdrawal of a conditional offer of employment.
It is unlawful in Massachusetts to require or administer a lie detector test as a condition of employment or continued employment. An employer who violates this law shall be subject to criminal penalties and civil liability.
By applying and submitting your resume, you authorize Yoh to review and reformat your resume to meet Yoh’s hiring clients’ preferences. To learn more about Yoh’s privacy practices, please see our Candidate Privacy Notice: https://www.yoh.com/privacy-notice
Company DescriptionYoh delivers expertise, methodology, and momentum to keep work moving forward. From strategy to execution, we deliver bold ideas and big results through consulting, staffing, and enterprise solutions. Nearly a century after our founding, Yoh remains STEM-centered, collaborative, and committed to client success. Yoh is a proud member of the Day & Zimmermann family of companies. Visit us at https://www.yoh.com/Company DescriptionYoh delivers expertise, methodology, and momentum to keep work moving forward. From strategy to execution, we deliver bold ideas and big results through consulting, staffing, and enterprise solutions. Nearly a century after our founding, Yoh remains STEM-centered, collaborative, and committed to client success. Yoh is a proud member of the Day & Zimmermann family of companies. Visit us at https://www.yoh.com/