Job DescriptionJob DescriptionSalary:
Location: Bethesda, MD
Category: Systems Engineer
Travel Required: No
Remote Type: Onsite
Clearance: Top Secret/SCI
Sunayu, LLC is looking for a highly skilled Systems Engineer with deep expertise in operating systems, hardware, GPU, and high-speed networking. In this role, you will design, develop, and optimize GPU clusters that power enterprise AI for the mission customers.
This is a 100% on-site position. All work must be performed at the customer site in Bethesda at the Intelligence Community Campus.
Primary Responsibilities
- GPU Cluster Engineering: Design, configure, and maintain GPU Clusters. Collaborate with a multidisciplinary team to define and optimize architectures, ensuring they meet performance, power efficiency, and feature requirements.
- Operating System Integration:Work closely with AI/ML engineers to ensure smooth GPU integration with Linux-based systems. Optimize GPU drivers for compatibility, reliability, and performance. Provide regular maintenance and updates.
- Performance Optimization:Analyze GPU performance, identify bottlenecks, and develop strategies to improve efficiency across hardware and software layers.
- Tooling and Automation:Build and maintain debugging tools, profiling utilities, and performance analysis software for Linux environments. Leverage scripting and configuration tools such as Bash, Python, Ansible, Puppet, and Salt.
- Compliance & Documentation: Maintain technical documentation, architectural specifications, and Linux best practices. Support ATO (Authority to Operate) and ensure compliance with federal security standards.
Basic Qualifications
- Bachelor's or higher degree in Computer Science, Computer Engineering, Electrical Engineering, or a related field with at least 12 years of related technical experience. Additional years of experience may be considered in lieu of a degree.
- 10+ years of relevant systems engineering experience
- Experience in managing NVIDIA GPU data center platforms. (DGX, HGX, H200, H100, L4s).
- Knowledge of enterprise server components (storage/network controllers, HBA, SSDs).
- Strong expertise with Linux distributions. (RHEL, Ubuntu, Oracle, and Rocky).
- Excellent problem-solving skills and the ability to collaborate within a team.
- Candidate must, at a minimum, meet DoD 8140/8570- IAT Level II certification requirements (currently Security+ CE, CCNA-Security, GICSP, GSEC, or SSCP along with an appropriate computing environment (CE) certification). An IAT Level III certification would also be acceptable (CASP+, CCNP Security, CISA, CISSP, GCED, GCIH, CCSP).
Clearance
- Due to the nature of the government contracts we support, US Citizenship is required.
- TS/SCI clearance with Polygraph required or a TS/SCI and willingness to obtain a Polygraph prior to starting.
Qualifications
- Experience with Kubernetes cluster management and AI/ML workflow orchestration (Argo, Airflow, and Kubeflow).
- Familiarity with GPU virtualization and cloud computing.
- Experience with Prometheus/Grafana for monitoring.
- Knowledge of distributed resource scheduling systems (Slurm (), LSF, etc.).