Skip to main content

Ai Infrastructure Engineer

Job DescriptionJob Description

Company Overview

At STN Inc., we're at the forefront of artificial intelligence innovation, building scalable infrastructure to power cutting-edge AI models and applications. We're seeking a talented AI Infrastructure Engineers to join our dynamic team and help optimize our high-performance computing environments for AI workloads.

Job Summary

The AI Infrastructure Engineer will be responsible for designing, deploying, and maintaining robust infrastructure systems tailored for AI and machine learning operations. This role focuses on ensuring seamless performance, scalability, and reliability in distributed computing environments. You'll collaborate with data scientists, ML engineers, and DevOps teams to support large-scale AI training and inference pipelines.

Key Responsibilities

  • Design and implement AI infrastructure solutions, including cluster management, resource allocation, and workload orchestration for high-performance computing (HPC) environments.
  • Deploy, configure, and troubleshoot containerized applications using Kubernetes across various flavors (e.g., vanilla Kubernetes, Amazon EKS, Google GKE, Azure AKS, and on-premises setups).
  • Manage job scheduling and resource management using Slurm for efficient utilization of GPU clusters in AI training workflows.
  • Optimize Ubuntu-based systems for AI workloads, including kernel tuning, security hardening, and performance monitoring.
  • Integrate and maintain NVIDIA GPU technologies, ensuring compatibility with AI frameworks like TensorFlow, PyTorch, and CUDA.
  • Monitor system performance, identify bottlenecks, and implement automation scripts for infrastructure provisioning and scaling.
  • Collaborate on disaster recovery planning, security compliance, and cost optimization for cloud and on-premises AI infrastructure.
  • Stay updated on emerging technologies in AI infrastructure and contribute to best practices documentation.

Required Qualifications

  • Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience).
  • Proven expertise as an Ubuntu specialist, with hands-on experience in system administration, networking, and scripting (e.g., Bash, Python) on Ubuntu servers.
  • Extensive experience with Kubernetes in all major flavors, including cluster setup, scaling, networking (e.g., CNI plugins), and security (e.g., RBAC, Pod Security Policies).
  • Strong proficiency in Slurm for managing HPC clusters, including job submission, queue configuration, and integration with GPU resources.
  • 3+ years of experience in infrastructure engineering, preferably in AI/ML or HPC environments.
  • Familiarity with cloud platforms (AWS, GCP, Azure) and container orchestration tools.
  • Excellent problem-solving skills and ability to work in a fast-paced, collaborative environment.

Qualifications

  • NVIDIA certifications (e.g., NVIDIA Certified Professional in Data Center GPU Management or CUDA Programming) are a strong plus.
  • Experience with other HPC schedulers (e.g., PBS, LSF) or AI-specific tools like Kubeflow.
  • Knowledge of infrastructure-as-code tools (e.g., Terraform, Ansible) and CI/CD pipelines.
  • Background in AI model deployment, monitoring tools (e.g., Prometheus, Grafana), or edge computing.

What We Offer

  • Competitive salary and benefits package.
  • Opportunities for professional growth in a rapidly evolving AI field.
  • Flexible remote/hybrid work options.
  • Access to state-of-the-art AI hardware and tools.

If you're passionate about building the backbone for AI innovation, apply today!

Company DescriptionSTN is an equal opportunity IT Solution Provider based in Pleasanton, CA.Company DescriptionSTN is an equal opportunity IT Solution Provider based in Pleasanton, CA.