Job DescriptionJob DescriptionSenior MLOps Engineer
Reports to: Head of Machine Learning
Location: Clarksville, TN
Company: XRI Global, Inc.
About XRI
XRI is an AI company specializing in research, data, and development for low resource . We are dedicated to enabling speakers of low resource to flourish through the development and deployment of advanced technology solutions.
About the Role We’re seeking an experienced Senior MLOps Engineer to design, deploy, and maintain our next- AI infrastructure. In the near term, you’ll help us architect and implement a GPU-based compute cluster hosted in rented data-center racks. This role will lead the end-to-end process of hardware selection, cluster provisioning, containerized workflows, and scalable training/inference pipelines for large-scale machine learning workloads.
You’ll work closely with our research and platform teams to enable efficient, reproducible, and cost-effective AI model development and deployment at scale.
Key Responsibilities
- Infrastructure Design & Buildout
- Evaluate GPU hardware and server configurations for cost/performance.
- Plan rack layout, power, networking, and cooling requirements in collaboration with data center providers.
- Lead provisioning, configuration, and monitoring of the new GPU cluster.
- Cluster & Systems Administration
- Set up and manage compute orchestration via Kubernetes, Slurm, or Ray across multiple GPU nodes.
- Implement high-availability networking and secure access for internal research workloads.
- Build automation for node setup (e.g., PXE boot, Ansible, Terraform).
- MLOps & Workflow Automation
- Develop CI/CD pipelines for model training and deployment (GitHub Actions, Argo, MLflow, etc.).
- Implement containerized environments (Docker, Podman) and job scheduling for large distributed training runs.
- Manage data lifecycle: versioning, validation, and storage optimization for multi-TB datasets.
- Monitoring, Logging, and Optimization
- Deploy observability tools (Prometheus, Grafana, Loki) to monitor GPU utilization, network throughput, and storage.
- Benchmark and optimize training and inference performance across frameworks (PyTorch, vLLM, DeepSpeed, FSDP, etc.).
- Design cost-reporting and utilization dashboards.
- Collaboration & Research Enablement
- Work with AI researchers to productionize new models.
- Maintain reproducible training environments for long-term experimentation.
- Contribute to model-deployment strategies (REST/gRPC, Triton, vLLM, FastAPI, etc.).
Qualifications
Required:
- 5+ years experience in MLOps, HPC, or cloud infrastructure for ML.
- Strong Linux administration skills; experience with Ubuntu and NVIDIA drivers.
- Experience managing multi-GPU systems, CUDA, and containerized ML workloads.
- Familiarity with infrastructure-as-code tools (Terraform, Ansible).
- Experience with one or more orchestration systems (Kubernetes, Ray, Slurm).
- Deep understanding of distributed training (DDP, FSDP, DeepSpeed, etc.).
- Proficiency in Python and shell scripting.
:
- Prior experience standing up GPU clusters or on-prem HPC environments.
- Familiarity with model deployment frameworks (vLLM, Triton Inference Server).
- Networking knowledge (InfiniBand, RoCE, NVLink, etc.).
- Experience setting up monitoring stacks (Prometheus, Grafana, Loki).
- Strong understanding of ML lifecycle tools (MLflow, DVC, Weights & Biases).
- Familiarity with security, access control, and GPU job scheduling policies.
Nice to Have
- Background in large-model training, quantization, and inference optimization.
- Familiarity with hybrid on-prem + cloud bursting architectures.
- Interest in helping define long-term data-center strategy and procurement.
Collaboration
- Reports to Head of ML, works with PMs and broader engineering team.
- Mentor colleagues and engage in cross-functional teams.
KPIs
- Inference uptime, cost-efficiency, and latency.
- Tool adoption, incident resolution, and observability metrics.
& Travel
Fluent in English; additional a plus. Willing to travel internationally for field use, offsites, and events.
Location & Compensation
- Location: Hybrid in Clarksville, TN
- Compensation: Competitive salary and eligibility for annual performance bonuses
Culture and Values
XRI values innovation, compassion, problem-solving, and flexibility. Ideal candidates work autonomously, think strategically, and support team growth through knowledge-sharing.