Job Description
About Nscale
Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale enables AI-focused companies to achieve superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business outcomes, including cost management, rapid innovation, and environmental responsibility.
We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you’ll build trust through openness and transparency, where everyone is inspired to do their best work. If you join our team, you’ll be contributing to building the technology that powers the future.
About the Role (Job Purpose)
We are looking for a Director of Observability to lead Nscale’s global observability strategy across our AI cloud platform. You will own the design, implementation, and operation of monitoring, logging, tracing, and alerting systems that provide 360° visibility into infrastructure, orchestration, and AI workloads.
This role is highly cross-functional. You’ll work closely with SRE, infrastructure, and product engineering teams to ensure our systems are transparent, measurable, and reliable at scale. You’ll also be responsible for building and leading the observability team, defining best practices, and delivering actionable insights to drive operational excellence.
What You’ll be Doing (Responsibilities)
- Define and execute Nscale’s observability strategy across all infrastructure and services.
- Build and manage a global observability engineering team.
- Deploy, operate, and scale observability platforms (Prometheus, Grafana, Loki, tracing/alerting systems).
- Ensure comprehensive instrumentation across GPU clusters, networking fabrics, Kubernetes (NKS/NKS Lite), and Slurm orchestration
- Establish and track reliability metrics (SLIs, SLOs, error budgets) to guide service health.
- Integrate observability with incident management and fleet automation.
- Drive down MTTD and MTTR through proactive monitoring and automated remediation.
- Deliver executive-level reporting on system health, capacity, and reliability trends.
- Stay ahead of industry trends in observability, AIOps, and AI workload telemetry.
About you
- 10+ years of experience in large-scale infrastructure, SRE, or observability roles.
- Leadership experience managing distributed engineering teams.
- Strong expertise with observability tools (Prometheus, Grafana, Loki, alerting systems).
- Deep understanding of distributed systems, networking, and cloud- architectures.
- Proficiency in automation and scripting (Python, Go, Bash).
- Hands-on experience with Kubernetes and container orchestration.
- Experience in improving incident response processes and operational reliability.
Nice to Have
- Experience with GPU/AI workload observability (e.g. DCGM, model telemetry, prompt analytics).
- Familiarity with HPC environments (Slurm, RDMA, InfiniBand).
- Knowledge of Infrastructure-as-Code (Terraform, Pulumi, Ansible).
- Awareness of sustainability and efficiency practices in data centre observability
In all we do, our core values guide us:
Relentless Innovation
At Nscale, we constantly push the boundaries of innovation, embracing creative risks to shape the future. Our aim is to deliver products that not only meet but exceed today’s expectations, setting new standards for tomorrow.
Ownership and Accountability
Every Nscaler is fully accountable for their work, driving it with excellence and urgency. We set high standards, ensuring that our contributions are not just good but exceptional.
Openness and Transparency
We believe trust and transparency are key to our success. We maintain open communication within our teams and with stakeholders, sharing both successes and challenges. Our open-source approach allows customers to explore our technology, building trust and ensuring our solutions are both innovative, secure, and reliable.
Customer-Centric Focus
Our customers are central to our mission, and we are committed to delivering impactful solutions that drive real-world success. We focus on deeply understanding their needs and challenges, striving to exceed expectations in both product quality and service.
Sustainability
We are dedicated to considering the long-term environmental and societal impacts of our technologies. By integrating sustainability into our operations and product development, we ensure that our innovations are both effective and responsible, contributing positively to the world around us.
Full-Speed Collaboration
Collaboration at Nscale is fast, efficient, and respectful. We work together seamlessly, with clear communication and mutual respect, ensuring our shared goals are met with high standards and impactful outcomes.
Equal Opportunities Statement
At NScale, we are committed to fostering an inclusive, diverse, and equitable workplace. We believe that a variety of perspectives enriches our work environment, and we warmly welcome applications from individuals of all backgrounds, experiences, and perspectives. We strongly encourage applications from people of , the + community, people with disabilities, neurodivergent people, parents, carers, and people from lower socio-economic backgrounds.