Skip to main content

Senior IT Platform Engineer, Compute & Resilience

Job DescriptionJob Description

SUMMARY

The Senior Platform Engineer designs, builds, modernizes, and operates the enterprise compute, virtualization, storage, and backup platform across plants, data center, offices, cloud environments, and remote users. This role owns the compute and resilience platform end to end, including architecture, automation, capacity management, disaster recovery, and operational performance.

The position emphasizes Infrastructure as Code, automation first practices, AI-enabled operations, disaster recovery readiness and reduction of technical debt to deliver resilient, scalable, and secure compute services aligned to enterprise strategy.

The full salary range for this position is $111,200 – $166,800. However, our current budget for a new hire is $111,200 – $150,000, depending on the candidate's specific experience and skills.

ESSENTIAL DUTIES AND RESPONSIBILITIES may include the following. Other duties may be assigned.

PLATFORM OWNERSHIP

  • Own the enterprise compute, virtualization, storage, and backup platforms across plants, warehouses, offices, cloud, and remote environments
  • Design for high availability, fault tolerance, scalability, and rapid recovery
  • Ensure platform reliability supports manufacturing uptime, enterprise operations, and business continuity
  • Serve as technical authority for compute architecture, virtualization standards, storage design, and resilience strategy
  • Drive modernization, standardization, and lifecycle management of servers, hypervisors, storage arrays, and backup platforms
  • Reduce technical debt and eliminate configuration drift
  • Act as a technical mentor and escalation point within the platform domain

ARCHITECTURE AND ENGINEERING

  • Design and implement resilient, secure, and scalable compute, virtualization, and storage architectures
  • Define and maintain standards, reference designs, and best practices for server builds, cluster design, hypervisor configuration, and storage layout
  • Lead platform upgrades, hypervisor migrations, storage refreshes, and modernization initiatives
  • Ensure integration with adjacent platforms such as network, security, cloud, , data, and applications
  • Support hybrid environments spanning on-premises infrastructure, cloud compute platforms (Azure AWS), and SaaS workloads
  • Design and maintain high-availability clusters and disaster recovery configurations

INFRASTRUCTURE AS CODE

  • Define compute and infrastructure configurations using code, templates, or structured configuration management tools
  • Establish version-controlled configurations as the system of record or server builds, hypervisor configurations, and storage policies
  • Enable repeatable, low-risk changes through standardized deployment models
  • Reduce manual changes and operational inconsistencies
  • Contribute to CI/CD practices for infrastructure or platform changes
  • Maintain version-controlled repositories as the authoritative source of platform configuration

AUTOMATION & OPERATIONAL EXCELLENCE

  • Automate server provisioning, patching, lifecycle management, validation, recovery, and compliance validation
  • Reduce manual operational effort through scripting and workflow automation
  • Partner with MSPs to ensure consistent execution of backup, recovery, and infrastructure runbooks
  • Improve monitoring signal quality across compute, storage, and virtualization layers
  • Design self-healing or auto-remediation capabilities where appropriate
  • Continuously optimize resource utilization, performance, and capacity planning

RESILIENCY & SECURITY FOUNDATIONS

  • Ensure compute platform resilience, redundancy, backup, and disaster recovery alignment
  • Own backup, recovery, and disaster recovery design and testing processes

  • Maintain documented recovery procedures and conduct periodic DR exercises

  • Partner with Security teams to maintain compliance, segmentation, access controls, and monitoring standards
  • Support enterprise risk management initiatives related to infrastructure stability, ransomware protection, and business continuity

AI-ENABLED OPERATIONS

  • Leverage AI-driven monitoring and analytics to detect anomalies and performance risks
  • Support predictive insights related to compute utilization, storage growth and failure trends
  • Contribute to AI-assisted incident investigation and root cause analysis where tooling supports it
  • Identify opportunities to reduce alert fatigue and improve operational insight using intelligent tooling

COLLABORATION & ENABLEMENT

  • Partner closely with the business and other IT teams
  • Provide clear architecture diagrams, standards documentation, and operational runbooks
  • Participate in Tier-2 or Tier-3 escalation and on-call rotations within the platform domain
  • Act as secondary or tertiary responder for critical enterprise outages
  • Support cross-functional initiatives tied to modernization and transformation

SUPERVISORY RESPONSIBILITIES

This position has no supervisory responsibilities, but acts in a lead capacity to other department staff.

QUALIFICATIONS

EDUCATION/EXPERIENCE

  • Bachelor’s degree in a related field or equivalent practical experience
  • 8+ years of progressive experience in infrastructure or platform engineering, including at least 5 years in a senior-level role within a defined technology domain

SKILLS

  • Senior-level experience designing and operating enterprise compute, virtualization, storage, and backup platforms across multi-site and hybrid environments
  • Strong background in Windows Server and Linux administration, including enterprise virtualization platforms such as Nutanix AHV, VMware vSphere, or equivalent
  • Expertise in high availability, clustering, failover, disaster recovery design, and data replication
  • Experience with Azure cloud services and hybrid compute integrations
  • Proficiency in automation and scripting, with PowerShell ; familiarity with Python, Ansible, Terraform, or similar tools
  • Experience with Infrastructure as Code and repeatable infrastructure deployment practices
  • Familiarity with AI-driven infrastructure monitoring and AIOps platforms
  • Strong understanding of security best practices, system hardening, and patch management
  • Ability to support production systems with strict uptime, resiliency, and recovery requirements
  • Experience documenting architecture standards, recovery procedures, and operational runbooks

Qualifications

  • Experience supporting manufacturing or operational technology environments
  • Experience working with MSPs providing infrastructure services
  • Relevant certifications such as MCSE, Azure Solutions Architect Expert, NCP-MCI, VCP, or equivalent

SKILLS

  • Ability to communicate complex technical concepts clearly to both technical and non-technical audiences.
  • Strong written communication skills for architecture documentation, standards, and executive-level summaries.

REASONING ABILITY

Ability to evaluate complex technical environments, synthesize system data, assess risk, and make sound architectural and operational decisions in dynamic or ambiguous situations. Demonstrates strong analytical thinking, structured problem-solving, and the ability to balance business impact with technical considerations.

CERTIFICATES, LICENSES, REGISTRATIONS

No certifications are required. Relevant industry certifications aligned to the platform domain such as cloud, networking, security, or automation credentials are and may strengthen candidacy.

CONFIDENTIALITY

In the course of performing this role, the Senior IT Platform Engineer may have access to confidential company, employee, operational, and technical information. The employee is expected to safeguard all such information, exercise sound judgment in its handling, and maintain strict confidentiality at all times.

TRAVEL

Travel up to 10% may be required to support corporate headquarters, plants, warehouses, or other business locations.

PHYSICAL DEMANDS

The physical demands described here represent those required to successfully perform the essential functions of this role. Reasonable accommodations may be made for individuals with disabilities.

This position is primarily office-based and requires prolonged periods of sitting, speaking, and hearing. Occasional standing, walking, reaching, bending, or lifting of equipment up to 50 pounds may be required when supporting infrastructure or on-site technology environments. Close vision and color vision may be necessary for working with hardware and cabling.

WORK ENVIRONMENT

The work environment is primarily an office or corporate setting, with occasional visits to plant, warehouse, or technical facilities. The role may involve exposure to operational or industrial environments when supporting manufacturing technology. Reasonable accommodations may be made to enable individuals with disabilities to perform essential job functions.

Senior IT Platform Engineer, Compute & Resilience

Petaluma, CA
Full time

Published on 03/01/2026

Share this job now