Skip to main content

Senior DevOps/SRE Engineer

Job Description

Job Description

We are seeking a skilled mid-level Senior DevOps Site Reliability Engineer (SRE) to ensure the reliability, availability, and performance of enterprise services hosted across Cloud Service Providers (CSPs) and on-prem data centers. The SRE is responsible for the practical implementation of Site Reliability Engineering (SRE) principles through best practices, operations, and monitoring. Speed and stability are carefully balanced; and the SRE team acts as versatile problem solvers, filling gaps in knowledge and expertise to ensure efficient software operations.

If you are a proactive problem solver with a passion for continuous learning and innovation, join us as we endeavor to increase the dynamism and efficacy of our DevOps practices.

Applicant Requirements:

  • Must be a US or must be authorized to work in the United States.
  • Must have lived in the USA for three (3) of the last five (5) years.
  • Must be able to obtain a US federal government badge and eligible for Public Trust clearance.
  • Must be able to pass a VITG background check, including a drug test.

We’re looking for candidates who:

  • Demonstrate hand-on expertise in SRE principles, with a strong understanding of maintaining quality and stability of enterprise services in a continuous development environment
  • Must possess experience designing and developing solutions using various AWS services
  • Must possess experience in developing scripts in Shell/Bash, Python and deploying them as step/lambda functions
  • Must possess experience working with monitoring and administering observability tools like Splunk, Datadog, and New Relic
  • Possess extensive knowledge in troubleshooting issues while leveraging monitoring tools like Splunk, Datadog, New Relic, AWS services, etc.
  • Possess skill related to analyzing, identifying and documenting root cause analysis.
  • Possess a strong technical background and be able to provide clear explanations of technical concepts verbally and in writing
  • Demonstrate ability and passion to learn new technologies quickly and perform Proof of Concepts (POCs) based on project needs
  • Apply strong problem solving skills in monitoring system performance, troubleshooting issues, crisis management, etc.
  • Produce high quality work independently and collaboratively
  • Excel in a fast-paced environment
  • Demonstrate effective communication and collaboration, and be a team player.

Job Responsibilities:

  • Design and develop monitoring solutions leveraging approved AWS services using Infrastructure as Code (IaC) tools.
  • Develop and maintain CI/CD pipelines using Github, Jenkins.
  • Develop serverless functions and scripts using python, curl, and/or bash.
  • Leverage observability best practices to proactively identify potential software issues and implement preventive measures to minimize potential for system incidents and outages.
  • Set and monitor critical metrics to gain insights into system reliability, including latency, traffic, errors, and saturation levels.
  • Learn and adapt new technologies to perform POCs (Proof of Concepts) based on project needs.
  • Provide guidance, training, and support for external development teams to manage their infrastructure independently.
  • Develop, publish, and maintain all required documentation in the repository and ticketing system (i.e., Confluence and Jira).
  • Respond quickly and effectively to critical incidents, conduct post-incident reviews to identify root causes and implement preventive measures.
  • Collaborate effectively with cross-functional teams and communicate SRE concepts and recommendations clearly to both technical and non-technical stakeholders.
  • Participate in reliability-based release management processes.
  • Plan, participate and manage on-call rotations to ensure prompt response to reported performance and reliability issues.
  • Attend ongoing and ad hoc meetings with internal and external stakeholders.
  • Stay up-to-date with the latest industry trends, technologies, and best practices related to SRE, DevOps, and infrastructure management.

Our Tech Stack (Must have):

  • CI/CD: GitHub, CI/CD, Jenkins, Terraform, CloudFormation, Containers, Docker
  • Cloud Infrastructure: AWS, Azure
  • Monitoring & Alerting: Datadog, AWS CloudWatch (including canaries and x-ray), Splunk (Enterprise, ITSI and On-Call), New Relic
  • OS: Windows servers, Amazon Linux, Red Hat, Citrix VDI

Certifications

  • AWS Certified SysOps/DevOps Associate or equivalent AWS certification (Required)
  • Splunk Core Certified Certification (Strongly )
  • Datadog Certification (Strongly )

Job Type: Full Time (No 1099 or C2C)

Salary: BOE

Benefits:

  • 401(k) with employer contribution
  • Medical/Dental/Vision insurance (option for full coverage for employee)
  • Life, ST/LT insurance
  • Professional development opportunities
  • Company-paid holidays and paid vacation (PTO)

Schedule:

  • 8 hour shift during core business hours
  • May include minimal after hours support depending on on-call schedule

Work Type:

  • Currently hybrid remote in Ellicott City, MD 21043
  • Minimum 2 days in office weekly

Senior DevOps/SRE Engineer

Ellicott City, MD
Full time

Published on 12/28/2025

Share this job now