Job Description

Job Description

We are seeking a skilled mid-level Senior DevOps Site Reliability Engineer (SRE) to ensure the reliability, availability, and performance of enterprise services hosted across Cloud Service Providers (CSPs) and on-prem data centers. The SRE is responsible for the practical implementation of Site Reliability Engineering (SRE) principles through best practices, operations, and monitoring. Speed and stability are carefully balanced; and the SRE team acts as versatile problem solvers, filling gaps in knowledge and expertise to ensure efficient software operations.

If you are a proactive problem solver with a passion for continuous learning and innovation, join us as we endeavor to increase the dynamism and efficacy of our DevOps practices.

Applicant Requirements:

Must be a US or must be authorized to work in the United States.
Must have lived in the USA for three (3) of the last five (5) years.
Must be able to obtain a US federal government badge and eligible for Public Trust clearance.
Must be able to pass a VITG background check, including a drug test.

We’re looking for candidates who:

Demonstrate hand-on expertise in SRE principles, with a strong understanding of maintaining quality and stability of enterprise services in a continuous development environment
Must possess experience designing and developing solutions using various AWS services
Must possess experience in developing scripts in Shell/Bash, Python and deploying them as step/lambda functions
Must possess experience working with monitoring and administering observability tools like Splunk, Datadog, and New Relic
Possess extensive knowledge in troubleshooting issues while leveraging monitoring tools like Splunk, Datadog, New Relic, AWS services, etc.
Possess skill related to analyzing, identifying and documenting root cause analysis.
Possess a strong technical background and be able to provide clear explanations of technical concepts verbally and in writing
Demonstrate ability and passion to learn new technologies quickly and perform Proof of Concepts (POCs) based on project needs
Apply strong problem solving skills in monitoring system performance, troubleshooting issues, crisis management, etc.
Produce high quality work independently and collaboratively
Excel in a fast-paced environment
Demonstrate effective communication and collaboration, and be a team player.

Job Responsibilities:

Design and develop monitoring solutions leveraging approved AWS services using Infrastructure as Code (IaC) tools.
Develop and maintain CI/CD pipelines using Github, Jenkins.
Develop serverless functions and scripts using python, curl, and/or bash.
Leverage observability best practices to proactively identify potential software issues and implement preventive measures to minimize potential for system incidents and outages.
Set and monitor critical metrics to gain insights into system reliability, including latency, traffic, errors, and saturation levels.
Learn and adapt new technologies to perform POCs (Proof of Concepts) based on project needs.
Provide guidance, training, and support for external development teams to manage their infrastructure independently.
Develop, publish, and maintain all required documentation in the repository and ticketing system (i.e., Confluence and Jira).
Respond quickly and effectively to critical incidents, conduct post-incident reviews to identify root causes and implement preventive measures.
Collaborate effectively with cross-functional teams and communicate SRE concepts and recommendations clearly to both technical and non-technical stakeholders.
Participate in reliability-based release management processes.
Plan, participate and manage on-call rotations to ensure prompt response to reported performance and reliability issues.
Attend ongoing and ad hoc meetings with internal and external stakeholders.
Stay up-to-date with the latest industry trends, technologies, and best practices related to SRE, DevOps, and infrastructure management.

Our Tech Stack (Must have):

CI/CD: GitHub, CI/CD, Jenkins, Terraform, CloudFormation, Containers, Docker
Cloud Infrastructure: AWS, Azure
Monitoring & Alerting: Datadog, AWS CloudWatch (including canaries and x-ray), Splunk (Enterprise, ITSI and On-Call), New Relic
OS: Windows servers, Amazon Linux, Red Hat, Citrix VDI

Certifications

AWS Certified SysOps/DevOps Associate or equivalent AWS certification (Required)
Splunk Core Certified Certification (Strongly )
Datadog Certification (Strongly )

Job Type: Full Time (No 1099 or C2C)

Salary: BOE

Benefits:

401(k) with employer contribution
Medical/Dental/Vision insurance (option for full coverage for employee)
Life, ST/LT insurance
Professional development opportunities
Company-paid holidays and paid vacation (PTO)

Schedule:

8 hour shift during core business hours
May include minimal after hours support depending on on-call schedule

Work Type:

Currently hybrid remote in Ellicott City, MD 21043
Minimum 2 days in office weekly

Senior DevOps/SRE Engineer

Senior DevOps/SRE Engineer

Share this job now

Similar jobs