Job Description
Job Description
We are seeking a skilled mid-level Senior DevOps Site Reliability Engineer (SRE) to ensure the reliability, availability, and performance of enterprise services hosted across Cloud Service Providers (CSPs) and on-prem data centers. The SRE is responsible for the practical implementation of Site Reliability Engineering (SRE) principles through best practices, operations, and monitoring. Speed and stability are carefully balanced; and the SRE team acts as versatile problem solvers, filling gaps in knowledge and expertise to ensure efficient software operations.
If you are a proactive problem solver with a passion for continuous learning and innovation, join us as we endeavor to increase the dynamism and efficacy of our DevOps practices.
Applicant Requirements:
- Must be a US or must be authorized to work in the United States.
- Must have lived in the USA for three (3) of the last five (5) years.
- Must be able to obtain a US federal government badge and eligible for Public Trust clearance.
- Must be able to pass a VITG background check, including a drug test.
We’re looking for candidates who:
- Demonstrate hand-on expertise in SRE principles, with a strong understanding of maintaining quality and stability of enterprise services in a continuous development environment
- Must possess experience designing and developing solutions using various AWS services
- Must possess experience in developing scripts in Shell/Bash, Python and deploying them as step/lambda functions
- Must possess experience working with monitoring and administering observability tools like Splunk, Datadog, and New Relic
- Possess extensive knowledge in troubleshooting issues while leveraging monitoring tools like Splunk, Datadog, New Relic, AWS services, etc.
- Possess skill related to analyzing, identifying and documenting root cause analysis.
- Possess a strong technical background and be able to provide clear explanations of technical concepts verbally and in writing
- Demonstrate ability and passion to learn new technologies quickly and perform Proof of Concepts (POCs) based on project needs
- Apply strong problem solving skills in monitoring system performance, troubleshooting issues, crisis management, etc.
- Produce high quality work independently and collaboratively
- Excel in a fast-paced environment
- Demonstrate effective communication and collaboration, and be a team player.
Job Responsibilities:
- Design and develop monitoring solutions leveraging approved AWS services using Infrastructure as Code (IaC) tools.
- Develop and maintain CI/CD pipelines using Github, Jenkins.
- Develop serverless functions and scripts using python, curl, and/or bash.
- Leverage observability best practices to proactively identify potential software issues and implement preventive measures to minimize potential for system incidents and outages.
- Set and monitor critical metrics to gain insights into system reliability, including latency, traffic, errors, and saturation levels.
- Learn and adapt new technologies to perform POCs (Proof of Concepts) based on project needs.
- Provide guidance, training, and support for external development teams to manage their infrastructure independently.
- Develop, publish, and maintain all required documentation in the repository and ticketing system (i.e., Confluence and Jira).
- Respond quickly and effectively to critical incidents, conduct post-incident reviews to identify root causes and implement preventive measures.
- Collaborate effectively with cross-functional teams and communicate SRE concepts and recommendations clearly to both technical and non-technical stakeholders.
- Participate in reliability-based release management processes.
- Plan, participate and manage on-call rotations to ensure prompt response to reported performance and reliability issues.
- Attend ongoing and ad hoc meetings with internal and external stakeholders.
- Stay up-to-date with the latest industry trends, technologies, and best practices related to SRE, DevOps, and infrastructure management.
Our Tech Stack (Must have):
- CI/CD: GitHub, CI/CD, Jenkins, Terraform, CloudFormation, Containers, Docker
- Cloud Infrastructure: AWS, Azure
- Monitoring & Alerting: Datadog, AWS CloudWatch (including canaries and x-ray), Splunk (Enterprise, ITSI and On-Call), New Relic
- OS: Windows servers, Amazon Linux, Red Hat, Citrix VDI
Certifications
- AWS Certified SysOps/DevOps Associate or equivalent AWS certification (Required)
- Splunk Core Certified Certification (Strongly )
- Datadog Certification (Strongly )
Job Type: Full Time (No 1099 or C2C)
Salary: BOE
Benefits:
- 401(k) with employer contribution
- Medical/Dental/Vision insurance (option for full coverage for employee)
- Life, ST/LT insurance
- Professional development opportunities
- Company-paid holidays and paid vacation (PTO)
Schedule:
- 8 hour shift during core business hours
- May include minimal after hours support depending on on-call schedule
Work Type:
- Currently hybrid remote in Ellicott City, MD 21043
- Minimum 2 days in office weekly