Senior DevOps Site Reliability Engineer

Job Description

About the Company:

  • A leading IT services provider is seeking a Senior DevOps Site Reliability Engineer (SRE) to enhance system reliability, scalability, and performance. This is an opportunity to work in a dynamic team, driving automation, monitoring, and infrastructure optimization.

Roles & Responsibilities:

  • Lead and mentor a team of engineers in implementing best practices for reliability engineering.
  • Design and optimize scalable infrastructure, ensuring high availability and fault tolerance.
  • Develop and refine automation tools for monitoring, alerting, and incident response.
  • Manage Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to measure and improve system performance.
  • Conduct postmortem analysis and implement preventive measures to mitigate risks.
  • Collaborate with development teams to optimize CI/CD pipelines and deployment workflows.
  • Enhance system observability by implementing and tuning logging, monitoring, and alerting solutions.
  • Perform capacity planning, performance tuning, and cost optimization for cloud environments.
  • Strengthen security and compliance in cloud-native infrastructure.
  • Participate in on-call rotations and handle escalations for critical incidents.

Requirements:

  • BSc in Computer Science, Engineering, or a related field (or equivalent experience).
  • 5+ years of experience in DevOps, SRE, or related infrastructure roles.
  • Expertise in cloud environments (AWS, Google Cloud, Azure) and container orchestration (Kubernetes, Docker Swarm).
  • Deep knowledge of infrastructure-as-code tools such as Terraform, Ansible, or SaltStack.
  • Strong proficiency in Python, Go, or Bash for automation and scripting.
  • Experience managing distributed systems, databases (SQL/NoSQL), and caching technologies (Redis, Memcached, Varnish).
  • Solid understanding of networking, load balancing, and high-availability configurations.
  • Hands-on experience with observability tools (Prometheus, Grafana, ELK Stack, etc.).
  • Proven track record in postmortem analysis and implementing long-term reliability solutions.
  • Familiarity with incident management frameworks and ITIL methodologies.
  • Ability to mentor and guide junior and mid-level engineers in best practices.

Additional Benefits:

  • Hybrid work model for flexibility.
  • Competitive compensation package.
  • Monthly meal allowance.
  • Comprehensive health and life insurance plan.
  • Additional paid time off, including leave for training and education.
  • Extra day off on birthdays.
  • Friday afternoons off in July & August.

Apply Now