Senior Site Reliability Engineer (SRE)

October 3, 2024

Golabs

Quesada

About the job Senior Site Reliability Engineer (SRE)

We are seeking a highly skilled and experienced Senior Site Reliability Engineer (SRE) to join our team. As a Senior SRE, you will play a critical role in ensuring the reliability, scalability, and performance of our infrastructure while collaborating with development teams to improve overall system architecture. This will be a 6-month project, with the possibility of extension.
Responsibilities

Design, implement, and manage scalable, reliable, and secure cloud infrastructure on AWS.
Collaborate with development teams to build and maintain highly available systems and services, using Java and AWS.
Automate infrastructure provisioning and management using Terraform and other Infrastructure as Code (Ia C) tools.
Implement, manage, and optimize containerized applications using Kubernetes and containerization best practices.
Develop and maintain scripts (e.g., Linux, Shell) for automation of operational tasks and system maintenance.
Set up and maintain monitoring and alerting systems using Grafana, Prometheus, and other monitoring tools to track system health and performance.
Troubleshoot and resolve infrastructure and application-related issues in production, ensuring minimal downtime.
Lead incident response and post-incident reviews, implementing best practices to prevent future occurrences.
Drive continuous improvements in system reliability, scalability, and operational efficiency.

Qualifications

5+ years of experience in Site Reliability Engineering or similar roles.
Extensive experience with AWS and strong knowledge of its services (e.g., EC2, S3, RDS, Lambda).
Proficiency in Java for application performance tuning and troubleshooting.
Hands-on experience with Infrastructure as Code (Ia C) tools such as Terraform.
Expertise in container orchestration using Kubernetes and Docker.
Strong scripting skills with Linux/Shell scripting (additional scripting languages such as Python or Bash are a plus).
Experience with monitoring and alerting tools such as Grafana, Prometheus, or Cloud Watch.
Solid understanding of CI/CD pipelines and experience integrating infrastructure into automated deployment workflows.
Experience with incident management, including root cause analysis and post-mortem processes.
Excellent communication and collaboration skills, with the ability to work effectively in cross-functional teams.
Intermediate English proficiency, at least B1/B2 level.

Why Join Us?

Full-time position
Payment in US dollars
100% remote anywhere in LATAM
12 PTO per year
Holidays from your country off and paid
Birthday off and paid
Career Path
Recognition Program
Paid Leaves

If you meet these requirements and are interested in applying for this position, please let us know. We look forward to the possibility of working with you.

We regret to inform you that this job opportunity is no longer available