
Senior Site Reliability Engineer
- Madrid
- Permanente
- Tiempo completo
- Implement and manage cloud-native systems (AWS) using best-in-class tools and automation.
- Operate and enhance Kubernetes clusters, deployment pipelines, and service meshes to support rapid delivery cycles.
- Design, build, and maintain the infrastructure powering our multi-tenant SaaS platform with reliability, security, and scalability in mind.
- Define and maintain SLOs, SLAs, and error budgets, and proactively address availability and performance issues.
- Develop infrastructure-as-code (Terraform or similar) for repeatable and auditable provisioning.
- Build internal platform tools and automation to support provisioning, monitoring, and operational efficiency.
- Monitor infrastructure and applications ensuring high-quality user experiences.
- Participate in a shared on-call rotation, responding to incidents, troubleshooting outages, and driving timely resolution and communication.
- Act as an Incident Commander during the on-call duty and coordinate cross-team responses effectively to maintain an SLA.
- Drive and refine incident response processes, reducing Mean Time to Detect (MTTD) and Mean Time to Recovery (MTTR).
- Diagnose and resolve complex issues independently, minimizing the need for external escalation.
- Work closely with software engineers to embed observability, fault tolerance, and reliability principles into service design.
- Automate runbooks, health checks, and alerting to support reliable operations with minimal manual intervention.
- Support automated testing, canary deployments, and rollback strategies to ensure safe, fast, and reliable releases.
- Contribute to security best practices, compliance automation, and cost optimization.
- Minimum Bachelor's degree in Computer Science or equivalent practical experience.
- 5+ years of experience as a Site Reliability Engineer or Platform Engineer with strong knowledge of software development best practices.
- Strong hands-on experience with public cloud services (AWS, GCP, Azure) and supporting SaaS product.
- Strong programming or scripting skills (e.g., Python, Go, Bash...), and experience with infrastructure-as-code (e.g. Terraform).
- Proficiency with Kubernetes, container-based deployment (e.g., Docker) and related ecosystems (e.g., Helm).
- Experience supporting multi-tenant microservices architectures.
- Experience with CI/CD pipelines & tools (e.g., Jenkins, GitHub Actions, GitLab CI, FluxCD, Crossplane).
- Experience with managing monitoring solutions (e.g. Datadog).
- Comfortable participating in a rotating on-call schedule, managing critical incidents, and leading post-incident reviews.
- At ease with operating and managing production systems, striking the right balance between urgency and methodology.
- Strong system-level troubleshooting skills and a proactive mindset toward incident prevention.
- Deep understanding of Linux systems, networking, and common troubleshooting practices.
- Solid understanding of the network stack (e.g., TCP/IP, VPN, etc.), cloud architectures (VPC, subnets, firewalls, load balancers), service mesh (e.g., Istio) and storage (e.g., S3, EBS, etc).
- Knowledge of zero-downtime deployment strategies, blue/green and canary releases.
- Exposure to compliance standards such as SOC 2, ISO 27001, or HIPAA. FedRAMP experience is a big plus.
- Experience with chaos engineering or resilience testing practices.
- Excellent problem-solving skills, collaborative mindset, and a strong grasp of agile, iterative development.
- Self-driven, highly organised, and capable of independently managing priorities.
- Curiosity to learn new things and discover new technologies.
- Strong communication, presentation, and team collaboration skills.
- Excellent written and verbal skills in English.