What You’ll Do Reliability & Operations - Own availability, latency, and scalability across SaaS and AI systems - Define and enforce SLOs, SLIs, and error budgets - Participate in a global on-call rotation (~1 week every 4 weeks) - Lead incident response and drive blameless postmortems with systemic fixes Platform & Infrastructure - Architect and operate on-premise and multi-region, multi-cloud environments - Manage large-scale Kubernetes workloads - Build and evolve infrastructure using Terraform and Ansible - Improve system resilience, fault isolation, and capacity planning AI/ML & Automation - Build and scale agentic AI systems for triage, anomaly detection, and self-healing - Ensure reliability of model serving infrastructure - Operate, optimize and scale distributed systems What You Bring - 5+ years in SRE , Production Engineering, or Platform Engineering - Strong experience with cloud providers (AWS/GCP/OCI), Kubernetes, and IaC (Terraform/Ansible) - Proficiency in Python, Go, or TypeScript - Experience with distributed systems and AI/ML platforms - Deep understanding of SLOs, observability, and incident management - Strong bias toward automation and system-level problem solving Culture & Growth - Blameless, transparent postmortems - Current mix: ~70% operations / 30% engineering , with active investment in automation and AI-driven toil reduction - Clear growth path into Staff/Principal technical leadership or management

Site Reliability Engineer

RCS TECH

Empleos similares

Gerente De Aseguramiento De Calidad Y Responsable Sanitario

CARNOT LABORATORIOS

Ejecutivo De Cuenta Selecto

EMPRESA CONFIDENCIAL

Supervisor De Logística

STYROPEK

Ingeniero De Calidad

EMPRESA CONFIDENCIAL

Gerente Ventas Cuenta Clave

GRUPO JUMEX

Planeador De Mantenimiento

STEPAN COMPANY

Profesor De Planta Diseño Gráfico (Ai)

UNIVERSIDAD DE MONTERREY

Recibe empleos similares por e-mail