Svitla Systems Inc. is looking for a Senior Site Reliability Engineer for a full-time position (40 hours per week) in Mexico/Costa Rica. Our client has multiple online platforms providing spiritual guidance on love, relationships, and life. They offer a vast portfolio of over 20 brands in the spiritual and emotional wellness space. You’ll build the automation that routes every incident to the right team instantly and own the metrics — MTTD, MTTR, and uptime — that keep the 24/7 marketplace dependable for the customers who rely on it. You’ll transform incident response from manual, knowledge‑dependent triage into an engineered, self‑improving system that lets our engineers build instead of firefighting. Requirements 3+ years of experience in Site Reliability Engineering, DevOps, or production operations, including hands‑on work with incident management and observability tooling. Hands‑on experience in Site Reliability Engineering with incident management. You’ve built or operated automated paging and routing with tools like PagerDuty, New Relic (or comparable observability platforms such as Datadog, Grafana, or Splunk), and incident management workflows. Experience in SRE with large e‑commerce platforms. Expertise in automation and scripting. Be comfortable writing in Python (or a similar language) and have experience in working with REST APIs, webhooks, and integrations to wire tools together; familiarity with AWS and Azure DevOps pipelines is a strong plus given the stack. A documentation and enablement mindset doesn’t just build systems. Experience with the runbooks, SOPs, and knowledge base that let others operate them, and you genuinely enjoy teaching a team to operate what you’ve built. Self‑starter who operates without a playbook. This role defines the playbook. Don’t wait for instructions; you spot the gaps yourself. You’re comfortable presenting plans and trade‑offs directly to engineering leadership and the executive suite. Clear, data‑driven communicator. You frame decisions in terms of metrics (MTTD, MTTR, uptime) and can make the case for what to automate, what to filter, and when a coverage gap is a tooling problem rather than a process problem. A degree in Computer Science or a related field is welcome but not required. Demonstrated experience building and running reliable systems matters more to us than credentials. Responsibilities Build the automated incident routing system. Design and own the PagerDuty Service Directory. Integrate the alert sources into one routing pipeline. Connect New Relic (primary), Splunk, Zabbix, and Slack‑reported issues into automated routing via PagerDuty. Make incident response metrics‑driven. Stand up a dashboard tracking MTTD, MTTA, MTTR, missed‑page rate, and wrong‑team routing rate, and drive measurable improvement against baseline. Cut alert noise and after‑hours burden. Tune alert thresholds and implement AIOps‑style grouping so that cascading symptoms generate a single actionable page, not many. Codify the operating model. Author and maintain runbooks, SOPs, on‑call schedules, and the knowledge base, including explicit “wake someone now vs. wait until morning” criteria for every service. Level up the team. Document and train the existing Site Operations team to own and extend the automation, runbooks, and alert tuning as the function matures. Enable the follow‑the‑sun transition. Establish coverage across the global footprint (Argentina, Paris, Berlin, Tel Aviv, Ukraine, India) so incident response occurs during normal business hours, eliminating the need for a 24/7 staffed overnight shift. WE OFFER US and EU projects based on advanced technologies. Competitive compensation based on skills and experience. Regular performance appraisals to support your growth. Flexibility in workspace, either remote or our welcoming office. Bonuses for article writing, public talks, other activities. Generous time off, including vacation, national holidays, sick leaves, and family days. Personalized learning programs tailored to your interests and skill development. Free tech webinars and meetups organized by Svitla. Regular corporate online activities. Awesome team and a friendly, supportive community! #J-18808-Ljbffr
Senior Site Reliability Engineer
SVITLA SYSTEMS, INC.
Remote, Remote
Publicado hace 4 días
Denunciar empleo