Job Title: Platform Site Reliability Engineer (SRE) We are seeking a Platform Site Reliability Engineer (SRE) to support the reliability, observability, and day-2 operations of modern AI platform environments running performance-sensitive workloads. This role is suited for someone with hands‑on experience in production support, monitoring, alerting, incident response, Linux troubleshooting, operational automation, system software maintenance, and GPU‑enabled platform operations across infrastructure and platform layers. The ideal candidate has experience with Prometheus, Grafana, and logging/metrics platforms, and can work across compute, platform, DevOps, storage, and network teams to improve service health, reduce alert noise, speed up incident resolution, and strengthen overall platform reliability. Key Responsibilities Support reliability and day‑2 operations for production platform environments. Build and maintain monitoring, alerting, dashboards, and operational reporting across infrastructure and platform services. Use tools such as Prometheus, Grafana, and related observability platforms to track health, availability, capacity, and performance. Troubleshoot issues across Linux hosts, containers, platform services, infrastructure dependencies, and GPU‑enabled Kubernetes environments. Support incident detection, triage, root cause analysis, and post‑incident improvements. Tune alerts and service checks to improve signal quality and reduce false positives. Partner with platform, compute, storage, DevOps, and network teams to isolate and resolve production issues. Automate repetitive operational tasks using Bash, Python, Ansible, or similar tools. Perform patching and maintenance of system software across platform environments. Use BMC and related out‑of‑band management interfaces for system reboot, diagnosis, and hardware‑level troubleshooting. Support operational reliability of Kubernetes‑based AI/ML environments, including GPU‑related components and dependencies. Maintain runbooks, monitoring standards, alert documentation, and operational procedures. Contribute to continuous improvement through standardization, automation, and reliability best practices. Must Have Skills (3–6 years) Experience supporting production environments with a focus on uptime, stability, and day‑2 operations Strong Linux administration and troubleshooting skills Ability to troubleshoot issues across host, service, infrastructure, and platform layers Hands‑on experience with monitoring and alerting in production systems Experience with Prometheus, Grafana, or similar observability tools Experience with incident triage, support operations, and runbook‑driven response Experience with Kubernetes, containers, and distributed platform environments Hands‑on experience with NVIDIA GPU plugins for Kubernetes Solid understanding of fault‑tolerant distributed computing and storage systems, including key reliability, health, and performance metrics to monitor Hands‑on experience using BMC or similar tools for system reboot, diagnostics, and hardware management Experience patching system software in production or pre‑production platform environments Basic scripting or automation experience using Bash, Python, or Ansible Strong collaboration skills across platform, infrastructure, DevOps, storage, network, and support teams Experience creating or maintaining dashboards, alerts, SOPs, and operational documentation Strong adherence to Agile or Kanban ways of working, including delivering work within defined cadences or flow‑based priorities and providing consistent, proactive status updates on progress, risks, and blockers Nice to Have Skills Familiarity with SRE practices described in the Google SRE book Experience writing automated tests or synthetic checks for infrastructure/platform validation Experience with ELK, Loki, OpenSearch, or similar logging tools Experience with NVIDIA GPU infrastructure, including DCGM, GPU Operator, or NVAIE Exposure to hardware‑level telemetry, including BMC, firmware health, thermal monitoring, power monitoring, and lower‑level system health indicators Exposure to telemetry, exporters, instrumentation, and service health checks Experience with capacity monitoring, trend analysis, and performance reporting Familiarity with RCA, postmortems, SLI/SLO concepts, and reliability improvement practices Exposure to CI/CD pipelines and Git‑based operational workflows #J-18808-Ljbffr
Site Reliability Engineer
CIRRUSLABS
distrito federal, distrito federal
Publicado hace 7 días
Denunciar empleo