Company QUALCOMM SEMICONDUCTORES Y SISTEMAS AVANZADOS DE BAJA CALIFORNIA Job Area Engineering Group, Engineering Group Software Engineering General Summary Lead design, build, and manage cloud infrastructure and operate large‑scale distributed systems at Qualcomm. Work across cloud, Kubernetes, CI/CD, data platforms, AI‑assisted SRE, and monitoring to deliver reliable, scalable, cost‑efficient services. Cloud Infrastructure & Infrastructure as Code Design, build, and manage cloud infrastructure with a primary focus on AWS, integrated with OpenStack environments Build and maintain Infrastructure as Code using Terraform, Ansible, and Kubernetes manifests/Helm Design infrastructure solutions for scalability, high availability, performance, reliability, and cost efficiency Implement redundancy, failover, and disaster‑recovery patterns across services and regions Perform capacity planning based on performance metrics, usage trends, and utilization data Kubernetes & Platform Reliability Operate and scale production Kubernetes clusters in large‑scale environments Partner with development and QA teams to improve system reliability, resiliency, and automate scalability and availability mechanisms Apply SRE principles, including service reliability ownership, proactive failure prevention, continuous improvement of operational processes, and support microservices‑based and distributed system architectures CI/CD, Automation & Operational Excellence Manage and evolve CI/CD pipelines (e.g., Jenkins) Automate infrastructure provisioning, configuration, and lifecycle management Write, maintain, and improve runbooks for operational processes Build automation to reduce manual intervention and operational toil Plan and execute infrastructure upgrades and maintenance activities Proactively identify and address technical and infrastructure debt Data Platforms & Streaming Systems Operate, tune, and scale data and streaming platforms including Kafka, Zookeeper, NiFi, Elasticsearch, MySQL, and Vertica Diagnose and resolve performance and stability issues across data pipelines Ensure data platform reliability, throughput, and resilience at scale AI‑Assisted SRE & Intelligent Automation Design and maintain knowledge‑driven automated runbooks and operational bots Develop AI‑assisted operational workflows, including incident analysis and summarization, intelligent diagnostics and remediation suggestions, and automation of repetitive operational decision‑making Work with LLM‑based agent frameworks (e.g., Claude Agent SDK), integrating agents with logs, metrics, monitoring, and internal tools, implementing guard‑railed, controlled‑action automation for production use, and researching new AI‑driven approaches to improve reliability and efficiency Monitoring, Reliability & Incident Management Design and operate monitoring and observability systems using Prometheus, Grafana, and ELK stack Improve alert quality, signal‑to‑noise ratio, and troubleshooting efficiency Lead incident response activities, root cause analysis, and post‑incident reviews Support software engineers in debugging complex production issues across distributed systems Embed reliability, automation, and operational readiness into system design Experience Required Extensive experience operating large‑scale distributed cloud systems Hands‑on experience with AWS in production environments Direct experience working with OpenStack Strong Linux background in large‑scale SaaS or production systems Ability to maintain and improve existing mission‑critical systems, prioritize and systematically reduce technical and infrastructure debt, and design for operational excellence Required Skills Programming: Strong experience with Python and/or Go Cloud & IaC: Terraform, Ansible, CloudFormation or equivalent Containers: Kubernetes (production experience) CI/CD: Jenkins and modern CI/CD practices Data & Streaming: Kafka, NiFi, Elasticsearch, MySQL, Vertica, Zookeeper Observability: Prometheus, Grafana, ELK Infrastructure: Nginx, Linux internals AI / Automation (advantage): Experience integrating AI or LLMs into operational workflows, familiarity with agent‑based automation concepts Experience Guidelines 3+ years overall experience managing infrastructure 3+ years Linux administration in large‑scale environments 3+ years operating production systems on AWS and/or OpenStack 3+ years managing Kubernetes in production 3+ years using infrastructure as code 3+ years working with CI/CD systems Minimum Qualifications Bachelor’s degree in Engineering, Information Systems, Computer Science, or related field and 2+ years of Software Engineering or related work experience. or Master’s degree in Engineering, Information Systems, Computer Science, or related field and 1+ year of Software Engineering or related work experience. or PhD in Engineering, Information Systems, Computer Science, or related field. 2+ years of academic or work experience with programming languages such as C, C++, Java, Python, etc. Applicants Qualcomm is an equal opportunity employer. If you are an individual with a disability and need an accommodation during the application/hiring process, Qualcomm is committed to providing an accessible process. For accommodations, contact disability‑ or call Qualcomm’s toll‑free number. #J-18808-Ljbffr
Site Reliability Engineer (Sre) – Regional Multi Project Platform
QUALCOMM
tijuana, tijuana
Publicado hace 22 días
Denunciar empleo