QUALCOMM SEMICONDUCTORES Y SISTEMAS AVANZADOS DE BAJA CALIFORNIA Job Area Engineering Group, Engineering Group Software Engineering General Summary Cloud Infrastructure & Infrastructure as Code Design, build, and manage cloud infrastructure with a primary focus on AWS, integrated with OpenStack environments Build and maintain Infrastructure as Code using:Terraform Ansible Kubernetes (manifests / Helm) Design infrastructure solutions for:Scalability High availability Performance Reliability Cost efficiency Implement redundancy, failover, and disaster‑recovery patterns across services and regions Perform capacity planning based on performance metrics, usage trends, and utilization data Kubernetes & Platform Reliability Operate and scale production Kubernetes clusters in large‑scale environments Partner with development and QA teams to:Improve system reliability and resiliency Automate scalability and availability mechanisms Apply SRE principles including:Service reliability ownership Proactive failure prevention Continuous improvement of operational processes Support microservices‑based and distributed system architectures CI/CD, Automation & Operational Excellence Manage and evolve CI/CD pipelines (e.g., Jenkins) Automate infrastructure provisioning, configuration, and lifecycle management Write, maintain, and improve runbooks for operational processes Build automation to reduce manual intervention and operational toil Plan and execute infrastructure upgrades and maintenance activities Proactively identify and address technical and infrastructure debt Data Platforms & Streaming Systems Operate, tune, and scale data and streaming platforms, including:Kafka, Zookeeper NiFi Elasticsearch MySQL, Vertica Diagnose and resolve performance and stability issues across data pipelines Ensure data platform reliability, throughput, and resilience at scale AI‑Assisted SRE & Intelligent Automation Design and maintain knowledge‑driven automated runbooks and operational bots Develop AI‑assisted operational workflows, including:Incident analysis and summarization Intelligent diagnostics and remediation suggestions Automation of repetitive operational decision‑making Work with LLM‑based agent frameworks (e.g., Claude Agent SDK or similar):Integrate agents with logs, metrics, monitoring, and internal tools Implement guard‑railed, controlled‑action automation for production use Research and propose new concepts, tools, and AI‑driven approaches to improve reliability and efficiency Monitoring, Reliability & Incident Management Design and operate monitoring and observability systems using:Prometheus Grafana ELK stack Improve alert quality, signal‑to‑noise ratio, and troubleshooting efficiency Lead incident response activities, root cause analysis, and post‑incident reviews Support software engineers in debugging complex production issues across distributed systems Embed reliability, automation, and operational readiness into system design Experience Required Extensive experience operating large‑scale distributed cloud systems Hands‑on experience with AWS in production environments Direct experience working with OpenStack Strong Linux background in large‑scale SaaS or production systems Ability to:Maintain and improve existing mission‑critical systems Prioritize and systematically reduce technical and infrastructure debt Strong understanding of designing for operational excellence, not just greenfield solutions Required Skills Programming: Strong experience with Python and/or Go Cloud & IaC: Terraform, Ansible, CloudFormation or equivalent Containers: Kubernetes (production experience) CI/CD: Jenkins and modern CI/CD practices Data & Streaming: Kafka, NiFi, Elasticsearch, MySQL, Vertica, Zookeeper Observability: Prometheus, Grafana, ELK Infrastructure: Nginx, Linux internals AI / Automation (advantage):Experience integrating AI or LLMs into operational workflows Familiarity with agent‑based automation concepts Experience Guidelines 3+ years in: overall experience managing infrastructure Linux administration in large‑scale environments operating production systems on AWS and/or OpenStack managing Kubernetes in production using infrastructure as code working with CI/CD systems Minimum Qualifications Bachelor's degree in Engineering, Information Systems, Computer Science, or related field and 2+ years of Software Engineering or related work experience. Master's degree in Engineering, Information Systems, Computer Science, or related field and 1+ year of Software Engineering or related work experience. PhD in Engineering, Information Systems, Computer Science, or related field. 2+ years of academic or work experience with Programming Language such as C, C++, Java, Python, etc. Equal Opportunity Employer Statement Applicants: Qualcomm is an equal opportunity employer. If you are an individual with a disability and need an accommodation during the application/hiring process, rest assured that Qualcomm is committed to providing an accessible process. Qualcomm will provide reasonable accommodations to support individuals with disabilities to be able participate in the hiring process. You may e-mail or call Qualcomm's toll-free number found here. Qualcomm is also committed to making our workplace accessible for individuals with disabilities. #J-18808-Ljbffr
Site Reliability Engineer (Sre) – Regional Multi Project Platform
QUALCOMM
tijuana, tijuana
Publicado hace 7 días
Denunciar empleo