Design, implement, and maintain scalable infrastructure using Linux and Kubernetes. Monitor system performance using Prometheus and address potential issues proactively. Automate operational processes to improve system reliability and efficiency. Respond to incidents, perform root cause analysis, and implement improvements. Collaborate with development teams to ensure smooth deployments and high availability. Create and maintain documentation, runbooks, and operational guidelines. Promote best practices in reliability, security, and system performance. Requirements Strong experience with Linux system administration and troubleshooting. Strong expertise in Kubernetes cluster management and orchestration. Strong experience using Prometheus for monitoring and alerting. Proficiency in scripting languages such as Bash or Python. Strong problem-solving and incident management skills. Excellent written and verbal communication skills. Ability to work independently in a remote, fast-paced environment. #J-18808-Ljbffr
Site Reliability Engineer | Remote
CROSSING HURDLES
distrito federal, distrito federal
Publicado hace 22 días
Denunciar empleo