Senior Site Reliability Engineer

2 semanas atrás


São Paulo, Brasil Dev.Pro Tempo inteiro

Senior Site Reliability Engineer - OPS00023 We are a US‑based outsource software development company that has been delivering exceptional software experience to our clients since 2011, helping technology companies to become industry leaders. Over the past few years, we’ve been hiring specialists all over the world while our main development centers were in Ukraine. Now, we keep expanding and start growing our centers in different parts of the world. Dev.Pro is open to hire specialists from other countries as well as Ukrainians who live outside of Ukraine now. We stand with Ukraine and keep supporting our people by offering a friendly remote environment while adhering to the values of democracy, human rights, and state sovereignty. About This Opportunity We invite a skilled and experienced Senior Site Reliability Engineer to join our fully remote, international team. In this role, you’ll ensure our GPU clusters and supporting AI infrastructure are reliable, resilient, automated, and observable at scale. You’ll work with NVIDIA, Slurm, and Kubernetes to turn bare‑metal GPU clusters into high‑performance AI infrastructure. What's in it for you Join a fast‑scaling company shaping the future of AI infrastructure in Europe Scale, optimize, and automate bare‑metal GPU clusters for some of the most compute‑intensive AI workloads Collaborate with a top‑tier international team and grow through global AI and cloud events Is that you 5+ years as an SRE, DevOps, or HPC engineer in large‑scale compute environments Expertise in HPC workload managers (Slurm, PBS Pro, LSF) Strong Python or Go skills for automation and observability Infrastructure‑as‑code experience (Terraform, Ansible, Helm) Kubernetes experience for AI workloads (vLLM, Ray, Triton Inference Server) GPU resource management knowledge (MIG, NCCL, CUDA, containers) Experience with storage systems (VAST, WEKA, DDN) and parallel filesystems (GPFS, Lustre) Linux systems engineering, CI/CD, and configuration management skills Strategic thinking with strong technical and business communication Organization, autonomy, adaptability Advanced English level Desirable Exposure to BlueField DPU, NVSwitch, or Slurm‑on‑Kubernetes hybrid orchestration Key Responsibilities And Your Contribution In this role, you’ll apply your expertise to ensure our GPU clusters and AI infrastructure run reliably, efficiently, and at scale. Automate deployment, scaling, and lifecycle management of GPU clusters Optimize HPC scheduling and AI workload orchestration, including job preemption and GPU affinity Implement observability and monitoring across GPU, NVLink, InfiniBand, and storage layers Ensure reliability and uptime through SLOs, error budgets, chaos testing, and automated remediation Collaborate with teams to optimize performance, resources, and fault recovery at petascale Referrals increase your chances of interviewing at Dev.Pro by 2x #J-18808-Ljbffr



  • São Paulo, Brasil K2 Solutions Tempo inteiro

    Trabalho híbrido na região de Pinheiros/ SP - 3x por semana no escritórioEstamos selecionando um Senior Site Reliability Engineer - SRE para se juntar ao nosso time e desempenhar um papel essencial na manutenção, automação e melhoria da confiabilidade dos sistemas que impulsionam a rede logística da empresa em múltiplas regiões. Essa pessoa...


  • São Paulo, Brasil Lend Tempo inteiro

    Buscamos um(a) Site Reliability Engineer Sênior para projetar, operar e evoluir uma infraestrutura de crédito que vai transformar o mercado financeiro brasileiro. Você será responsável por garantir que nossa plataforma seja confiável, escalável, segura e eficiente em custo , impactando diretamente nossos clientes e moldando como o crédito será...


  • São Paulo, Brasil Canonical Tempo inteiro

    Senior Site Reliability / Gitops EngineerJoin to apply for the Senior Site Reliability / Gitops Engineer role at Canonical Senior Site Reliability / Gitops Engineer1 day ago Be among the first 25 applicants Join to apply for the Senior Site Reliability / Gitops Engineer role at Canonical Get AI-powered advice on this job and more exclusive features....


  • São Paulo, Brasil Chainlink Labs Tempo inteiro

    Join to apply for the Senior Site Reliability Engineer role at Chainlink Labs 2 weeks ago Be among the first 25 applicants Join to apply for the Senior Site Reliability Engineer role at Chainlink Labs Get AI-powered advice on this job and more exclusive features. About UsChainlink Labs is the primary contributing developer of Chainlink, the decentralized...


  • São Paulo, Brasil Mouts TI Tempo inteiro

    Na Mouts TI, entregamos soluções que impulsionam a transformação digital de forma ágil, eficiente e descomplicada.Buscamos um(a) SRE (Site Reliability Engineer) para atuar presencialmente, com foco em infraestrutura, automação e observabilidade em ambientes de missão crítica.Responsabilidades:Implementar e gerenciar soluções de observabilidade


  • São Paulo, Brasil PayRetailers Tempo inteiro

    Site Reliability Engineer Join PayRetailers in São Paulo. We are expanding across Latin America and Africa, building cutting‑edge payment solutions. We value creativity, growth, and collaboration. About the role Site Reliability Engineers are guardians of our reliability promise. They deliver a highly reliable, resilient, and cost‑efficient platform...


  • São Paulo, Brasil PayRetailers Tempo inteiro

    Site Reliability Engineer Join PayRetailers in São Paulo. We are expanding across Latin America and Africa, building cutting‑edge payment solutions. We value creativity, growth, and collaboration. About the role Site Reliability Engineers are guardians of our reliability promise. They deliver a highly reliable, resilient, and cost‑efficient platform...


  • São Paulo, Brasil INDI Staffing Services Tempo inteiro

    OverviewWe are looking for a Site Reliability Engineer to build and maintain highly reliable, scalable, and secure OpenShift/Kubernetes clusters. Approach the problem of building and maintaining production systems from a software engineering perspective with a focus on automation and reliability. ResponsibilitiesBuild, automate, and maintain...


  • São Paulo, Brasil K2 Solutions Tempo inteiro

    Trabalho híbrido na região de Pinheiros/ SP - 3x por semana no escritórioEstamos selecionando um Senior Site Reliability Engineer - SRE para se juntar ao nosso time e desempenhar um papel essencial na manutenção, automação e melhoria da confiabilidade dos sistemas que impulsionam a rede logística da empresa em múltiplas regiões. Essa pessoa...


  • São Paulo, Brasil K2 Solutions Tempo inteiro

    Trabalho híbrido na região de Pinheiros/ SP - 3x por semana no escritórioEstamos selecionando um Senior Site Reliability Engineer - SRE para se juntar ao nosso time e desempenhar um papel essencial na manutenção, automação e melhoria da confiabilidade dos sistemas que impulsionam a rede logística da empresa em múltiplas regiões. Essa pessoa...