Senior Site Reliability Engineer
Há 4 dias
We are a US-based outsource software development company that has been delivering exceptional software experience to our clients since 2011, helping technology companies to become industry leaders.
Over the past few years, we've been hiring specialists all over the world while our main development centers were in Ukraine. Now, we keep expanding and start growing our centers in different parts of the world. Dev.Pro is open to hire specialists from other countries as well as Ukrainians who live outside of Ukraine now. We stand with Ukraine and keep supporting our people by offering a friendly remote environment while adhering to the values of democracy, human rights, and state sovereignty.
About this opportunity
We invite a skilled and experienced Senior Site Reliability Engineer to join our fully remote, international team. In this role, you'll ensure our GPU clusters and supporting AI infrastructure are reliable, resilient, automated, and observable at scale. You'll work with NVIDIA, Slurm, and Kubernetes to turn bare-metal GPU clusters into high-performance AI infrastructure.
What's in it for you:
• Join a fast-scaling company shaping the future of AI infrastructure in Europe
• Scale, optimize, and automate bare-metal GPU clusters for some of the most compute-intensive AI workloads
• Collaborate with a top-tier international team and grow through global AI and cloud events
Is that you?
• 5+ years as an SRE, DevOps, or HPC engineer in large-scale compute environments
• Expertise in HPC workload managers (Slurm, PBS Pro, LSF)
• Strong Python or Go skills for automation and observability
• Infrastructure-as-code experience (Terraform, Ansible, Helm)
• Kubernetes experience for AI workloads (vLLM, Ray, Triton Inference Server)
• GPU resource management knowledge (MIG, NCCL, CUDA, containers)
• Experience with storage systems (VAST, WEKA, DDN) and parallel filesystems (GPFS, Lustre)
• Linux systems engineering, CI/CD, and configuration management skills
• Strategic thinking with strong technical and business communication
• Organization, autonomy, adaptability
• Advanced English level
Desirable:
• Exposure to BlueField DPU, NVSwitch, or Slurm-on-Kubernetes hybrid orchestration
Key responsibilities and your contribution
In this role, you'll apply your expertise to ensure our GPU clusters and AI infrastructure run reliably, efficiently, and at scale.
• Automate deployment, scaling, and lifecycle management of GPU clusters
• Optimize HPC scheduling and AI workload orchestration, including job preemption and GPU affinity
• Implement observability and monitoring across GPU, NVLink, InfiniBand, and storage layers
• Ensure reliability and uptime through SLOs, error budgets, chaos testing, and automated remediation
• Collaborate with teams to optimize performance, resources, and fault recovery at petascale
-
Senior Site Reliability
2 semanas atrás
São Paulo, São Paulo, Brasil Jahnel Group Tempo inteiro R$120.000 - R$180.000 por anoJahnel Group's mission is to provide the absolute best environment for software creators to pursue their passion by connecting them with great clients doing meaningful work. We get to build some of the most complex and compelling applications for our clients located across the country.We're a fast-growing INC 5000 recognized company, yet we still work as...
-
Site Reliability Engineer
1 semana atrás
São Paulo, São Paulo, Brasil act digital Tempo inteiroSAP Site Reliability Engineer (SRE) SeniorLocal:São Paulo – Híbrido (Morumbi Shopping), presencial 3x por semanaIdioma:Inglês conversacional (a partir de B2)Modelo:Tempo integralSobre a oportunidadeBuscamos um(a) SRE Senior com forte vivência emoperações e confiabilidade de SAP S/4HANA, responsável por garantir estabilidade, performance e...
-
Site Reliability Engineer
Há 3 dias
São Paulo, Estado de São Paulo, Brasil INDI Staffing Services Tempo inteiroAt INDI, we're passionate about empowering individuals and businesses worldwide. Our cutting-edge recruiters connect leading companies with top talent, fostering a dynamic environment where innovation thrives. Join us in shaping the future of work.Overview of the role:We are looking for a Site Reliability Engineer to build and maintain highly reliable,...
-
Site Reliability Engineer
Há 5 dias
São Paulo, Estado de São Paulo, Brasil Mouts TI Tempo inteiroNa Mouts TI, entregamos soluções que impulsionam a transformação digital de forma ágil, eficiente e descomplicada.Buscamos um(a) SRE (Site Reliability Engineer) para atuar presencialmente, com foco em infraestrutura, automação e observabilidade em ambientes de missão crítica.Responsabilidades:Implementar e gerenciar soluções de observabilidade...
-
Site Reliability Engineer
2 semanas atrás
São Paulo, São Paulo, Brasil Thales Tempo inteiro R$96.000 - R$180.000 por anoThales people architect identity management and data protection solutions at the heart of digital security. Business and governments rely on us to bring trust to the billons of digital interactions they have with people. Our technologies and services help banks exchange funds, people cross borders, energy become smarter and much more. More than 30,000...
-
Site Reliability Engineer
Há 2 dias
São Paulo, São Paulo, Brasil DELIVER IT Tempo inteiro R$60.000 - R$120.000 por anoVocê é uma pessoa com sólida experiência em engenharia de confiabilidade, tem pensamento estratégico, perfil colaborativo e busca constantemente elevar o nível técnico dos times e sistemas com os quais trabalha? Então essa oportunidade é para vocêEstamos em busca de um(a) SRE Sênior (Site Reliability Engineer) para compor uma equipe técnica de...
-
Senior Frontend Engineer, Reliability Experience
2 semanas atrás
São Paulo, São Paulo, Brasil Airbnb Tempo inteiro R$10.000 - R$20.000 por anoThe Community You Will Join:Our team, Reliability Experience, is responsible for the ideation, development, and maintenance of opinionated UX across the Reliability Engineering ecosystem at Airbnb. We chart the paved path that all platform, infra, and product engineers rely upon to effectively monitor, investigate, and debug system health across Airbnb's...
-
Sr Site Reliability Engineer
Há 2 dias
São Paulo, São Paulo, Brasil Workana Tempo inteiro R$150.000 - R$200.000 por anoNa Workana, estamos em busca de um(a)Senior Site Reliability Engineer (SRE)para integrar o time de um dos nossos clientes e desempenhar um papel essencial na manutenção, automação e melhoria da confiabilidade dos sistemas que impulsionam sua rede logística em múltiplas regiões.Sobre o cliente:Trata-se de uma plataforma que gerencia fluxos logísticos...
-
Site Reliability Engineer
Há 6 dias
São Paulo, São Paulo, Brasil DELIVER IT Tempo inteiro R$80.000 - R$120.000 por anoVocê se considera uma pessoa que tem sede por aprendizado, gosta de trabalhar em equipe e almeja desenvolvimento na carreira? Então essa oportunidade é para vocêEstamos em busca de um(a) SRE Júnior (Site Reliability Engineer) para integrar uma equipe altamente técnica e comprometida com a excelência operacional. O profissional atuará com foco na...
-
Senior Site Reliabity Engineer
1 semana atrás
São Paulo, São Paulo, Brasil Lend Tempo inteiro R$80.000 - R$120.000 por anoBuscamos um(a)Site Reliability Engineer Sêniorpara projetar, operar e evoluir uma infraestrutura de crédito que vai transformar o mercado financeiro brasileiro.Você será responsável por garantir que nossa plataforma sejaconfiável, escalável, segura e eficiente em custo, impactando diretamente nossos clientes e moldando como o crédito será...