Senior Site Reliability Engineer

Há 18 horas


São Paulo, São Paulo, Brasil Dev Tempo inteiro R$80.000 - R$160.000 por ano

We are a US-based outsource software development company that has been delivering exceptional software experience to our clients since 2011, helping technology companies to become industry leaders.

Over the past few years, we've been hiring specialists all over the world while our main development centers were in Ukraine. Now, we keep expanding and start growing our centers in different parts of the world. Dev.Pro is open to hire specialists from other countries as well as Ukrainians who live outside of Ukraine now. We stand with Ukraine and keep supporting our people by offering a friendly remote environment while adhering to the values of democracy, human rights, and state sovereignty.

About this opportunity

We invite a skilled and experienced Senior Site Reliability Engineer to join our fully remote, international team. In this role, you'll ensure our GPU clusters and supporting AI infrastructure are reliable, resilient, automated, and observable at scale. You'll work with NVIDIA, Slurm, and Kubernetes to turn bare-metal GPU clusters into high-performance AI infrastructure.

What's in it for you:


• Join a fast-scaling company shaping the future of AI infrastructure in Europe


• Scale, optimize, and automate bare-metal GPU clusters for some of the most compute-intensive AI workloads


• Collaborate with a top-tier international team and grow through global AI and cloud events

Is that you?


• 5+ years as an SRE, DevOps, or HPC engineer in large-scale compute environments


• Expertise in HPC workload managers (Slurm, PBS Pro, LSF)


• Strong Python or Go skills for automation and observability


• Infrastructure-as-code experience (Terraform, Ansible, Helm)


• Kubernetes experience for AI workloads (vLLM, Ray, Triton Inference Server)


• GPU resource management knowledge (MIG, NCCL, CUDA, containers)


• Experience with storage systems (VAST, WEKA, DDN) and parallel filesystems (GPFS, Lustre)


• Linux systems engineering, CI/CD, and configuration management skills


• Strategic thinking with strong technical and business communication


• Organization, autonomy, adaptability


• Advanced English level

Desirable:


• Exposure to BlueField DPU, NVSwitch, or Slurm-on-Kubernetes hybrid orchestration

Key responsibilities and your contribution

In this role, you'll apply your expertise to ensure our GPU clusters and AI infrastructure run reliably, efficiently, and at scale.


• Automate deployment, scaling, and lifecycle management of GPU clusters


• Optimize HPC scheduling and AI workload orchestration, including job preemption and GPU affinity


• Implement observability and monitoring across GPU, NVLink, InfiniBand, and storage layers


• Ensure reliability and uptime through SLOs, error budgets, chaos testing, and automated remediation


• Collaborate with teams to optimize performance, resources, and fault recovery at petascale



  • São Paulo, São Paulo, Brasil act digital Tempo inteiro

    SAP Site Reliability Engineer (SRE) SeniorLocal:São Paulo – Híbrido (Morumbi Shopping), presencial 3x por semanaIdioma:Inglês conversacional (a partir de B2)Modelo:Tempo integralSobre a oportunidadeBuscamos um(a) SRE Senior com forte vivência emoperações e confiabilidade de SAP S/4HANA, responsável por garantir estabilidade, performance e...

  • Site Reliability Engineer

    2 semanas atrás


    São Paulo, São Paulo, Brasil Enter Tempo inteiro R$80.000 - R$160.000 por ano

    A Enter (anteriormente Talisman AI) foi fundada em 2023 com a missão de tornar o Brasil um protagonista em Inteligência Artificial. Unimos a expertise humana à eficiência da IA para ajudar grandes empresas da América Latina a otimizar processos críticos de alto volume e que exigem intenso trabalho manual. Iniciamos nossa jornada aplicando IA para...


  • São Paulo, São Paulo, Brasil WEX Inc. Tempo inteiro R$80.000 - R$160.000 por ano

    About the Team/RoleThe WEX Site Reliability Engineering (SRE) team seeks individuals passionate about developing software and solutions for observability, incident response, reliability, performance, operational excellence, and compliance. As part of the Site Reliability Engineering organization, you will support internal stakeholders and Payment Platform...


  • São Paulo, São Paulo, Brasil Loadsmart Tempo inteiro R$80.000 - R$120.000 por ano

    ARE YOU INTERESTED IN JOINING AN INNOVATIVE LOGISTICS TECHNOLOGY COMPANY? Loadsmart is a growth-stage technology company valued at over $1 billion (a true Tech Unicorn We are a collection of industry veterans and user-centered engineers using innovative technology to fearlessly reinvent the future of freight by helping shippers, brokers, warehouses and...


  • São Paulo, São Paulo, Brasil Loadsmart Tempo inteiro R$120.000 - R$240.000 por ano

    ARE YOU INTERESTED IN JOINING AN INNOVATIVE LOGISTICS TECHNOLOGY COMPANY?Loadsmart is a growth-stage technology company valued at over $1 billion (a true Tech Unicorn)We are a collection of industry veterans and user-centered engineers using innovative technology to fearlessly reinvent the future of freight by helping shippers, brokers, warehouses and...


  • São Paulo, São Paulo, Brasil Loadsmart Tempo inteiro R$80.000 - R$120.000 por ano

    ARE YOU INTERESTED IN JOINING AN INNOVATIVE LOGISTICS TECHNOLOGY COMPANY?Loadsmart is a growth-stage technology company valued at over $1 billion (a true Tech Unicorn)We are a collection of industry veterans and user-centered engineers using innovative technology to fearlessly reinvent the future of freight by helping shippers, brokers, warehouses and...

  • Site Reliability Engineer

    2 semanas atrás


    São Paulo, São Paulo, Brasil Thales Tempo inteiro R$80.000 - R$120.000 por ano

    Thales people architect identity management and data protection solutions at the heart of digital security. Business and governments rely on us to bring trust to the billons of digital interactions they have with people. Our technologies and services help banks exchange funds, people cross borders, energy become smarter and much more. More than 30,000...


  • São Paulo, São Paulo, Brasil DELIVER IT Tempo inteiro R$80.000 - R$120.000 por ano

    Você se considera uma pessoa que tem sede por aprendizado, gosta de trabalhar em equipe e almeja desenvolvimento na carreira? Então essa oportunidade é para vocêEstamos em busca de um(a) SRE Júnior (Site Reliability Engineer) para integrar uma equipe altamente técnica e comprometida com a excelência operacional. O profissional atuará com foco na...

  • Site Reliability Engineer

    1 semana atrás


    São Paulo, São Paulo, Brasil Foxbit Group Tempo inteiro R$60.000 - R$120.000 por ano

    Estamos à procura de um SRE (Site Reliability Engineer) para nos ajudar a garantir a estabilidade, segurança e escalabilidade de uma das maiores exchanges de criptomoedas do BrasilO principal objetivo do time de SRE é, em conjunto com Desenvolvimento e Segurança, garantir a confiabilidade dos sistemas, monitorar, melhorar a performance e automatizar...


  • São Paulo, São Paulo, Brasil Housecall Pro Tempo inteiro US$7.500 - US$15.000 por ano

    TO BE CONSIDERED FOR THIS ROLE, PLEASE SUBMIT AN UPDATED RESUME TRANSLATED TO ENGLISHWho is Housecall Pro?Housecall Pro is a fintech company founded in 2013. We built a SaaS platform that helps Home Service Professionals operate their businesses. We created the application for plumbers, electricians, and other Pros in the home improvement/trades...