Senior Site Reliability Engineer

1 semana atrás

São Paulo, São Paulo, Brasil Dev Tempo inteiro

We are a US-based outsource software development company that has been delivering exceptional software experience to our clients since 2011, helping technology companies to become industry leaders.

Over the past few years, we've been hiring specialists all over the world while our main development centers were in Ukraine. Now, we keep expanding and start growing our centers in different parts of the world. Dev.Pro is open to hire specialists from other countries as well as Ukrainians who live outside of Ukraine now. We stand with Ukraine and keep supporting our people by offering a friendly remote environment while adhering to the values of democracy, human rights, and state sovereignty.

About this opportunity

We invite a skilled and experienced Senior Site Reliability Engineer to join our fully remote, international team. In this role, you'll ensure our GPU clusters and supporting AI infrastructure are reliable, resilient, automated, and observable at scale. You'll work with NVIDIA, Slurm, and Kubernetes to turn bare-metal GPU clusters into high-performance AI infrastructure.

What's in it for you:

• Join a fast-scaling company shaping the future of AI infrastructure in Europe

• Scale, optimize, and automate bare-metal GPU clusters for some of the most compute-intensive AI workloads

• Collaborate with a top-tier international team and grow through global AI and cloud events

Is that you?

• 5+ years as an SRE, DevOps, or HPC engineer in large-scale compute environments

• Expertise in HPC workload managers (Slurm, PBS Pro, LSF)

• Strong Python or Go skills for automation and observability

• Infrastructure-as-code experience (Terraform, Ansible, Helm)

• Kubernetes experience for AI workloads (vLLM, Ray, Triton Inference Server)

• GPU resource management knowledge (MIG, NCCL, CUDA, containers)

• Experience with storage systems (VAST, WEKA, DDN) and parallel filesystems (GPFS, Lustre)

• Linux systems engineering, CI/CD, and configuration management skills

• Strategic thinking with strong technical and business communication

• Organization, autonomy, adaptability

• Advanced English level

Desirable:

• Exposure to BlueField DPU, NVSwitch, or Slurm-on-Kubernetes hybrid orchestration

Key responsibilities and your contribution

In this role, you'll apply your expertise to ensure our GPU clusters and AI infrastructure run reliably, efficiently, and at scale.

• Automate deployment, scaling, and lifecycle management of GPU clusters

• Optimize HPC scheduling and AI workload orchestration, including job preemption and GPU affinity

• Implement observability and monitoring across GPU, NVLink, InfiniBand, and storage layers

• Ensure reliability and uptime through SLOs, error budgets, chaos testing, and automated remediation

• Collaborate with teams to optimize performance, resources, and fault recovery at petascale

SRE - Senior Site Reliability Engineer

Há 2 dias

São Paulo, São Paulo, Brasil K2 Solutions Tempo inteiro

Trabalho híbrido na região de Pinheiros/ SP - 3x por semana no escritório Estamos selecionando um Senior Site Reliability Engineer - SRE para se juntar ao nosso time e desempenhar um papel essencial na manutenção, automação e melhoria da confiabilidade dos sistemas que impulsionam a rede logística da empresa em múltiplas regiões. Essa pessoa...
Site Reliability Engineer

Há 4 dias

São Paulo, São Paulo, Brasil Truelogic Tempo inteiro

About TruelogicAt Truelogic we are a leading provider of nearshore staff augmentation services headquartered in New York. For over two decades, we've been delivering top-tier technology solutions to companies of all sizes, from innovative startups to industry leaders, helping them achieve their digital transformation goals.Our team of 600+ highly skilled...
Senior Site Reliability Engineer

1 semana atrás

São Paulo, São Paulo, Brasil Enumerate Tempo inteiro

Role OverviewWe're looking for a Senior Site Reliability Engineer who can own the architecture, governance, and cost efficiency of our cloud and platform infrastructure. In this role you'll design and evolve our production environments, define standards and best practices, and partner with engineering and IT teams to build scalable, reliable systems that are...
Site Reliability Engineer

2 semanas atrás

São Paulo, Estado de São Paulo, Brasil Conquest One Tempo inteiro

Vaga: SRE SêniorHíbrido – presencial 2x na semana no Jardim Paulista (Av. Nove de Julho – São Paulo/SP) + 3x na semana de home office Contratação: CLT Horário de trabalho: 09:00 às 18:00Estamos em busca de um(a) Site Reliability Engineer Sênior para atuar de forma estratégica na transformação e evolução de nossas plataformas! Se você tem...
Senior Site Reliability Engineer

Há 2 dias

São Paulo, São Paulo, Brasil Dev Tempo inteiro

Are you in Brazil or Argentina? Join us as we actively recruit in these locations, offering a comfortable remote environment. Submit your CV in English, and we'll get back to youWe invite a Senior Site Reliability Engineer to join our dynamic team. In this hands-on role, you'll focus on improving the stability, observability, and efficiency of our services....
Senior Site Reliability Engineer

Há 2 dias

São Paulo, São Paulo, Brasil Dev Tempo inteiro

Are you in Brazil or Argentina? Join us as we actively recruit in these locations, offering a comfortable remote environment. Submit your CV in English, and we'll get back to youWe invite a Senior Site Reliability Engineer to join our dynamic team. In this hands-on role, you'll focus on improving the stability, observability, and efficiency of our services....
Sr Site Reliability Engineer

1 semana atrás

São Paulo, São Paulo, Brasil Workana Tempo inteiro

Na Workana, estamos em busca de um(a)Senior Site Reliability Engineer (SRE)para integrar o time de um dos nossos clientes e desempenhar um papel essencial na manutenção, automação e melhoria da confiabilidade dos sistemas que impulsionam sua rede logística em múltiplas regiões.Sobre o cliente:Trata-se de uma plataforma que gerencia fluxos logísticos...
Site Reliability Engineer

Há 2 dias

São Paulo, São Paulo, Brasil WSO2 Tempo inteiro

About WSO2Founded in 2005, WSO2 is the largest independent software vendor providing open-source API management, integration, and identity and access management (IAM) products. WSO2's products and platforms—including our next-gen internal developer platform, Choreo—empower organizations to leverage the full potential of APIs for secure delivery of...
Senior Frontend Engineer, Reliability Experience

Há 2 dias

São Paulo, São Paulo, Brasil Airbnb Tempo inteiro R$26.666 - R$33.333

Airbnb was born in 2007 when two hosts welcomed three guests to their San Francisco home, and has since grown to over 5 million hosts who have welcomed over 2 billion guest arrivals in almost every country across the globe. Every day, hosts offer unique stays and experiences that make it possible for guests to connect with communities in a more authentic...
Senior Frontend Engineer, Reliability Experience

Há 8 horas

São Paulo, São Paulo, Brasil Airbnb Tempo inteiro

Airbnb was born in 2007 when two hosts welcomed three guests to their San Francisco home, and has since grown to over 5 million hosts who have welcomed over 2 billion guest arrivals in almost every country across the globe. Every day, hosts offer unique stays and experiences that make it possible for guests to connect with communities in a more authentic...

Américas

Europa

Ásia / Oceania

África

Senior Site Reliability Engineer