Reliability Engineer

Há 7 dias


Brasília, Brasil Cedana Tempo inteiro
What you will do

You’ll span the stack from the kernel, system and hypervisor to exploit our unique insights in compute. You’ll help increase the reliability of our system all the way from the kernel to our managed kubernetes cloud offering.You'll interact with customers on a regular basis, triaging issues and develop a working relationship with points of contact across multiple organizations.You'll help build out internal tooling to measure reliability and success and alerting infrastructure that can help us identify problems quickly; from the kernel all the way to kubernetes.

What we are looking for

Someone who doesn’t fit in with traditional full stack developers because you are obsessed with understanding how every layer of compute works.Interest in working in multiple domains and wearing multiple hats.Ability and experience, or strong interest in learning the compute stack from hardware, device drivers, OS kernel and system, k8s, distributed systems. You don’t need to know all of these coming in but are curious and have the intellectual bandwidth to quickly learn them.Track record of solving challenging problems in systems programming (e.g compilers, distributed systems, embedded systems, highly available systems at scale etc).Creative problem solving, multidisciplinary experience.Demonstrated ability to collaborate with others.

Required Qualifications

Strong understanding of Kubernetes (Controllers, Operators, CRDs)Strong understanding of Linux and UNIX fundamentals (standard libraries, services, networking, kernel/user-space interaction)Strong system level programming experience (i.e. C/Rust/Go)Experience or familiarity with low-level systems programming concepts.Experience writing Kubernetes controllers or services from scratch.

Preferred Qualifications

Experience with different container runtimes (runc, docker, podman, etc.) and container orchestration.Contributing to Open Source Projects such as: participating in Cloud Native Computing Foundation (CNCF), Apache Software Foundation (ASF), or Open Source Security Foundation (OpenSSF) is a huge plusExperience with Kubernetes system administration (using Helm, Terraform, etc.)Experience scaling infrastructure out as part of a platform team.Experience productionizing and managing production-level Kubernetes clusters.Familiar with being oncall (our founders have experience being oncall, and know how rough it is)

Nice to Have

Experience supporting data teams with data processing infrastructure (BigQuery, OpenTelemetry, etc.) and implementing observability and monitoring best practices.Experience with high performance computing (think SLURM).Experience deploying and scaling ML workloads (training or inference) in production.Familiarity with problems associated with deploying large scale ML models or batch/scientific compute.

Working at Cedana

We’re building a unique and powerful system that transforms compute orchestration. Our team is pushing the boundaries of compute performance across multiple layers of the stack.On top of building a transformative stack, our engineers dig into the linux kernel, spend time bushwacking around kubernetes and runc source code, investigate novel virtualization techniques and pore through open source GPU drivers. By moving fast and shipping quickly, they also get an opportunity to improve performance in real-world, deployed production systems on behalf of our customers - which include leading companies in Computing & GPU Infrastructure, DevTools, and LLM/Foundation Models.Our company is led by founders with extensive experience in building and scaling successful startups. Our investors including a co-founder of OpenAI, former, Chief Architect of Slack, founding members of Facebook AI and leading VC firms.
  • Site Reliability Engineer

    4 semanas atrás


    Brasília, Brasil Insight Global Tempo inteiro

    Title: Site Reliability EngineerLocation: Latin America (100% Remote working PST Hours)Rate: $28-$34hr USDRequired Skills and Experience *- 6+ years of experience as a Senior SRE/Software Engineer- Experience withDatadogand observability tools- Strong coding experience usingGolang- Strong coding experience withTypescript- Experience withAWS& Cloud...

  • Site Reliability Engineer

    2 semanas atrás


    Brasília, Distrito Federal, Brasil Sigma Software Group Tempo inteiro

    We are looking for a strong Site Reliability Engineer who can strengthen our team and will participate in the development of a complex and in-demand AdTech projectCustomerOur customer is Beeswax ( a rapidly growing US AdTech company. Founded by three former Google specialists, it has a highly technical team and an excellent technological culture.Beeswax...

  • Site Reliability Engineer

    4 semanas atrás


    Brasília, Brasil Insight Global Tempo inteiro

    Title: Site Reliability EngineerLocation: Latin America (100% Remote working PST Hours)Rate: $28-$34hr USDRequired Skills and Experience *- 6+ years of experience as a Senior SRE/Software Engineer- Experience with Datadog and observability tools- Strong coding experience using Golang- Strong coding experience with Typescript- Experience with AWS &...


  • Brasília, Brasil Outly Tempo inteiro

    Descrição da Vaga: Estamos em busca de um Site Reliability Engineer (SRE) altamente qualificado para integrar nossa equipe. O candidato ideal deve ter experiência comprovada em configurar e gerenciar ambientes de nuvem do zero, com profundo conhecimento em AWS e Azure. Além disso, o profissional deve estar apto a suportar novas implementações e ter...

  • Site Reliability Engineer

    2 semanas atrás


    Brasília, Distrito Federal, Brasil Insight Global Tempo inteiro

    Title: Site Reliability EngineerLocation: Latin America (100% Remote working PST Hours)Rate: $28-$34hr USDRequired Skills and Experience6+ years of experience as a Senior SRE/Software EngineerExperience with Datadog and observability toolsStrong coding experience using GolangStrong coding experience with TypescriptExperience with AWS & Cloud infrastructure,...

  • Site Reliability Engineer

    2 semanas atrás


    Brasília, Distrito Federal, Brasil Insight Global Tempo inteiro

    Title: Site Reliability EngineerLocation: Latin America (100% Remote working PST Hours)Rate: $28-$34hr USDRequired Skills and Experience *- 6+ years of experience as a Senior SRE/Software EngineerExperience withDatadogand observability toolsStrong coding experience usingGolangStrong coding experience withTypescriptExperience withAWS& Cloud infrastructure,EC2...

  • Site Reliability Engineer

    2 semanas atrás


    Brasília, Distrito Federal, Brasil Insight Global Tempo inteiro

    Title: Site Reliability EngineerLocation: Latin America (100% Remote working PST Hours)Rate: $28-$34hr USDRequired Skills and Experience *- 6+ years of experience as a Senior SRE/Software EngineerExperience with Datadog and observability toolsStrong coding experience using GolangStrong coding experience with TypescriptExperience with AWS & Cloud...

  • Site Reliability Engineer

    4 semanas atrás


    Brasília, Brasil Insight Global Tempo inteiro

    Title: Site Reliability EngineerLocation: Latin America (100% Remote working PST Hours)Rate: $28-$34hr USDRequired Skills and Experience *- 6+ years of experience as a Senior SRE/Software Engineer- Experience withDatadogand observability tools- Strong coding experience usingGolang- Strong coding experience withTypescript- Experience withAWS& Cloud...

  • Site Reliability Engineer

    2 semanas atrás


    Brasília, Distrito Federal, Brasil Insight Global Tempo inteiro

    Title: Site Reliability EngineerLocation: Latin America (100% Remote working PST Hours)Rate: $28-$34hr USDRequired Skills and Experience *- 6+ years of experience as a Senior SRE/Software Engineer- Experience withDatadogand observability tools- Strong coding experience usingGolang- Strong coding experience withTypescript- Experience withAWS& Cloud...

  • Site Reliability Engineer

    4 semanas atrás


    Brasília, Brasil Sigma Software Group Tempo inteiro

    We are looking for a strong Site Reliability Engineer who can strengthen our team and will participate in the development of a complex and in-demand AdTech project!CustomerOur customer is Beeswax ( a rapidly growing US AdTech company. Founded by three former Google specialists, it has a highly technical team and an excellent technological culture.Beeswax...

  • Site Reliability Engineer

    2 semanas atrás


    Brasília, Distrito Federal, Brasil Podium Tempo inteiro

    The Role:A Site Reliability Engineer borders the worlds of software engineering and systems engineering. At Podium, the SRE team drives our products to success by building a stable, scalable, sustainable, and slick system. We permanently sit and sup with the product engineering teams to address all of their needs, and work as an SRE guild to build a...

  • Site Reliability Engineer

    3 semanas atrás


    Brasília, Federal District, BR Sigma Software Group Tempo inteiro

    We are looking for a strong Site Reliability Engineer who can strengthen our team and will participate in the development of a complex and in-demand AdTech project! Customer Our customer is Beeswax ( a rapidly growing US AdTech company. Founded by three former Google specialists, it has a highly technical team and an excellent technological culture. Beeswax...


  • Brasília, Distrito Federal, Brasil Loadsmart Tempo inteiro

    ARE YOU INTERESTED IN JOINING A HYPER-GROWTH LOGISTICS TECH COMPANY?Loadsmart is a growth-stage start-up technology company valued at over $1 billion (a true Tech Unicorn)We are looking for a talented Sr. Site Reliability Engineer to our teamIn this role, you will build and maintain the company's internal platform, driving operational excellence and...


  • Brasília, Distrito Federal, Brasil amazon Tempo inteiro

    WiFi has become a critical component to every home worldwide. eero, an Amazon Company, is the first product to deliver a whole home WiFi experience using mesh technology to make sure you never have to worry about connectivity ever again. We believe customers deserve the best connectivity and smart home experience possible. To find out more about eero, please...


  • Brasília, Brasil HCLTech Tempo inteiro

    Your Role & ResponsibilitiesDeployment Reliability EngineerPrimary Responsibilities• Continuous delivery and configuration management of the SAP Ariba Cloud products using variety of deployment tools.• Continuous delivery and configuration management of the SAP Ariba Cloud products using variety of deployment tools.• Effectively and quickly respond to...

  • Site Reliability Engineer II

    2 semanas atrás


    Brasília, Distrito Federal, Brasil Timreed Tempo inteiro

    RecargaPay is the Super App that simplifies everyday payments for consumers and SMEs in Brazil.The platform streamlines payments for over 6 million Brazilians by consolidating credit and debit cards, instant payments like Pix, and Open Finance, on a mission to democratize mobile payments and financial services in Brazil.Featuring services such as bill...


  • Brasília, Federal District, BR HCLTech Tempo inteiro

    Your Role & ResponsibilitiesDeployment Reliability EngineerPrimary Responsibilities• Continuous delivery and configuration management of the SAP Ariba Cloud products using variety of deployment tools. • Continuous delivery and configuration management of the SAP Ariba Cloud products using variety of deployment tools. • Effectively and quickly respond...


  • Brasília, Distrito Federal, Brasil Selectin Tempo inteiro

    Help developers and project managers with automating their solution, by developing/supporting the infrastructure as code part of their project, implementing a CI/CD pipeline, improving observability, and incident management, self-healing if possible using the best practices of infrastructure as code and DevOpsSupport continuous improvement by measuring the...

  • SQA Engineer

    4 semanas atrás


    Brasília, Brasil Telit Cinterion Tempo inteiro

    Primary Objective:We are looking for a detail-oriented and proactive Software Test Engineer to join our dynamic team. The ideal candidate will be responsible for designing, implementing, and executing test plans, identifying and documenting software defects, and collaborating closely with the development team to ensure the delivery of high-quality software...


  • Brasília, Distrito Federal, Brasil Langham Recruitment Tempo inteiro

    Senior Online Services Engineer | AAA Gaming Studio | Up to $120k CAD | Fully RemoteLangham are delighted to represent the opportunity for a Senior Online Services Engineer to join a highly reputable AAA-rated gaming studio.With leadership from some of the most well know global gaming companies, this studio works collaboratively as an embedded part of their...