Head of Site Reliability Engineering

2 semanas atrás


Belo Horizonte, Minas Gerais, Brasil ChipStack Tempo inteiro
Head of Site Reliability Engineering at ChipStack Join or sign in to find your next job

Join to apply for the Head of Site Reliability Engineering at ChipStack role at ChipStack

Head of Site Reliability Engineering at ChipStack

14 hours ago Be among the first 25 applicants

Join to apply for the Head of Site Reliability Engineering at ChipStack role at ChipStack

About ChipStack
Chips power everything, yet chip‑design tooling hasn't kept up with the exploding complexity. ChipStack reinvents verification with AI‑native software already in use at 10+ semiconductor innovators. Backed by Khosla Ventures, Cerberus, and Clear Ventures, our small, fast team ships at the intersection of AI, EDA, and systems engineering.

About ChipStack
Chips power everything, yet chip‑design tooling hasn't kept up with the exploding complexity. ChipStack reinvents verification with AI‑native software already in use at 10+ semiconductor innovators. Backed by Khosla Ventures, Cerberus, and Clear Ventures, our small, fast team ships at the intersection of AI, EDA, and systems engineering.
The Opportunity
We need rock‑solid, low‑latency deployments—often inside customer data centers with no internet egress. As our first dedicated reliability owner, you'll design, automate and operate these hybrid/on‑prem environments so customers experience "five nines" availability without touching the underlying plumbing.
What You'll Do

  • Own end‑to‑end reliability – architect, deploy, and monitor production clusters (on‑prem & cloud) running our Python/TypeScript micro‑services, LLM workloads and GPU back‑ends.
  • Automate the stack – build IaC pipelines (Terraform), GitOps workflows and zero‑downtime rollout strategies.
  • Observe & respond – instrument apps with Prometheus/Grafana, set SLOs/SLIs, lead incident response, perform root‑cause analysis, and harden runbooks.
  • Secure & comply – implement network segmentation, secrets management, RBAC and vulnerability scanning to satisfy strict semiconductor‑industry requirements.
  • Collaborate – pair with product engineers on performance profiling, scalability bottlenecks and customer issue triage.
  • Continually improve – champion best practices in testing, CI/CD, and chaos drills to push our "ship fast, ship quality" culture.
Must‑Have Skills
  • 5+ years building and operating production systems as an SRE / DevOps / Platform Engineer.
  • Hands‑on expertise with Kubernetes and Docker in hybrid or bare‑metal setups.
  • Strong Python for automation tooling; proficiency reading TypeScript services.
  • Deep Linux administration knowledge (kernel tuning, networking, storage, security hardening).
  • Proven track record delivering 99.9 %+ uptime for latency‑sensitive services.
  • Observability stack experience (Prometheus, Grafana, Loki / ELK, Alertmanager).
  • Proficiency with Terraform (or equivalent IaC) and Git‑based workflows.
  • Excellent communication and a bias for action when facing vague, first‑of‑its‑kind problems.
Nice‑to‑Have
  • Experience running GPU workloads, ML inference or EDA toolchains in production.
  • Familiarity with air‑gapped / restricted‑network deployments and data‑center operations.
  • Exposure to security certifications (SOC 2, ISO 27001) or semiconductor customer audits.
  • Prior work at an early‑stage startup.
Our Culture (What You'll Thrive In)
  • Challenge status‑quo
  • Strong opinions, loosely held
  • Ship fast, ship quality
  • Proud of our craft
Ready to harden the infrastructure that will redefine chip design? Apply now and keep ChipStack running flawlessly for the world's most advanced silicon teams.
Seniority level
  • Seniority level Mid-Senior level
Employment type
  • Employment type Full-time
Job function
  • Job function Engineering and Information Technology
  • Industries Software Development

Referrals increase your chances of interviewing at ChipStack by 2x

Sign in to set job alerts for "Site Engineer" roles. 250701480 - ENGENHEIRO DE SUPORTE A FROTAS I (CONTAGEM-MG)

Itabirito, Minas Gerais, Brazil 1 year ago

Staff Site Reliability Engineer - Work from home Community Engineer (multiple roles and seniority levels) Site Reliability Engineer - Remote Work | REF#279922

We're unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

#J-18808-Ljbffr

  • Belo Horizonte, Minas Gerais, Brasil ChipStack Tempo inteiro

    Head of Site Reliability Engineering at ChipStackJoin or sign in to find your next jobJoin to apply for the Head of Site Reliability Engineering at ChipStack role at ChipStackHead of Site Reliability Engineering at ChipStack14 hours ago Be among the first 25 applicantsJoin to apply for the Head of Site Reliability Engineering at ChipStack role at...

  • Site Reliability

    3 semanas atrás


    Belo Horizonte, Minas Gerais, Brasil Canonical Tempo inteiro

    Join or sign in to find your next jobJoin to apply for the Site Reliability / Gitops Engineer role at Canonical3 days ago Be among the first 25 applicantsJoin to apply for the Site Reliability / Gitops Engineer role at CanonicalGet AI-powered advice on this job and more exclusive features.Canonical is a leading provider of open source software and operating...

  • Site Reliability

    2 semanas atrás


    Belo Horizonte, Minas Gerais, Brasil Canonical Tempo inteiro

    Join or sign in to find your next job Join to apply for the Site Reliability / Gitops Engineer role at Canonical 3 days ago Be among the first 25 applicants Join to apply for the Site Reliability / Gitops Engineer role at Canonical Get AI-powered advice on this job and more exclusive features. Canonical is a leading provider of open source software and...


  • Belo Horizonte, Minas Gerais, Brasil ChipStack Tempo inteiro

    **About ChipStack**:Chips power everything, yet chip‑design tooling hasn't kept up with the exploding complexity. ChipStack reinvents verification with AI‑native software already in use at 10+ semiconductor innovators. Backed by Khosla Ventures, Cerberus, and Clear Ventures, our small, fast team ships at the intersection of AI, EDA, and systems...

  • Senior Site Reliability

    3 semanas atrás


    Belo Horizonte, Minas Gerais, Brasil Canonical Tempo inteiro

    Senior Site Reliability / Gitops EngineerJoin or sign in to find your next jobJoin to apply for the Senior Site Reliability / Gitops Engineer role at CanonicalSenior Site Reliability / Gitops Engineer3 days ago Be among the first 25 applicantsJoin to apply for the Senior Site Reliability / Gitops Engineer role at CanonicalCanonical is a leading provider of...


  • Belo Horizonte, Minas Gerais, Brasil beBeeSiteReliabilityEngineer Tempo inteiro US$200.000 - US$250.000

    Job Title: Site Reliability EngineerWe are seeking a skilled and experienced Site Reliability Engineer to join our team.This is a challenging and rewarding role that requires strong technical skills, excellent communication skills, and a passion for delivering high-quality results.The successful candidate will be responsible for designing, building, and...


  • Belo Horizonte, Minas Gerais, Brasil AgileEngine Tempo inteiro

    Site Reliability Engineer (Middle) ID38916 Join to apply for the Site Reliability Engineer (Middle) ID38916 role at AgileEngine Site Reliability Engineer (Middle) ID38916 3 weeks ago Be among the first 25 applicants Join to apply for the Site Reliability Engineer (Middle) ID38916 role at AgileEngine Get AI-powered advice on this job and more exclusive...


  • Belo Horizonte, Minas Gerais, Brasil Caderno Nacional Tempo inteiro

    Segmento: Não InformadoAtividades: We work across the full stack, from bare metal to Kubernetes, including cloud and virtualisation.We also work across the full range of infrastructure, from public cloud to private cloud and edge.You will need to be a Linux and operations expert, as well as a great manager capable of leading a high-performance team, to...


  • Belo Horizonte, Minas Gerais, Brasil beBeeCloud Tempo inteiro US$90.000 - US$120.000

    OverviewWe are looking for a highly skilled Site Reliability Engineer to join our team. As a key member of our operations team, you will be responsible for designing and implementing reliable infrastructure solutions that meet the evolving needs of our clients.Our ideal candidate will have extensive experience with cloud computing platforms, including AWS...


  • Belo Horizonte, Minas Gerais, Brasil Pythian Tempo inteiro

    OverviewJoin to apply for the Linux Site Reliability Consultant role at Pythian.2 weeks ago Be among the first 25 applicants.Site Reliability Consultant Brazil | Remote | Work from Home.One available position for the following time zone: PST.Why PythianAt Pythian, we are experts in strategic database and analytics services, driving digital transformation and...