Head of Site Reliability Engineering at Chipstack

2 semanas atrás


Belo Horizonte, Brasil ChipStack Tempo inteiro

**About ChipStack**:
Chips power everything, yet chip‑design tooling hasn’t kept up with the exploding complexity. ChipStack reinvents verification with AI‑native software already in use at 10+ semiconductor innovators. Backed by Khosla Ventures, Cerberus, and Clear Ventures, our small, fast team ships at the intersection of AI, EDA, and systems engineering.

**The Opportunity**:
We need rock‑solid, low‑latency deployments—often inside customer data centers with no internet egress. As our first dedicated reliability owner, you’ll design, automate and operate these hybrid/on‑prem environments so customers experience “five nines” availability without touching the underlying plumbing.

**What You’ll Do**:

- ** Own end‑to‑end reliability** - architect, deploy, and monitor production clusters (on‑prem & cloud) running our Python/TypeScript micro‑services, LLM workloads and GPU back‑ends.
- ** Automate the stack** - build IaC pipelines (Terraform), GitOps workflows and zero‑downtime rollout strategies.
- ** Observe & respond** - instrument apps with Prometheus/Grafana, set SLOs/SLIs, lead incident response, perform root‑cause analysis, and harden runbooks.
- ** Secure & comply** - implement network segmentation, secrets management, RBAC and vulnerability scanning to satisfy strict semiconductor‑industry requirements.
- ** Collaborate** - pair with product engineers on performance profiling, scalability bottlenecks and customer issue triage.
- ** Continually improve** - champion best practices in testing, CI/CD, and chaos drills to push our “ship fast, ship quality” culture.

**Must‑Have Skills**:

- 5+ years building and operating production systems as an SRE / DevOps / Platform Engineer.
- Hands‑on expertise with **Kubernetes** and **Docker** in hybrid or bare‑metal setups.
- Strong Python for automation tooling; proficiency reading TypeScript services.
- Deep Linux administration knowledge (kernel tuning, networking, storage, security hardening).
- Proven track record delivering 99.9 %+ uptime for latency‑sensitive services.
- Observability stack experience (Prometheus, Grafana, Loki / ELK, Alertmanager).
- Proficiency with Terraform (or equivalent IaC) and Git‑based workflows.
- Excellent communication and a bias for action when facing vague, first‑of‑its‑kind problems.

**Nice‑to‑Have**:

- Experience running GPU workloads, ML inference or EDA toolchains in production.
- Familiarity with air‑gapped / restricted‑network deployments and data‑center operations.
- Exposure to security certifications (SOC 2, ISO 27001) or semiconductor customer audits.
- Prior work at an early‑stage startup.

**Our Culture (What You’ll Thrive In)**:

- ** Challenge status‑quo** - **Strong opinions, loosely held** - **Ship fast, ship quality** - **Proud of our craft



  • Belo Horizonte, Minas Gerais, Brasil ChipStack Tempo inteiro R$90.000 - R$120.000 por ano

    About ChipStackChips power everything, yet chip‑design tooling hasn't kept up with the exploding complexity. ChipStack reinvents verification with AI‑native software already in use at 10+ semiconductor innovators. Backed by Khosla Ventures, Cerberus, and Clear Ventures, our small, fast team ships at the intersection of AI, EDA, and systems...

  • Application Engineer

    4 semanas atrás


    Belo Horizonte, Brasil ChipStack Tempo inteiro

    Application Engineer - AI RTL Design and Verification Join or sign in to find your next job Join to apply for the Application Engineer - AI RTL Design and Verification role at ChipStack Application Engineer - AI RTL Design and Verification 14 hours ago Be among the first 25 applicants Join to apply for the Application Engineer - AI RTL Design and...


  • Belo Horizonte, Brasil Articul8 AI Tempo inteiro

    Senior Site Reliability Engineer (SRE) - (Brazil) Join to apply for the Senior Site Reliability Engineer (SRE) - (Brazil) role at Articul8 AI. Position Overview We are seeking an experienced Site Reliability Engineer (SRE) to join our team and help ensure the reliability, performance, and scalability of our GenAI SaaS platform. As an SRE, you will bridge the...


  • Belo Horizonte, Brasil Canonical Tempo inteiro

    OverviewCanonical is a leading provider of open source software and operating systems to the global enterprise and technology markets.Our platform, Ubuntu, is widely used in breakthrough enterprise initiatives such as public cloud, data science, AI, engineering innovation and IoT.The company is founder led, profitable and growing, with 1200+ colleagues in...

  • Head of Security Operations

    2 semanas atrás


    Belo Horizonte, Brasil Canonical Tempo inteiro

    Join or sign in to find your next job Join to apply for the Head of Security Operations role at Canonical 3 months ago Be among the first 25 applicants Join to apply for the Head of Security Operations role at Canonical Get AI-powered advice on this job and more exclusive features. This global leadership role in cyber security is to manage the Security...

  • Site Reliability Engineer

    3 semanas atrás


    Belo Horizonte, Brasil BairesDev Tempo inteiro

    Site Reliability Engineer - Remote Work: At BairesDev, we've been leading the way in technology projects for over 15 years. We deliver cutting-edge solutions to giants like Google and the most innovative startups in Silicon Valley. Our diverse 4,000+ team, composed of the world's Top 1% of tech talent, works remotely on roles that drive significant impact...

  • Head of Product

    Há 2 dias


    Belo Horizonte, Brasil BairesDev Tempo inteiro

    WinDifferent specializes in helping businesses achieve rapid and sustainable growth through our powerful proprietary marketing system. Our data‑driven solutions generate positive engagement that leads to ready‑to‑close opportunities, massively expanding sales pipelines and enabling companies to scale faster than the competition. As one of...


  • Belo Horizonte, Brasil Canonical Tempo inteiro

    OverviewCanonical is a leading provider of open source software and operating systems to the global enterprise and technology markets.Our platform, Ubuntu, is very widely used in breakthrough enterprise initiatives such as public cloud, data science, AI, engineering innovation and IoT.Our customers include the world's leading public cloud and silicon...

  • Site Reliability Engineer

    2 semanas atrás


    Belo Horizonte, Brasil AgileEngine Tempo inteiro

    Site Reliability Engineer (Middle) ID38916 AgileEngine is an Inc. 5000 company that creates award-winning software for Fortune 500 brands and startups across 17+ industries. We are leaders in application development and AI/ML, with a people-first culture and multiple Best Place to Work awards. If you're looking for a place to grow, make an impact, and work...


  • Belo Horizonte, Brasil Bebeedevops Tempo inteiro

    Embark on a transformative career journey:Aubay, a French multinational with Portuguese roots since ****, has established a strong presence in Lisbon and Porto.Job Responsibilities:Minimum 3 years of hands-on experience in DevOps engineering;At least 5 years of overall professional experience in infrastructure, software development, or related...