Senior Site Reliability Engineer

Há 9 horas


BrazilRemote, Brasil Articul8 Tempo inteiro
About Us

Articul8 AI is at the forefront of Generative AI innovation, delivering cutting-edge SaaS products that transform how businesses operate. Our platform empowers organizations to leverage the power of artificial intelligence in a reliable, scalable, and secure environment.

Position Overview

We are seeking an experienced Site Reliability Engineer (SRE) specializing in chaos engineering and monitoring to join our team and help ensure the reliability, performance, and scalability of our GenAI SaaS platform. As a Senior SRE and Chaos Engineering Specialist, you will create and run chaos experiments to validate our systems' resilience against real-world failures and will bridge the gap between development and operations, implementing automation and best practices to maintain our service reliability objectives while supporting rapid innovation.

Key Responsibilities
  • Architect and maintain scalable, highly available infrastructure for our GenAI platform.

  • Design and implement robust monitoring, alerting, and observability solutions to proactively ensure system health and performance.

  • Automate deployment, scaling, and management of our cloud-native infrastructure, reducing toil and improving efficiency.

  • Define, measure, and improve Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to deliver outstanding service quality.

  • Participate in on-call rotations and provide rapid response to production incidents, minimizing downtime and user impact.

  • Collaborate closely with development teams to build reliable, scalable, and efficient systems for complex AI workloads.

  • Lead incident response efforts, conduct thorough post-mortems, and champion continuous improvement initiatives.

  • Optimize infrastructure for performance, scalability, and cost-effectiveness—especially for high-demand AI workloads.

  • Implement and enforce security best practices across all systems and environments.

  • Create and maintain comprehensive documentation, including runbooks and knowledge base articles, to foster a culture of shared knowledge.

QualificationsRequired
  • Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience

  • 5+ years of experience in DevOps, SRE, or similar roles

  • Strong experience with cloud platforms (AWS, GCP, or Azure)

  • Proficiency in at least one programming/scripting language (Python, Go, Bash, etc.)

  • Hands-on experience with infrastructure as code tools (Terraform, CloudFormation, etc.)

  • Solid background in containerization technologies (Docker, Kubernetes)

  • Proven experience with monitoring and observability tools (Prometheus, Grafana, ELK stack, etc.)

  • Strong understanding of CI/CD pipelines and automation

  • Exceptional troubleshooting and problem-solving skills and ability to troubleshoot complex systems

  • Experience with chaos engineering tools such as Chaos Monkey, Gremlin, or similar frameworks

  • Familiarity with container orchestration platforms like Kubernetes and related chaos tools

Preferred
  • Experience supporting AI/ML systems in production

  • Knowledge of GPU infrastructure management and optimization

  • Familiarity with distributed systems and high-performance computing

  • Experience with database systems (SQL and NoSQL)

  • Certifications in cloud platforms (AWS, GCP, Azure)

  • Experience with chaos engineering and resilience testing

  • Knowledge of security best practices and compliance requirements

Ready to shape the future of resilient software systems? Apply now and help drive the reliability of tomorrow's AI at Articul8 AI
NOTE: This position is available via CLT contract only, Thank you



  • Remote, Brasil Swile Tempo inteiro

    At Swile, we believe that good products can help reduce friction in daily professional life and boost employee satisfaction. Today, we provide innovative solutions in various areas such as Fintech, Travel, HR, and Employee Benefits to more than 5.5 million users in 85,000 companies in France and Brazil. Your role as a Senior Site Reliability Engineer (SRE)...

  • Site Reliability Engineer

    4 semanas atrás


    Brazil Review ALL Tempo inteiro

    About the CompanyThis company operates a global computing platform that enables businesses to programmatically deploy single-tenant Bare Metal instances across multiple regions worldwide.They are a team of passionate engineers working at the intersection of hardware, software, and network infrastructure, building the fastest, most developer-centric...

  • Site Reliability Engineer

    4 semanas atrás


    Brazil MetaCTO Tempo inteiro

    About Us At MetaCTO, we specialize in helping startups and growing companies turn visionary ideas into successful digital products through expert app development and fractional CTO services. As a Site Reliability Engineer (SRE), you will play a critical role in ensuring the reliability, scalability, and security of the backend infrastructure that powers...

  • Site Reliability Engineer

    2 semanas atrás


    Brazil Quantum World Technologies Inc. Tempo inteiro

    We are seeking a Site Reliability Engineer (SRE) who is passionate about large-scale infrastructure and eager to develop deeper expertise in PostgreSQL. In this role, you will join the Database Engineering organization and help strengthen the reliability, resilience, and automation of our database platform. This position is an excellent fit for an...


  • Remote (Brazil) Alternative Payments Tempo inteiro

    At Alternative Payments, we are transforming the way service-based companies handle payments. Our innovative platform automates the entire accounts receivable process, helping businesses save time, reduce costs, and scale with confidence.We are building a global team that values innovation, impact, and collaboration. As part of a scaling FinTech company,...


  • brazil Podium Tempo inteiro

    At Podium, our mission is to help local businesses win. Our lead conversion platform, powered by AI and integrations, helps local businesses convert leads faster, communicate easier, and make more sales. Every day, thousands of local businesses utilize our review management, communication, marketing, and payments products.  Our work and focus on helping...


  • brazil (remote) Gympass Tempo inteiro

    Your wellbeing matters. Join a company that cares. GET TO KNOW US Wellhub (formerly Gympass*) is a corporate wellness platform that connects employees to the best partners for fitness, mindfulness, therapy, nutrition, and sleep, all included in one subscription designed to cost less than each individual partner. Founded in and headquartered in NYC, we have a...


  • brazil (remote) Gympass Tempo inteiro

    Your wellbeing matters. Join a company that cares. GET TO KNOW US Wellhub (formerly Gympass*) is a corporate wellness platform that connects employees to the best partners for fitness, mindfulness, therapy, nutrition, and sleep, all included in one subscription designed to cost less than each individual partner. Founded in and headquartered in NYC, we have a...


  • Brazil Stone Tempo inteiro

    Quem é Stone Tech? A Stone nasceu com o propósito de ser protagonista na transformação da indústria de pagamentos, lutando para oferecer as melhores soluções para quem empreende no Brasil. Pensando nisso, construímos a Stone Tech A junção dos times de tecnologia Stone Co. e as empresas financeiras do grupo que reconhecem o potencial empreendedor de...

  • Senior DevOps Engineer

    3 semanas atrás


    Brazil Edison Smart® Tempo inteiro

    Senior DevOps Engineer | Remote ±3 GMT (Brazil) | €200 per day | 12 months Role: Senior DevOps Engineer Location: Remote ±3 GMT (Brazil) Rate / Salary: €200 per day Duration: 12 months (extension likely) Language: English I'm partnering with a leading consultancy delivering a large-scale Cloud, DevOps & Data transformation for a major global...