Senior Site Reliability Engineer

1 semana atrás

BrazilRemote, Brasil Articul8 Tempo inteiro

About Us

Articul8 AI is at the forefront of Generative AI innovation, delivering cutting-edge SaaS products that transform how businesses operate. Our platform empowers organizations to leverage the power of artificial intelligence in a reliable, scalable, and secure environment.

Position Overview

We are seeking an experienced Site Reliability Engineer (SRE) specializing in chaos engineering and monitoring to join our team and help ensure the reliability, performance, and scalability of our GenAI SaaS platform. As a Senior SRE and Chaos Engineering Specialist, you will create and run chaos experiments to validate our systems' resilience against real-world failures and will bridge the gap between development and operations, implementing automation and best practices to maintain our service reliability objectives while supporting rapid innovation.

Key Responsibilities

Architect and maintain scalable, highly available infrastructure for our GenAI platform.
Design and implement robust monitoring, alerting, and observability solutions to proactively ensure system health and performance.
Automate deployment, scaling, and management of our cloud-native infrastructure, reducing toil and improving efficiency.
Define, measure, and improve Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to deliver outstanding service quality.
Participate in on-call rotations and provide rapid response to production incidents, minimizing downtime and user impact.
Collaborate closely with development teams to build reliable, scalable, and efficient systems for complex AI workloads.
Lead incident response efforts, conduct thorough post-mortems, and champion continuous improvement initiatives.
Optimize infrastructure for performance, scalability, and cost-effectiveness—especially for high-demand AI workloads.
Implement and enforce security best practices across all systems and environments.
Create and maintain comprehensive documentation, including runbooks and knowledge base articles, to foster a culture of shared knowledge.

QualificationsRequired

Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience
5+ years of experience in DevOps, SRE, or similar roles
Strong experience with cloud platforms (AWS, GCP, or Azure)
Proficiency in at least one programming/scripting language (Python, Go, Bash, etc.)
Hands-on experience with infrastructure as code tools (Terraform, CloudFormation, etc.)
Solid background in containerization technologies (Docker, Kubernetes)
Proven experience with monitoring and observability tools (Prometheus, Grafana, ELK stack, etc.)
Strong understanding of CI/CD pipelines and automation
Exceptional troubleshooting and problem-solving skills and ability to troubleshoot complex systems
Experience with chaos engineering tools such as Chaos Monkey, Gremlin, or similar frameworks
Familiarity with container orchestration platforms like Kubernetes and related chaos tools

Preferred

Experience supporting AI/ML systems in production
Knowledge of GPU infrastructure management and optimization
Familiarity with distributed systems and high-performance computing
Experience with database systems (SQL and NoSQL)
Certifications in cloud platforms (AWS, GCP, Azure)
Experience with chaos engineering and resilience testing
Knowledge of security best practices and compliance requirements

Ready to shape the future of resilient software systems? Apply now and help drive the reliability of tomorrow's AI at Articul8 AI
NOTE: This position is available via CLT contract only, Thank you

Senior Site Reliability Engineer

2 semanas atrás

Remote, Brasil Swile Tempo inteiro

At Swile, we believe that good products can help reduce friction in daily professional life and boost employee satisfaction. Today, we provide innovative solutions in various areas such as Fintech, Travel, HR, and Employee Benefits to more than 5.5 million users in 85,000 companies in France and Brazil. Your role as a Senior Site Reliability Engineer (SRE)...
Principal Site Reliability Engineer

Há 7 dias

Remote - Argentina; Remote - Brazil; Remote - Chile; Remote - Colombia; Remote - Ecuador; Remote - Mexico; Remote - Peru; Remote - Uruguay Groupon Tempo inteiro

Groupon is a marketplace where customers discover new experiences and services everyday and local businesses thrive. To date we have worked with over a million merchant partners worldwide, connecting over 16 million customers with deals across various categories. In a world often dominated by e-commerce giants, we stand out as one of the few platforms...
Site Reliability Engineer

4 semanas atrás

Brazil Softensity Inc Tempo inteiro

Summary We at Softensity are looking for a Site Reliability Engineer (SRE) – This is a dynamic and hands-on role within a global, collaborative SRE environment . The SRE Technical Member will contribute to building resilient systems, automating operations, and ensuring the platform meets high standards for performance, reliability, and security. You will...
Senior Site Reliability Engineer

Há 4 dias

São Paulo, State of São Paulo, Brazil Sigma Software Tempo inteiro

Company Description As a Site Reliability Engineer, you will work as an integral member of product teams, helping to build, deploy, and monitor cloud services reliably. You will contribute to complex software development projects to maintain essential, revenue-critical services. Additionally, you will actively develop code and build frameworks to monitor...
Senior DevOps Engineer/K8s expert

2 semanas atrás

Remote, SP, Brazil Wizdaa Tempo inteiro

Job description Level: Senior (5+ years) | Department: Foundation/Platform EngineeringRole OverviewLead development of internal Kubernetes platform enabling scalable application deployment through GitOps. Engineer solutions for deployment complexity, database migrations, multi-environment management, and developer productivity. Drive DevOps practices...
Cloud Reliability Engineer

2 semanas atrás

Remote Brazil Infios BR . Tempo inteiro

If you are looking for a meaningful career where people work and act with passion, rethink the existing and always strive to find the best solution - you have come to the right place. We develop future technologies to relentlessly make supply chains better. We are a leader in supply chain software solutions, helping organizations streamline operations,...
Site Reliability Engineer

1 semana atrás

Descartes Brazil Descartes SmartCompliance Tempo inteiro

Descartes Unites the People and Technology that Move the WorldThe need for efficient, secure, and agile supply chains and logistics operations has become ever more critical and complex. By combining innovative technology, powerful trade intelligence and the reach of our network, Descartes helps get goods, information, transportation assets, and people...
Senior Software Engineer

Há 5 dias

Brazil - Remote Kuali Tempo inteiro

Senior Software Engineer (Remote Contractor, Brazil)About the Role We're looking for six Senior Full Stack Engineers to join our team as remote contractors. We're seeking experienced engineers based in Brazil or across Latin America who want to join a US engineering team building our next generation of our enterprise software platform for delivering amazing...
Senior Data Engineer

Há 6 dias

Brazil Pride Global Tempo inteiro

We're Hiring: Senior Data Engineer | Remote from Brazil | Fluent English required | Location : Remote – Brazil only Contact: Temporary Are you passionate about building scalable data platforms and cutting-edge MLOps solutions? Do you want to work with a top-tier US company revolutionizing e-commerce and circular fashion? We're looking for a Senior Data...
Senior Data Base Engineer

2 semanas atrás

Remote, Brazil WatchGuard Technologies Tempo inteiro

We are looking for an experienced and passionate Senior Data Engineer to join our growing data team. In this role, you will be responsible for designing, developing, and maintaining scalable data pipelines and systems to support a wide range of analytics and business intelligence solutions. You will work closely with cross-functional teams including data...

Américas

Europa

Ásia / Oceania

África

Senior Site Reliability Engineer