Senior Site Reliability Engineer
Há 9 horas
Articul8 AI is at the forefront of Generative AI innovation, delivering cutting-edge SaaS products that transform how businesses operate. Our platform empowers organizations to leverage the power of artificial intelligence in a reliable, scalable, and secure environment.
Position OverviewWe are seeking an experienced Site Reliability Engineer (SRE) specializing in chaos engineering and monitoring to join our team and help ensure the reliability, performance, and scalability of our GenAI SaaS platform. As a Senior SRE and Chaos Engineering Specialist, you will create and run chaos experiments to validate our systems' resilience against real-world failures and will bridge the gap between development and operations, implementing automation and best practices to maintain our service reliability objectives while supporting rapid innovation.
Key ResponsibilitiesArchitect and maintain scalable, highly available infrastructure for our GenAI platform.
Design and implement robust monitoring, alerting, and observability solutions to proactively ensure system health and performance.
Automate deployment, scaling, and management of our cloud-native infrastructure, reducing toil and improving efficiency.
Define, measure, and improve Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to deliver outstanding service quality.
Participate in on-call rotations and provide rapid response to production incidents, minimizing downtime and user impact.
Collaborate closely with development teams to build reliable, scalable, and efficient systems for complex AI workloads.
Lead incident response efforts, conduct thorough post-mortems, and champion continuous improvement initiatives.
Optimize infrastructure for performance, scalability, and cost-effectiveness—especially for high-demand AI workloads.
Implement and enforce security best practices across all systems and environments.
Create and maintain comprehensive documentation, including runbooks and knowledge base articles, to foster a culture of shared knowledge.
Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience
5+ years of experience in DevOps, SRE, or similar roles
Strong experience with cloud platforms (AWS, GCP, or Azure)
Proficiency in at least one programming/scripting language (Python, Go, Bash, etc.)
Hands-on experience with infrastructure as code tools (Terraform, CloudFormation, etc.)
Solid background in containerization technologies (Docker, Kubernetes)
Proven experience with monitoring and observability tools (Prometheus, Grafana, ELK stack, etc.)
Strong understanding of CI/CD pipelines and automation
Exceptional troubleshooting and problem-solving skills and ability to troubleshoot complex systems
Experience with chaos engineering tools such as Chaos Monkey, Gremlin, or similar frameworks
Familiarity with container orchestration platforms like Kubernetes and related chaos tools
Experience supporting AI/ML systems in production
Knowledge of GPU infrastructure management and optimization
Familiarity with distributed systems and high-performance computing
Experience with database systems (SQL and NoSQL)
Certifications in cloud platforms (AWS, GCP, Azure)
Experience with chaos engineering and resilience testing
Knowledge of security best practices and compliance requirements
Ready to shape the future of resilient software systems? Apply now and help drive the reliability of tomorrow's AI at Articul8 AI
NOTE: This position is available via CLT contract only, Thank you
-
Senior Site Reliability Engineer
Há 4 dias
Remote, Brasil Swile Tempo inteiroAt Swile, we believe that good products can help reduce friction in daily professional life and boost employee satisfaction. Today, we provide innovative solutions in various areas such as Fintech, Travel, HR, and Employee Benefits to more than 5.5 million users in 85,000 companies in France and Brazil. Your role as a Senior Site Reliability Engineer (SRE)...
-
Site Reliability Engineer
4 semanas atrás
Brazil Review ALL Tempo inteiroAbout the CompanyThis company operates a global computing platform that enables businesses to programmatically deploy single-tenant Bare Metal instances across multiple regions worldwide.They are a team of passionate engineers working at the intersection of hardware, software, and network infrastructure, building the fastest, most developer-centric...
-
Site Reliability Engineer
4 semanas atrás
Brazil MetaCTO Tempo inteiroAbout Us At MetaCTO, we specialize in helping startups and growing companies turn visionary ideas into successful digital products through expert app development and fractional CTO services. As a Site Reliability Engineer (SRE), you will play a critical role in ensuring the reliability, scalability, and security of the backend infrastructure that powers...
-
Site Reliability Engineer
2 semanas atrás
Brazil Quantum World Technologies Inc. Tempo inteiroWe are seeking a Site Reliability Engineer (SRE) who is passionate about large-scale infrastructure and eager to develop deeper expertise in PostgreSQL. In this role, you will join the Database Engineering organization and help strengthen the reliability, resilience, and automation of our database platform. This position is an excellent fit for an...
-
Site Reliability Engineer
Há 2 dias
Remote (Brazil) Alternative Payments Tempo inteiroAt Alternative Payments, we are transforming the way service-based companies handle payments. Our innovative platform automates the entire accounts receivable process, helping businesses save time, reduce costs, and scale with confidence.We are building a global team that values innovation, impact, and collaboration. As part of a scaling FinTech company,...
-
Senior Site Reliability Engineer
1 semana atrás
brazil Podium Tempo inteiroAt Podium, our mission is to help local businesses win. Our lead conversion platform, powered by AI and integrations, helps local businesses convert leads faster, communicate easier, and make more sales. Every day, thousands of local businesses utilize our review management, communication, marketing, and payments products. Our work and focus on helping...
-
Lead Site Reliability Engineer
2 semanas atrás
brazil (remote) Gympass Tempo inteiroYour wellbeing matters. Join a company that cares. GET TO KNOW US Wellhub (formerly Gympass*) is a corporate wellness platform that connects employees to the best partners for fitness, mindfulness, therapy, nutrition, and sleep, all included in one subscription designed to cost less than each individual partner. Founded in and headquartered in NYC, we have a...
-
Staff Site Reliability Engineer
Há 7 dias
brazil (remote) Gympass Tempo inteiroYour wellbeing matters. Join a company that cares. GET TO KNOW US Wellhub (formerly Gympass*) is a corporate wellness platform that connects employees to the best partners for fitness, mindfulness, therapy, nutrition, and sleep, all included in one subscription designed to cost less than each individual partner. Founded in and headquartered in NYC, we have a...
-
Site Reliability Engineer Sênior
Há 3 dias
Brazil Stone Tempo inteiroQuem é Stone Tech? A Stone nasceu com o propósito de ser protagonista na transformação da indústria de pagamentos, lutando para oferecer as melhores soluções para quem empreende no Brasil. Pensando nisso, construímos a Stone Tech A junção dos times de tecnologia Stone Co. e as empresas financeiras do grupo que reconhecem o potencial empreendedor de...
-
Senior DevOps Engineer
3 semanas atrás
Brazil Edison Smart® Tempo inteiroSenior DevOps Engineer | Remote ±3 GMT (Brazil) | €200 per day | 12 months Role: Senior DevOps Engineer Location: Remote ±3 GMT (Brazil) Rate / Salary: €200 per day Duration: 12 months (extension likely) Language: English I'm partnering with a leading consultancy delivering a large-scale Cloud, DevOps & Data transformation for a major global...