Senior Site Reliability Engineer
1 semana atrás
Articul8 AI is at the forefront of Generative AI innovation, delivering cutting-edge SaaS products that transform how businesses operate. Our platform empowers organizations to leverage the power of artificial intelligence in a reliable, scalable, and secure environment.
Position OverviewWe are seeking an experienced Site Reliability Engineer (SRE) specializing in chaos engineering and monitoring to join our team and help ensure the reliability, performance, and scalability of our GenAI SaaS platform. As a Senior SRE and Chaos Engineering Specialist, you will create and run chaos experiments to validate our systems' resilience against real-world failures and will bridge the gap between development and operations, implementing automation and best practices to maintain our service reliability objectives while supporting rapid innovation.
Key ResponsibilitiesArchitect and maintain scalable, highly available infrastructure for our GenAI platform.
Design and implement robust monitoring, alerting, and observability solutions to proactively ensure system health and performance.
Automate deployment, scaling, and management of our cloud-native infrastructure, reducing toil and improving efficiency.
Define, measure, and improve Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to deliver outstanding service quality.
Participate in on-call rotations and provide rapid response to production incidents, minimizing downtime and user impact.
Collaborate closely with development teams to build reliable, scalable, and efficient systems for complex AI workloads.
Lead incident response efforts, conduct thorough post-mortems, and champion continuous improvement initiatives.
Optimize infrastructure for performance, scalability, and cost-effectiveness—especially for high-demand AI workloads.
Implement and enforce security best practices across all systems and environments.
Create and maintain comprehensive documentation, including runbooks and knowledge base articles, to foster a culture of shared knowledge.
Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience
5+ years of experience in DevOps, SRE, or similar roles
Strong experience with cloud platforms (AWS, GCP, or Azure)
Proficiency in at least one programming/scripting language (Python, Go, Bash, etc.)
Hands-on experience with infrastructure as code tools (Terraform, CloudFormation, etc.)
Solid background in containerization technologies (Docker, Kubernetes)
Proven experience with monitoring and observability tools (Prometheus, Grafana, ELK stack, etc.)
Strong understanding of CI/CD pipelines and automation
Exceptional troubleshooting and problem-solving skills and ability to troubleshoot complex systems
Experience with chaos engineering tools such as Chaos Monkey, Gremlin, or similar frameworks
Familiarity with container orchestration platforms like Kubernetes and related chaos tools
Experience supporting AI/ML systems in production
Knowledge of GPU infrastructure management and optimization
Familiarity with distributed systems and high-performance computing
Experience with database systems (SQL and NoSQL)
Certifications in cloud platforms (AWS, GCP, Azure)
Experience with chaos engineering and resilience testing
Knowledge of security best practices and compliance requirements
Ready to shape the future of resilient software systems? Apply now and help drive the reliability of tomorrow's AI at Articul8 AI
NOTE: This position is available via CLT contract only, Thank you
-
Senior Site Reliability Engineer
2 semanas atrás
Remote, Brasil Swile Tempo inteiroAt Swile, we believe that good products can help reduce friction in daily professional life and boost employee satisfaction. Today, we provide innovative solutions in various areas such as Fintech, Travel, HR, and Employee Benefits to more than 5.5 million users in 85,000 companies in France and Brazil. Your role as a Senior Site Reliability Engineer (SRE)...
-
Principal Site Reliability Engineer
Há 6 dias
Remote - Argentina; Remote - Brazil; Remote - Chile; Remote - Colombia; Remote - Ecuador; Remote - Mexico; Remote - Peru; Remote - Uruguay Groupon Tempo inteiroGroupon is a marketplace where customers discover new experiences and services everyday and local businesses thrive. To date we have worked with over a million merchant partners worldwide, connecting over 16 million customers with deals across various categories. In a world often dominated by e-commerce giants, we stand out as one of the few platforms...
-
Site Reliability Engineer
4 semanas atrás
Brazil Softensity Inc Tempo inteiroSummary We at Softensity are looking for a Site Reliability Engineer (SRE) – This is a dynamic and hands-on role within a global, collaborative SRE environment . The SRE Technical Member will contribute to building resilient systems, automating operations, and ensuring the platform meets high standards for performance, reliability, and security. You will...
-
Site Reliability Engineer
4 semanas atrás
Brazil, BR Softensity Inc Tempo inteiroSummaryWe at Softensity are looking for a Site Reliability Engineer (SRE) – This is a dynamic and hands-on role within a global, collaborative SRE environment. The SRE Technical Member will contribute to building resilient systems, automating operations, and ensuring the platform meets high standards for performance, reliability, and security.You will be...
-
Senior Site Reliability Engineer
Há 3 dias
São Paulo, State of São Paulo, Brazil Sigma Software Tempo inteiroCompany Description As a Site Reliability Engineer, you will work as an integral member of product teams, helping to build, deploy, and monitor cloud services reliably. You will contribute to complex software development projects to maintain essential, revenue-critical services. Additionally, you will actively develop code and build frameworks to monitor...
-
Site Reliability Engineer Sr
4 semanas atrás
Brazil, BR Mercado Eletrônico Tempo inteiroO Mercado Eletrônico é líder na América Latina em soluções de gestão de compras B2B. Suas tecnologias e serviços para as áreas de compras ajudam empresas a conquistarem mais economia, agilidade, governança e colaboração.Com escritórios no Brasil, Estados Unidos, México e Portugal, contabiliza mais de 1 milhão de fornecedores, 10 mil...
-
Senior DevOps Engineer/K8s expert
1 semana atrás
Remote, SP, Brazil Wizdaa Tempo inteiroJob description Level: Senior (5+ years) | Department: Foundation/Platform EngineeringRole OverviewLead development of internal Kubernetes platform enabling scalable application deployment through GitOps. Engineer solutions for deployment complexity, database migrations, multi-environment management, and developer productivity. Drive DevOps practices...
-
Cloud Reliability Engineer
2 semanas atrás
Remote Brazil Infios BR . Tempo inteiroIf you are looking for a meaningful career where people work and act with passion, rethink the existing and always strive to find the best solution - you have come to the right place. We develop future technologies to relentlessly make supply chains better. We are a leader in supply chain software solutions, helping organizations streamline operations,...
-
Site Reliability Engineer
1 semana atrás
Descartes Brazil Descartes SmartCompliance Tempo inteiroDescartes Unites the People and Technology that Move the WorldThe need for efficient, secure, and agile supply chains and logistics operations has become ever more critical and complex. By combining innovative technology, powerful trade intelligence and the reach of our network, Descartes helps get goods, information, transportation assets, and people...
-
Senior Software Engineer
Há 5 dias
Brazil - Remote Kuali Tempo inteiroSenior Software Engineer (Remote Contractor, Brazil)About the Role We're looking for six Senior Full Stack Engineers to join our team as remote contractors. We're seeking experienced engineers based in Brazil or across Latin America who want to join a US engineering team building our next generation of our enterprise software platform for delivering amazing...