Senior Site Reliability Engineer
1 dia atrás
Articul8 AI is at the forefront of Generative AI innovation, delivering cutting-edge SaaS products that transform how businesses operate. Our platform empowers organizations to leverage the power of artificial intelligence in a reliable, scalable, and secure environment.
Position OverviewWe are seeking an experienced Site Reliability Engineer (SRE) to join our team and help ensure the reliability, performance, and scalability of our GenAI SaaS platform. As an SRE, you will bridge the gap between development and operations, implementing automation and best practices to maintain our service reliability objectives while supporting rapid innovation.
Key ResponsibilitiesArchitect and maintain scalable, highly available infrastructure for our GenAI platform.
Design and implement robust monitoring, alerting, and observability solutions to proactively ensure system health and performance.
Automate deployment, scaling, and management of our cloud-native infrastructure, reducing toil and improving efficiency.
Define, measure, and improve Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to deliver outstanding service quality.
Participate in on-call rotations and provide rapid response to production incidents, minimizing downtime and user impact.
Collaborate closely with development teams to build reliable, scalable, and efficient systems for complex AI workloads.
Lead incident response efforts, conduct thorough post-mortems, and champion continuous improvement initiatives.
Optimize infrastructure for performance, scalability, and cost-effectiveness—especially for high-demand AI workloads.
Implement and enforce security best practices across all systems and environments.
Create and maintain comprehensive documentation, including runbooks and knowledge base articles, to foster a culture of shared knowledge.
Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience
5+ years of experience in DevOps, SRE, or similar roles
Strong experience with cloud platforms (AWS, GCP, or Azure)
Proficiency in at least one programming/scripting language (Python, Go, Bash, etc.)
Hands-on experience with infrastructure as code tools (Terraform, CloudFormation, etc.)
Solid background in containerization technologies (Docker, Kubernetes)
Proven experience with monitoring and observability tools (Prometheus, Grafana, ELK stack, etc.)
Strong understanding of CI/CD pipelines and automation
Exceptional troubleshooting and problem-solving skills and ability to troubleshoot complex systems
Experience supporting AI/ML systems in production
Knowledge of GPU infrastructure management and optimization
Familiarity with distributed systems and high-performance computing
Experience with database systems (SQL and NoSQL)
Certifications in cloud platforms (AWS, GCP, Azure)
Experience with chaos engineering and resilience testing
Knowledge of security best practices and compliance requirements
Ready to shape the future of resilient software systems? Apply now and help drive the reliability of tomorrow's AI at Articul8 AI
-
Senior Site Reliability Engineer
Há 5 dias
Remote, Brasil Swile Tempo inteiro €80.000 - €120.000 por anoAt Swile, we believe that good products can help reduce friction in daily professional life and boost employee satisfaction. Today, we provide innovative solutions in various areas such as Fintech, Travel, HR, and Employee Benefits to more than 5.5 million users in 85,000 companies in France and Brazil. Your role as a Senior Site Reliability Engineer (SRE)...
-
Senior Site Reliability Engineer
4 semanas atrás
Brazil Mercado Eletrônico Tempo inteiroO Mercado Eletrônico é líder na América Latina em soluções de gestão de compras B2B. Suas tecnologias e serviços para as áreas de compras ajudam empresas a conquistarem mais economia, agilidade, governança e colaboração. Com escritórios no Brasil, Estados Unidos, México e Portugal, contabiliza mais de 1 milhão de fornecedores, 10 mil...
-
Site Reliability Engineer
Há 2 dias
Joinville, South Carolina, Brazil Billor Tempo inteiro R$80.000 - R$120.000 por anoAbout UsAt Billor, short for "Bill of Rights," we are building the largest trucking ecosystem in the U.S., dedicated to supporting truck drivers. By combining FinTech, Technology, and Freight Management, we empower drivers to achieve truck ownership and a better quality of life. Our mission is rooted in freedom, responsibility, and efficiency, enabling...
-
Senior Site Reliability Engineer
Há 6 dias
brazil Podium Tempo inteiroAt Podium, our mission is to help local businesses win. Our lead conversion platform, powered by AI and integrations, helps local businesses convert leads faster, communicate easier, and make more sales. Every day, thousands of local businesses utilize our review management, communication, marketing, and payments products. Our work and focus on helping...
-
Middle Site Reliability Engineer
Há 5 dias
Remote, Brasil Lean Tech Tempo inteiro R$60.000 - R$120.000 por anoDescription Company Overview: Lean Tech is a rapidly expanding organization situated in Medellín, Colombia. We pride ourselves on possessing one of the most influential networks within software development and IT services for the entertainment, financial, and logistics sectors. Our corporate projections offer many opportunities for professionals to elevate...
-
Senior Site Reliability Engineer
4 semanas atrás
Brazil Mercado Eletrônico Tempo inteiroO Mercado Eletrônico é líder na América Latina em soluções de gestão de compras B2B. Suas tecnologias e serviços para as áreas de compras ajudam empresas a conquistarem mais economia, agilidade, governança e colaboração.Com escritórios no Brasil, Estados Unidos, México e Portugal, contabiliza mais de 1 milhão de fornecedores, 10 mil...
-
Lead Site Reliability Engineer
Há 4 dias
brazil (remote) Gympass Tempo inteiroYour wellbeing matters. Join a company that cares. GET TO KNOW US Wellhub (formerly Gympass*) is a corporate wellness platform that connects employees to the best partners for fitness, mindfulness, therapy, nutrition, and sleep, all included in one subscription designed to cost less than each individual partner. Founded in and headquartered in NYC, we have a...
-
Staff Site Reliability Engineer
Há 2 dias
brazil (remote) Gympass Tempo inteiroYour wellbeing matters. Join a company that cares. GET TO KNOW US Wellhub (formerly Gympass*) is a corporate wellness platform that connects employees to the best partners for fitness, mindfulness, therapy, nutrition, and sleep, all included in one subscription designed to cost less than each individual partner. Founded in and headquartered in NYC, we have a...
-
Site Reliability Engineer Sr
Há 6 dias
Brazil Mercado Eletrônico Tempo inteiroO Mercado Eletrônico é líder na América Latina em soluções de gestão de compras B2B. Suas tecnologias e serviços para as áreas de compras ajudam empresas a conquistarem mais economia, agilidade, governança e colaboração. Com escritórios no Brasil, Estados Unidos, México e Portugal, contabiliza mais de 1 milhão de fornecedores, 10 mil...
-
Senior Site Reliability Engineer
4 semanas atrás
Brazil Signify Technology Tempo inteiroThe Company A well-established tech organization building advanced AI products for healthcare and clinical research. The team focuses on secure, reliable platforms that process sensitive medical data and support research and clinical workflows.Role & Responsibilities As a Senior SRE, you will:Design and automate infrastructure (infrastructure-as-code...