
Senior Site Reliability Engineer
Há 20 horas
Engineering
/
Remote
Job Title: Senior Site Reliability Engineer
Level: Senior
Working Hours: Full Time(40h/Week)
Contract: Contractor
Location: Remote
Your Team
You will report to our Head Of Infrastructure and Deployment and join the Engineering team. The Site Reliability Engineering (SRE) team is dedicated to engineering, maintaining, and continuously improving the reliability, scalability, and performance of all critical Rocket.Chat systems and services. Our mission is to ensure an exceptional and uninterrupted experience for our users and customers, bridging the gap between development and operations to deliver value efficiently and automatically. On TheOrg you can view the complete structure of our organisation, including information about every team member, hiring managers and the size of each department.
Your Responsabilities
As a Senior Site Reliability Engineer, you will play a critical role in enhancing the reliability, performance, and scalability of Rocket.Chat's entire ecosystem. You will apply software engineering principles to infrastructure and operations, proactively preventing outages, optimizing system efficiency, and ensuring that new features and services are delivered with the highest standards of stability. Your expertise will be instrumental in delivering exceptional user experiences across our core platform, internal infrastructure, and customer-facing services.
Mandatory Hard Skills
- Strong background in software engineering with expertise in large-scale distributed systems.
- Expertise in Kubernetes, including operator development, and cloud platforms (e.g., AWS, GCP, Azure, OVH).
- Proficiency in programming/scripting languages such as Go, Python, or Bash for tooling and operator development.
- Deep, hands-on experience with monitoring, logging, and alerting systems (e.g., Prometheus, Grafana, Loki).
- Experience with Infrastructure as Code (IaC) tools like Terraform, Pulumi or Ansible and CI/CD pipelines using tools like ArgoCD.
- Solid understanding of networking fundamentals (TCP/IP, DNS, routing) and security principles.
- Familiarity with database technologies such as MongoDB or Redis.
Desirable Hard Skills
- Practical experience with chaos engineering principles and tools.
- Experience with disaster recovery planning, testing, and implementation.
- Familiarity with agile management tools such as Jira.
Soft Skills
- Proactive Mindset: Anticipate and address potential issues before they impact users.
- Collaboration: Work seamlessly with other teams, sharing knowledge and expertise to drive reliability.
- Problem-Solving: Strong troubleshooting and analytical skills to identify the root cause of complex issues across diverse technical stacks.
- Leadership: Guide and inspire team members, especially during incidents, and effectively communicate with both technical and non-technical stakeholders.
- Data-Driven Decisions: Base decisions on metrics and data to drive improvements.
- Passion: Genuine enthusiasm for what you do and how it contributes to our company's mission;
- Dream: Proactively seek out opportunities and challenges to achieve extraordinary results. If you're someone who takes initiative and is always striving to improve, you'll fit right in;
- Own: Take ownership of your work, set high standards for yourself, and be accountable for outcomes demonstrating a strong sense of responsibility and commitment. Take full responsibility for the reliability and performance of all Rocket.Chat services and infrastructure.
- Trust: Recognizing the importance of trust and support and actively working towards a collaborative and inclusive workplace;
- Share: Communicating openly and transparently, ensures clarity and honesty in interactions.
What You'll Do
- Engineer & Operate Deployment & Platform Services: Design, develop, and maintain the Kubernetes Operators at the core of our managed hosting offerings, ensuring their reliability, scalability, and robust error handling.
- Manage & Optimize Core Infrastructure: Oversee the reliability and performance of foundational infrastructure, including multiple Kubernetes clusters and critical services like ArgoCD, Traefik, and our monitoring stack.
- Ensure Service Reliability & Uptime: Define, monitor, and enforce SLOs for all critical services, manage error budgets, and implement robust monitoring, alerting, and logging solutions.
- Automate Operations & Reduce Toil: Develop and maintain automation frameworks for deployment, configuration, and operational tasks, building internal tools to streamline SRE workflows.
- Lead Incident Management & On-Call Response: Act as a primary responder for critical alerts, lead blameless post-mortems, and continuously improve runbook documentation and disaster recovery plans.
- Foster Cross-Functional Collaboration: Engage early in the product lifecycle to ensure reliability is built-in, and collaborate with Engineering, Security, and QA to integrate reliability best practices.
- Implement Advanced Reliability Practices: Conduct proactive load testing, performance analysis, and chaos engineering experiments to identify system weaknesses and improve fault tolerance.
Benefits
- Fully Remote & Flexible Working Hours
- Flexible Paid Time Off, Holidays and Vacation
- Company Laptop
- Remote Benefit
- iTalki, Courses and Books
- Stock Options
- Multicultural Environment
- Vibrant Company Culture
Check out our handbook to dive into each of our awesome benefits At Rocket.Chat, we have tailored base pay ranges according to work locations. This approach ensures that we can competitively and consistently compensate our employees across different geographic markets.
Note: While we define an initial seniority level and budget for each role, this can be adjusted during the hiring process. The selection process itself — including interviews and assessments — helps us better understand where the candidate fits within our career framework and which grade they should be positioned in.
About Rocket.Chat
Rocket.Chat is the world's largest open-source communications platform. Built for organizations needing more control over their communications, Rocket.Chat Secure CommsOS is a communication platform that unifies messaging, voice, video, AI, and mission-critical applications—ensuring uncompromising security, compliance, and operational efficiency for governments, defense, and critical infrastructure organizations operating in highly-regulated environments.
Tens of millions of users in over 150 countries and organizations such as Deutsche Bahn, the U.S. Navy and Credit Suisse trust Rocket.Chat every day to keep their communications completely private and secure. As Rocket.Chat we believe in reconnecting the world, one conversation at a time
See yourself in that? So apply nowCheck out our handbook for more information about our rocket.
-
Site Reliability Engineer
1 semana atrás
Remoto, Brasil Objective Tempo inteiro R$90.000 - R$120.000 por anoSomos ávidos por tecnologia, criatividade e desafios.Se você gosta de desafios, aprendizado constante e valoriza as conexões pessoais, junte-se a nósValorizamos a diversidade e acreditamos que ela é fundamental para a inovação e entregas de valor aos nossos clientes. Todas as nossas vagas são destinadas a todas as pessoas, com ou sem deficiência,...
-
SRE (Site Reliability Engineer) Sênior
2 semanas atrás
Remoto, Brasil Clicksign Tempo inteiro R$90.000 - R$120.000 por anoSobre a ClicksignSomos uma empresa brasileira líder em assinaturas eletrônicas. Em essência, facilitamos relações entre pessoas e empresas no ambiente digital. Por trás da nossa tecnologia de ponta e foco em segurança, temos a missão de fazer o mundo crescer, tornando as relações digitais cada vez mais inteligentes.Como trabalhamos:Nossa essência...
-
Site Reliability Engineer
Há 22 horas
Remoto, Brasil Mercos Tempo inteiro R$80.000 - R$120.000 por anoSe você tem experiência com infraestrutura Cloud, está sempre atualizado nas tecnologias novas, busca desenvolver infraestruturas imutáveis e reproduzíveis, esta vaga é para você Aqui, você será incentivado a simplificar e automatizar ao máximo a infraestrutura, questionar a arquitetura e as tecnologias escolhidas e a resposta não será "porque...
-
Data Engineer Senior, Brazil
Há 18 horas
Remoto, Brasil Ci&T Tempo inteiro R$80.000 - R$120.000 por anoWe are tech transformation specialists, uniting human expertise with AI to create scalable tech solutions.With over 7.400 CI&Ters around the world, we've built partnerships with more than 1,000 clients during our 30 years of history. Artificial Intelligence is our reality.We are looking for a Senior Data Developer with strong knowledge in developing and...
-
Site Reliability Engineer
Há 19 horas
Remoto, Brasil AgileEngine Tempo inteiro R$80.000 - R$120.000 por anoImportant: after confirming your application on this platform, you'll receive an email with the next step: completing your application on our internal site, LaunchPod. So keep an eye on your inbox and don't miss this step — without it, the process can't move forward.What you will doShift: Monday – Thursday 8AM – 7PM PST (11AM – 10PM EST) with...
-
Senior AI Engineer
1 semana atrás
Remoto, Brasil Lean Tech Tempo inteiro R$60.000 - R$180.000 por anoPosition Summary: As a Principal AI Engineer and Thought Leader, you will spearhead the design, development, and deployment of cutting-edge AI solutions that drive operational and product innovation. You will work closely with cross-functional teams to implement AI-driven improvements, optimize workflows, and create meaningful business impact. This role...
-
Database Reliability Engineer
2 semanas atrás
Remoto, Brasil Foxbit Tempo inteiro R$80.000 - R$120.000 por anoEstamos à procura de um DBRE (Database Reliability Engineer) para reforçar nosso time de Plataforma e garantir a confiabilidade, performance e eficiência de toda a nossa infraestrutura de dados em uma das maiores exchanges de criptomoedas do BrasilO principal objetivo do DBRE é, em conjunto com os times de SRE, Desenvolvimento e Segurança, assegurar que...
-
DBRE - Database Reliability Engineer
1 semana atrás
Remoto, Brasil vortigo Tempo inteiro R$80.000 - R$120.000 por anoSomos a Vortigo - nascemos com o propósito de criar aplicativos mobile para um mundo em constante movimento, mas não paramos por aí. Ampliamos nossa atuação e hoje desenvolvemos softwares para ajudar empresas e startups no processo de transformação digital. Nosso time é composto por pessoas apaixonadas por desafios gigantes, mudando a experiência...
-
Senior Software Engineer- Openshift
Há 22 horas
Remoto, Brasil Red Hat Tempo inteiroThe Red Hat Engineering team is seeking a Senior Software Engineer to join the ROSA Service Engineering Operators team in Brazil. This team is responsible for developing and maintaining various operators that help create and facilitate the ROSA platform. In this role, you will contribute to the development of new features and the maintenance of these...
-
Backend Engineer
Há 20 horas
Remoto, Brasil Fitnext Tempo inteiro US$100.000 - US$120.000 por anoSenior Backend Engineer (.NET | C#) - Remote (Brazil & LATAM)About the Role:We are looking for a highly experienced Senior Backend Engineer to join a top global company in a 100% remote position. You'll work with an elite international team, solving challenging problems and building high-impact products.What We Offer:100% RemoteUSD CompensationExposure to...