SRE Architect

4 semanas atrás


Blumenau, Santa Catarina, Brasil EPAM Systems Tempo inteiro
Overview

We are seeking a highly skilled Site Reliability Engineer/Architect (SRE) to join our innovative and fast-paced team.

In this role, you will be responsible for designing and implementing modern SRE practices to enhance the reliability and scalability of our enterprise-grade Generative AI (GenAI) integration platform. You will play a vital role in driving operational excellence by adopting advanced methodologies and tools while collaborating with key stakeholders across technical and business units.

Responsibilities
  • Define Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to establish reliability standards and monitor system health
  • Architect resilient production systems using methodologies like canary deployments, shadow traffic, and testing-in-production
  • Develop incident management strategies and automate on-call operations to minimize downtime and improve system stability
  • Enhance observability frameworks with logging, tracing, and monitoring for real-time visibility and proactive troubleshooting
  • Automate tasks related to scalability, performance optimization, and operational processes for improved efficiency
  • Collaborate with engineering teams to integrate SRE principles into system design and development
  • Provide strategic leadership for implementing site reliability solutions in multi-cloud, multi-tenant environments for enterprise applications
  • Advise executive stakeholders with insights and recommendations to align SRE strategies with organizational goals
  • Promote a culture of innovation and operational reliability through mentoring and industry-leading best practices
  • Ensure the platform's infrastructure supports high availability and scalability in partnership with architecture and DevOps teams
  • Drive continuous improvement by identifying opportunities for process innovation and optimization
Requirements
  • 10+ years of professional experience in SRE, DevOps, or related areas, including managing production systems
  • Expertise in SRE practices such as SLOs, SLIs, canary testing, and incident management
  • Proficiency with cloud technologies like AWS, Google Cloud Platform, or Azure, with hands-on experience in multi-cloud setups
  • Background in observability tools such as Prometheus, Grafana, or ELK Stack, as well as monitoring distributed systems
  • Skills in automation platforms such as Terraform, Ansible, or Kubernetes, enabling infrastructure-as-code adoption
  • Familiarity with programming languages like Python, Go, or Bash for building automation solutions
  • Strong understanding of CI/CD pipelines, containerization technologies, and orchestration frameworks
  • Competency in system architecture for fault tolerance, redundancy, and performance optimization
  • History of collaborating effectively with diverse stakeholders, from technical teams to executive management
  • Background in managing enterprise-scale systems and multi-tenant platform deployments
Nice to have
  • Knowledge of Generative AI platforms and integration techniques
  • Understanding of managed database services, including Amazon RDS, Google Spanner, or Azure SQL
  • Familiarity with security practices for enterprise platforms and multi-cloud infrastructures
  • Background in contributing to technical roadmaps for distributed systems at scale
  • Capability to lead initiatives involving Chaos Engineering or disaster recovery strategies
We offer
  • International projects with top brands
  • Work with global teams of highly skilled, diverse peers
  • Employee financial programs
  • Paid time off and sick leave
  • Upskilling, reskilling and certification courses
  • Unlimited access to the LinkedIn Learning library and 22,000+ courses
  • Global career opportunities
  • Volunteer and community involvement opportunities
  • EPAM Employee Groups
  • Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn
Seniority level
  • Mid-Senior level
Employment type
  • Full-time
Job function
  • Information Technology, Engineering, and Business Development
Industries
  • Software Development, IT Services and IT Consulting, and Venture Capital and Private Equity Principals

Referrals increase your chances of interviewing at EPAM Systems.

Get notified about new Site Reliability Engineer jobs in Brazil .

#J-18808-Ljbffr

  • Blumenau, Santa Catarina, Brasil Pulsati Tempo inteiro R$80.000 - R$120.000 por ano

    Venha fazer parte do time Pulsati Desenvolvemos soluções tecnológicas para empresas de saúde de todos os portes e segmentos.O maior ingrediente da PULSATI é o investimento em pessoas voltadas para inovação. Juntos nós temos mais de 30 anos de caminhada nessa trajetória na área da saúde.Estamos, desde então, unindo forças por um propósito...

  • Devsecops Analyst Sre

    1 semana atrás


    Blumenau, Brasil WIIPO LAB Tempo inteiro

    **None**: Nosso propósito é oferecer à empresas e pessoas a liberdade e poder de escolha, aliado à experiência de uma empresa de tecnologia. Ter serviços financeiros e benefícios flexíveis conectados de forma inteligente é o que nos move. Embarque com a gente nesse foguete: Seja um #Wiiper ? **SOBRE A OPORTUNIDADE**: Na Wiipo, somos uma fintech que...

  • Devsecops Analyst Sre

    1 semana atrás


    Blumenau, Brasil WIIPO LAB Tempo inteiro

    **None**:Nosso propósito é oferecer à empresas e pessoas a liberdade e poder de escolha, aliado à experiência de uma empresa de tecnologia. Ter serviços financeiros e benefícios flexíveis conectados de forma inteligente é o que nos move. Embarque com a gente nesse foguete: Seja um #Wiiper ?**SOBRE A OPORTUNIDADE**:Na Wiipo, somos uma fintech que...


  • Blumenau, Brasil Pulsati Tempo inteiro

    Engenheiro(a) de Infraestrutura em Nuvem | Devops Venha fazer parte do time Pulsati! Desenvolvemos soluções tecnológicas para empresas de saúde de todos os portes e segmentos.O maior ingrediente da PULSATI é o investimento em pessoas voltadas para inovação. Juntos nós temos mais de 30 anos de caminhada nessa trajetória na área da saúde.Estamos,...

  • Arquiteto de sistemas Pleno

    2 semanas atrás


    Blumenau, Brasil Runtalent Tempo inteiro

    Overview Somos a @Runtalent, com DNA inovador, somos consolidados no mercado de tecnologia e especializados em soluções de TI há quase duas décadas. Acompanhamos todos os avanços tecnológicos dos últimos anos e estamos juntos nessa corrida pela transformação digital. Temos uma oportunidade para: Arquiteto de sistemas Pleno (remoto) Venha conhecer...