
Machine Learning Engineer
Há 5 dias
Who we are:
CloudWalk is a fintech company reimagining the future of financial services. We are building intelligent infrastructure powered by AI, blockchain, and thoughtful design. Our products serve millions of entrepreneurs across Brazil and the US every day, helping them grow with tools that are fast, fair, and built for how business actually works. Learn more at .
Who We're Looking For:
We're looking for a Machine Learning Engineer to own and evolve our distributed training pipeline for large language models. You'll work inside our GPU cluster to help researchers train and scale foundation models using frameworks like
Hugging Face Transformers, Accelerate, DeepSpeed, FSDP,
and others. Your focus will be distributed training: from designing sharding strategies and multi-node orchestration to optimizing throughput and managing checkpoints at scale.
This role is
not research
- it's about building and scaling the systems that let researchers move fast and models grow big. You'll work closely with MLOps, infra, and model developers to make our training runs efficient, resilient, and reproducible.
What You'll Do:
- Own the architecture and maintenance of our distributed training pipeline;
- Train LLMs using tools like DeepSpeed, FSDP, and Hugging Face Accelerate;
- Design and debug multi-node/multi-GPU training runs (Kubernetes-based);
- Optimize training performance: memory usage, speed, throughput, and cost;
- Help manage experiment tracking, artifact storage, and resume logic;
- Build reusable, scalable training templates for internal use;
- Collaborate with researchers to bring their training scripts into production shape.
What We're Looking For:
- Expertise in distributed training: Experience with DeepSpeed, FSDP, or Hugging Face Accelerate in real-world multi-GPU or multi-node setups;
- Strong PyTorch background: Comfortable writing custom training loops, schedulers, or callbacks;
- Hugging Face stack experience: Transformers, Datasets, Accelerate - you know the ecosystem and how to bend it;
- Infra literacy: You understand how GPUs, containers, and job schedulers work together. You can debug cluster issues, memory bottlenecks, or unexpected slowdowns;
- Resilience mindset: You write code that can checkpoint, resume, log correctly, and keep running when things go wrong;
- Collaborative builder: You don't mind digging into other people's scripts, making them robust, and helping everyone train faster.
Bonus Points:
- Experience with Kubernetes-based GPU clusters and Ray;
- Experience with experiment tracking (MLflow, W&B);
- Familiarity with mixed precision, ZeRO stages, model parallelism;
- Comfort with CLI tooling, profiling, logging, and telemetry;
- Experience with dataloading bottlenecks and dataset streaming.
How We Hire:
- Online assessment: technical logic and fundamentals (Math/Calculus, Statistics, Probability, Machine Learning/Deep Learning, Code)
- Technical interview: deep dive into distributed training theory and reasoning (no code)
- Cultural interview
- If you are not willing to take an online quiz, do not apply.
If you've trained LLMs before - or helped others do it better - this role is for you.
Even if you don't check every box, if you're confident working with distributed compute and real-world LLM workloads, we want to hear from you.
-
Machine Learning Engineer
4 semanas atrás
São Paulo, São Paulo, Brasil BairesDev Tempo inteiroOverviewJoin to apply for the Machine Learning Engineer - Remote Work role at BairesDev.We are looking for an outstanding Machine Learning Engineer to join our team. This professional will use data and machine learning techniques to help the business automate and scale decision-making, mainly by producing advanced data analysis reports and data-based models...
-
Machine Learning Engineer
4 semanas atrás
São Paulo, São Paulo, Brasil BairesDev Tempo inteiroOverview Join to apply for the Machine Learning Engineer - Remote Work role at BairesDev . We are looking for an outstanding Machine Learning Engineer to join our team. This professional will use data and machine learning techniques to help the business automate and scale decision-making, mainly by producing advanced data analysis reports and data-based...
-
Senior Machine Learning Engineer
3 semanas atrás
São Paulo, São Paulo, Brasil Onzze Tempo inteiroOverview Join to apply for the Senior Machine Learning Engineer role at Onzze . Are you an experienced software developer with a passion for artificial intelligence and scalable agent platforms? This role involves working on a next-generation AI Automation Platform, driving automation, intelligence, and scalability on a global scale. Contratação para uma...
-
Machine Learning Engineer
4 semanas atrás
São Paulo, São Paulo, Brasil Blue Orange Digital Tempo inteiroBase pay rangeCompany overview: Blue Orange Digital is a boutique data & AI consultancy that delivers enterprise-grade results. We design and build modern data platforms, analytics, and ML/AI Agent solutions for mid‑market and enterprise clients across Private Equity, Financial Services, Healthcare, and Retail. Our teams work with technologies like...
-
Machine Learning Engineer
4 semanas atrás
São Paulo, São Paulo, Brasil Blue Orange Digital Tempo inteiroBase pay range Company overview: Blue Orange Digital is a boutique data & AI consultancy that delivers enterprise-grade results. We design and build modern data platforms, analytics, and ML/AI Agent solutions for mid‑market and enterprise clients across Private Equity, Financial Services, Healthcare, and Retail. Our teams work with technologies like...
-
Machine Learning Engineer
2 semanas atrás
São Paulo, São Paulo, Brasil Blue Orange Digital Tempo inteiro R$113.220 - R$114.588 por anoCompany overview:Blue Orange Digital is a boutique data & AI consultancy that delivers enterprise-grade results. We design and build modern data platforms, analytics, and ML/AI Agent solutions for mid‑market and enterprise clients across Private Equity, Financial Services, Healthcare, and Retail.Our teams work with technologies like Databricks, Snowflake,...
-
Machine learning
4 semanas atrás
São Paulo, São Paulo, Brasil Netvagas Tempo inteiroOverview Join to apply for the Machine learning role at Netvagas Estamos em busca de uma pessoa Engenheira de Machine Learning para integrar nossa equipe de Soluções IA e ser responsável por aplicar técnicas de aprendizado de máquina em projetos inovadores, ajudando a construir soluções inteligentes para nossos produtos e clientes. Se você possui...
-
Machine Learning Engineer
1 semana atrás
São Paulo, São Paulo, Brasil Capgemini Tempo inteiro R$80.000 - R$120.000 por anoVocê é apaixonado(a) por tecnologia, inovação e quer fazer parte de um ambiente inclusivo, colaborativo e em constante evolução? Então essa oportunidade é para vocêNa Capgemini, valorizamos o equilíbrio entre vida pessoal e profissional. Por isso, oferecemos modelos de trabalho flexíveis, que podem variar entre home office, híbrido ou presencial,...
-
Machine Learning Engineer
3 semanas atrás
São Paulo, São Paulo, Brasil Tata Consultancy Services Tempo inteiroCome to one of the biggest IT Services companies in the world Here you can transform your careerWhy to join TCS? Here at TCS we believe that people make the difference, that's why we live a culture of unlimited learning full of opportunities for improvement and mutual development. The ideal scenario to expand ideas through the right tools, contributing to...
-
Senior Machine Learning Engineer
1 semana atrás
São Paulo, São Paulo, Brasil Onzze Tempo inteiro R$90.000 - R$120.000 por anoContratação para uma empresa localizada na Suécia. Há necessidade de mudança para Estocolmo.Are you an experienced software developer with a passion for artificial intelligence and scalable agent platforms? Do you thrive in fast-paced environments where innovation and impact go hand in hand? One of our customers is building a next-generation AI...