Back to all jobs
Mind Robotics logo

Machine Learning Infrastructure Engineer

Mind Robotics
Palo AltoOn-site
Employment
Full-time

About the role

The Role

At Mind Robotics, we’re building generalized physical AI—robotic systems capable of dexterous, adaptive, and reasoning-intensive work in real-world industrial environments. Our ability to iterate quickly on large-scale models depends on world-class ML infrastructure.

We’re looking for a Machine Learning Infrastructure Engineer to build the core systems that enable fast, reliable, and scalable model training—powering everything from experimentation to production deployment.

Responsibilities

  • Design and implement scalable systems for training large ML models

  • Enable efficient workflows for data ingestion, training, and iteration

  • Develop and optimize distributed training systems across hundreds of GPUs

  • Implement strategies for parallelization, sharding, and efficient compute utilization

  • Improve training efficiency through techniques such as attention optimizations, kernel fusion, and memory management

  • Partner closely with modeling teams to accelerate iteration speed and reduce training costs

  • Build internal tools for experiment tracking, monitoring, and debugging

  • Implement systems for tracking training performance, failures, and resource utilization

  • Debug and resolve bottlenecks across the training stack

  • Provide lightweight infrastructure support for deploying and running models in production environments

  • Optimize inference performance and reliability where needed

  • Support core cloud infrastructure needs for training workloads (without heavy DevOps overhead)

  • Manage compute resources efficiently across training jobs

Qualifications

  • Strong experience building infrastructure for large-scale ML training

  • Deep understanding of how modern LLM/VLM systems are trained and scaled

  • Proven experience setting up and scaling distributed training across hundreds of GPUs

  • Strong understanding of parallelization strategies (data, model, pipeline parallelism)

  • Strong proficiency in Python programming

  • Expert-level proficiency in PyTorch and/or JAX

  • Strong understanding of techniques like attention optimization, kernel fusion, and efficient memory usage

Nice to Have

  • Experience supporting inference systems in production

  • Familiarity with robotics or embodied AI workloads

  • Experience building tools for experiment management and researcher productivity

731,000+ hidden jobs like this

Mind Robotics and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.

Everything Pro unlocks:

  • Unlimited applications — free stops at 5
  • Track every application in one place
  • Apply straight to the source, one click
  • Save & organize roles you love
  • Roles pulled from company boards before the big sites

Weekly

$9.99
$4.99/week

For an active search. Cancel anytime.

Most popular

Monthly

$24.99
$12.99/month

The smart pick. Save 35% vs weekly.

Lifetime

$99
$49.99once

Pay once. Every future feature, forever.