Helix AI Engineer, Training Infrastructure

San Jose1w ago

About the role

<div>Figure is an AI robotics company developing autonomous general-purpose humanoid robots. The goal of the company is to ship humanoid robots with human level intelligence. Its robots are engineered to perform a variety of tasks in the home and commercial markets. Figure is headquartered in San Jose, CA.<br><br>Figure's vision is to deploy autonomous humanoids at a global scale. Our Helix team is looking for an experienced Training Infrastructure Engineer to take our infrastructure to the next level. This role is focused on managing the training cluster, implementing distributed training algorithms, data loaders, and developer tools for AI researchers.<br><br><strong>Responsibilities</strong></div> <ul> <li>Design, deploy, and maintain Figure's training clusters</li> <li>Architect, optimize, and maintain scalable deep learning frameworks for training on massive robot datasets</li> <li>Work together with AI researchers to implement training of new model architectures at a large scale</li> <li>Implement distributed training, advanced parallelization strategies, and high-performance data loaders to reduce model development cycles</li> <li>Profile, identify, and eliminate training bottlenecks at the hardware and software levels to maximize Model FLOPs Utilization (MFU)</li> <li>Implement tooling for data processing, model experimentation, and continuous integration</li> </ul> <div><strong>Requirements</strong></div> <ul> <li>Strong software engineering fundamentals</li> <li>Bachelor's or Master's degree in Computer Science, Robotics, Engineering, or a related field</li> <li>Extensive professional experience with Python and PyTorch</li> <li>Proven track record of scaling and running large-scale training experiments personally on 800+ GPUs</li> <li>Experience managing HPC clusters for deep neural network training</li> <li>Minimum of 4 years of professional, full-time experience building reliable backend systems and infrastructure</li> </ul> <div><strong>Bonus Qualifications</strong></div> <div> <ul> <li>Experience contributing to or maintaining open-source distributed training frameworks (Megatron-LM, DeepSpeed, TorchTitan)</li> <li>Experience managing cloud infrastructure (AWS, Azure, GCP)</li> <li>Experience with job scheduling / orchestration tools (SLURM, Kubernetes, LSF, etc.)</li> <li>Experience with configuration management tools (Ansible, Terraform, Puppet, Chef, etc.)</li> <li>Deep understanding of CUDA and hands-on experience writing custom GPU kernels to optimize training</li> </ul> <p>The US base salary range for this full-time position is between $150,000 - $350,000 annually.</p> <p>The pay offered for this position may vary based on several individual factors, including job-related knowledge, skills, and experience. The total compensation package may also include additional components/benefits depending on the specific role. This information will be shared if an employment offer is extended.</p> </div>

741,000+ hidden jobs like this

Figure and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.

Everything Pro unlocks:

Unlimited applications — free stops at 5
Track every application in one place
Apply straight to the source, one click
Save & organize roles you love
Roles pulled from company boards before the big sites

Weekly

$9.99

$4.99/week

For an active search. Cancel anytime.

Get Weekly

Monthly

$24.99

$12.99/month

The smart pick. Save 35% vs weekly.

Get Monthly

Lifetime

$99

$49.99once

Pay once. Every future feature, forever.

Get Lifetime