Back to all jobs
H

LLM Pre-training & Distributed Engineer (AI Infrastructure)

Hyphen Connect Limited

Hong Kong1mo ago

About the role

<p><span data-sheets-root="1">We are seeking a highly skilled LLM Pre-training &amp; Distributed Systems Engineer. This role is essential for orchestrating large-scale machine learning training runs and optimizing&nbsp; distributed infrastructure. The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering to ensure efficient and reliable training processes.</span></p> <p><strong>Responsibilities:</strong></p> <ul> <li>Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM.</li> <li>Optimize networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors.</li> <li>Automate checkpointing and failure recovery during month-long training runs.</li> </ul> <p><strong>Required Skills:</strong></p> <ul> <li>Deep expertise in 3D parallelism (Data, Tensor, Pipeline).</li> <li>Experience managing SLURM or Kubernetes-based GPU clusters.</li> <li>Strong systems engineering background (C++, CUDA, Python).</li> </ul> <p>&nbsp;</p>

747,000+ hidden jobs like this

Hyphen Connect Limited and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.

Everything Pro unlocks:

  • Unlimited applications — free stops at 5
  • Track every application in one place
  • Apply straight to the source, one click
  • Save & organize roles you love
  • Roles pulled from company boards before the big sites

Weekly

$9.99
$4.99/week

For an active search. Cancel anytime.

Most popular

Monthly

$24.99
$12.99/month

The smart pick. Save 35% vs weekly.

Lifetime

$99
$49.99once

Pay once. Every future feature, forever.