Machine Learning Performance Engineer

Jane Street

New York1mo ago

About the role

We are looking for an engineer with experience in low-level systems programming and optimisation to join our growing ML team.  <a href="https://www.janestreet.com/join-jane-street/machine-learning/">Machine learning</a> is a critical pillar of Jane Street's global business. Our ever-evolving trading environment serves as a unique, rapid-feedback platform for ML experimentation, allowing us to incorporate new ideas with relatively little friction. Your part here is optimising the performance of our models – both training and inference. We care about efficient large-scale training, low-latency inference in real-time systems and high-throughput inference in research. Part of this is improving straightforward CUDA, but the interesting part needs a whole-systems approach, including storage systems, networking and host- and GPU-level considerations. Zooming in, we also want to ensure our platform makes sense even at the lowest level – is all that throughput actually goodput? Does loading that vector from the L2 cache really take that long? If you’ve never thought about a career in finance, you’re in good company. Many of us were in the same position before working here. If you have a curious mind and a passion for solving interesting problems, we have a feeling you’ll fit right in.  There’s no fixed set of skills, but here are some of the things we’re looking for: <ul> <li>An understanding of modern ML techniques and toolsets</li> <li>The experience and systems knowledge required to debug a training run’s performance end to end</li> <li>Low-level GPU knowledge of PTX, SASS, warps, cooperative groups, Tensor Cores and the memory hierarchy</li> <li>Debugging and optimisation experience using tools like CUDA GDB, NSight Systems, NSight Computesight-systems and nsight-compute</li> <li>Library knowledge of Triton, CUTLASS, CUB, Thrust, cuDNN and cuBLAS</li> <li>Intuition about the latency and throughput characteristics of CUDA graph launch, tensor core arithmetic, warp-level synchronization and asynchronous memory loads</li> <li>Background in Infiniband, RoCE, GPUDirect, PXN, rail optimisation and NVLink, and how to use these networking technologies to link up GPU clusters</li> <li>An understanding of the collective algorithms supporting distributed GPU training in NCCL or MPI</li> <li>An inventive approach and the willingness to ask hard questions about whether we're taking the right approaches and using the right tools</li> <li>Fluent in English</li> </ul>   If you're a recruiting agency and want to partner with us, please reach out to <a href="mailto:agency-partnerships@janestreet.com">agency-partnerships@janestreet.com</a>.

731,000+ hidden jobs like this

Jane Street and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.

Everything Pro unlocks:

Unlimited applications — free stops at 5
Track every application in one place
Apply straight to the source, one click
Save & organize roles you love
Roles pulled from company boards before the big sites

Weekly

$9.99

$4.99/week

For an active search. Cancel anytime.

Get Weekly

Monthly

$24.99

$12.99/month

The smart pick. Save 35% vs weekly.

Get Monthly

Lifetime

$99

$49.99once

Pay once. Every future feature, forever.

Get Lifetime