Principal Systems Engineer

nscaleoperationsukltd

Houston; New York; San Francisco; Seattle; US1d ago

Seniority: Staff

About the role

<h1><strong>Principal Systems Engineer – GPU Supercluster Bringup</strong></h1> <h2><strong>About Us</strong></h2> <p>We are building AI infrastructure for frontier-scale workloads. Our platform is designed for high-density, high-performance GPU clusters that push the limits of power, networking, and distributed compute.</p> <p>As a startup, we move fast, operate with ownership, and expect technical leaders to define standards—not just follow them.</p> <h2><strong>The Role</strong></h2> <p>We are hiring a <strong>Principal Deployment Engineer</strong> to architect and lead the bringup of large-scale GPU clusters (hundreds to thousands of GPUs). This is a technical leadership role responsible for defining how we deploy, validate, and scale AI superclusters across sites.</p> <p>You will own the full lifecycle of deployment—from rack design and fabric architecture to cluster validation frameworks and production readiness standards. You will set the bar for performance, reliability, and operational excellence.</p> <p>This role combines deep hands-on expertise with system-level thinking and cross-functional leadership.</p> <h2><strong>What You’ll Do</strong></h2> <h3><strong>End-to-End Supercluster Bringup Ownership</strong></h3> <ul> <li>Define the technical standards for node, rack, and full-cluster bringup.<br><br></li> <li>Lead large-scale GPU cluster deployments (multi-rack, multi-pod environments).<br><br></li> <li>Architect high-performance network fabrics (IB, RoCE, Ethernet) optimized for AI workloads.<br><br></li> <li>Establish cluster-level acceptance criteria and validation frameworks.<br><br></li> </ul> <h3><strong>Performance & Fabric Architecture</strong></h3> <ul> <li>Tune and validate NCCL, RDMA, GPUDirect, and collective operations at scale.<br><br></li> <li>Identify and eliminate performance bottlenecks across hardware, topology, and firmware layers.<br><br></li> <li>Drive congestion control and fabric optimization strategies.<br><br></li> <li>Define performance benchmarking methodology for AI training workloads.<br><br></li> </ul> <h3><strong>Deployment Strategy & Scalability</strong></h3> <ul> <li>Design repeatable deployment models for multi-site expansion.<br><br></li> <li>Build automation frameworks for provisioning and cluster validation.<br><br></li> <li>Establish deployment SLAs, quality gates, and operational readiness standards.<br><br></li> <li>Reduce time-to-capacity while increasing reliability.<br><br></li> </ul> <h3><strong>Technical Leadership</strong></h3> <ul> <li>Serve as the escalation point for complex bringup and performance issues.<br><br></li> <li>Mentor senior engineers and shape infrastructure best practices.<br><br></li> <li>Influence hardware selection, rack topology, and data center design decisions.<br><br></li> <li>Partner with executive leadership on infrastructure scaling strategy.<br><br></li> </ul> <h2><strong>What We’re Looking For</strong></h2> <h3><strong>Required</strong></h3> <ul> <li>10+ years of experience in large-scale infrastructure or HPC environments.<br><br></li> <li>Proven experience bringing up large GPU clusters (hundreds+ GPUs).<br><br></li> <li>Deep expertise in high-speed networking (InfiniBand, RoCE, Ethernet fabrics).<br><br></li> <li>Strong understanding of server architecture (PCIe, NUMA, memory hierarchy).<br><br></li> <li>Experience debugging performance issues across compute and network layers.<br><br></li> <li>Strong automation and systems-level thinking.<br><br></li> </ul> <h3><strong>Strongly Preferred</strong></h3> <ul> <li>Experience scaling AI training clusters for frontier models.<br><br></li> <li>Experience with liquid cooling or ultra-high-density deployments.<br><br></li> <li>Knowledge of distributed storage systems (Lustre, Ceph, NVMe-oF).<br><br></li> <li>Experience defining infrastructure standards in a fast-growing organization.<br><br></li> </ul> <h2><strong>What Success Looks Like</strong></h2> <ul> <li>Superclusters are brought online quickly, predictably, and at peak performance.<br><br></li> <li>Deployment processes scale from first cluster to multi-site expansion.<br><br></li> <li>Infrastructure becomes a competitive advantage.<br><br></li> <li>You define the technical blueprint for how we scale AI infrastructure.</li> </ul><div class="content-pay-transparency"><div class="pay-input"><div class="description"><p>The range below reflects the base salary for the position. Actual compensation may vary based on job-related factors such as skill set, experience, education, and location. In addition to base salary, this role may be eligible for bonus, equity, and/or commission programs. Nscale may offer a competitive benefits package including medical, dental, vision, flexible paid time off, parental leave, and retirement plan participation.</p></div><div class="title">Salary Range</div><div class="pay-range"><span>$175,000</span><span class="divider">—</span><span>$225,000 USD</span></div></div></div><div class="content-conclusion"><p><em>For information on how Nscale handles candidate personal data, please see our Employee & Candidate Privacy Notice: <a href="https://drive.google.com/file/d/1QK5Yg04WHD9K9IAtJgQWubJZC9oLvatK/view?usp=sharing" target="_blank" data-saferedirecturl="https://www.google.com/url?q=https://drive.google.com/file/d/1QK5Yg04WHD9K9IAtJgQWubJZC9oLvatK/view?usp%3Dsharing&source=gmail&ust=1765375172804000&usg=AOvVaw2Ncte4rmlGl8OKuFuDgDtx">Here.</a></em></p></div>

Perks & benefits

Paid Time Off
Equity Compensation

741,000+ hidden jobs like this

nscaleoperationsukltd and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.

Everything Pro unlocks:

Unlimited applications — free stops at 5
Track every application in one place
Apply straight to the source, one click
Save & organize roles you love
Roles pulled from company boards before the big sites

Weekly

$9.99

$4.99/week

For an active search. Cancel anytime.

Get Weekly

Monthly

$24.99

$12.99/month

The smart pick. Save 35% vs weekly.

Get Monthly

Lifetime

$99

$49.99once

Pay once. Every future feature, forever.

Get Lifetime