Back to all jobs
N

Principal Deployment Engineer

nscaleoperationsukltd

US3d ago
Seniority
Staff

About the role

<p>.</p> <h1><strong>Principal Deployment Engineer – GPU Infrastructure Bringup</strong></h1> <p><strong>Location:</strong> United States (Travel Required)<br>&nbsp;<strong>Team:</strong> Infrastructure<br>&nbsp;<strong>Reports to:</strong> Head of Infrastructure</p> <h2><strong>About Us</strong></h2> <p>We are building next-generation AI infrastructure from the ground up. Our mission is to deliver highly performant, reliable, and scalable GPU clusters purpose-built for large-scale AI training and inference.</p> <p>As a startup, we operate with urgency, ownership, and a bias toward action. We are assembling the foundational infrastructure that will power frontier AI workloads—and we’re looking for engineers who want to build it from zero to scale.</p> <h2><strong>The Role</strong></h2> <p>We are hiring a <strong>Principal Deployment Engineer</strong> to lead hands-on bringup of GPU clusters across our data center environments. You will own the execution of node, rack, and network deployment, ensuring clusters are validated, performant, and production-ready.</p> <p>This role is deeply technical and execution-focused. You will be in the details—cabling racks, validating firmware, tuning fabrics, debugging performance—and helping us build repeatable processes as we scale.</p> <h2><strong>What You’ll Do</strong></h2> <h3><strong>Cluster Deployment &amp; Bringup</strong></h3> <ul> <li>Execute end-to-end bringup of GPU nodes and racks from installation to production readiness.<br><br></li> <li>Validate BIOS/BMC/firmware configurations and GPU health.<br><br></li> <li>Perform rack-level integration including power, cabling, and airflow validation.<br><br></li> <li>Bring up and validate high-speed network fabrics (InfiniBand, RoCE, 100–400G Ethernet).<br><br></li> </ul> <h3><strong>Network &amp; Performance Validation</strong></h3> <ul> <li>Configure and validate leaf/spine network connectivity.<br><br></li> <li>Run cluster-wide burn-in and stress testing.<br><br></li> <li>Validate GPU-to-GPU and node-to-node performance (NCCL, RDMA, GPUDirect).<br><br></li> <li>Troubleshoot hardware, firmware, and fabric-level issues.<br><br></li> </ul> <h3><strong>Automation &amp; Process</strong></h3> <ul> <li>Contribute to automation for provisioning and cluster validation.<br><br></li> <li>Improve deployment playbooks and documentation.<br><br></li> <li>Identify reliability issues early and drive corrective actions.<br><br></li> <li>Help turn ad hoc deployments into repeatable systems.<br><br></li> </ul> <h3><strong>Cross-Functional Collaboration</strong></h3> <ul> <li>Work closely with networking, systems software, and data center teams.<br><br></li> <li>Coordinate with hardware vendors to resolve bringup issues.<br><br></li> <li>Support rapid capacity expansion as we scale.<br><br></li> </ul> <h2><strong>What We’re Looking For</strong></h2> <h3><strong>Required</strong></h3> <ul> <li>7–8+ years in infrastructure engineering, hardware deployment, or data center operations.<br><br></li> <li>Hands-on experience deploying GPU servers (HGX/DGX or similar platforms).<br><br></li> <li>Experience with high-speed networking (InfiniBand, RoCE, Ethernet fabrics).<br><br></li> <li>Strong Linux systems knowledge.<br><br></li> <li>Experience troubleshooting distributed systems performance issues.<br><br></li> <li>Comfortable working onsite in data center environments as needed.<br><br></li> </ul> <h3><strong>Strongly Preferred</strong></h3> <ul> <li>Experience in AI/ML infrastructure or HPC environments.<br><br></li> <li>Familiarity with NCCL, CUDA, RDMA.<br><br></li> <li>Automation experience (Python, Ansible, Terraform, Bash).<br><br></li> <li>Experience in high-density power and cooling environments.<br><br></li> </ul> <h2><strong>What Success Looks Like</strong></h2> <ul> <li>Clusters are brought online quickly and correctly.<br><br></li> <li>Performance baselines meet or exceed expectations.<br><br></li> <li>Deployment processes become faster and more reliable over time.<br><br></li> </ul> <p>You help build the foundation for scaled infrastructure growth.</p><div class="content-conclusion"><p><em>For information on how Nscale handles candidate personal data, please see our Employee &amp; Candidate Privacy Notice:&nbsp;<a href="https://drive.google.com/file/d/1QK5Yg04WHD9K9IAtJgQWubJZC9oLvatK/view?usp=sharing" target="_blank" data-saferedirecturl="https://www.google.com/url?q=https://drive.google.com/file/d/1QK5Yg04WHD9K9IAtJgQWubJZC9oLvatK/view?usp%3Dsharing&amp;source=gmail&amp;ust=1765375172804000&amp;usg=AOvVaw2Ncte4rmlGl8OKuFuDgDtx">Here.</a></em></p></div>

731,000+ hidden jobs like this

nscaleoperationsukltd and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.

Everything Pro unlocks:

  • Unlimited applications — free stops at 5
  • Track every application in one place
  • Apply straight to the source, one click
  • Save & organize roles you love
  • Roles pulled from company boards before the big sites

Weekly

$9.99
$4.99/week

For an active search. Cancel anytime.

Most popular

Monthly

$24.99
$12.99/month

The smart pick. Save 35% vs weekly.

Lifetime

$99
$49.99once

Pay once. Every future feature, forever.