Back to all jobs
N
Senior Systems Engineer
nscaleoperationsukltd
AMER3d ago
- Seniority
- Senior
About the role
<p>.</p>
<h1><strong>Senior Deployment Engineer – GPU Infrastructure Bringup</strong></h1>
<p><strong>Location:</strong> United States (Travel Required)<br> <strong>Team:</strong> Infrastructure<br> <strong>Reports to:</strong> Head of Infrastructure</p>
<h2><strong>About Us</strong></h2>
<p>We are building next-generation AI infrastructure from the ground up. Our mission is to deliver highly performant, reliable, and scalable GPU clusters purpose-built for large-scale AI training and inference.</p>
<p>As a startup, we operate with urgency, ownership, and a bias toward action. We are assembling the foundational infrastructure that will power frontier AI workloads—and we’re looking for engineers who want to build it from zero to scale.</p>
<h2><strong>The Role</strong></h2>
<p>We are hiring a <strong>Senior Deployment Engineer</strong> to lead hands-on bringup of GPU clusters across our data center environments. You will own the execution of node, rack, and network deployment, ensuring clusters are validated, performant, and production-ready.</p>
<p>This role is deeply technical and execution-focused. You will be in the details—cabling racks, validating firmware, tuning fabrics, debugging performance—and helping us build repeatable processes as we scale.</p>
<h2><strong>What You’ll Do</strong></h2>
<h3><strong>Cluster Deployment & Bringup</strong></h3>
<ul>
<li>Execute end-to-end bringup of GPU nodes and racks from installation to production readiness.<br><br></li>
<li>Validate BIOS/BMC/firmware configurations and GPU health.<br><br></li>
<li>Perform rack-level integration including power, cabling, and airflow validation.<br><br></li>
<li>Bring up and validate high-speed network fabrics (InfiniBand, RoCE, 100–400G Ethernet).<br><br></li>
</ul>
<h3><strong>Network & Performance Validation</strong></h3>
<ul>
<li>Configure and validate leaf/spine network connectivity.<br><br></li>
<li>Run cluster-wide burn-in and stress testing.<br><br></li>
<li>Validate GPU-to-GPU and node-to-node performance (NCCL, RDMA, GPUDirect).<br><br></li>
<li>Troubleshoot hardware, firmware, and fabric-level issues.<br><br></li>
</ul>
<h3><strong>Automation & Process</strong></h3>
<ul>
<li>Contribute to automation for provisioning and cluster validation.<br><br></li>
<li>Improve deployment playbooks and documentation.<br><br></li>
<li>Identify reliability issues early and drive corrective actions.<br><br></li>
<li>Help turn ad hoc deployments into repeatable systems.<br><br></li>
</ul>
<h3><strong>Cross-Functional Collaboration</strong></h3>
<ul>
<li>Work closely with networking, systems software, and data center teams.<br><br></li>
<li>Coordinate with hardware vendors to resolve bringup issues.<br><br></li>
<li>Support rapid capacity expansion as we scale.<br><br></li>
</ul>
<h2><strong>What We’re Looking For</strong></h2>
<h3><strong>Required</strong></h3>
<ul>
<li>5–8+ years in infrastructure engineering, hardware deployment, or data center operations.<br><br></li>
<li>Hands-on experience deploying GPU servers (HGX/DGX or similar platforms).<br><br></li>
<li>Experience with high-speed networking (InfiniBand, RoCE, Ethernet fabrics).<br><br></li>
<li>Strong Linux systems knowledge.<br><br></li>
<li>Experience troubleshooting distributed systems performance issues.<br><br></li>
<li>Comfortable working onsite in data center environments as needed.<br><br></li>
</ul>
<h3><strong>Strongly Preferred</strong></h3>
<ul>
<li>Experience in AI/ML infrastructure or HPC environments.<br><br></li>
<li>Familiarity with NCCL, CUDA, RDMA.<br><br></li>
<li>Automation experience (Python, Ansible, Terraform, Bash).<br><br></li>
<li>Experience in high-density power and cooling environments.<br><br></li>
</ul>
<h2><strong>What Success Looks Like</strong></h2>
<ul>
<li>Clusters are brought online quickly and correctly.<br><br></li>
<li>Performance baselines meet or exceed expectations.<br><br></li>
<li>Deployment processes become faster and more reliable over time.<br><br></li>
</ul>
<p>You help build the foundation for scaled infrastructure growth.</p><div class="content-conclusion"><p><em>For information on how Nscale handles candidate personal data, please see our Employee & Candidate Privacy Notice: <a href="https://drive.google.com/file/d/1QK5Yg04WHD9K9IAtJgQWubJZC9oLvatK/view?usp=sharing" target="_blank" data-saferedirecturl="https://www.google.com/url?q=https://drive.google.com/file/d/1QK5Yg04WHD9K9IAtJgQWubJZC9oLvatK/view?usp%3Dsharing&source=gmail&ust=1765375172804000&usg=AOvVaw2Ncte4rmlGl8OKuFuDgDtx">Here.</a></em></p></div>
731,000+ hidden jobs like this
nscaleoperationsukltd and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.
Everything Pro unlocks:
- Unlimited applications — free stops at 5
- Track every application in one place
- Apply straight to the source, one click
- Save & organize roles you love
- Roles pulled from company boards before the big sites