Back to all jobs
N
Staff HPC Systems Software Engineer
nscaleoperationsukltd
US2d ago
- Seniority
- Staff
About the role
<h2>About Nscale</h2>
<p>Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale enables AI-focused companies to achieve superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business outcomes, including cost management, rapid innovation, and environmental responsibility.</p>
<p>We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you’ll build trust through openness and transparency, where everyone is inspired to do their best work. If you join our team, you’ll be contributing to building the technology that powers the future.</p>
<h2>About the Role</h2>
<p>We’re hiring a <strong><strong class="textBold">Staff HPC Systems Software Engineer</strong></strong> to define the technical direction and evolution of a core HPC platform domain at Nscale.</p>
<p>In this role, you will operate beyond a single team, shaping how multiple teams build, automate, and run <strong><strong class="textBold">Slurm-based capabilities</strong></strong> within Nscale’s wider <strong><strong class="textBold">cloud-native platform</strong></strong>. You’ll work across engineering boundaries to bring coherence to architecture, interfaces, lifecycle models, and operational approaches, while partnering closely with teams working on platform tooling, infrastructure APIs, identity systems, and Kubernetes-adjacent systems.</p>
<p>This is a high-impact staff-level role for someone who combines deep hands-on software engineering with strong systems judgement. Your work will help ensure Nscale’s HPC services are robust, supportable, and maintainable, while creating leverage through shared patterns, reusable implementations, and clear technical direction across ambiguous, business-critical problem spaces.</p>
<h2>What you'll be doing</h2>
<p><strong><strong class="textBold">Domain Architecture & Technical Direction</strong></strong></p>
<ul>
<li value="1"><strong><strong class="textBold">Own</strong></strong> and evolve the technical direction for a defined HPC systems domain, such as Slurm platform architecture, scheduler integrations, cluster lifecycle, workload environments, or service automation.</li>
<li value="2"><strong><strong class="textBold">Make</strong></strong> architectural decisions that balance software quality, operational realities, customer needs, and long-term maintainability.</li>
<li value="3"><strong><strong class="textBold">Define</strong></strong> how proven Slurm implementations should be packaged, automated, and exposed as a service.</li>
<li value="4"><strong><strong class="textBold">Resolve</strong></strong> ambiguity around ownership, interfaces, lifecycle boundaries, and operating models across teams.</li>
<li value="5"><strong><strong class="textBold">Act</strong></strong> as the technical escalation point for the most complex issues within the domain.</li>
</ul>
<p><strong><strong class="textBold">Cross-Team Engineering Leverage</strong></strong></p>
<ul>
<li value="1"><strong><strong class="textBold">Establish</strong></strong> shared patterns and standards for automation, service lifecycle management, observability, reliability, and supportability across the HPC platform.</li>
<li value="2"><strong><strong class="textBold">Drive</strong></strong> cross-team design for integrations between Slurm, Kubernetes-adjacent systems, infrastructure APIs, identity systems, and platform tooling.</li>
<li value="3"><strong><strong class="textBold">Create</strong></strong> reusable modules, automation, deployment patterns, and reference implementations that increase engineering leverage.</li>
<li value="4"><strong><strong class="textBold">Identify</strong></strong> and correct avoidable technical divergence, duplicated effort, and fragile operating models.</li>
<li value="5"><strong><strong class="textBold">Ensure</strong></strong> domain designs reflect the realities of GPU scheduling, HPC networking, performance isolation, and production operations.</li>
</ul>
<p><strong><strong class="textBold">Delivery, Reliability & Influence</strong></strong></p>
<ul>
<li value="1"><strong><strong class="textBold">Lead</strong></strong> technically critical initiatives spanning <strong><strong class="textBold">2–4 teams</strong></strong> or a defined HPC platform area.</li>
<li value="2"><strong><strong class="textBold">Unblock</strong></strong> delivery by clarifying technical direction and reducing ambiguity in complex system design problems.</li>
<li value="3"><strong><strong class="textBold">Contribute</strong></strong> hands-on where needed to de-risk or accelerate critical work.</li>
<li value="4"><strong><strong class="textBold">Influence</strong></strong> engineering teams without formal authority through strong judgement, design clarity, and practical solutions.</li>
<li value="5"><strong><strong class="textBold">Partner</strong></strong> with adjacent cloud-native software engineers so HPC implementations build on shared platform patterns rather than separate ones.</li>
</ul>
<h2>KPIs</h2>
<ul>
<li value="1"><strong><strong class="textBold">Technical direction across a defined HPC domain</strong></strong></li>
<li value="2"><strong><strong class="textBold">Delivery of critical initiatives across 2–4 teams</strong></strong></li>
<li value="3"><strong><strong class="textBold">Reduction in technical divergence and duplicated effort</strong></strong></li>
<li value="4"><strong><strong class="textBold">Reliability and supportability of Slurm-based HPC services</strong></strong></li>
</ul>
<h2>About You</h2>
<ul>
<li value="1"><strong><strong class="textBold">Extensive experience</strong></strong> designing and building production software and automation for HPC systems, especially <strong><strong class="textBold">Slurm-based environments</strong></strong>.</li>
<li value="2">Strong track record of writing <strong><strong class="textBold">maintainable, testable, and resilient software</strong></strong> in <strong><strong class="textBold">Go, Python, or similar languages</strong></strong>.</li>
<li value="3">Proven ability to <strong><strong class="textBold">define technical direction</strong></strong> across a domain spanning multiple teams or services.</li>
<li value="4">Strong understanding of <strong><strong class="textBold">Slurm internals, scheduler behaviour, cluster lifecycle concerns, and operational trade-offs</strong></strong>.</li>
<li value="5">Strong practical understanding of <strong><strong class="textBold">GPU-backed infrastructure</strong></strong> and <strong><strong class="textBold">HPC networking</strong></strong>, including <strong><strong class="textBold">InfiniBand, RoCE, RDMA</strong></strong>, and performance-sensitive workload characteristics.</li>
<li value="6">Experience integrating <strong><strong class="textBold">HPC systems with cloud-native platforms, APIs, or service delivery models</strong></strong>.</li>
<li value="7">Experience creating engineering leverage through <strong><strong class="textBold">standards, reusable patterns, shared tooling, and architectural clarity</strong></strong>.</li>
<li value="8">Strong judgement in balancing <strong><strong class="textBold">short-term delivery</strong></strong> with <strong><strong class="textBold">long-term platform health and supportability</strong></strong>.</li>
<li value="9">Strong written and verbal communication skills, with the ability to align multiple teams around a coherent technical direction.</li>
<li value="10">Experience with other schedulers or batch systems such as <strong><strong class="textBold">Kueue</strong></strong> is valuable.</li>
</ul>
<h2>What we can offer you</h2>
<p>At Nscale, you'll find a collaborative, supportive, and innovative environment where your contributions spark real impact. We're building something extraordinary, and we want you at the core.</p>
<ul>
<li>Highly competitive US compensation package (base + bonus + equity), with performance reviews every 12 months. 🚀</li>
<li>Join one of the fastest-growing AI infrastructure companies — your chance to directly shape how global AI capacity is planned and deployed. ✨</li>
<li>Expect a dynamic progression plan tailored to your ambitions. Grow by leading critical cross-functional initiatives and shaping capital strategy — always with our full support.</li>
<li>Human-First Flexibility: We treat you as humans first. 🫶🏽 Our flexible workplace trusts Nscalers to deliver, giving you the autonomy to shape your day around life's moments.</li>
</ul>
<h2>Equal Opportunities Statement</h2>
<p>We strongly encourage applications from people of colour, the LGBTQ+ community, people with disabilities, neurodivergent people, parents, carers, and people from lower socio-economic backgrounds.</p>
<p>If there’s anything we can do to accommodate your specific situation, please let us know.</p>
<p>The responsibilities outlined in this job description are not exhaustive and are intended to provide a general overview of the position. The employee may be required to perform additional duties, tasks, and responsibilities as assigned by management, consistent with the skills and qualifications required for the role.</p>
<p>For information on how Nscale handles candidate personal data, please see our Employee & Candidate Privacy Notice: Here.</p>
<h2>Salary Range</h2>
<p>The range below reflects the base salary for the position. Actual compensation may vary based on job-related factors such as skill set, experience, education, and location. In addition to base salary, this role may be eligible for bonus, equity, and/or commission programs. Nscale may offer a competitive benefits package including medical, dental, vision, flexible paid time off, parental leave, and retirement plan participation.</p><div class="content-pay-transparency"><div class="pay-input"><div class="description"><p>The range below reflects the base salary for the position. Actual compensation may vary based on job-related factors such as skill set, experience, education, and location. In addition to base salary, this role may be eligible for bonus, equity, and/or commission programs. Nscale may offer a competitive benefits package including medical, dental, vision, flexible paid time off, parental leave, and retirement plan participation.</p></div><div class="title">Salary Range</div><div class="pay-range"><span>$225,000</span><span class="divider">—</span><span>$275,000 USD</span></div></div></div><div class="content-conclusion"><p><em>For information on how Nscale handles candidate personal data, please see our Employee & Candidate Privacy Notice: <a href="https://drive.google.com/file/d/1QK5Yg04WHD9K9IAtJgQWubJZC9oLvatK/view?usp=sharing" target="_blank" data-saferedirecturl="https://www.google.com/url?q=https://drive.google.com/file/d/1QK5Yg04WHD9K9IAtJgQWubJZC9oLvatK/view?usp%3Dsharing&source=gmail&ust=1765375172804000&usg=AOvVaw2Ncte4rmlGl8OKuFuDgDtx">Here.</a></em></p></div>
Perks & benefits
- Paid Time Off
- Equity Compensation
731,000+ hidden jobs like this
nscaleoperationsukltd and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.
Everything Pro unlocks:
- Unlimited applications — free stops at 5
- Track every application in one place
- Apply straight to the source, one click
- Save & organize roles you love
- Roles pulled from company boards before the big sites