Back to all jobs
F
Senior HPC Infrastructure Engineer
Firmus Technologies
Sydney1d ago
- Seniority
- Senior
About the role
<p><strong>Role Summary</strong></p>
<p>Firmus is seeking a highly skilled and driven Kubernetes HPC Engineer to join our Software Defined Infrastructure team. In this role, you will build high-performance, fault-tolerant, and reliable infrastructure to support bare-metal provisioning, performance benchmarking, and platform validation.</p>
<p>You will be instrumental in ensuring the stability, performance, and continuous improvement of our complex and mission-critical bare-metal HPC GPU clusters.</p>
<p> </p>
<p><strong>Key Responsibilities</strong></p>
<ul>
<li>Design and implement bare-metal provisioning workflows using Ironic and Kubernetes CRDs.</li>
<li>Deploy and manage GPU-enabled AI compute nodes with RDMA, InfiniBand, and RoCE networking.</li>
<li>Optimise Kubernetes and Slurm platforms for multi-node AI training performance, including NCCL, UCX, GPUDirect, and fabric tuning.</li>
<li>Implement Kubernetes primitives for GPU scheduling, isolation, and resource management models.</li>
<li>Design, deploy, and fine-tune Slurm GPU clusters with topology-aware configurations.</li>
<li>Develop and execute performance benchmarking workloads, including MLPerf, NCCL tests, microbenchmarks, and throughput/latency validation.</li>
<li>Establish observability across GPU, InfiniBand fabric, storage, and provisioning components.</li>
<li>Document architecture designs, operational procedures, and performance results.</li>
<li>Collaborate with L2 SRE engineers, site operations, and networking teams to ensure platform reliability, reproducibility, and performance.</li>
<li>Support hardware bring-up activities, including BIOS tuning, GPU topology verification, NUMA alignment, and PCIe/NVLink checks.</li>
<li>Contribute to continuous improvement in cluster validation, CI/CD automation, and provisioning and testing frameworks.</li>
<li>Contribute to the development of custom Kubernetes operators and intelligent orchestration frameworks that optimise AI workload performance for large-scale GPU cluster commissioning.</li>
</ul>
<p> </p>
<p><strong>Skills & Experience</strong></p>
<ul>
<li>Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.</li>
<li>Experience with bare-metal cluster provisioning using tools such as Metal3, OpenStack Ironic, MaaS, xCAT, or similar.</li>
<li>Deep knowledge of Kubernetes internals, including CRDs, controllers, operators, and cluster lifecycle management.</li>
<li>Strong understanding of Slurm configuration and compiling AI and HPC applications.</li>
<li>Strong understanding of GPU systems (NVIDIA H100/H200 SXM platforms), CUDA/NCCL, and GPU topology (NVLink, NVSwitch, PCIe).</li>
<li>Familiarity with container runtimes for compute workloads, including Docker, Enroot, Singularity, and Podman.</li>
<li>Experience with benchmarking and performance validation for AI, HPC, or distributed training workloads.</li>
<li>Practical Linux systems engineering experience, including kernel, cgroups, system services, networking, and drivers.</li>
<li>Strong automation mindset using tools such as Ansible, Helm, Terraform/OpenTofu, or equivalent.</li>
<li>Understanding of firmware, BIOS, BMC/IPMI/Redfish, and low-level system tuning.</li>
<li>Proficiency in one or more programming languages such as Go, Bash, Rust, or Python.</li>
<li>Excellent documentation skills with strong attention to detail.</li>
<li>Experience participating in an on-call rotation supporting production services.</li>
<li>Proactive self-starter with a drive for continuous technical improvement.</li>
</ul>
<p> </p>
<p><strong>Key Competencies</strong></p>
<ul>
<li>Systems Architecture: Ability to design and integrate bare-metal, GPU, RDMA, and Kubernetes/Slurm platforms.</li>
<li>Infrastructure Automation: Skilled in automated provisioning and lifecycle management of hardware and clusters.</li>
<li>GPU and HPC Performance: Understanding of GPU systems, RDMA fabrics, and distributed AI workload performance.</li>
<li>Technical Communication: Ability to communicate technical concepts effectively across diverse engineering and operations teams.</li>
<li>Continuous Improvement: Demonstrates curiosity, proactive learning, and innovation in AI and HPC infrastructure.</li>
</ul>
<p> </p>
<p><strong>Success Metrics</strong></p>
<ul>
<li>Reliable provisioning of Kubernetes and Slurm AI clusters.</li>
<li>Performance validation and optimisation.</li>
<li>Improved operational efficiency.</li>
<li>High-quality documentation and effective knowledge transfer.</li>
</ul>
<p> </p>
<p><strong>Location & Reporting</strong></p>
<ul>
<li>Australia (Sydney, NSW or Launceston, TAS)</li>
<li>Reporting to Senior Manager, Software Defined Infrastructure</li>
</ul>
<p> </p>
<p><strong>Employment Basis</strong></p>
<p>Full-time</p>
<p> </p>
<p><strong>Diversity</strong></p>
<p>At Firmus, we are committed to building a diverse and inclusive workplace. We encourage applications from candidates of all backgrounds who are passionate about creating a more sustainable future through innovative engineering solutions.</p>
<p>Join us in our mission to revolutionize the AI industry through sustainable practices and cutting-edge engineering. Apply now to be part of shaping the future of sustainable AI infrastructure.</p>
731,000+ hidden jobs like this
Firmus Technologies and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.
Everything Pro unlocks:
- Unlimited applications — free stops at 5
- Track every application in one place
- Apply straight to the source, one click
- Save & organize roles you love
- Roles pulled from company boards before the big sites