Senior HPC Infrastructure Engineer

Firmus Technologies

Sydney1d ago

Seniority: Senior

About the role

Role Summary Firmus is seeking a highly skilled and driven Kubernetes HPC Engineer to join our Software Defined Infrastructure team. In this role, you will build high-performance, fault-tolerant, and reliable infrastructure to support bare-metal provisioning, performance benchmarking, and platform validation. You will be instrumental in ensuring the stability, performance, and continuous improvement of our complex and mission-critical bare-metal HPC GPU clusters.   Key Responsibilities <ul> <li>Design and implement bare-metal provisioning workflows using Ironic and Kubernetes CRDs.</li> <li>Deploy and manage GPU-enabled AI compute nodes with RDMA, InfiniBand, and RoCE networking.</li> <li>Optimise Kubernetes and Slurm platforms for multi-node AI training performance, including NCCL, UCX, GPUDirect, and fabric tuning.</li> <li>Implement Kubernetes primitives for GPU scheduling, isolation, and resource management models.</li> <li>Design, deploy, and fine-tune Slurm GPU clusters with topology-aware configurations.</li> <li>Develop and execute performance benchmarking workloads, including MLPerf, NCCL tests, microbenchmarks, and throughput/latency validation.</li> <li>Establish observability across GPU, InfiniBand fabric, storage, and provisioning components.</li> <li>Document architecture designs, operational procedures, and performance results.</li> <li>Collaborate with L2 SRE engineers, site operations, and networking teams to ensure platform reliability, reproducibility, and performance.</li> <li>Support hardware bring-up activities, including BIOS tuning, GPU topology verification, NUMA alignment, and PCIe/NVLink checks.</li> <li>Contribute to continuous improvement in cluster validation, CI/CD automation, and provisioning and testing frameworks.</li> <li>Contribute to the development of custom Kubernetes operators and intelligent orchestration frameworks that optimise AI workload performance for large-scale GPU cluster commissioning.</li> </ul>   Skills & Experience <ul> <li>Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.</li> <li>Experience with bare-metal cluster provisioning using tools such as Metal3, OpenStack Ironic, MaaS, xCAT, or similar.</li> <li>Deep knowledge of Kubernetes internals, including CRDs, controllers, operators, and cluster lifecycle management.</li> <li>Strong understanding of Slurm configuration and compiling AI and HPC applications.</li> <li>Strong understanding of GPU systems (NVIDIA H100/H200 SXM platforms), CUDA/NCCL, and GPU topology (NVLink, NVSwitch, PCIe).</li> <li>Familiarity with container runtimes for compute workloads, including Docker, Enroot, Singularity, and Podman.</li> <li>Experience with benchmarking and performance validation for AI, HPC, or distributed training workloads.</li> <li>Practical Linux systems engineering experience, including kernel, cgroups, system services, networking, and drivers.</li> <li>Strong automation mindset using tools such as Ansible, Helm, Terraform/OpenTofu, or equivalent.</li> <li>Understanding of firmware, BIOS, BMC/IPMI/Redfish, and low-level system tuning.</li> <li>Proficiency in one or more programming languages such as Go, Bash, Rust, or Python.</li> <li>Excellent documentation skills with strong attention to detail.</li> <li>Experience participating in an on-call rotation supporting production services.</li> <li>Proactive self-starter with a drive for continuous technical improvement.</li> </ul>   Key Competencies <ul> <li>Systems Architecture: Ability to design and integrate bare-metal, GPU, RDMA, and Kubernetes/Slurm platforms.</li> <li>Infrastructure Automation: Skilled in automated provisioning and lifecycle management of hardware and clusters.</li> <li>GPU and HPC Performance: Understanding of GPU systems, RDMA fabrics, and distributed AI workload performance.</li> <li>Technical Communication: Ability to communicate technical concepts effectively across diverse engineering and operations teams.</li> <li>Continuous Improvement: Demonstrates curiosity, proactive learning, and innovation in AI and HPC infrastructure.</li> </ul>   Success Metrics <ul> <li>Reliable provisioning of Kubernetes and Slurm AI clusters.</li> <li>Performance validation and optimisation.</li> <li>Improved operational efficiency.</li> <li>High-quality documentation and effective knowledge transfer.</li> </ul>   Location & Reporting <ul> <li>Australia (Sydney, NSW or Launceston, TAS)</li> <li>Reporting to Senior Manager, Software Defined Infrastructure</li> </ul>   Employment Basis Full-time   Diversity At Firmus, we are committed to building a diverse and inclusive workplace. We encourage applications from candidates of all backgrounds who are passionate about creating a more sustainable future through innovative engineering solutions. Join us in our mission to revolutionize the AI industry through sustainable practices and cutting-edge engineering. Apply now to be part of shaping the future of sustainable AI infrastructure.

731,000+ hidden jobs like this

Firmus Technologies and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.

Everything Pro unlocks:

Unlimited applications — free stops at 5
Track every application in one place
Apply straight to the source, one click
Save & organize roles you love
Roles pulled from company boards before the big sites

Weekly

$9.99

$4.99/week

For an active search. Cancel anytime.

Get Weekly

Monthly

$24.99

$12.99/month

The smart pick. Save 35% vs weekly.

Get Monthly

Lifetime

$99

$49.99once

Pay once. Every future feature, forever.

Get Lifetime