Senior AI Infrastructure Engineer (Virtualisation)

Firmus Technologies

Australia1d ago

Seniority: Senior

About the role

Role Summary Firmus is seeking a highly skilled and driven Senior Engineer to play a key role in designing, building, and operating software-defined infrastructure, including high-performance AI storage platforms. You will help evolve our Software Defined Infrastructure by building reliable, scalable solutions that power some of the world’s largest and most innovative AI workloads. You will be instrumental in ensuring the stability, performance, and continuous improvement of our mission-critical control plane and storage infrastructure.   Key Responsibilities <ul> <li>Design and implement a highly scalable, multi-tenant control plane that supports Firmus’ growing AI and infrastructure needs.</li> <li>Contribute to the development of exabyte-scale, S3-compatible object storage, distributed file systems, and high-performance filesystems.</li> <li>Work with bare-metal provisioning tools such as Base Command Manager, Warewulf, Ironic, MaaS, and similar platforms.</li> <li>Apply a deep understanding of operating systems, computer networks, software-defined storage, and high-performance applications.</li> <li>Work with technologies including RDMA, GPU Direct Storage, RoCE, InfiniBand, DPDK, Ceph, Weka, DAOS, and others.</li> <li>Collaborate with operations teams to monitor, analyse, and optimise internal clusters and storage platforms.</li> <li>Document architecture designs, operational procedures, and performance results.</li> <li>Collaborate with L2 SRE engineers, site operations, and networking teams to ensure platform reliability, reproducibility, and performance.</li> <li>Contribute to continuous improvement in cluster validation, CI/CD automation, and provisioning and testing frameworks.</li> <li>Apply knowledge of Kubernetes and composable storage clusters.</li> <li>Contribute to the development of custom Kubernetes operators and intelligent orchestration frameworks to optimise AI workload performance for large-scale GPU cluster commissioning.</li> </ul>   Skills & Experience <ul> <li>Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field. 6–10 years of experience in infrastructure engineering and/or storage engineering.</li> <li>Hands-on experience with bare-metal provisioning. Ability to operate software-defined storage platforms such as Ceph, Weka, Vast Data, DAOS, or Lustre.</li> <li>Solid understanding of cloud-native infrastructure, Kubernetes, and scalable system architectures.</li> <li>Strong debugging and problem-solving skills in distributed, high-performance environments.</li> <li>Practical Linux systems engineering experience (kernel, cgroups, system services, networking, drivers).</li> <li>Strong automation mindset using tools such as Ansible, Helm, Terraform/OpenTofu, or equivalent.</li> <li>Understanding of firmware, BIOS, BMC/IPMI/Redfish, and low-level system tuning.</li> <li>Proficiency in one or more programming languages such as Go, Bash, Rust, or Python.</li> <li>Excellent documentation skills with strong attention to detail.</li> <li>Experience participating in an on-call rotation supporting production services. Proactive self-starter with a drive for continuous technical improvement.</li> </ul>   Key Competencies <ul> <li>Systems Architecture: Ability to design and integrate virtualisation, bare-metal, GPU, storage, and Kubernetes/Slurm platforms.</li> <li>Infrastructure Automation: Expertise in automated provisioning and lifecycle management of hardware and clusters.</li> <li>GPU and HPC Performance: Strong understanding of GPU systems, RDMA fabrics, and distributed AI workload performance.</li> <li>Technical Communication: Ability to communicate complex technical concepts effectively across engineering and operations teams.</li> <li>Continuous Improvement: Demonstrates curiosity, proactive learning, and innovation in AI and HPC infrastructure.</li> </ul>   Success Metrics <ul> <li>Reliable provisioning and benchmarking of scalable, high-performance storage systems.</li> <li>Performance validation and optimisation.</li> <li>Operational efficiency improvements.</li> <li>High-quality documentation and effective knowledge transfer.</li> </ul>   Location & Reporting <ul> <li>Singapore or Australia (Melbourne, VIC or Sydney, NSW or Launceston, TAS)</li> <li>Reporting to Senior Manager, Software Defined Infrastructure</li> </ul>   Employment Basis Full-time   Diversity At Firmus, we are committed to building a diverse and inclusive workplace. We encourage applications from candidates of all backgrounds who are passionate about creating a more sustainable future through innovative engineering solutions. Join us in our mission to revolutionize the AI industry through sustainable practices and cutting-edge engineering. Apply now to be part of shaping the future of sustainable AI infrastructure.

731,000+ hidden jobs like this

Firmus Technologies and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.

Everything Pro unlocks:

Unlimited applications — free stops at 5
Track every application in one place
Apply straight to the source, one click
Save & organize roles you love
Roles pulled from company boards before the big sites

Weekly

$9.99

$4.99/week

For an active search. Cancel anytime.

Get Weekly

Monthly

$24.99

$12.99/month

The smart pick. Save 35% vs weekly.

Get Monthly

Lifetime

$99

$49.99once

Pay once. Every future feature, forever.

Get Lifetime