Back to all jobs
F
Senior AI Infrastructure Engineer (Virtualisation)
Firmus Technologies
Australia1d ago
- Seniority
- Senior
About the role
<p><strong>Role Summary</strong></p>
<p>Firmus is seeking a highly skilled and driven Senior Engineer to play a key role in designing, building, and operating software-defined infrastructure, including high-performance AI storage platforms. You will help evolve our Software Defined Infrastructure by building reliable, scalable solutions that power some of the world’s largest and most innovative AI workloads.</p>
<p>You will be instrumental in ensuring the stability, performance, and continuous improvement of our mission-critical control plane and storage infrastructure.</p>
<p> </p>
<p><strong>Key Responsibilities</strong></p>
<ul>
<li>Design and implement a highly scalable, multi-tenant control plane that supports Firmus’ growing AI and infrastructure needs.</li>
<li>Contribute to the development of exabyte-scale, S3-compatible object storage, distributed file systems, and high-performance filesystems.</li>
<li>Work with bare-metal provisioning tools such as Base Command Manager, Warewulf, Ironic, MaaS, and similar platforms.</li>
<li>Apply a deep understanding of operating systems, computer networks, software-defined storage, and high-performance applications.</li>
<li>Work with technologies including RDMA, GPU Direct Storage, RoCE, InfiniBand, DPDK, Ceph, Weka, DAOS, and others.</li>
<li>Collaborate with operations teams to monitor, analyse, and optimise internal clusters and storage platforms.</li>
<li>Document architecture designs, operational procedures, and performance results.</li>
<li>Collaborate with L2 SRE engineers, site operations, and networking teams to ensure platform reliability, reproducibility, and performance.</li>
<li>Contribute to continuous improvement in cluster validation, CI/CD automation, and provisioning and testing frameworks.</li>
<li>Apply knowledge of Kubernetes and composable storage clusters.</li>
<li>Contribute to the development of custom Kubernetes operators and intelligent orchestration frameworks to optimise AI workload performance for large-scale GPU cluster commissioning.</li>
</ul>
<p> </p>
<p><strong>Skills & Experience</strong></p>
<ul>
<li>Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.<br data-start="2210" data-end="2213">6–10 years of experience in infrastructure engineering and/or storage engineering.</li>
<li>Hands-on experience with bare-metal provisioning.<br data-start="2351" data-end="2354">Ability to operate software-defined storage platforms such as Ceph, Weka, Vast Data, DAOS, or Lustre.</li>
<li>Solid understanding of cloud-native infrastructure, Kubernetes, and scalable system architectures.</li>
<li>Strong debugging and problem-solving skills in distributed, high-performance environments.</li>
<li>Practical Linux systems engineering experience (kernel, cgroups, system services, networking, drivers).</li>
<li>Strong automation mindset using tools such as Ansible, Helm, Terraform/OpenTofu, or equivalent.</li>
<li>Understanding of firmware, BIOS, BMC/IPMI/Redfish, and low-level system tuning.</li>
<li>Proficiency in one or more programming languages such as Go, Bash, Rust, or Python.</li>
<li>Excellent documentation skills with strong attention to detail.</li>
<li>Experience participating in an on-call rotation supporting production services.<br data-start="3187" data-end="3190">Proactive self-starter with a drive for continuous technical improvement.</li>
</ul>
<p> </p>
<p><strong>Key Competencies</strong></p>
<ul>
<li>Systems Architecture: Ability to design and integrate virtualisation, bare-metal, GPU, storage, and Kubernetes/Slurm platforms.</li>
<li>Infrastructure Automation: Expertise in automated provisioning and lifecycle management of hardware and clusters.</li>
<li>GPU and HPC Performance: Strong understanding of GPU systems, RDMA fabrics, and distributed AI workload performance.</li>
<li>Technical Communication: Ability to communicate complex technical concepts effectively across engineering and operations teams.</li>
<li>Continuous Improvement: Demonstrates curiosity, proactive learning, and innovation in AI and HPC infrastructure.</li>
</ul>
<p> </p>
<p><strong>Success Metrics</strong></p>
<ul>
<li>Reliable provisioning and benchmarking of scalable, high-performance storage systems.</li>
<li>Performance validation and optimisation.</li>
<li>Operational efficiency improvements.</li>
<li>High-quality documentation and effective knowledge transfer.</li>
</ul>
<p> </p>
<p><strong>Location & Reporting</strong></p>
<ul>
<li>Singapore or Australia (Melbourne, VIC or Sydney, NSW or Launceston, TAS)</li>
<li>Reporting to Senior Manager, Software Defined Infrastructure</li>
</ul>
<p> </p>
<p><strong>Employment Basis</strong></p>
<p>Full-time</p>
<p> </p>
<p><strong>Diversity</strong></p>
<p>At Firmus, we are committed to building a diverse and inclusive workplace. We encourage applications from candidates of all backgrounds who are passionate about creating a more sustainable future through innovative engineering solutions.</p>
<p>Join us in our mission to revolutionize the AI industry through sustainable practices and cutting-edge engineering. Apply now to be part of shaping the future of sustainable AI infrastructure.</p>
<p> </p>
731,000+ hidden jobs like this
Firmus Technologies and thousands of companies post here first — often days before LinkedIn or Indeed. Your first 5 applications are free; go Pro to apply without limits.
Everything Pro unlocks:
- Unlimited applications — free stops at 5
- Track every application in one place
- Apply straight to the source, one click
- Save & organize roles you love
- Roles pulled from company boards before the big sites