Senior HPC Infrastructure Engineer
firmus - Sydney, Australia
Posted Feb 9, 2026
Benefits
- Parental leave
- Not verified
- Non-birth-parent leave
- Not verified
- Family-building benefits
-
- Fertility benefits: Not verified
- Adoption assistance: Not verified
- Surrogacy assistance: Not verified
- Mental health support
- Not verified
- Relocation assistance
- Not verified
- Childcare support
- Not verified
- Learning budget
- Not verified
- Verification
- Not verified
- Salary
- Not verified
Was this benefit information wrong? Tell us.
Schedule
- Shift type
- Not verified
- Weekend work
- Not verified
Application
- Cover letter
- Not verified
- Assessment
- Not verified
- Deadline
- Not stated
Where they hire
State eligibility is not yet verified.
About this role
Senior HPC Infrastructure Engineer Sydney, Australia Role Summary Firmus is seeking a highly skilled and driven Kubernetes HPC Engineer to join our Software Defined Infrastructure team. In this role, you will build high-performance, fault-tolerant, and reliable infrastructure to support bare-metal provisioning, performance benchmarking, and platform validation. You will be instrumental in ensuring the stability, performance, and continuous improvement of our complex and mission-critical bare-metal HPC GPU clusters. Key Responsibilities - Design and implement bare-metal provisioning workflows using Ironic and Kubernetes CRDs. - Deploy and manage GPU-enabled AI compute nodes with RDMA, InfiniBand, and RoCE networking. - Optimise Kubernetes and Slurm platforms for multi-node AI training performance, including NCCL, UCX, GPUDirect, and fabric tuning. - Implement Kubernetes primitives for GPU scheduling, isolation, and resource management models. - Design, deploy, and fine-tune Slurm GPU clusters with topology-aware configurations. - Develop and execute performance benchmarking workloads, including MLPerf, NCCL tests, microbenchmarks, and throughput/latency validation. - Establish observability across GPU, InfiniBand fabric, storage, and provisioning components. - Document architecture designs, operational procedures, and performance results. - Collaborate with L2 SRE engineers, site operations, and networking teams to ensure platform reliability, reproducibility, and performance. - Support hardware bring-up activities, including BIOS tuning, GPU topology verification, NUMA alignment, and PCIe/NVLink checks. - Contribute to continuous improvement in cluster validation, CI/CD automation, and provisioning and testing frameworks. - Contribute to the development of custom Kubernetes operators and intelligent orchestration frameworks that optimise AI workload performance for large-scale GPU cluster commissioning. Skills & Experience - Bachelor's or Master's
Read the full description at job-boards.greenhouse.io. FewerJobs shows a source-linked preview and links to the original posting.
Apply link not verified; last-live date unavailable.
What verified means
Verified means a displayed claim has a recorded source field, a source URL when available, and a timestamp showing when FewerJobs checked or enriched the evidence.
Related jobs
-
Hardware System and Board Failure Analysis Technical Lead
Cisco - Milpitas, California, US
-
Sr. Staff System Architect
Northrop Grumman - United States-Illinois-Rolling Meadows
-
Senior Project Manager - Product Implementation
Deluxe CORP - 2 Locations
-
Sr. Staff Product Operations Manager, Product Lifecycle (Remote)
Cisco - Coral Gables, Florida, US