Senior AI Infrastructure Engineer - Training Platform
Scale AI - San Francisco, CA; Seattle, WA; New York, NY
Posted Apr 28, 2026
Benefits
- Parental leave
- Not verified
- Non-birth-parent leave
- Not verified
- Family-building benefits
-
- Fertility benefits: Not verified
- Adoption assistance: Not verified
- Surrogacy assistance: Not verified
- Mental health support
- Not verified
- Relocation assistance
- Not verified
- Childcare support
- Not verified
- Learning budget
- Not verified
- Verification
- Not verified
- Salary
- Not verified not verified - source not recorded; timestamp not recorded
- 401(k) match
- Not verified
Was this benefit information wrong? Tell us.
Schedule
- Shift type
- Not verified
- Weekend work
- Not verified
Application
- Cover letter
- Not verified
- Assessment
- Not verified
- Deadline
- Not stated
Where they hire
State eligibility is not yet verified.
About this role
Senior AI Infrastructure Engineer - Training Platform San Francisco, CA; Seattle, WA; New York, NY As a Software Engineer on the Machine Learning Infrastructure team, you will build the "Operating System" for our large-scale GPU clusters. You will architect a high-performance training platform that handles the immense complexity of multi-thousand GPU workloads, ensuring every cycle is used efficiently. Your work directly determines the velocity at which our researchers can train and iterate on the world's most advanced models. The ideal candidate is a systems expert who thrives on solving the orchestration, networking, and reliability challenges that emerge at massive scale. You will partner closely with researchers to build a seamless, resilient environment that transforms raw compute into breakthrough AI. You will: - Architect and scale a multi-tenant orchestration layer that abstracts away the complexity of GPU clusters, ensuring high utilization and seamless job recovery. - Design and implement scheduling primitives to optimize the lifecycle of training jobs. - Develop deep observability and automated health-checking into the training stack to proactively identify and isolate hardware failures - Evaluate and integrate emerging technologies in the CNCF and AI ecosystem (e.g. Ray, Kueue), making data-driven build vs. buy decisions that balance velocity with long-term maintainability. - Work closely with Finance and Procurement teams to drive our capacity planning process. - Participate in our team's on call process to ensure the availability of our services. - Own projects end-to-end, from requirements, scoping, design, to implementation, in a highly collaborative and cross-functional environment. Ideally you'd
Read the full description at job-boards.greenhouse.io. FewerJobs shows a source-linked preview and links to the original posting.
Apply link not verified; last-live date unavailable.
What verified means
Verified means a displayed claim has a recorded source field, a source URL when available, and a timestamp showing when FewerJobs checked or enriched the evidence.
Related jobs
-
Hardware System and Board Failure Analysis Technical Lead
Cisco - Milpitas, California, US
-
Sr. Staff System Architect
Northrop Grumman - United States-Illinois-Rolling Meadows
-
Senior Project Manager - Product Implementation
Deluxe CORP - 2 Locations
-
Sr. Staff Product Operations Manager, Product Lifecycle (Remote)
Cisco - Coral Gables, Florida, US