Senior AI Infrastructure Engineer - Training Platform

Scale AI - San Francisco, CA; Seattle, WA; New York, NY

Posted Apr 28, 2026

Benefits

Parental leave: Not verified
Non-birth-parent leave: Not verified
Family-building benefits: Fertility benefits: Not verified
Adoption assistance: Not verified
Surrogacy assistance: Not verified
Mental health support: Not verified
Relocation assistance: Not verified
Childcare support: Not verified
Learning budget: Not verified
Verification: Not verified
Salary: Not verified not verified - source not recorded; timestamp not recorded
401(k) match: Not verified

Was this benefit information wrong? Tell us.

Schedule

Shift type: Not verified
Weekend work: Not verified

Application

Cover letter: Not verified
Assessment: Not verified
Deadline: Not stated

Where they hire

State eligibility is not yet verified.

About this role

Senior AI Infrastructure Engineer - Training Platform San Francisco, CA; Seattle, WA; New York, NY As a Software Engineer on the Machine Learning Infrastructure team, you will build the "Operating System" for our large-scale GPU clusters. You will architect a high-performance training platform that handles the immense complexity of multi-thousand GPU workloads, ensuring every cycle is used efficiently. Your work directly determines the velocity at which our researchers can train and iterate on the world's most advanced models. The ideal candidate is a systems expert who thrives on solving the orchestration, networking, and reliability challenges that emerge at massive scale. You will partner closely with researchers to build a seamless, resilient environment that transforms raw compute into breakthrough AI. You will: - Architect and scale a multi-tenant orchestration layer that abstracts away the complexity of GPU clusters, ensuring high utilization and seamless job recovery. - Design and implement scheduling primitives to optimize the lifecycle of training jobs. - Develop deep observability and automated health-checking into the training stack to proactively identify and isolate hardware failures - Evaluate and integrate emerging technologies in the CNCF and AI ecosystem (e.g. Ray, Kueue), making data-driven build vs. buy decisions that balance velocity with long-term maintainability. - Work closely with Finance and Procurement teams to drive our capacity planning process. - Participate in our team's on call process to ensure the availability of our services. - Own projects end-to-end, from requirements, scoping, design, to implementation, in a highly collaborative and cross-functional environment. Ideally you'd

Read the full description at job-boards.greenhouse.io. FewerJobs shows a source-linked preview and links to the original posting.

Apply at job-boards.greenhouse.io

Apply link not verified; last-live date unavailable.

What verified means

Verified means a displayed claim has a recorded source field, a source URL when available, and a timestamp showing when FewerJobs checked or enriched the evidence.

Related jobs

Hardware System and Board Failure Analysis Technical Lead

Cisco - Milpitas, California, US
Sr. Staff System Architect

Northrop Grumman - United States-Illinois-Rolling Meadows
Senior Project Manager - Product Implementation

Deluxe CORP - 2 Locations
Sr. Staff Product Operations Manager, Product Lifecycle (Remote)

Cisco - Coral Gables, Florida, US

Senior AI Infrastructure Engineer - Training Platform

Benefits

Schedule

Application

Where they hire

About this role

What verified means

Related jobs

Hardware System and Board Failure Analysis Technical Lead

Sr. Staff System Architect

Senior Project Manager - Product Implementation

Sr. Staff Product Operations Manager, Product Lifecycle (Remote)