Software Engineer, Platform Systems
OpenAI - San Francisco, California, United States
Posted Jan 23, 2026
Benefits
- Parental leave
- Not verified
- Non-birth-parent leave
- Not verified
- Family-building benefits
-
- Fertility benefits: Not verified
- Adoption assistance: Not verified
- Surrogacy assistance: Not verified
- Mental health support
- Not verified
- Relocation assistance
- Not verified
- Childcare support
- Not verified
- Learning budget
- Not verified
- Verification
- Not verified
- Salary
- Not verified
- 401(k) match
- Not verified
Was this benefit information wrong? Tell us.
Schedule
- Shift type
- Not verified
- Weekend work
- Not verified
Application
- Cover letter
- Not verified
- Assessment
- Not verified
- Deadline
- Not stated
Where they hire
State eligibility is not yet verified.
About this role
Software Engineer, Platform Systems San Francisco, California, United States About the Team The Platform Systems team at OpenAI operates at the intersection of cutting-edge AI and large-scale distributed systems. We build the engineering and research infrastructure required to train OpenAI's flagship models on some of the world's largest, custom-built supercomputers. Our team develops core model training software and works deep in the stack - spanning collective communication, compute efficiency, parallelism strategies, fault tolerance, failure detection, and observability. The systems we build are foundational to OpenAI's research velocity, enabling reliable, efficient training at frontier scale. We collaborate closely with researchers across the organization, continuously incorporating learnings from across OpenAI into the evolution of our training platform. About the Role As a Software Engineer, Platform Systems, you will design and build distributed systems that provide visibility into large-scale training workloads and help operate them reliably at scale. You'll work on failure detection, tracing, and observability systems that identify slow or faulty nodes, surface performance bottlenecks, and help engineers understand and optimize massive distributed training jobs. This infrastructure is critical to operating OpenAI's training stack and is actively evolving to support new use cases and increasingly complex workloads. This role sits at the core of our training infrastructure, blending systems engineering, performance analysis, and large-scale debugging. In This Role, You Will - Design and build distributed failure detection, tracing, and profiling systems for large-scale AI training jobs - Develop tooling to identify slow, faulty, or misbehaving nodes and provide actionable visibility into system
Read the full description at jobs.ashbyhq.com. FewerJobs shows a source-linked preview and links to the original posting.
Apply link not verified; last-live date unavailable.
What verified means
Verified means a displayed claim has a recorded source field, a source URL when available, and a timestamp showing when FewerJobs checked or enriched the evidence.
Related jobs
-
Systems Engineer - (Execution) - Level 3/4
Northrop Grumman - United States-Alabama-Huntsville
-
Business Analyst (Top Secret cleared)
ICF International INC - Washington, DC
-
Engineering Project Specialist II (Full Time) - United State
Cisco - San Jose, California, US
-
Automation AI Ops Engineer
Cisco - 2 Locations