FewerJobs.
All jobs

Software Engineer, Frontier Clusters Infrastructure

OpenAI - San Francisco, California, United States

Posted Nov 7, 2024

Benefits

Parental leave
Not verified
Non-birth-parent leave
Not verified
Family-building benefits
  • Fertility benefits: Not verified
  • Adoption assistance: Not verified
  • Surrogacy assistance: Not verified
Mental health support
Not verified
Relocation assistance
Not verified
Childcare support
Not verified
Learning budget
Not verified
Verification
Not verified
Salary
Not verified
401(k) match
Not verified

Was this benefit information wrong? Tell us.

Schedule

Shift type
Not verified
Weekend work
Not verified

Application

Cover letter
Not verified
Assessment
Not verified
Deadline
Not stated

Where they hire

State eligibility is not yet verified.

About this role

Software Engineer, Frontier Clusters Infrastructure San Francisco, California, United States About the Team The Frontier Systems team at OpenAI builds, launches, and supports the largest supercomputers in the world that OpenAI uses for its most cutting edge model training. We take data center designs, turn them into real, working systems and build any software needed for running large-scale frontier model trainings. Our mission is to bring up, stabilize and keep these hyperscale supercomputers reliable and efficient during the training of the frontier models. About the Role We are looking for engineers to operate the next generation of compute clusters that power OpenAI's frontier research. This role blends distributed systems engineering with hands-on infrastructure work on our largest datacenters. You will scale Kubernetes clusters to massive scale, automate bare-metal bring-up, and build the software layer that hides the complexity of a magnitude of nodes across multiple data centers. You will work at the intersection of hardware and software, where speed and reliability are critical. Expect to manage fast-moving operations, quickly diagnose and fix issues when things are on fire, and continuously raise the bar for automation and uptime. In this role, you will: - Spin up and scale large Kubernetes clusters, including automation for provisioning, bootstrapping, and cluster lifecycle management - Build software abstractions that unify multiple clusters and present a seamless interface to training workloads - Own node bring-up from bare metal through firmware upgrades, ensuring fast, repeatable deployment at massive scale - Improve operational metrics such as reducing cluster

Read the full description at jobs.ashbyhq.com. FewerJobs shows a source-linked preview and links to the original posting.

Apply at jobs.ashbyhq.com

Apply link not verified; last-live date unavailable.

What verified means

Verified means a displayed claim has a recorded source field, a source URL when available, and a timestamp showing when FewerJobs checked or enriched the evidence.

Related jobs