Senior Software Engineer - Together Cloud Infrastructure
Together AI - San Francisco
Posted Jun 3, 2025
Benefits
- Parental leave
- Not verified
- Non-birth-parent leave
- Not verified
- Family-building benefits
-
- Fertility benefits: Not verified
- Adoption assistance: Not verified
- Surrogacy assistance: Not verified
- Mental health support
- Not verified
- Relocation assistance
- Not verified
- Childcare support
- Not verified
- Learning budget
- Not verified
- Verification
- Not verified
- Salary
- Not verified not verified - source not recorded; timestamp not recorded
- 401(k) match
- Not verified
Was this benefit information wrong? Tell us.
Schedule
- Shift type
- Not verified
- Weekend work
- Not verified
Application
- Cover letter
- Not verified
- Assessment
- Not verified
- Deadline
- Not stated
Where they hire
State eligibility is not yet verified.
About this role
Senior Software Engineer - Together Cloud Infrastructure San Francisco About the Role Together AI is building the AI Acceleration Cloud, an end-to-end platform for the full generative AI lifecycle, combining the fastest LLM inference engine with state-of-the-art AI cloud infrastructure. As a Senior AI Infrastructure Engineer, you will play a key role in building the next generation AI cloud platform - a highly available, global, blazing-fast cloud infrastructure that virtualizes cutting-edge ML hardware (GB200s/GB300s, BlueField DPUs) and enables state-of-the-art ML practitioners with self-serve AI cloud services, such as on-demand + managed Kubernetes and Slurm clusters. This platform serves both our internal SaaS products (inference, fine-tuning) and our external cloud customers, spanning dozens of data centers across the world. Responsibilities - Design, build, and maintain performant, secure, and highly-available backend services/operators that run in our data centers and automate hardware management, such as Infiniband partitioning, in-DC parallel storage provisioning, and VM provisioning. - Design and build out the IaaS software layer for a new GB200 data center with thousands of GPUs. - Work on a global multi-exabyte high-performance object store, serving massive datasets for pretraining. - Build advanced observability stacks for our customers with automated node lifecycle management for fault-tolerant distributed pretraining. - Perform architecture and research work for decentralized AI workloads - Work on the core, open-source Together AI platform - Create services, tools, and developer documentation - Create testing frameworks for robustness and fault-tolerance To be successful, you'll need to be deeply technical and possess excellent communication, collaboration, and
Read the full description at job-boards.greenhouse.io. FewerJobs shows a source-linked preview and links to the original posting.
Apply link not verified; last-live date unavailable.
What verified means
Verified means a displayed claim has a recorded source field, a source URL when available, and a timestamp showing when FewerJobs checked or enriched the evidence.
Related jobs
-
Hardware System and Board Failure Analysis Technical Lead
Cisco - Milpitas, California, US
-
Sr. Staff System Architect
Northrop Grumman - United States-Illinois-Rolling Meadows
-
Senior Project Manager - Product Implementation
Deluxe CORP - 2 Locations
-
Sr. Staff Product Operations Manager, Product Lifecycle (Remote)
Cisco - Coral Gables, Florida, US