Staff Machine Learning Engineer, GenAI Platform

🇺🇸Reddit

Remote - United StatesRemote0 applicants

Full TimeLead

Job Description

Reddit is a community of communities. It’s built on shared interests, passion, and trust, and is home to the most open and authentic conversations on the internet. Every day, Reddit users submit, vote, and comment on the topics they care most about. With 100,000+ active communities and approximately 121 million daily active unique visitors, Reddit is one of the internet’s largest sources of information. For more information, visit www.redditinc.com . Who We Are: The Machine Learning Platform team at Reddit is a high-impact organization that owns the infrastructure powering recommendations, content discovery, and user quantification. As Generative AI becomes a strategic priority for Reddit, we are expanding our platform to meet the unique demands of foundation models. We are building the foundational infrastructure to support massive-scale, long-running LLM workloads, enabling teams across Growth, Ads, Feeds, and Core ML to move fast on shared, robust GenAI infrastructure. What You’ll Do: As a Staff Software Engineer on the Machine Learning Platform team, you will be a key technical leader architecting and scaling our Generative AI and LLM platform capabilities. Training and deploying foundation models places unprecedented demands on our systems. You will define the technical strategy and build the core infrastructure that enables machine learning engineers and researchers to seamlessly train, evaluate, and iterate on large language models at Reddit scale. Drive GenAI Infrastructure Strategy: Propose, design, and lead the architecture of our next-generation LLM platform, significantly advancing our capabilities to support large-scale foundation models that serve millions of redditors. Design Resilient, Large-Scale Distributed Systems: Architect highly fault-tolerant training infrastructure capable of supporting multi-week, distributed workloads across massive GPU clusters. You will tackle challenges related to automated recovery, cluster-scale health monitoring, and advanced checkpointing to ensure optimal compute efficiency. Build Self-Serve LLM Workflows: Design and implement robust, production-grade pipelines for LLM fine-tuning (e.g., SFT, RLHF/DPO). You will abstract away the complexity of distributed tra

Read original posting

Required Skills

RustRRESTMachine LearningLLM