Scale AI, Inc.San Francisco, CA; Seattle, WA; New York, NY
On-site Full-time $218.4K/yr - $273K/yr
Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Experience
Qualifications
Key Responsibilities:Design, profile, and enhance our training and inference framework. Work collaboratively with ML teams to expedite their research and development processes, empowering them to create next-generation models and data curation strategies. Investigate and incorporate cutting-edge technologies to refine our ML system. Preferred Qualifications:A strong enthusiasm for system optimization. Hands-on experience with multi-node LLM training and inference. Proven experience in developing large-scale distributed ML systems. Robust software engineering capabilities, with proficiency in frameworks and tools like CUDA, PyTorch, Transformers, Flash Attention, etc. Excellent written and verbal communication skills with the ability to thrive in a cross-functional team environment. Desirable Skills:Demonstrated expertise in post-training methodologies and/or innovative use cases for large language models, including instruction tuning, RLHF (Reinforcement Learning from Human Feedback), tool usage, reasoning, agents, and multimodal applications.
About the job
Join Scale AI's ML platform team (RLXF) as a Machine Learning Research Engineer, where you will play a pivotal role in developing our advanced distributed framework for training and inference of large language models. This platform is vital for enabling machine learning engineers, researchers, data scientists, and operators to conduct rapid and automated training, as well as evaluation of LLMs and data quality.
At Scale, we occupy a unique position in the AI landscape, serving as an essential provider of training and evaluation data along with comprehensive solutions for the entire ML lifecycle. You will collaborate closely with Scale's ML teams and researchers to enhance the foundational platform that underpins our ML research and development initiatives. Your contributions will be crucial in optimizing the platform to support the next generation of LLM training, inference, and data curation.
If you are passionate about driving the future of AI through groundbreaking innovations, we want to hear from you!
About Scale AI, Inc.
Scale AI is a leader in the AI sector, providing indispensable training and evaluation data as well as comprehensive end-to-end solutions for the machine learning lifecycle. Our platform empowers researchers and engineers to push the boundaries of AI technology.
About Sygaldry Technologies Sygaldry Technologies develops quantum-accelerated AI servers in San Francisco, focusing on faster AI training and inference. By combining quantum technology with artificial intelligence, the team addresses challenges in computing costs and energy efficiency. Their AI servers integrate multiple qubit types within a fault-tolerant system, aiming for a balance of cost, scalability, and speed. The company values optimism, rigor, and a drive to solve complex problems in physics, engineering, and AI. Role Overview: ML Infrastructure Engineer The ML Infrastructure Engineer joins the AI & Algorithms team, which includes research scientists, applied mathematicians, and quantum algorithm specialists. This role centers on building and maintaining the compute infrastructure that powers advanced research. The systems you build will support reliable GPU access, reproducible experiments, and scalable workloads, so researchers can focus on their core work without needing deep cloud expertise. Expect to design and manage compute platforms for a range of tasks, including quantum circuit simulation, large-scale numerical optimization, model training, tensor network contractions, and high-throughput data generation. These workloads span multiple cloud providers and on-premises GPU servers. Key Responsibilities Develop compute abstractions for diverse workloads, such as GPU-accelerated simulations, distributed training, high-throughput CPU jobs, and interactive analyses using frameworks like PyTorch and JAX. Set up infrastructure to support experiment tracking and reproducibility. Create developer tools that make cloud computing feel local, streamlining environment setup, job submission, monitoring, and artifact management. Scale experiments from single-GPU prototypes to large, multi-node production runs. Multi-Cloud GPU Orchestration Design orchestration strategies for workloads across multiple cloud providers, optimizing job routing for cost, availability, and capability. Monitor and improve cloud spending, keeping track of credit balances, burn rates, and expiration dates.
At Sciforium, we are at the forefront of AI infrastructure, dedicated to the development of advanced multimodal AI models and an innovative serving platform that emphasizes high efficiency. With substantial funding and direct collaboration from AMD, our team is rapidly expanding to create the complete stack for pioneering AI models and dynamic real-time applications.Role OverviewThis position provides a distinct opportunity to engage with the fundamental systems that drive Sciforium's multimodal AI models. You will play a crucial role in constructing the model serving platform, working with C++, Python, runtime execution, and distributed infrastructure to design a swift, dependable engine for real-time AI applications.You will acquire practical experience in performance engineering, discover how large AI models are optimized and deployed at scale, and collaborate closely with ML researchers and seasoned systems engineers. If you thrive in low-level programming and are passionate about performance, this role offers both impactful contributions and significant growth opportunities.
About DelphinaIn today's data-driven world, data scientists face numerous challenges, from tedious data wrangling to the slow process of model development. The reliance on outdated tools, such as Jupyter notebooks and Pandas, hinders progress and innovation.At Delphina, we are dedicated to transforming the landscape of data science. Our mission is to empower users to harness data more effectively for insights into the present and future. As an AI Agent for Data Science, Delphina combines generative AI, large-scale optimization, and advanced infrastructure to automate the essential yet time-consuming tasks required to build robust machine learning models swiftly. We streamline the process of identifying relevant data, cleaning it, training models, and deploying pipelines.Our team comprises seasoned professionals who have successfully led large data science and machine learning initiatives, established startups, and developed impactful enterprise ML tools.We are proud to be backed by leading AI investors, including Fei-Fei Li, Radical VC, and Costanoa VC.Your RoleWe are seeking a skilled ML Infrastructure Engineer to join our Technical Staff at Delphina.As a pivotal early team member, you will collaborate closely with our team to shape the product's direction and influence key technical decisions. Your contributions will significantly impact the technology, product development, and company culture.Your responsibilities will include:Creating platforms that enable scientists, researchers, and developers to efficiently execute ML jobs at scale with modern technologies.Designing solutions to manage and support extensive data workflows through stages such as ingestion, indexing/mining, transformation, machine learning, and deployment.Establishing a consistent continuous integration/deployment framework to promote self-service application testing, deployment, and operations across cross-functional development teams.Leading and influencing cross-functional initiatives to align the team on preferred technologies and methodologies.
Full-time|$292K/yr - $417.2K/yr|Hybrid|San Francisco, CA; Los Angeles, CA; New York, NY (Hybrid); USA - Remote
About the Role:The Machine Learning team at Tubi is at the forefront of transforming user experiences through cutting-edge technology. With the industry's largest inventory and a vast audience of millions, we are dedicated to solving complex challenges in recommendations, search, content understanding, and ad optimization, shaping the future of streaming.We are on the lookout for a Director of Machine Learning Engineering and Infrastructure to spearhead a hybrid team that merges advanced ML engineering with exceptional infrastructure design. In this pivotal role, you will define the strategic vision and implementation for scaling our machine learning capabilities, ensuring our distributed systems and infrastructure can foster innovation on a grand scale. You will blend technical expertise with outstanding leadership to guide teams in delivering robust ML systems and high-performance distributed services.
As a Machine Learning Infrastructure Engineer at Physical Intelligence, you will play a vital role in enhancing and optimizing our training systems and core model code. You will take ownership of critical infrastructure for large-scale training, which includes managing GPU/TPU compute, orchestrating jobs, and developing reusable and efficient JAX training pipelines. Collaborating closely with researchers and model engineers, you will help transform innovative ideas into experiments and subsequently into production training runs.This position is hands-on and offers significant leverage at the intersection of machine learning, software engineering, and scalable infrastructure.The TeamOur ML Infrastructure team is dedicated to supporting and accelerating Physical Intelligence's core modeling initiatives by building systems that ensure large-scale training is reliable, reproducible, and efficient. The team collaborates with research, data, and platform engineers to guarantee that models can seamlessly transition from prototype to production-grade training runs.Key Responsibilities- Manage training/inference infrastructure: Design, implement, and maintain systems for large-scale model training, which includes scheduling, job management, checkpointing, and performance metrics/logging.- Expand distributed training: Collaborate with researchers to efficiently scale JAX-based training across TPU and GPU clusters.- Enhance performance: Profile and optimize memory usage, device utilization, throughput, and distributed synchronization to maximize efficiency.- Facilitate rapid iteration: Develop abstractions for launching, monitoring, debugging, and reproducing experiments.- Oversee compute resources: Ensure optimal allocation and utilization of cloud-based GPU/TPU compute resources while managing costs effectively.- Collaborate with researchers: Translate research requirements into infrastructure capabilities and promote best practices for large-scale training.- Contribute to core training code: Evolve the JAX model and training code to accommodate new architectures, modalities, and evaluation metrics.
Join Decagon as a Staff Software Engineer specializing in Machine Learning Infrastructure. In this role, you will play a crucial part in enhancing and optimizing our machine learning systems. You will collaborate with a talented team of engineers to build scalable and efficient infrastructure that supports our AI-driven initiatives.As a key contributor, you will leverage your expertise in software engineering and machine learning to solve complex challenges and drive innovation. Your work will impact various projects and help shape the future of our technology.
At Runway ML, we are revolutionizing the intersection of art and science through innovative AI technology. Our mission is to build sophisticated world models that transcend traditional artificial intelligence limitations. We believe that to tackle the most pressing challenges—such as robotics, disease, and scientific breakthroughs—we need systems that can learn from experiences just like humans do. By simulating these experiences, we can expedite progress in ways that were previously unimaginable.Our diverse and driven team consists of creative thinkers who are passionate about pushing boundaries and achieving the extraordinary. If you share this ambition and are eager to contribute to our groundbreaking work, we invite you to join us.About the Role*We are open to hiring remotely across North America. We also have offices in NYC, San Francisco, and Seattle.We are on the lookout for a highly skilled and intellectually inquisitive Technical Accounting Manager to be our go-to authority on intricate accounting issues. This position offers significant visibility and is ideal for a professional adept at interpreting complex accounting guidelines, formulating sound conclusions, and translating technical insights into practical accounting practices.
Join Our Team at Air AppsAt Air Apps, we are on a mission to revolutionize resource management through innovative technology. Founded in 2018 in Lisbon, Portugal, we have expanded our reach with offices in both Lisbon and San Francisco, boasting over 100 million downloads globally. Our vision is to create the world’s first AI-powered Personal & Entrepreneurial Resource Planner (PRP), and we are looking for passionate individuals to help us achieve this goal.Our commitment to challenging the status quo drives us to push the boundaries of AI-driven solutions that make a real impact. Here, you will have the opportunity to be a creative force, developing products that empower individuals worldwide.Join us as we embark on this journey to redefine how people plan, work, and live.
Role Overview Voxel is hiring a Senior or Staff Software Engineer focused on Machine Learning Infrastructure in San Francisco, CA. This position centers on building and maintaining scalable infrastructure that supports the company’s machine learning products and services. What You Will Do Design, develop, and maintain machine learning infrastructure for production systems Work with teams across engineering, product, and data to streamline ML workflows Optimize systems for performance, reliability, and operational efficiency Collaboration This role involves frequent collaboration with colleagues from multiple disciplines to ensure machine learning solutions are robust and scalable.
Full-time|$225K/yr - $315K/yr|Remote|San Francisco
About the CompanyLavendo is a pioneering publicly traded company leading the charge in the AI revolution. With an AI-centric cloud platform, we are transforming the artificial intelligence landscape. Our state-of-the-art infrastructure, including extensive GPU clusters and advanced cloud services, supports developers in harnessing the explosive growth of the global AI industry, catering to Fortune 1000 firms, innovative startups, and AI researchers alike.Company type: Publicly tradedIndustry: AI/ML, Cloud Computing, Infrastructure-as-CodeCandidate Location: Remote U.S.Our mission is to democratize AI infrastructure access and empower organizations to innovate, optimize, and deploy AI solutions seamlessly at any scale. By simplifying the complexities of AI development, we provide a comprehensive full-stack AI platform that marries robust hardware with easy-to-use tools and services.The OpportunityWe are on the lookout for a Senior AI/ML Specialist Solutions Architect to become a crucial part of our client's dynamic team. This role presents an exciting opportunity to design and implement scalable AI solutions tailored for AI-centric clients, leveraging cutting-edge technologies and contributing to one of the most powerful commercially available supercomputers.What You'll DoArchitect and enhance distributed training and inference systems for large-scale AI models.Design and deliver customer-centric solutions that optimize performance and drive business value.Lead the migration of ML pipelines from Proof of Concept to scalable production environments.Foster long-term relationships with clients, ensuring satisfaction and alignment with their strategic objectives.Produce whitepapers, conduct technical presentations, and facilitate webinars to disseminate insights and best practices.Provide technical guidance and mentorship to teams regarding AI infrastructure and deployment strategies.Collaborate with engineering and product teams to prioritize customer feedback and shape product roadmaps.
About GridwareGridware is an innovative technology firm based in San Francisco, committed to safeguarding and enhancing the electrical grid. We have pioneered an advanced class of grid management known as Active Grid Response (AGR), which focuses on monitoring the electrical, physical, and environmental aspects of the grid to improve reliability and safety. Our cutting-edge AGR platform utilizes high-precision sensors to identify potential issues early, enabling proactive maintenance and fault mitigation. This all-encompassing strategy enhances safety, minimizes outages, and promotes efficient grid operations. Supported by climate-tech and Silicon Valley investors, we are at the forefront of transforming grid management. For further details, visit www.Gridware.io.Role OverviewIn the role of Senior Machine Learning Infrastructure Engineer, you will collaborate closely with the Automation organization and the core ML, Operations, and Analytics teams to enhance and develop the infrastructure surrounding model deployment and monitoring. This position is crucial for amplifying the time-saving benefits that Gridware provides to its customers.
Full-time|$155.6K/yr - $320.3K/yr|Remote|San Francisco, CA, US; Remote, US
About tvScientific tvScientific is the premier CTV advertising platform exclusively tailored for performance marketers. Our innovative approach harnesses vast data and state-of-the-art science to automate and enhance TV advertising, ultimately driving impactful business results. Our platform seamlessly integrates media buying, optimization, measurement, and attribution into one powerful, efficient solution. Developed by industry veterans with extensive backgrounds in programmatic advertising, digital media, and ad verification, our CTV performance platform is designed to help advertisers confidently scale their business. We are currently seeking a Senior MLOps Engineer to join our dynamic, distributed engineering team focused on our Connected TV ad-buying platform, as we expand our Machine Learning capabilities. Having successfully optimized TV ad campaigns, we are poised for massive growth, and we need your expertise to ensure our scalability is both sustainable and effective. As a proud member of Idealab, tvScientific was co-founded by leaders deeply rooted in programmatic advertising and digital media. We empower our clients to purchase ads across the expansive CTV landscape, including platforms such as Hulu, PlutoTV, and the ad-supported tiers of Disney+ and HBO Max. Following our acquisition by Pinterest, we are intensifying our focus on CTV to enhance the performance of search and social advertising.
About UsAt Lemurian Labs, we are dedicated to democratizing AI technology while prioritizing sustainability. Our mission is to create solutions that minimize environmental impact, ensuring that artificial intelligence serves humanity positively. We are committed to responsible innovation and the sustainable growth of AI.We are in the process of developing a state-of-the-art, portable compiler that empowers developers to 'build once, deploy anywhere.' This technology ensures seamless cross-platform integration, allowing for model training in the cloud and deployment at the edge, all while maximizing resource efficiency and scalability.If you are passionate about scaling AI sustainably and are eager to make AI development more powerful and accessible, we invite you to join our team at Lemurian Labs. Together, we can build a future that is innovative and responsible.The RoleWe are seeking a Senior ML Performance Engineer to take charge of designing and leading our Performance Testing Platform from inception. In this pivotal role, you will be recognized as the technical expert in measuring, validating, and enhancing the performance of large language models (including Llama 3.2 70B, DeepSeek, and others) prior to and following compiler optimization on cutting-edge GPU architectures.This is a critical position that will significantly impact our product quality and customer success. You will work at the intersection of Machine Learning systems, GPU architecture, and performance engineering, constructing the infrastructure that substantiates the value of our compiler.
Full-time|$308K/yr - $423.5K/yr|On-site|San Francisco, CA
About FaireFaire is a cutting-edge online wholesale marketplace driven by the belief that the future is local. Independent retailers around the world generate more revenue than giants like Walmart and Amazon combined, yet individually, they often struggle against these behemoths. At Faire, we harness the power of technology, data, and machine learning to connect this vibrant community of entrepreneurs globally. Imagine your favorite local boutique; we empower them to discover and sell exceptional products from around the world. With the right tools and insights, we aim to level the playing field, allowing small businesses to compete effectively with large retail chains and e-commerce platforms.By fostering the growth of independent businesses, Faire is making a positive economic impact in local communities worldwide. We’re in search of intelligent, resourceful, and passionate individuals to join us in driving the shop-local movement. If you share our belief in community, we would love to welcome you to ours.About this Role:We are on the lookout for a Principal ML / AI Engineer to serve as a company-wide technical thought leader and practitioner in shaping the future of Data and AI at Faire. This unique opportunity allows you to influence broad technical strategies across data, engineering, and product while engaging directly with pioneering AI research and applications. This role will report directly to the CTO of Faire.Your Responsibilities:Shape the AI Vision – Collaborate with product, design, strategy & analytics, machine learning, and the wider engineering leadership to define how AI can unlock transformational value for Faire’s retailers and brands. Provide thought leadership to guide company-wide priorities, particularly focusing on product strategy and key investment areas.Prototype and Unblock – Lead the development and implementation of AI systems (such as LLM fine-tuning, RLHF, agent frameworks, etc.) that illustrate what’s achievable and promote adoption across teams. Act as a “super individual contributor” who can delve deeply into technical challenges, enabling the engineering organization to advance quickly with AI and amplify both development and impact.Architect the AI-Ready Stack – Design Faire’s technical ecosystem, encompassing event logging, data warehouses, feature stores, and model serving, to ensure our infrastructure is AI-ready, scalable, and optimized for rapid experimentation.
Full-time|$227.2K/yr - $417K/yr|Hybrid|San Francisco, CA; Los Angeles, CA; New York, NY (Hybrid); USA - Remote
About the Role:Join our dynamic ML Infrastructure team as a Software Engineer, where you'll collaborate intimately with the Machine Learning and Product teams to construct top-tier machine learning inference platforms. These cutting-edge platforms drive vital services such as personalized recommendations, search functionalities, and content comprehension at Tubi.Your primary focus will be on the development and maintenance of low-latency ML model serving systems that cater to Deep Learning, LLM, and Search models. This will include the creation of self-service infrastructure and critical components such as the inference engine, feature store, vector store, and experimentation engine.In this role, you'll enhance our service deployment and operational processes, with opportunities to contribute to open-source projects. Enjoy architectural freedom to explore innovative frameworks, spearhead significant cross-functional projects, and elevate the capabilities of our ML and Product teams.We are currently hiring for two positions:Staff Software EngineerPrincipal Software EngineerAdditional Details: As a Principal Engineer, you will serve as a technical leader and visionary, guiding the advancement of our machine learning platform. You'll address complex technical challenges, shape architectural decisions, and mentor senior engineers, fostering a culture of excellence and continuous improvement. Your contributions will impact millions of users.
Embrace the Future of Commerce with Whatnot!Whatnot stands as North America and Europe’s premier live shopping platform, dedicated to transforming the way you buy, sell, and discover your favorite items. We are on a mission to redefine e-commerce by seamlessly merging community engagement, shopping, and entertainment into a unique experience tailored just for you. As part of a remote, co-located team, we thrive on innovation while being firmly rooted in our core values. With operational hubs across the US, UK, Germany, Ireland, and Poland, we are collaboratively shaping the future of online marketplaces.Our live auctions span a diverse range of categories from fashion and beauty to electronics and collectibles, including trading cards, comic books, and even live plants. There’s truly something for everyone!And this is just the beginning! As one of the fastest-growing marketplaces, we are in search of bold, innovative problem solvers across all functional areas. Stay updated with the latest Whatnot news through our news and engineering blogs, and join us in empowering individuals to transform their passions into thriving businesses, fostering connections through commerce. Your RoleWe are seeking hands-on leaders—intellectually curious and technically proficient individuals ready to influence the future of AI and ML at Whatnot. In this pivotal role, you will spearhead the development and scaling of the foundational infrastructure that supports machine learning and self-hosted large language model applications across our organization. Collaborating closely with machine learning scientists, you will drive the implementation of innovative models powered by near-real-time features, enhancing product experiences. This entails building robust systems that ensure advanced ML is both reliable and efficient at scale—from low-latency deep learning model serving and streaming feature ingestion to distributed training and high-throughput GPU inference. As a managerial role, a strong technical foundation is essential, and potential candidates should be enthusiastic about diving deep into the details. You will elevate architectural discussions, provide insightful technical feedback, and dedicate at least one day a week to coding.Your Responsibilities:Lead the infrastructure supporting AI and ML models across critical business areas, enhancing growth, recommendations, trust and safety, fraud detection, seller tooling, and more.Oversee the prototyping, deployment, and productionization of innovative ML architectures, ensuring they align with our strategic objectives.
About Our TeamThe Infrastructure Engineering team operates within the IT department, dedicated to the reliable construction, deployment, and management of critical on-premises and hybrid environments that empower our internal services and vital research and development projects.This newly established team is committed to implementing rigorous Site Reliability Engineering (SRE) practices in environments where uptime, safety, recoverability, and security are paramount. We aim to replace unique, one-off infrastructure with standardized infrastructure-as-code components that enhance reliability and operational efficiency as OpenAI continues to grow.About This RoleWe are in search of an Infrastructure Engineering Lead who will architect, build, and maintain reliable, secure, and scalable infrastructure that supports identity, access, endpoint, and shared platform services throughout the organization.You will take full ownership of infrastructure and identity systems from conceptual design and provisioning to policy enforcement, upgrades, recovery, and ongoing operations. Your goal will be to develop robust, production-grade platforms that minimize operational hurdles, enforce security by default, and empower teams to work more effectively and confidently.This position is ideal for a senior engineer who excels in navigating ambiguity, relishes the challenge of overseeing complex systems from start to finish, and enhances reliability and security by transforming fragile implementations into standardized, repeatable infrastructure.This role is based at our San Francisco headquarters and requires in-office attendance.Key Responsibilities:Define and refine infrastructure patterns for on-prem and hybrid environments, including self-hosted platforms, vendor-supported systems, and lab settings.Establish standardized, production-grade deployment and operational models that replace custom-built solutions.Collaborate with IT, Security, Identity, and Network teams to ensure infrastructure is designed to meet reliability, security, and access standards.Design and enhance the production architecture for Identity and Access Management (IAM) adjacent platforms, such as Microsoft Entra, utilizing SRE principles.Develop common management protocols and shared resources within Azure subscriptions to ensure uniformity and policy compliance in operations.
Full-time|$218.4K/yr - $273K/yr|On-site|San Francisco, CA; Seattle, WA; New York, NY
Join Scale AI's ML platform team (RLXF) as a Machine Learning Research Engineer, where you will play a pivotal role in developing our advanced distributed framework for training and inference of large language models. This platform is vital for enabling machine learning engineers, researchers, data scientists, and operators to conduct rapid and automated training, as well as evaluation of LLMs and data quality.At Scale, we occupy a unique position in the AI landscape, serving as an essential provider of training and evaluation data along with comprehensive solutions for the entire ML lifecycle. You will collaborate closely with Scale's ML teams and researchers to enhance the foundational platform that underpins our ML research and development initiatives. Your contributions will be crucial in optimizing the platform to support the next generation of LLM training, inference, and data curation.If you are passionate about driving the future of AI through groundbreaking innovations, we want to hear from you!
About Our TeamAt OpenAI, our Hardware organization is pioneering the development of cutting-edge silicon and system-level solutions tailored to meet the distinctive needs of advanced AI workloads. We are dedicated to building the next generation of AI silicon, collaborating closely with software engineers and research partners to co-design hardware that integrates seamlessly with our AI models. Our mission includes not only delivering high-quality, production-grade silicon for OpenAI's supercomputing infrastructure but also creating custom design tools and methodologies that foster innovation and enable hardware optimized specifically for AI applications.About the RoleWe are on the lookout for a talented Research Hardware Co-Design Engineer to operate at the intersection of model research and silicon/system architecture. In this role, you will play a critical part in shaping the numerics, architecture, and technological strategies for the future of OpenAI's silicon in collaboration with both Research and Hardware teams.Your responsibilities will include diagnosing discrepancies between theoretical performance and real-world measurements, writing quantization kernels, assessing the risks associated with numerics through model evaluations, quantifying system architecture trade-offs, and implementing innovative numeric RTL. This is a hands-on position for individuals who are passionate about tackling challenging problems, seeking practical solutions, and driving them to production. Strong prioritization and transparent communication skills are vital for success in this role.Location: San Francisco, CA (Hybrid: 3 days/week onsite)Relocation assistance available.Key Responsibilities:Enhance our roofline simulator to monitor evolving workloads and deliver analyses that quantify the impact of architectural decisions, supporting technology exploration.Identify and resolve discrepancies between performance simulations and actual measurements; effectively communicate root causes, bottlenecks, and incorrect assumptions.Develop emulation kernels for low-precision numerics and lossy compression techniques, equipping Research with the insights needed to balance efficiency with model quality.Prototype numeric modules by advancing RTL through synthesis; either hand off innovative numeric solutions cleanly or occasionally take ownership of an RTL module from start to finish.Proactively engage with new ML workloads, prototype them using rooflines and/or functional simulations, and initiate evaluations of new opportunities or risks.Gain a holistic understanding of the transition from ML science to hardware optimization, breaking down this comprehensive objective into actionable short-term deliverables.Foster collaborative relationships across diverse teams with varying goals and expertise, ensuring that progress remains unimpeded.Clearly articulate design trade-offs with explicit assumptions and rationale.
About UsAt Sierra, we are revolutionizing the way businesses engage with their customers by building a cutting-edge platform that harnesses the power of AI. Our headquarters is located in the vibrant city of San Francisco, with additional offices expanding in Atlanta, New York, London, France, Singapore, and Japan.Our company culture is deeply rooted in our core values: Trust, Customer Obsession, Craftsmanship, Intensity, and Family. These principles guide our actions and foster an environment where innovation thrives.Sierra was co-founded by visionary leaders Bret Taylor, who currently serves as the Board Chair of OpenAI and has a rich history with Salesforce and Facebook, and Clay Bavor, who previously led Google Labs and spearheaded initiatives like Google Lens and Project Starline.Your RoleAs a Software Engineer focusing on Infrastructure at Sierra, you will play a pivotal role in designing, constructing, and maintaining the foundational systems that empower our AI platform. Your expertise will ensure that our infrastructure is not only secure and reliable but also scalable, allowing product teams to execute their work with agility and confidence.Guarantee the reliability, scalability, and performance of our platform and LLM inference serving in response to increasing traffic demands.Develop and oversee cloud infrastructure using Terraform to create secure, scalable, and reproducible environments.Establish and manage a self-service infrastructure platform to empower engineering teams in deploying and operating services independently.Take ownership of and improve CI/CD pipelines and release management processes, facilitating rapid and reliable deployments across Sierra’s platform.Design and manage distributed systems utilizing distributed databases, retrieval systems, and machine learning models.Develop and sustain core data serving abstractions along with essential authentication and security features (SSO, RBAC, authentication controls).Effectively navigate and integrate our technology stack with enterprise customer environments in a scalable and maintainable manner.
Oct 15, 2025
Sign in to browse more jobs
Create account — see all 5,401 results
Tailoring 0 resumes…
Tailoring 0 resumes…
We'll move completed jobs to Ready to Apply automatically.