Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Experience
Qualifications
Responsibilities:
Architect multi-cloud systems and abstractions to facilitate the SGP platform's seamless operation across existing cloud providers.
Develop custom integrations between Scale AI's platform and client data environments (cloud platforms, data warehouses, internal APIs).
Collaborate directly with platform and product teams, as well as clients, to create and implement scalable infrastructure that meets evolving demands.
Deliver high-quality experiments rapidly to engage and satisfy our customers.
Manage the complete product lifecycle from initial concept through to production.
Possess the ability and willingness to multitask and rapidly acquire new technologies.
Qualifications:
A minimum of 4 years of full-time engineering experience post-graduation.
Proven experience scaling products in hyper-growth startup environments.
Hands-on experience with LLMs, vector databases, and other cutting-edge AI technologies.
Proficiency in Python or JavaScript/TypeScript, and SQL.
Familiarity with Kubernetes.
Experience with major cloud platforms (AWS, Azure, GCP).
Strong communication skills, capable of conveying technical concepts to both technical and non-technical stakeholders.
About the job
Join Scale AI's innovative team as an Infrastructure Software Engineer for our Enterprise Generative AI Platform (SGP). In this dynamic role, you will help design and enhance our enterprise-grade AI platform, which offers robust APIs for knowledge retrieval, inference, evaluation, and more. We're seeking an exceptional engineer who thrives in fast-paced environments and is eager to contribute to the scaling of our core infrastructure.
The ideal candidate will possess a solid foundation in software engineering principles and extensive experience with large-scale distributed systems. Your role will involve implementing solutions across various cloud providers (GCP, Azure, AWS) for clients in highly regulated sectors, including healthcare, telecommunications, finance, and retail.
About Scale AI
Scale AI is at the forefront of artificial intelligence, providing an enterprise-grade generative AI platform that empowers businesses with advanced APIs and tools for effective knowledge management and data analytics. Our mission is to drive innovation across various industries, enabling clients to harness the power of AI effectively.
Similar jobs
1 - 20 of 5,468 Jobs
Search for Infrastructure Engineer Performance And Scale
About UsAt Parallel, we are a pioneering web infrastructure company dedicated to empowering businesses across various sectors, including sales, marketing, insurance, and software development. Our innovative products enable organizations to create cutting-edge AI agents with robust and flexible programmatic access to the web.Having successfully raised $130 million from esteemed investors such as Kleiner Perkins, Index Ventures, and Spark Capital, our mission is to reshape the web for AI applications. We are assembling a talented team of engineers, designers, marketers, and operational experts to help us achieve this vision.Job Overview: As a member of our technical staff, you will play a crucial role in building, operating, and scaling our infrastructure, particularly around large language models. Your responsibilities will include ensuring system reliability and cost-efficiency as we expand, anticipating potential bottlenecks, evolving our architecture to meet growing demands, and developing the tools that enhance engineering productivity.About You: You possess a deep understanding of distributed systems, cloud platforms, performance optimization, and scalable architecture. You are adept at balancing trade-offs between cost, reliability, and speed, and you are passionate about enabling teams to innovate rapidly and confidently while supporting products that serve millions of users seamlessly.
Full-time|$200K/yr - $400K/yr|Remote|San Francisco
At Inferact, we are dedicated to establishing vLLM as the premier AI inference engine, propelling advancements in AI by making inference both cost-effective and expeditious. Founded by the original creators and key maintainers of vLLM, we occupy a unique position at the convergence of models and hardware—an achievement that has taken years to realize.Role OverviewWe are seeking a talented Infrastructure Engineer to develop the distributed systems that facilitate inference on a global scale. In this role, you will design and implement essential layers that allow vLLM to deploy models across thousands of accelerators with minimal latency and maximum reliability. Our vision is to make deploying cutting-edge models at scale as simple as launching a serverless database. The complexities will be seamlessly integrated into the robust infrastructure you will be creating.
Senior Infrastructure & Performance EngineerAs a Senior Infrastructure & Performance Engineer, you will take charge of enhancing the performance, reliability, and scalability of Nash's foundational infrastructure. Collaborating closely with the Engineering Leadership and both platform and product engineering teams, you will design and manage low-latency, mission-critical systems that facilitate real-time logistics for some of the world's largest retailers.This is a key senior role focused on elastic capacity, high availability, cloud-native architectures, Postgres performance, and enterprise-grade CI/CD for multi-region deployments. You will define the technical roadmap, establish best practices, and implement systems that support the essential workflows of major retailers.Key ResponsibilitiesOversee infrastructure performance and reliability for Nash's production environments, ensuring low latency, high throughput, and consistent performance under load.Design, develop, and enhance AWS infrastructure, utilizing managed services with a focus on ECS/Fargate.Lead initiatives in Postgres performance engineering, including query optimization, indexing strategies, connection management, replication, cluster design, and failover.Architect and maintain multi-region, highly available systems with robust resiliency and guaranteed disaster recovery.Design and refine enterprise-grade CI/CD pipelines that enable safe, repeatable, and rapid deployments across environments and regions.Establish observability standards (metrics, logs, tracing, SLOs) to proactively identify and resolve performance bottlenecks.Collaborate with application engineers to inform system design choices that influence scalability, latency, and reliability.Lead incident response efforts and postmortems, emphasizing root cause analysis, systemic improvements, and long-term resilience.Set best practices for infrastructure and performance while mentoring engineers throughout the organization.Qualifications6+ years of experience in building and managing high-scale production infrastructure for mission-critical systems.Proficiency with AWS, particularly with ECS/Fargate, and experience with cloud-native architecture.Strong background in Postgres performance tuning and optimization.Deep understanding of CI/CD practices and experience in multi-region deployments.Exceptional analytical and problem-solving skills, with a proactive approach to performance management.
Full-time|$138K/yr - $259.4K/yr|On-site|San Francisco, CA; St. Louis, MO; New York, NY; Washington, DC
Scale AI is on the lookout for an exceptionally talented and driven Software Engineer, Frontier AI Infrastructure to become an integral part of our innovative Public Sector Engineering team. In this role, you will take charge of the model inference layer, enabling cutting-edge AI models, troubleshooting the latest AI tools, managing networking tasks, addressing latency issues, and monitoring pricing and usage metrics for AI models. You will spearhead technical discussions with cloud vendors and clients to fulfill critical contracts and resolve platform challenges. Additionally, you will collaborate closely with Product teams to anticipate feature requirements, transitioning from reactive 'infra-only debugging' to proactive integration testing.Your Responsibilities Include:Designing and implementing secure, scalable backend systems tailored for Public Sector clients, utilizing Scale's advanced cloud-native AI infrastructure.Owning services or systems while defining long-term health objectives and enhancing the health of related components.Redesigning the architecture to operate in compliant or restrictive environments, which entails creating swappable components (authentication, storage, logging) to adhere to government and security regulations without compromising product integrity.Collaborating with Product teams to develop integration tests that identify issues early, shifting focus from 'infra-only debugging' to preventing upstream failures.Actively participating in customer engagements, liaising with stakeholders to comprehend requirements and deliver innovative solutions.Contributing to the platform roadmap and product strategy for Scale AI's Public Sector division, playing a vital role in shaping the future trajectory of our offerings.
Full-time|$216.2K/yr - $270.3K/yr|On-site|San Francisco, CA; New York, NY
Join our dynamic Machine Learning Infrastructure team as a Senior AI Infrastructure Engineer, where you will play a pivotal role in designing and constructing platforms that ensure the scalable, reliable, and efficient serving of Large Language Models (LLMs). Our innovative platform supports a range of cutting-edge research and production systems, catering to both internal and external applications across diverse environments.The ideal candidate will possess a solid foundation in machine learning principles coupled with extensive experience in backend system architecture. You will thrive in a collaborative environment that bridges research and engineering, working diligently to provide seamless experiences for our customers and accelerating innovation across the organization.
Full-time|$216.2K/yr - $270.3K/yr|On-site|San Francisco, CA; New York, NY
Join Scale AI's innovative team as an Infrastructure Software Engineer for our Enterprise Generative AI Platform (SGP). In this dynamic role, you will help design and enhance our enterprise-grade AI platform, which offers robust APIs for knowledge retrieval, inference, evaluation, and more. We're seeking an exceptional engineer who thrives in fast-paced environments and is eager to contribute to the scaling of our core infrastructure. The ideal candidate will possess a solid foundation in software engineering principles and extensive experience with large-scale distributed systems. Your role will involve implementing solutions across various cloud providers (GCP, Azure, AWS) for clients in highly regulated sectors, including healthcare, telecommunications, finance, and retail.
Full-time|$236.5K/yr - $295.9K/yr|On-site|San Francisco, CA; New York, NY
As Scale AI continues to broaden its product offerings and customer base, we are actively seeking talented DevOps Engineers in the Public Sector who will take a leading role in enhancing our Continuous Integration/Continuous Deployment (CI/CD) pipelines. Your contribution will be vital in optimizing our Software Development Life Cycle (SDLC), transitioning from manual, fragmented deployments to a cohesive and automated system.In this position, you will develop an in-depth understanding of our core product architecture, allowing you to deploy and manage systems effectively. A key responsibility will involve integrating various machine learning (ML) tasks and updates into our SDLC, transforming isolated ML components into an integrated and automated workflow. Although direct ML experience is not mandatory, a genuine interest in learning and incorporating ML elements into our processes is essential.Your Responsibilities:Design, develop, and maintain efficient CI/CD pipelines for our low-side and high-side products.Work collaboratively with product and engineering teams to enhance existing application code for better compatibility and streamlined integration within automated pipelines.Contribute innovative ideas to improve the architecture and design of our deployment systems, increasing efficiency and reliability.Troubleshoot complex deployment issues, ensuring minimal disruption to development cycles.Gain a comprehensive understanding of our product and machine learning architectures to facilitate seamless integration and deployment.Document pipeline processes and configurations for maintainability and knowledge transfer.Integrate security best practices into every stage of the CI/CD pipeline, ensuring security is a foundational element of our development processes.Encourage standardization and collaboration across various product teams to achieve a unified and efficient SDLC.
Full-time|On-site|San Francisco, CA; Seattle, WA; New York, NY
Scale AI is seeking a Senior AI Infrastructure Engineer to help build and refine the company’s Training Platform. This position centers on designing, implementing, and improving infrastructure that supports machine learning teams as they train and deploy models. Role overview This engineer will work closely with colleagues across different functions to create solutions that make AI systems more efficient. The focus is on enabling faster, more reliable model training and deployment. Key responsibilities Design and build infrastructure for AI model training Implement and optimize systems to support machine learning workflows Collaborate with teams throughout the company to improve platform capabilities Locations This role is based in San Francisco, Seattle, or New York.
Full-time|$166K/yr - $225K/yr|On-site|San Francisco, California
P-97 At Databricks, we are dedicated to empowering data teams to tackle some of the most challenging problems in the world. We achieve this by creating and managing a leading data and AI infrastructure platform that enables our clients to leverage deep data insights for business enhancement. Our commitment to pushing the limits of data and AI technology is matched by our focus on resilience, security, and scalability, which are essential for our customers' success on our platform. Databricks operates one of the largest-scale software platforms, comprising millions of virtual machines that generate terabytes of logs and process exabytes of data daily. Given our scale, we frequently encounter cloud hardware, network, and operating system faults, and our software must adeptly protect our customers from these issues. As a Senior Performance Engineer, you will collaborate with various teams throughout the organization to assess product and feature performance, pinpoint performance bottlenecks, and partner with engineers to address performance and scalability challenges. This includes setting performance goals for different software releases, guiding teams in developing performance benchmarks, conducting competitive benchmark analyses for various Databricks products, and performing in-depth analyses to identify and resolve performance issues.
Full-time|$237.6K/yr - $297K/yr|On-site|San Francisco, CA; Seattle, WA; New York, NY
Join Scale AI as a talented Infrastructure Security Engineer, where you'll play a pivotal role in safeguarding the integrity and security of our platform. This position focuses on securing expansive cloud environments, managing and fortifying various compute clusters, and reviewing infrastructure as code. Your proficiency in cloud security, infrastructure automation, and advanced security practices will be crucial in upholding and advancing our security framework.Your responsibilities include:Securing infrastructure across major cloud hosting platforms (e.g., AWS, Azure, GCP).Implementing and maintaining comprehensive security configurations and policies for cloud environments.Conducting regular security assessments and audits to identify vulnerabilities and propose enhancements.Developing and enforcing security best practices for infrastructure automation and orchestration.Collaborating with Developer Experience, IT, and product teams to integrate security into every phase of the infrastructure lifecycle.Reviewing and securing infrastructure as code (e.g., Terraform, CloudFormation).Mentoring team members on infrastructure security best practices and emerging threats.
Full-time|$180K/yr - $210K/yr|On-site|San Francisco, CA
About Sigma Computing Sigma Computing builds AI-powered apps and analytics tools that connect directly to cloud data warehouses. Teams use Sigma to create applications, automate workflows, and analyze live data through a spreadsheet interface, SQL and Python editors, visual builders, and integrated AI features. The platform supports everything from interactive analyses to reports and embedded data experiences. Role Overview: Senior Product Manager - Platform Performance & Infrastructure Sigma is growing to serve larger enterprises with demanding, complex workloads. The Senior Product Manager for Platform Performance & Infrastructure will guide the development of core backend systems that keep Sigma responsive and reliable as usage scales. This role focuses on driving improvements in: Workbook performance Query lifecycle management Compute and caching strategies Metadata services Compiler components New warehouse connectors These systems are essential for Sigma’s ability to deliver consistent, high-quality performance to enterprise customers. What You Will Do Define and prioritize product enhancements for backend platform performance and scalability Work closely with platform engineering and cross-functional teams to address technical challenges Translate performance and scalability needs into clear product requirements and measurable objectives Ensure Sigma’s infrastructure can support enterprise clients with reliability and speed Who We’re Looking For Experienced Senior Product Manager with strong technical background Comfortable working hands-on with backend systems and infrastructure Skilled at collaborating with engineering and cross-functional partners Focused on delivering measurable improvements for customers Location & On-Site Requirement This position is based in San Francisco, CA. It requires working on-site at the Sigma office at least four days per week.
Join worldlabs as a Research Engineer focused on scaling multimodal data. In this dynamic role, you will leverage cutting-edge technologies and methodologies to enhance data processing capabilities. You will be responsible for developing innovative solutions that integrate various data types and drive impactful research outcomes.
Role overview The Performance Modeling Engineer II position at OpenAI centers on building and applying performance models to enhance the efficiency of advanced AI systems. Based in San Francisco, this role contributes to the reliability and speed of OpenAI’s technologies. What you will do Develop and implement performance models for AI systems Collaborate with data scientists and engineers to refine performance metrics Support the efficiency and rigorous standards of OpenAI’s technologies
Full-time|$138K/yr - $292.6K/yr|On-site|Boston, Massachusetts ; Honolulu, HI; San Diego, CA; San Francisco, CA; St. Louis, MO; New York, NY; Washington, DC
Join Scale AI as a passionate and skilled Mission Software Engineer within our innovative Federal Engineering team. In this pivotal position, you will directly contribute to enhancing our government clients' capabilities by designing and implementing tailored solutions. Our advanced, scalable platform is essential for these solutions, and your technical expertise will be crucial in ensuring seamless integrations with existing client infrastructures, enhancing their workflows.Your Responsibilities Include:Engaging directly with clients to identify their needs and translating them into functional features within Scale’s platform.Being prepared for over 50% travel or potential relocation to key client sites.Working collaboratively with diverse teams to define and implement backend solutions that cater to the specific requirements of government agencies operating in secure environments.Executing comprehensive data integrations, ensuring synchronization of client data with Scale’s platform.Deploying and maintaining Scale software at client locations.Creating features requested by clients, working closely to ensure high satisfaction.Building robust and reliable backend systems that serve as standalone products, empowering clients to advance their AI initiatives.Actively participating in client engagements and collaborating with stakeholders to capture requirements and deliver cutting-edge solutions.This role requires an active TS/SCI security clearance or the capability to obtain one.
Join Crusoe as a Senior Systems Performance Engineer, where you will play a crucial role in optimizing and enhancing our systems for superior performance. You will be responsible for diagnosing performance bottlenecks, implementing solutions, and ensuring that our infrastructure can scale efficiently. Work in a dynamic environment that encourages innovation and professional growth.
About Our TeamThe Infrastructure Engineering team operates within the IT department, dedicated to the reliable construction, deployment, and management of critical on-premises and hybrid environments that empower our internal services and vital research and development projects.This newly established team is committed to implementing rigorous Site Reliability Engineering (SRE) practices in environments where uptime, safety, recoverability, and security are paramount. We aim to replace unique, one-off infrastructure with standardized infrastructure-as-code components that enhance reliability and operational efficiency as OpenAI continues to grow.About This RoleWe are in search of an Infrastructure Engineering Lead who will architect, build, and maintain reliable, secure, and scalable infrastructure that supports identity, access, endpoint, and shared platform services throughout the organization.You will take full ownership of infrastructure and identity systems from conceptual design and provisioning to policy enforcement, upgrades, recovery, and ongoing operations. Your goal will be to develop robust, production-grade platforms that minimize operational hurdles, enforce security by default, and empower teams to work more effectively and confidently.This position is ideal for a senior engineer who excels in navigating ambiguity, relishes the challenge of overseeing complex systems from start to finish, and enhances reliability and security by transforming fragile implementations into standardized, repeatable infrastructure.This role is based at our San Francisco headquarters and requires in-office attendance.Key Responsibilities:Define and refine infrastructure patterns for on-prem and hybrid environments, including self-hosted platforms, vendor-supported systems, and lab settings.Establish standardized, production-grade deployment and operational models that replace custom-built solutions.Collaborate with IT, Security, Identity, and Network teams to ensure infrastructure is designed to meet reliability, security, and access standards.Design and enhance the production architecture for Identity and Access Management (IAM) adjacent platforms, such as Microsoft Entra, utilizing SRE principles.Develop common management protocols and shared resources within Azure subscriptions to ensure uniformity and policy compliance in operations.
About UsAt Lemurian Labs, we are dedicated to democratizing AI technology while prioritizing sustainability. Our mission is to create solutions that minimize environmental impact, ensuring that artificial intelligence serves humanity positively. We are committed to responsible innovation and the sustainable growth of AI.We are in the process of developing a state-of-the-art, portable compiler that empowers developers to 'build once, deploy anywhere.' This technology ensures seamless cross-platform integration, allowing for model training in the cloud and deployment at the edge, all while maximizing resource efficiency and scalability.If you are passionate about scaling AI sustainably and are eager to make AI development more powerful and accessible, we invite you to join our team at Lemurian Labs. Together, we can build a future that is innovative and responsible.The RoleWe are seeking a Senior ML Performance Engineer to take charge of designing and leading our Performance Testing Platform from inception. In this pivotal role, you will be recognized as the technical expert in measuring, validating, and enhancing the performance of large language models (including Llama 3.2 70B, DeepSeek, and others) prior to and following compiler optimization on cutting-edge GPU architectures.This is a critical position that will significantly impact our product quality and customer success. You will work at the intersection of Machine Learning systems, GPU architecture, and performance engineering, constructing the infrastructure that substantiates the value of our compiler.
Role overview Scale AI seeks a Database Engineer to strengthen and refine its data infrastructure. The position centers on designing, building, and maintaining database systems that deliver high availability and dependable performance. What you will do Design and implement database solutions that align with business requirements Maintain and tune database systems to ensure reliability and speed Collaborate with engineering, product, and operations teams to improve data processing and management Location This role is based in San Francisco, CA or New York, NY.
Full-time|$179.4K/yr - $224.3K/yr|On-site|San Francisco, CA; New York, NY
In a world where software is rapidly evolving, artificial intelligence (AI) is at the forefront, transforming how we interact with technology. At Scale AI, we recognize the immense potential of AI to enhance human capabilities, offering personalized support across various aspects of life—from coaching and tutoring to shopping and travel guidance. As enterprises, startups, and governments rush to integrate large language models (LLMs) into their operations, it is crucial to ensure these systems are safe, aligned, and effective. This involves rigorous human evaluation and reinforcement learning through human feedback (RLHF) during all stages of model development.Our innovative products, including the Generative AI Data Engine, SGP, and Donovan, are designed to empower the most advanced LLMs and generative models globally. By leveraging world-class RLHF, human data generation, model evaluation, safety, and alignment, we are shaping the future of human-AI interaction.As a member of our Platform Engineering team, you will play a pivotal role in designing and developing the foundational platforms that support Scale's operations. Your responsibilities will include architecting our core cloud infrastructure, enhancing our data lifecycle, and transforming the software development process for engineers at Scale. You will gain invaluable insights into the AI landscape as it develops within diverse sectors.
About UsAt Sierra, we are revolutionizing the way businesses engage with their customers by building a cutting-edge platform that harnesses the power of AI. Our headquarters is located in the vibrant city of San Francisco, with additional offices expanding in Atlanta, New York, London, France, Singapore, and Japan.Our company culture is deeply rooted in our core values: Trust, Customer Obsession, Craftsmanship, Intensity, and Family. These principles guide our actions and foster an environment where innovation thrives.Sierra was co-founded by visionary leaders Bret Taylor, who currently serves as the Board Chair of OpenAI and has a rich history with Salesforce and Facebook, and Clay Bavor, who previously led Google Labs and spearheaded initiatives like Google Lens and Project Starline.Your RoleAs a Software Engineer focusing on Infrastructure at Sierra, you will play a pivotal role in designing, constructing, and maintaining the foundational systems that empower our AI platform. Your expertise will ensure that our infrastructure is not only secure and reliable but also scalable, allowing product teams to execute their work with agility and confidence.Guarantee the reliability, scalability, and performance of our platform and LLM inference serving in response to increasing traffic demands.Develop and oversee cloud infrastructure using Terraform to create secure, scalable, and reproducible environments.Establish and manage a self-service infrastructure platform to empower engineering teams in deploying and operating services independently.Take ownership of and improve CI/CD pipelines and release management processes, facilitating rapid and reliable deployments across Sierra’s platform.Design and manage distributed systems utilizing distributed databases, retrieval systems, and machine learning models.Develop and sustain core data serving abstractions along with essential authentication and security features (SSO, RBAC, authentication controls).Effectively navigate and integrate our technology stack with enterprise customer environments in a scalable and maintainable manner.
Oct 15, 2025
Sign in to browse more jobs
Create account — see all 5,468 results
Tailoring 0 resumes…
Tailoring 0 resumes…
We'll move completed jobs to Ready to Apply automatically.