Cloud Orchestration Engineer At Inferact San Francisco jobs in San Francisco – Browse 11,447 openings on RoboApply Jobs
Cloud Orchestration Engineer At Inferact San Francisco jobs in San Francisco
Open roles matching “Cloud Orchestration Engineer At Inferact San Francisco” with location signals for San Francisco. 11,447 active listings on RoboApply Jobs.
11,447 jobs found
Cloud Orchestration Engineer at Inferact | San Francisco
Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Experience
Qualifications
Minimum Qualifications:Bachelor's degree or equivalent experience in computer science, engineering, or a related field. Extensive experience with Kubernetes and large-scale container orchestration. Proficient in designing and implementing custom Kubernetes operators. Strong programming skills in Python, Rust, or Go, along with experience in infrastructure-as-code tools such as Terraform and Helm. Experience in managing GPU clusters and troubleshooting hardware issues. Ability to work across various cloud platforms (AWS, GCP, Azure) as well as on-premise infrastructure. Preferred Qualifications:Familiarity with ML-specific orchestration tools like Ray or Slurm. Understanding of GPU scheduling, multi-tenancy, and resource optimization. Knowledge of vLLM deployment patterns and configurations. Proven track record of enhancing operational reliability for machine learning systems. Bonus Points:Experience deploying inference systems on large-scale GPU clusters (1,000+ nodes).
About the job
At Inferact, we are on a mission to establish vLLM as the premier AI inference engine, aiming to propel AI advancements by making inference processes more efficient and cost-effective. Our company is founded by the original creators and core maintainers of vLLM, placing us at a unique intersection of models and hardware, a position we have cultivated over many years.
About the Role
We are seeking a talented Cloud Orchestration Engineer to develop and maintain the operational framework that ensures vLLM operates seamlessly at an extensive scale. In this role, you will be responsible for designing systems for cluster management, deployment automation, and production monitoring, enabling teams across the globe to deploy AI models effortlessly. Your work will guarantee that vLLM deployments are not only observable and debuggable but also recoverable, transforming operational complexities into reliable infrastructure that operates smoothly.
About Inferact
Inferact is committed to transforming AI inference through vLLM, making it accessible and efficient for everyone. Our focus is to innovate at the confluence of AI models and hardware, fostering an environment where technology meets practicality to drive the future of AI.
Full-time|$200K/yr - $400K/yr|Remote|San Francisco
At Inferact, we are on a mission to establish vLLM as the premier AI inference engine, aiming to propel AI advancements by making inference processes more efficient and cost-effective. Our company is founded by the original creators and core maintainers of vLLM, placing us at a unique intersection of models and hardware, a position we have cultivated over many years.About the RoleWe are seeking a talented Cloud Orchestration Engineer to develop and maintain the operational framework that ensures vLLM operates seamlessly at an extensive scale. In this role, you will be responsible for designing systems for cluster management, deployment automation, and production monitoring, enabling teams across the globe to deploy AI models effortlessly. Your work will guarantee that vLLM deployments are not only observable and debuggable but also recoverable, transforming operational complexities into reliable infrastructure that operates smoothly.
Join Handshake as a Senior Cloud Engineer, where you'll play a pivotal role in designing and implementing scalable cloud solutions. Your expertise will help us enhance our cloud infrastructure while ensuring high availability and security.We are looking for a passionate and driven individual who thrives in a fast-paced environment, is eager to tackle complex challenges, and enjoys collaborating with cross-functional teams. If you have a deep understanding of cloud architectures and services, we want to hear from you!
Role Overview Crusoe Technologies is seeking a Senior Staff Software Engineer focused on Managed Orchestration to help shape the direction of our cloud infrastructure. This position is based in San Francisco, CA. What You Will Do Design and implement scalable orchestration solutions for cloud infrastructure. Lead a team of engineers, providing technical guidance and mentorship. Work closely with cross-functional groups to integrate services and products smoothly. Apply deep technical expertise to drive the development of new technologies that improve operational efficiency and customer experience. About Crusoe Technologies Crusoe Technologies builds innovative solutions for cloud computing with a focus on efficiency and sustainability.
Full-time|$180K/yr - $210K/yr|On-site|San Francisco, CA - US
At Crusoe, our mission is to accelerate the fusion of energy and intelligence. We are building the infrastructure that empowers individuals to innovate boldly with AI, ensuring that our advancements come without compromises in scale, speed, or sustainability.Join us in revolutionizing AI with sustainable technology at Crusoe, where you will spearhead impactful innovations, contribute to meaningful projects, and collaborate with a team that is reshaping the future of responsible cloud infrastructure.About the Role:We are looking for a talented Senior Software Engineer to join our cloud software team. Your role will be pivotal in enhancing our state-of-the-art infrastructure. You will leverage your expertise to design and scale our carbon-reducing operating model while managing essential hardware, software, and networking components.In this position, you will write and review code, develop proposals, and contribute to architectural documents. You will assess tools and frameworks, weighing their implications on reliability, scalability, operational costs, and ease of implementation. Your knowledge of orchestration and optimization will be crucial in advancing our managed Kubernetes and AI training clusters, ensuring they maintain a competitive edge in reliability and performance.What You'll Be Working On:Develop scalable and resilient software solutions that align with the strategic goals outlined in the Crusoe Cloud roadmap.Collaborate with tech leads and engineers to foster an environment of creativity and technical excellence, driving the development of innovative cloud solutions.Stay updated on the latest cloud software trends and techniques, integrating these insights to keep Crusoe’s innovations at the forefront of the industry.Although you won’t have formal management responsibilities, you will mentor your colleagues by sharing knowledge and guiding technical discussions.What You'll Bring to the Team:5-7 years of experience in software engineering, with strong expertise in Systems Engineering.2+ years of programming experience in GoLang.Experience with Kubernetes and Linux engineering, including debugging capabilities.A proactive attitude towards problem-solving and continuous learning.
About UsBraintrust is at the forefront of AI observability, seamlessly integrating evaluations and observability into a single workflow. Our platform empowers innovators by providing them with the critical insights needed to understand AI performance in production environments and the tools required to enhance it.Recognized by leading companies such as Notion, Stripe, Zapier, Vercel, and Ramp, Braintrust enables teams to compare AI models, test prompts, and detect regressions, transforming production data into superior AI with each iteration.Role OverviewWe are seeking a talented Cloud Infrastructure Engineer to join our team and contribute to the development of a robust and scalable infrastructure. You will provide developers with a premium platform to deploy code efficiently and confidently. Your role will involve leading initiatives across Terraform, Kubernetes, CI/CD, observability, and support, significantly impacting Braintrust's internal operations and the self-hosted experiences of our customers.This position is pivotal as you will manage our AWS environment while assisting customers in deploying our infrastructure on AWS, Azure, and GCP.Your ResponsibilitiesDevelop and maintain Terraform modules for both internal infrastructure and customer deployments.Engage directly with customers via Slack to assist with self-hosting and troubleshoot infrastructure challenges, creating tools to simplify their support process.Take ownership of our CI/CD pipeline, aiming to reduce build times, enhance failure visibility, and facilitate safer, quicker releases.Centralize and scale observability through logs, metrics, dashboards, and alerts.Collaborate with engineering teams to create and enhance a secure, developer-friendly infrastructure platform.Support multi-cloud deployment strategies, primarily in AWS, while also extending support for Azure and GCP for our enterprise clientele.Implement tools and automation to bolster deployment, rollback, and infrastructure reliability.Ideal Candidate ProfileA minimum of 5 years of experience in DevOps, SRE, or Infrastructure Engineering roles.In-depth knowledge of Terraform and experience with at least one major cloud provider, preferably AWS.Proficient in Kubernetes, with capabilities in deploying, debugging, and scaling real workloads.Strong programming skills in scripting languages like Python, Typescript, or Go.Experience in supporting production systems and managing incidents effectively.Comfortable working closely with customers in a support or deployment capacity.Bonus: Familiarity with monitoring and logging tools, as well as knowledge of security best practices.
At Judgment Labs, we are pioneering the way that Agent Behavior Monitoring (ABM) is approached. Unlike conventional observability methods that primarily focus on logging exceptions and latency, our innovative ABM technology identifies behavioral anomalies such as instruction drifts and context retrieval losses in large-scale production environments.Our platform is trusted by numerous teams developing autonomous agents, enabling them to gain insights into system behavior post-deployment. By moving beyond reactive incident management, our users can analyze patterns across conversations and workflows, correlate regressions to specific interaction types, and accurately identify where reliability issues arise within their operational context.Recent funding success: We have successfully raised over $30M across two funding rounds within the last five months, attracting notable investors such as Lightspeed, SV Angel, Valor Equity Partners, and more.
Join Perplexity as a Senior Cloud Security Engineer and play a pivotal role in transforming how users search and interact with the internet. As a key member of our innovative security team, you will spearhead initiatives to construct and sustain secure and scalable cloud infrastructure, enabling our engineers to innovate swiftly and securely.Core ResponsibilitiesCollaborate with infrastructure and engineering teams to embed security measures into development processes and advocate for secure-by-default practices.Develop Terraform modules that incorporate essential security features, including logging, encryption, and automated threat detection.Implement cloud-native detection capabilities utilizing AWS GuardDuty, Security Hub, and tailor-made detection rules to uncover credential breaches, crypto-mining, and lateral movements.Ensure compliance with SOC 2 Type II and ISO 27001 by automating the collection of cloud control evidence.Conduct security assessments of cloud resource configurations using tools like AWS Config and Open Policy Agent, addressing discrepancies in line with CIS Benchmarks and internal security policies.Fortify CI/CD and supply chain pipelines through controls such as artifact signing, secret scanning, and dependency monitoring.Implement zero trust principles via stringent network segmentation, authentication, and authorization across cloud environments.Engage in security on-call rotation, responding to security alerts and incidents for prompt resolution and root cause analysis.
Full-time|$115K/yr - $135K/yr|On-site|San Francisco, CA - US
At Crusoe, our mission is to foster a future abundant with energy and intelligence. We are building the infrastructure that enables ambitious AI creations without compromising on scale, speed, or sustainability.Join us in leading the AI revolution through sustainable technology at Crusoe. In this role, you will drive significant innovations, contribute to impactful solutions, and collaborate with a team that is at the forefront of responsible, transformative cloud infrastructure.About This RoleWe are looking for an Atlassian Cloud Engineer who will be the primary architect and strategic leader of our Atlassian ecosystem. This role combines in-depth technical knowledge with strong project management skills to ensure that Jira, Confluence, and related tools are optimized for collaboration, delivery, and service management. As the platform owner and change agent, you will drive innovation, implement best practices, and facilitate adoption as our business scales.What You’ll Be Working OnManaging the daily administration and long-term strategy for the Atlassian suite, including Jira Software, Jira Service Management, Confluence, Opsgenie, Product Discovery, and Statuspage.Customizing workflows, permissions, dashboards, and automation to enhance project execution, visibility, and inter-team collaboration.Collaborating with project managers, business leaders, and IT teams to integrate effective project management and IT service management practices within the Atlassian tools.Designing sophisticated reporting and analytics using eazyBI, JQL, filters, CQL, and dashboards to enable real-time decision-making.Serving as the escalation point for complex platform issues, working closely with Atlassian Support and our internal Service Desk teams.Leading enhancements of the platform by staying updated with Atlassian’s roadmap and promoting the adoption of new features.Advancing AI initiatives within the Atlassian ecosystem, including the development of custom agents using Atlassian Rovo and the implementation of the Rovo MCP server.Collaborating across IT to ensure system reliability, security, and alignment with broader business objectives.What You’ll Bring to the TeamA minimum of 3 years of experience in administering Atlassian Cloud environments in enterprise settings.Demonstrated ownership of Jira and Confluence customization, site administration, and daily operations.Proficient in advanced reporting and analytics tools.Strong project management skills and the ability to collaborate effectively across teams.
Full-time|On-site|San Francisco, California, United States
Join code-metal as a Senior Platform DevOps Engineer, where you will play a pivotal role in enhancing our cloud and on-premises infrastructure. You will be responsible for deploying, managing, and optimizing systems to ensure high availability and performance. This position offers an exciting opportunity to work with cutting-edge technologies and collaborate within a dynamic team.
Join our innovative team at litellm as a DevOps Engineer. In this role, you will be instrumental in enhancing our development and operations processes, ensuring seamless integration and delivery of our services. Collaborate with cross-functional teams to design, implement, and manage scalable infrastructure solutions.We are looking for a passionate individual with a strong foundation in cloud technologies, automation, and continuous integration/continuous deployment (CI/CD) practices. Your expertise will help us drive efficiency and reliability in our software delivery lifecycle.
Join our dynamic team at leverdemo-8 as a Software Engineer specializing in Cloud Infrastructure. We are passionate about reimagining the hiring landscape and are looking for talented engineers to enhance our YugaByte DB for enterprise applications. Your expertise will contribute to optimizing orchestration support across major public clouds including AWS, Google Cloud, and Azure, as well as Kubernetes services and private data centers. You'll play a crucial role in the control and manageability plane of YugaByte and collaborate with tools such as Prometheus and Alert Manager to ensure seamless infrastructure management.Please note that this position is part of Lever's testing environment; we kindly ask you not to apply for this role.
The OpportunityJoin rowspace as an Infrastructure Engineer and play a pivotal role in constructing and safeguarding the core of our cutting-edge AI data platform. In this position, you'll engineer systems capable of managing extensive volumes of sensitive financial information while adhering to rigorous security and compliance standards. Your work will involve real-time integration of public data with private, tenant-isolated customer data at scale.Key ResponsibilitiesDesign and implement scalable infrastructure to support our AI-driven knowledge engine that processes both structured and unstructured financial data.Establish a security-first architecture for private cloud environments, ensuring data governance aligns with financial services regulations.Create resilient data ingestion pipelines that accommodate a variety of data sources, from CapIQ feeds (structured data) to internal SharePoint documents (unstructured data).Develop comprehensive monitoring and alerting systems for our BYOC platform.Enforce access controls and maintain audit trails to ensure that AI interactions can be traced back to primary sources.Collaborate with our AI Research and Product teams to enhance infrastructure for LLM inference and training workloads, as well as agent infrastructure development.Establish CI/CD practices and infrastructure-as-code for swift, reliable deployments across multiple cloud providers.
Join our innovative team at Crusoe as a Staff Product Manager for Orchestration. In this pivotal role, you will lead our efforts in enhancing product orchestration strategies, ensuring seamless integration and execution of our technology solutions. Your expertise will guide cross-functional teams, drive product vision, and ultimately contribute to our mission of transforming the technology landscape.
Full-time|$133.2K/yr - $159.8K/yr|On-site|San Francisco, CA
At Fastly, we empower individuals to forge deeper connections with the things they cherish. Our cutting-edge edge cloud platform enables clients to swiftly, securely, and reliably craft exceptional digital experiences by processing, serving, and safeguarding their applications as close to their end-users as possible — right at the edge of the Internet. This platform is tailored to leverage the modern internet, is highly programmable, and supports agile software development methodologies. Our clientele includes renowned global brands, such as GitHub, Yelp, Paramount, and JetBlue.Join us in our mission to create a more trustworthy Internet.Posting Open Date: March 13th, 2026Anticipated Posting Close Date*: May 30th, 2026*Job posting may close early due to the volume of applicants.Data EngineerAs part of Fastly's Analytics team, you will empower leaders across the organization with actionable data that drives essential business decisions. We are focused on expanding and enhancing our premier internal data platform. In your role as a Data Engineer, you will play a pivotal part in transforming our data infrastructure, optimizing data pipelines on Google Cloud Platform (GCP), scaling the ingestion of complex data sources, and adhering to best practices for performance and reliability. This is your chance to contribute to significant projects within a dynamic, collaborative, and innovative workspace, supporting data scientists, analysts, and business analysts across our organization.
Full-time|$180K/yr - $220K/yr|On-site|San Francisco, CA - US
At Crusoe, we are on a mission to revolutionize the future by accelerating the abundance of energy and intelligence. We are building the foundational engine that empowers individuals to create bold innovations with AI while ensuring sustainability, speed, and scalability.Join us in the forefront of the AI revolution with cutting-edge sustainable technology. You will play a pivotal role in driving meaningful innovation, making a significant impact, and collaborating with a team that is leading the way in responsible, transformative cloud infrastructure.About the RoleAs a Senior Staff Cloud Support Engineer, you will serve as a technical expert within Crusoe Cloud and significantly enhance the efforts of our Customer Experience, SRE, Networking, Fleet, and Product teams. Your role transcends basic ticket resolution; you will design reliability frameworks, influence architectural decisions, mentor senior engineers, and safeguard revenue by averting large-scale incidents. With profound expertise in Linux systems, Kubernetes, networking, and AI/ML infrastructure, you will apply your knowledge with a strong focus on customer satisfaction. You will be comfortable navigating uncertainty, leading incident responses, and shaping the global scaling of high-performance AI infrastructure.Key ResponsibilitiesAct as the top escalation point for complex P1/P0 incidents.Lead cross-functional investigations into root causes involving compute, networking (IB/RDMA/RoCE), storage, and orchestration layers.Collaborate with SRE and Software teams (Storage, Networking, Compute, K8) to devise systemic solutions rather than temporary fixes.Reliability ArchitectureDesign and enhance node validation, burn-in processes, performance baselining, and release readiness.Influence Kubernetes architecture, workload orchestration (Slurm, Terraform), and AI/ML cluster stability.Minimize MTTR and prevent incident recurrence through structural enhancements.AI/ML Infrastructure ExpertiseTroubleshoot NCCL, IB, GPU driver/firmware issues, and distributed training failures.Support complex AI workloads (training + inference) through performance tuning and observability enhancements.Customer-Facing AuthorityAct as a senior technical advisor during high-stakes customer incidents.
At Greptile, we are on a mission to develop intelligent agents that autonomously verify code modifications. Our current focus involves utilizing AI to analyze pull requests on GitHub, effectively identifying bugs and enforcing coding standards. With our technology, we review nearly 1 billion lines of code each month for over 3,000 companies.Challenges We Are Excited To TackleDeveloping agents that can learn coding standards through experience, similar to how new hires adapt.Determining customer-specific preferences for pull request feedback using sample-efficient reinforcement learning to enhance signal-to-noise ratios.Implementing automated deployments of feature branches and leveraging agents to stress-test the application for bug detection.Our Growth TrajectoryServing over 7,000 customers.Successfully raised $30 million from prominent investors including Benchmark, Y Combinator, Paul Graham, and Initialized.Our TeamWe have curated a highly skilled team that has successfully scaled vital functions at leading companies such as Stripe, Google, Figma, and others.Key ResponsibilitiesDesign and implement resilient infrastructure to accommodate Greptile's expanding user base.Collaborate with our largest enterprise clients to facilitate the deployment of Greptile within their environments.Streamline the on-premise deployment process to support smaller clients with minimal hands-on intervention.
About SieveSieve stands as a pioneering AI research lab dedicated solely to video data. Our innovative approach integrates exabyte-scale video infrastructure with state-of-the-art video understanding techniques and a myriad of data sources, creating unparalleled datasets that redefine video modeling. With video accounting for 80% of global internet traffic, it has become the vital digital medium fueling creativity, communication, gaming, AR/VR, and robotics. At Sieve, we aim to eliminate the most significant bottleneck hindering the expansion of these applications: access to high-quality training data.With strategic partnerships with leading AI labs, our team of just 12 has achieved remarkable financial success, generating $XXM last quarter alone. Earlier this year, we secured Series A funding from elite firms including Matrix Partners, Swift Ventures, Y Combinator, and AI Grant.About the RoleAs we process petabytes of video across numerous nodes and cloud environments, ensuring reliability, observability, and security is essential to our growth.We are seeking our inaugural Reliability Engineer, who will focus entirely on fortifying the infrastructure that underpins Sieve. This role demands high ownership and a deep understanding of:System throughput and stabilityMonitoring and incident managementSecurity principles, including least-privilege designMinimizing operational burdens for the entire engineering teamYou will collaborate closely with our CTO and founding engineers to develop the foundational tools that empower our engineering efforts.This position is ideal for an engineer who is passionate about reliability, throughput, observability, and security. You are proactive in anticipating potential failure modes, reducing operational risks, and designing resilient systems.If a system failure occurs, you take it personally, thriving under the weight of responsibility.What You'll Be DoingCollaborate with engineering to design and validate infrastructure supporting PB-scale workloadsDevelop and manage Terraform-based multi-cloud deploymentsEnhance cloud and data security (SSO, IAM, least privilege access, auditability)Lead incident response efforts and strengthen systems against failuresCreate CI/CD systems to minimize user errors and maximize safetyEstablish monitoring and alerting frameworks (Prometheus, OpenTelemetry, VictoriaMetrics)
Role OverviewAt Variance, we are at the forefront of teaching machines to execute high-stakes judgment calls on a large scale. This involves developing AI agents that navigate the complex domains of risk investigations, fraud detection, and identity verifications.Our San Francisco-based team is small yet exceptionally talented, comprising former founders and specialists from leading AI laboratories. We cater to an impressive clientele, including Fortune 500 companies, global marketplaces, and regulated financial institutions. If you are passionate about taking ownership, working swiftly, and collaborating closely with founders, you will thrive in our environment.We are seeking a Security Engineer to help establish a robust security foundation. You will collaborate across product, infrastructure, and internal systems to ensure that Variance is secure by design, enabling us to meet the rigorous standards needed to deploy AI in critical workflows for the world’s largest corporations.
About Flow EngineeringAt Flow Engineering, we are pioneering an AI-native requirements platform that empowers cutting-edge engineering teams to collaborate seamlessly with AI agents. Our mission is to facilitate the design, validation, and evolution of complex systems with unmatched speed and precision. Following our successful Series A funding, we are on an exciting trajectory to scale our product from thousands to hundreds of thousands of users, all while upholding the highest standards of reliability and performance.About the RoleWe are seeking a passionate Infrastructure Software Engineer to join our dynamic team. In this role, you will be instrumental in constructing and expanding the core platform that underpins Flow. You will manage services and infrastructure that empower "agentic systems engineers" and product teams to leverage Flow in their daily tasks.You will become a key member of a small, senior team that prioritizes speed, ownership, and solid engineering principles—delivering version 1 products swiftly, learning, and iterating effectively.Your ResponsibilitiesDesign, develop, and maintain backend services and platform primitives that facilitate complex engineering workflows and large-scale collaboration.Enhance Flow’s capacity from thousands to hundreds of thousands of users, focusing on performance, reliability, observability, and security across the entire stack.Take ownership of CI/CD pipelines, testing infrastructure, and internal tools to enable rapid and safe product releases.Collaborate with frontend and AI engineers to establish robust APIs, data models, and integration points that are easy to adapt and evolve.Contribute to architectural decisions and the technical roadmap as our product and customer base expands.Your ProfileA minimum of 3 years of software engineering experience in building and maintaining production systems within a cloud environment (e.g., AWS or GCP).Deep understanding of systems design, distributed systems, and best practices for reliability, observability, and security.Proficiency with containerization and infrastructure-as-code tools (e.g., Docker, Terraform, etc.).Ability to take ownership of projects end-to-end in a fast-paced environment and make pragmatic decisions amidst ambiguity.A collaborative mindset with low ego, eager to work closely with product, design, and customer-facing teams.Our Technology StackUtilization of TypeScript, Node.js, and React for application development, with a strong emphasis on type safety.Employing Postgres and various managed cloud services for data persistence and messaging.
Join Our Team as a Network Software EngineerAt blaxel, we are seeking an exceptional Network Software Engineer to spearhead the architecture and development of the networking infrastructure for the world's first AI-native cloud.In this role, you will design and implement a comprehensive virtual networking stack for Blaxel’s Virtual Private Cloud (VPC), our cutting-edge Software-Defined Networking (SDN) fabric tailored for agentic AI workloads. Your contributions will encompass ultra-high-performance dataplane technologies, including VPP, DPDK, and eBPF, alongside distributed control-plane systems and traffic services developed in Rust and Go.You will be responsible for enhancing our globally distributed networking suite, which includes our load balancer and HTTP proxies built in Rust, traffic routing systems, and data plane components that currently handle over 1.5 million requests daily. Your objective is to develop a VPC that significantly outperforms conventional cloud solutions in terms of throughput, latency, visibility, and security.If you are a systems-oriented engineer with a passion for Rust, Go, networking concepts, and building cloud infrastructure from the ground up, this is the perfect opportunity for you.
Full-time|$200K/yr - $400K/yr|Remote|San Francisco
At Inferact, we are on a mission to establish vLLM as the premier AI inference engine, aiming to propel AI advancements by making inference processes more efficient and cost-effective. Our company is founded by the original creators and core maintainers of vLLM, placing us at a unique intersection of models and hardware, a position we have cultivated over many years.About the RoleWe are seeking a talented Cloud Orchestration Engineer to develop and maintain the operational framework that ensures vLLM operates seamlessly at an extensive scale. In this role, you will be responsible for designing systems for cluster management, deployment automation, and production monitoring, enabling teams across the globe to deploy AI models effortlessly. Your work will guarantee that vLLM deployments are not only observable and debuggable but also recoverable, transforming operational complexities into reliable infrastructure that operates smoothly.
Join Handshake as a Senior Cloud Engineer, where you'll play a pivotal role in designing and implementing scalable cloud solutions. Your expertise will help us enhance our cloud infrastructure while ensuring high availability and security.We are looking for a passionate and driven individual who thrives in a fast-paced environment, is eager to tackle complex challenges, and enjoys collaborating with cross-functional teams. If you have a deep understanding of cloud architectures and services, we want to hear from you!
Role Overview Crusoe Technologies is seeking a Senior Staff Software Engineer focused on Managed Orchestration to help shape the direction of our cloud infrastructure. This position is based in San Francisco, CA. What You Will Do Design and implement scalable orchestration solutions for cloud infrastructure. Lead a team of engineers, providing technical guidance and mentorship. Work closely with cross-functional groups to integrate services and products smoothly. Apply deep technical expertise to drive the development of new technologies that improve operational efficiency and customer experience. About Crusoe Technologies Crusoe Technologies builds innovative solutions for cloud computing with a focus on efficiency and sustainability.
Full-time|$180K/yr - $210K/yr|On-site|San Francisco, CA - US
At Crusoe, our mission is to accelerate the fusion of energy and intelligence. We are building the infrastructure that empowers individuals to innovate boldly with AI, ensuring that our advancements come without compromises in scale, speed, or sustainability.Join us in revolutionizing AI with sustainable technology at Crusoe, where you will spearhead impactful innovations, contribute to meaningful projects, and collaborate with a team that is reshaping the future of responsible cloud infrastructure.About the Role:We are looking for a talented Senior Software Engineer to join our cloud software team. Your role will be pivotal in enhancing our state-of-the-art infrastructure. You will leverage your expertise to design and scale our carbon-reducing operating model while managing essential hardware, software, and networking components.In this position, you will write and review code, develop proposals, and contribute to architectural documents. You will assess tools and frameworks, weighing their implications on reliability, scalability, operational costs, and ease of implementation. Your knowledge of orchestration and optimization will be crucial in advancing our managed Kubernetes and AI training clusters, ensuring they maintain a competitive edge in reliability and performance.What You'll Be Working On:Develop scalable and resilient software solutions that align with the strategic goals outlined in the Crusoe Cloud roadmap.Collaborate with tech leads and engineers to foster an environment of creativity and technical excellence, driving the development of innovative cloud solutions.Stay updated on the latest cloud software trends and techniques, integrating these insights to keep Crusoe’s innovations at the forefront of the industry.Although you won’t have formal management responsibilities, you will mentor your colleagues by sharing knowledge and guiding technical discussions.What You'll Bring to the Team:5-7 years of experience in software engineering, with strong expertise in Systems Engineering.2+ years of programming experience in GoLang.Experience with Kubernetes and Linux engineering, including debugging capabilities.A proactive attitude towards problem-solving and continuous learning.
About UsBraintrust is at the forefront of AI observability, seamlessly integrating evaluations and observability into a single workflow. Our platform empowers innovators by providing them with the critical insights needed to understand AI performance in production environments and the tools required to enhance it.Recognized by leading companies such as Notion, Stripe, Zapier, Vercel, and Ramp, Braintrust enables teams to compare AI models, test prompts, and detect regressions, transforming production data into superior AI with each iteration.Role OverviewWe are seeking a talented Cloud Infrastructure Engineer to join our team and contribute to the development of a robust and scalable infrastructure. You will provide developers with a premium platform to deploy code efficiently and confidently. Your role will involve leading initiatives across Terraform, Kubernetes, CI/CD, observability, and support, significantly impacting Braintrust's internal operations and the self-hosted experiences of our customers.This position is pivotal as you will manage our AWS environment while assisting customers in deploying our infrastructure on AWS, Azure, and GCP.Your ResponsibilitiesDevelop and maintain Terraform modules for both internal infrastructure and customer deployments.Engage directly with customers via Slack to assist with self-hosting and troubleshoot infrastructure challenges, creating tools to simplify their support process.Take ownership of our CI/CD pipeline, aiming to reduce build times, enhance failure visibility, and facilitate safer, quicker releases.Centralize and scale observability through logs, metrics, dashboards, and alerts.Collaborate with engineering teams to create and enhance a secure, developer-friendly infrastructure platform.Support multi-cloud deployment strategies, primarily in AWS, while also extending support for Azure and GCP for our enterprise clientele.Implement tools and automation to bolster deployment, rollback, and infrastructure reliability.Ideal Candidate ProfileA minimum of 5 years of experience in DevOps, SRE, or Infrastructure Engineering roles.In-depth knowledge of Terraform and experience with at least one major cloud provider, preferably AWS.Proficient in Kubernetes, with capabilities in deploying, debugging, and scaling real workloads.Strong programming skills in scripting languages like Python, Typescript, or Go.Experience in supporting production systems and managing incidents effectively.Comfortable working closely with customers in a support or deployment capacity.Bonus: Familiarity with monitoring and logging tools, as well as knowledge of security best practices.
At Judgment Labs, we are pioneering the way that Agent Behavior Monitoring (ABM) is approached. Unlike conventional observability methods that primarily focus on logging exceptions and latency, our innovative ABM technology identifies behavioral anomalies such as instruction drifts and context retrieval losses in large-scale production environments.Our platform is trusted by numerous teams developing autonomous agents, enabling them to gain insights into system behavior post-deployment. By moving beyond reactive incident management, our users can analyze patterns across conversations and workflows, correlate regressions to specific interaction types, and accurately identify where reliability issues arise within their operational context.Recent funding success: We have successfully raised over $30M across two funding rounds within the last five months, attracting notable investors such as Lightspeed, SV Angel, Valor Equity Partners, and more.
Join Perplexity as a Senior Cloud Security Engineer and play a pivotal role in transforming how users search and interact with the internet. As a key member of our innovative security team, you will spearhead initiatives to construct and sustain secure and scalable cloud infrastructure, enabling our engineers to innovate swiftly and securely.Core ResponsibilitiesCollaborate with infrastructure and engineering teams to embed security measures into development processes and advocate for secure-by-default practices.Develop Terraform modules that incorporate essential security features, including logging, encryption, and automated threat detection.Implement cloud-native detection capabilities utilizing AWS GuardDuty, Security Hub, and tailor-made detection rules to uncover credential breaches, crypto-mining, and lateral movements.Ensure compliance with SOC 2 Type II and ISO 27001 by automating the collection of cloud control evidence.Conduct security assessments of cloud resource configurations using tools like AWS Config and Open Policy Agent, addressing discrepancies in line with CIS Benchmarks and internal security policies.Fortify CI/CD and supply chain pipelines through controls such as artifact signing, secret scanning, and dependency monitoring.Implement zero trust principles via stringent network segmentation, authentication, and authorization across cloud environments.Engage in security on-call rotation, responding to security alerts and incidents for prompt resolution and root cause analysis.
Full-time|$115K/yr - $135K/yr|On-site|San Francisco, CA - US
At Crusoe, our mission is to foster a future abundant with energy and intelligence. We are building the infrastructure that enables ambitious AI creations without compromising on scale, speed, or sustainability.Join us in leading the AI revolution through sustainable technology at Crusoe. In this role, you will drive significant innovations, contribute to impactful solutions, and collaborate with a team that is at the forefront of responsible, transformative cloud infrastructure.About This RoleWe are looking for an Atlassian Cloud Engineer who will be the primary architect and strategic leader of our Atlassian ecosystem. This role combines in-depth technical knowledge with strong project management skills to ensure that Jira, Confluence, and related tools are optimized for collaboration, delivery, and service management. As the platform owner and change agent, you will drive innovation, implement best practices, and facilitate adoption as our business scales.What You’ll Be Working OnManaging the daily administration and long-term strategy for the Atlassian suite, including Jira Software, Jira Service Management, Confluence, Opsgenie, Product Discovery, and Statuspage.Customizing workflows, permissions, dashboards, and automation to enhance project execution, visibility, and inter-team collaboration.Collaborating with project managers, business leaders, and IT teams to integrate effective project management and IT service management practices within the Atlassian tools.Designing sophisticated reporting and analytics using eazyBI, JQL, filters, CQL, and dashboards to enable real-time decision-making.Serving as the escalation point for complex platform issues, working closely with Atlassian Support and our internal Service Desk teams.Leading enhancements of the platform by staying updated with Atlassian’s roadmap and promoting the adoption of new features.Advancing AI initiatives within the Atlassian ecosystem, including the development of custom agents using Atlassian Rovo and the implementation of the Rovo MCP server.Collaborating across IT to ensure system reliability, security, and alignment with broader business objectives.What You’ll Bring to the TeamA minimum of 3 years of experience in administering Atlassian Cloud environments in enterprise settings.Demonstrated ownership of Jira and Confluence customization, site administration, and daily operations.Proficient in advanced reporting and analytics tools.Strong project management skills and the ability to collaborate effectively across teams.
Full-time|On-site|San Francisco, California, United States
Join code-metal as a Senior Platform DevOps Engineer, where you will play a pivotal role in enhancing our cloud and on-premises infrastructure. You will be responsible for deploying, managing, and optimizing systems to ensure high availability and performance. This position offers an exciting opportunity to work with cutting-edge technologies and collaborate within a dynamic team.
Join our innovative team at litellm as a DevOps Engineer. In this role, you will be instrumental in enhancing our development and operations processes, ensuring seamless integration and delivery of our services. Collaborate with cross-functional teams to design, implement, and manage scalable infrastructure solutions.We are looking for a passionate individual with a strong foundation in cloud technologies, automation, and continuous integration/continuous deployment (CI/CD) practices. Your expertise will help us drive efficiency and reliability in our software delivery lifecycle.
Join our dynamic team at leverdemo-8 as a Software Engineer specializing in Cloud Infrastructure. We are passionate about reimagining the hiring landscape and are looking for talented engineers to enhance our YugaByte DB for enterprise applications. Your expertise will contribute to optimizing orchestration support across major public clouds including AWS, Google Cloud, and Azure, as well as Kubernetes services and private data centers. You'll play a crucial role in the control and manageability plane of YugaByte and collaborate with tools such as Prometheus and Alert Manager to ensure seamless infrastructure management.Please note that this position is part of Lever's testing environment; we kindly ask you not to apply for this role.
The OpportunityJoin rowspace as an Infrastructure Engineer and play a pivotal role in constructing and safeguarding the core of our cutting-edge AI data platform. In this position, you'll engineer systems capable of managing extensive volumes of sensitive financial information while adhering to rigorous security and compliance standards. Your work will involve real-time integration of public data with private, tenant-isolated customer data at scale.Key ResponsibilitiesDesign and implement scalable infrastructure to support our AI-driven knowledge engine that processes both structured and unstructured financial data.Establish a security-first architecture for private cloud environments, ensuring data governance aligns with financial services regulations.Create resilient data ingestion pipelines that accommodate a variety of data sources, from CapIQ feeds (structured data) to internal SharePoint documents (unstructured data).Develop comprehensive monitoring and alerting systems for our BYOC platform.Enforce access controls and maintain audit trails to ensure that AI interactions can be traced back to primary sources.Collaborate with our AI Research and Product teams to enhance infrastructure for LLM inference and training workloads, as well as agent infrastructure development.Establish CI/CD practices and infrastructure-as-code for swift, reliable deployments across multiple cloud providers.
Join our innovative team at Crusoe as a Staff Product Manager for Orchestration. In this pivotal role, you will lead our efforts in enhancing product orchestration strategies, ensuring seamless integration and execution of our technology solutions. Your expertise will guide cross-functional teams, drive product vision, and ultimately contribute to our mission of transforming the technology landscape.
Full-time|$133.2K/yr - $159.8K/yr|On-site|San Francisco, CA
At Fastly, we empower individuals to forge deeper connections with the things they cherish. Our cutting-edge edge cloud platform enables clients to swiftly, securely, and reliably craft exceptional digital experiences by processing, serving, and safeguarding their applications as close to their end-users as possible — right at the edge of the Internet. This platform is tailored to leverage the modern internet, is highly programmable, and supports agile software development methodologies. Our clientele includes renowned global brands, such as GitHub, Yelp, Paramount, and JetBlue.Join us in our mission to create a more trustworthy Internet.Posting Open Date: March 13th, 2026Anticipated Posting Close Date*: May 30th, 2026*Job posting may close early due to the volume of applicants.Data EngineerAs part of Fastly's Analytics team, you will empower leaders across the organization with actionable data that drives essential business decisions. We are focused on expanding and enhancing our premier internal data platform. In your role as a Data Engineer, you will play a pivotal part in transforming our data infrastructure, optimizing data pipelines on Google Cloud Platform (GCP), scaling the ingestion of complex data sources, and adhering to best practices for performance and reliability. This is your chance to contribute to significant projects within a dynamic, collaborative, and innovative workspace, supporting data scientists, analysts, and business analysts across our organization.
Full-time|$180K/yr - $220K/yr|On-site|San Francisco, CA - US
At Crusoe, we are on a mission to revolutionize the future by accelerating the abundance of energy and intelligence. We are building the foundational engine that empowers individuals to create bold innovations with AI while ensuring sustainability, speed, and scalability.Join us in the forefront of the AI revolution with cutting-edge sustainable technology. You will play a pivotal role in driving meaningful innovation, making a significant impact, and collaborating with a team that is leading the way in responsible, transformative cloud infrastructure.About the RoleAs a Senior Staff Cloud Support Engineer, you will serve as a technical expert within Crusoe Cloud and significantly enhance the efforts of our Customer Experience, SRE, Networking, Fleet, and Product teams. Your role transcends basic ticket resolution; you will design reliability frameworks, influence architectural decisions, mentor senior engineers, and safeguard revenue by averting large-scale incidents. With profound expertise in Linux systems, Kubernetes, networking, and AI/ML infrastructure, you will apply your knowledge with a strong focus on customer satisfaction. You will be comfortable navigating uncertainty, leading incident responses, and shaping the global scaling of high-performance AI infrastructure.Key ResponsibilitiesAct as the top escalation point for complex P1/P0 incidents.Lead cross-functional investigations into root causes involving compute, networking (IB/RDMA/RoCE), storage, and orchestration layers.Collaborate with SRE and Software teams (Storage, Networking, Compute, K8) to devise systemic solutions rather than temporary fixes.Reliability ArchitectureDesign and enhance node validation, burn-in processes, performance baselining, and release readiness.Influence Kubernetes architecture, workload orchestration (Slurm, Terraform), and AI/ML cluster stability.Minimize MTTR and prevent incident recurrence through structural enhancements.AI/ML Infrastructure ExpertiseTroubleshoot NCCL, IB, GPU driver/firmware issues, and distributed training failures.Support complex AI workloads (training + inference) through performance tuning and observability enhancements.Customer-Facing AuthorityAct as a senior technical advisor during high-stakes customer incidents.
At Greptile, we are on a mission to develop intelligent agents that autonomously verify code modifications. Our current focus involves utilizing AI to analyze pull requests on GitHub, effectively identifying bugs and enforcing coding standards. With our technology, we review nearly 1 billion lines of code each month for over 3,000 companies.Challenges We Are Excited To TackleDeveloping agents that can learn coding standards through experience, similar to how new hires adapt.Determining customer-specific preferences for pull request feedback using sample-efficient reinforcement learning to enhance signal-to-noise ratios.Implementing automated deployments of feature branches and leveraging agents to stress-test the application for bug detection.Our Growth TrajectoryServing over 7,000 customers.Successfully raised $30 million from prominent investors including Benchmark, Y Combinator, Paul Graham, and Initialized.Our TeamWe have curated a highly skilled team that has successfully scaled vital functions at leading companies such as Stripe, Google, Figma, and others.Key ResponsibilitiesDesign and implement resilient infrastructure to accommodate Greptile's expanding user base.Collaborate with our largest enterprise clients to facilitate the deployment of Greptile within their environments.Streamline the on-premise deployment process to support smaller clients with minimal hands-on intervention.
About SieveSieve stands as a pioneering AI research lab dedicated solely to video data. Our innovative approach integrates exabyte-scale video infrastructure with state-of-the-art video understanding techniques and a myriad of data sources, creating unparalleled datasets that redefine video modeling. With video accounting for 80% of global internet traffic, it has become the vital digital medium fueling creativity, communication, gaming, AR/VR, and robotics. At Sieve, we aim to eliminate the most significant bottleneck hindering the expansion of these applications: access to high-quality training data.With strategic partnerships with leading AI labs, our team of just 12 has achieved remarkable financial success, generating $XXM last quarter alone. Earlier this year, we secured Series A funding from elite firms including Matrix Partners, Swift Ventures, Y Combinator, and AI Grant.About the RoleAs we process petabytes of video across numerous nodes and cloud environments, ensuring reliability, observability, and security is essential to our growth.We are seeking our inaugural Reliability Engineer, who will focus entirely on fortifying the infrastructure that underpins Sieve. This role demands high ownership and a deep understanding of:System throughput and stabilityMonitoring and incident managementSecurity principles, including least-privilege designMinimizing operational burdens for the entire engineering teamYou will collaborate closely with our CTO and founding engineers to develop the foundational tools that empower our engineering efforts.This position is ideal for an engineer who is passionate about reliability, throughput, observability, and security. You are proactive in anticipating potential failure modes, reducing operational risks, and designing resilient systems.If a system failure occurs, you take it personally, thriving under the weight of responsibility.What You'll Be DoingCollaborate with engineering to design and validate infrastructure supporting PB-scale workloadsDevelop and manage Terraform-based multi-cloud deploymentsEnhance cloud and data security (SSO, IAM, least privilege access, auditability)Lead incident response efforts and strengthen systems against failuresCreate CI/CD systems to minimize user errors and maximize safetyEstablish monitoring and alerting frameworks (Prometheus, OpenTelemetry, VictoriaMetrics)
Role OverviewAt Variance, we are at the forefront of teaching machines to execute high-stakes judgment calls on a large scale. This involves developing AI agents that navigate the complex domains of risk investigations, fraud detection, and identity verifications.Our San Francisco-based team is small yet exceptionally talented, comprising former founders and specialists from leading AI laboratories. We cater to an impressive clientele, including Fortune 500 companies, global marketplaces, and regulated financial institutions. If you are passionate about taking ownership, working swiftly, and collaborating closely with founders, you will thrive in our environment.We are seeking a Security Engineer to help establish a robust security foundation. You will collaborate across product, infrastructure, and internal systems to ensure that Variance is secure by design, enabling us to meet the rigorous standards needed to deploy AI in critical workflows for the world’s largest corporations.
About Flow EngineeringAt Flow Engineering, we are pioneering an AI-native requirements platform that empowers cutting-edge engineering teams to collaborate seamlessly with AI agents. Our mission is to facilitate the design, validation, and evolution of complex systems with unmatched speed and precision. Following our successful Series A funding, we are on an exciting trajectory to scale our product from thousands to hundreds of thousands of users, all while upholding the highest standards of reliability and performance.About the RoleWe are seeking a passionate Infrastructure Software Engineer to join our dynamic team. In this role, you will be instrumental in constructing and expanding the core platform that underpins Flow. You will manage services and infrastructure that empower "agentic systems engineers" and product teams to leverage Flow in their daily tasks.You will become a key member of a small, senior team that prioritizes speed, ownership, and solid engineering principles—delivering version 1 products swiftly, learning, and iterating effectively.Your ResponsibilitiesDesign, develop, and maintain backend services and platform primitives that facilitate complex engineering workflows and large-scale collaboration.Enhance Flow’s capacity from thousands to hundreds of thousands of users, focusing on performance, reliability, observability, and security across the entire stack.Take ownership of CI/CD pipelines, testing infrastructure, and internal tools to enable rapid and safe product releases.Collaborate with frontend and AI engineers to establish robust APIs, data models, and integration points that are easy to adapt and evolve.Contribute to architectural decisions and the technical roadmap as our product and customer base expands.Your ProfileA minimum of 3 years of software engineering experience in building and maintaining production systems within a cloud environment (e.g., AWS or GCP).Deep understanding of systems design, distributed systems, and best practices for reliability, observability, and security.Proficiency with containerization and infrastructure-as-code tools (e.g., Docker, Terraform, etc.).Ability to take ownership of projects end-to-end in a fast-paced environment and make pragmatic decisions amidst ambiguity.A collaborative mindset with low ego, eager to work closely with product, design, and customer-facing teams.Our Technology StackUtilization of TypeScript, Node.js, and React for application development, with a strong emphasis on type safety.Employing Postgres and various managed cloud services for data persistence and messaging.
Join Our Team as a Network Software EngineerAt blaxel, we are seeking an exceptional Network Software Engineer to spearhead the architecture and development of the networking infrastructure for the world's first AI-native cloud.In this role, you will design and implement a comprehensive virtual networking stack for Blaxel’s Virtual Private Cloud (VPC), our cutting-edge Software-Defined Networking (SDN) fabric tailored for agentic AI workloads. Your contributions will encompass ultra-high-performance dataplane technologies, including VPP, DPDK, and eBPF, alongside distributed control-plane systems and traffic services developed in Rust and Go.You will be responsible for enhancing our globally distributed networking suite, which includes our load balancer and HTTP proxies built in Rust, traffic routing systems, and data plane components that currently handle over 1.5 million requests daily. Your objective is to develop a VPC that significantly outperforms conventional cloud solutions in terms of throughput, latency, visibility, and security.If you are a systems-oriented engineer with a passion for Rust, Go, networking concepts, and building cloud infrastructure from the ground up, this is the perfect opportunity for you.
Mar 3, 2026
Sign in to browse more jobs
Create account — see all 11,447 results
Tailoring 0 resumes…
Tailoring 0 resumes…
We'll move completed jobs to Ready to Apply automatically.