Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Experience
Qualifications
Key ResponsibilitiesDesign, build, and maintain the cloud infrastructure for our distributed build acceleration platform. Automate everything: Develop deployment pipelines and streamline monitoring and recovery processes. Manage scalability and reliability for high-throughput, low-latency systems. Implement and maintain observability through logging, metrics, tracing, and alerting. Collaborate with product and engineering teams to integrate reliability into every feature. Quickly diagnose and resolve production incidents and provide feedback to enhance system design. Optimize cost, performance, and resilience across multi-cloud environments. QualificationsMinimum of 4 years of experience in SRE, DevOps, or Production Engineering roles. Proven experience managing Kubernetes in a production environment. Solid background in cloud infrastructure (preferably GCP or AWS) and Infrastructure as Code (Terraform preferred). Strong understanding of networking and security principles and practices. Experience with monitoring and logging systems such as Prometheus, Grafana, or ELK stack. Excellent problem-solving skills and ability to work in a fast-paced environment.
About the job
Join Our Team at EngFlow
EngFlow is revolutionizing the software development process by enabling developers to save valuable time in their build and test cycles. Our innovative cloud-based distributed service optimizes workflows through advanced remote execution and caching, significantly enhancing efficiency, productivity, and product quality.
Supported by esteemed investors, EngFlow is at the forefront of transforming how organizations develop software and deliver thoroughly tested products. Our solutions can accelerate builds by tenfold or more, and our observability platform provides crucial insights for ongoing optimization. Founded by leading contributors to Bazel, we create tools that empower engineering teams, from startups to Fortune 500 companies, to boost developer velocity and build performance.
We are seeking a talented and experienced Site Reliability Engineer to join our dynamic engineering team. In this pivotal role, you will bridge the gap between software engineering and systems operations, ensuring our distributed infrastructure is highly available, performant, and scalable, thereby allowing our engineers to work swiftly and with confidence.
About EngFlow
EngFlow is a cutting-edge technology company dedicated to improving the software development lifecycle. Our focus on cloud-based solutions allows engineering teams to enhance their workflows and deliver high-quality software faster. We pride ourselves on our innovative approach and commitment to excellence, fostering a culture that values collaboration and continuous improvement.
Similar jobs
1 - 20 of 11,446 Jobs
Search for Foundations Engineer Deep Infrastructure On Site In San Francisco
About RoxAt Rox, we are pioneering the development of an AI-native revenue operating system that transforms how enterprises interact with technology. Unlike traditional software designed for human dashboard operators, Rox is engineered for agents managing complex systems.We eliminate static workflows, enabling continuous decision-making processes powered by real-time insights from across the enterprise landscape. Our agents are equipped to analyze signals, reason through them, and autonomously execute actions.To support this innovation, we are constructing a robust infrastructure that integrates:Distributed data platformsReal-time decision-making systemsAgent execution frameworksLow-latency context retrievalBacked by prominent investors like Sequoia, GV, and General Catalyst, we are assembling a talented team of engineers eager to tackle deep technical challenges that have a tangible impact on the world.About the Foundations TeamThe Foundations team is responsible for developing the core infrastructure that powers Rox agents.Our work focuses on:Real-time context ingestionAgent execution and orchestrationEnsuring reliability for long-term AI tasksLow-latency decision-making across distributed systemsIf you have experience with:Streaming compute platformsDistributed query enginesReal-time OLAP systemsMatching enginesLarge-scale data infrastructureMany challenges you encounter here will resonate with your past work but will be applied to a novel category of software. At Rox, agents continuously:Retrieve contextMake decisionsTrigger actionsUpdate stateThe Foundations team builds the infrastructure that ensures these feedback loops are reliable, swift, and observable.The RoleWe are on the lookout for a Foundations Engineer (Deep Infrastructure) to design and oversee the systems that power Rox's agent runtime.
Join Us as Our First Marketing LeadFoundation is on the lookout for our inaugural marketing leader to propel our vision of revolutionizing homebuilding and enhancing the journey of buying, selling, and owning a home.About FoundationWith approximately $6.8M in backing from top-tier venture capitalists, including Y Combinator, Foundation is composed of a dynamic team formerly from Opendoor, dedicated to reshaping the future of residential real estate.Our flagship product is a cutting-edge customer experience platform designed specifically for homebuilders—think of it as the "Shopify for Homebuilders." We collaborate with large-scale homebuilders to deliver a modern digital experience, significantly boosting customer satisfaction and team productivity. In just two years, we've achieved remarkable product-market fit and impressive growth, all without a dedicated marketing team.Our Growth JourneyWe are currently navigating the first of three interconnected growth phases:AI-Driven SaaS for Homebuilding: A transformative opportunity with public-scale potential.Real Estate Enterprise Ecosystem: Homebuilders drive this ecosystem, which fosters collaboration among adjacent trillion-dollar sectors such as lending, title, home insurance, and retail.AI Native Home Operating System: This will enable seamless home buying and ownership through our platform.Your Role as Our First MarketerWe seek a hands-on, results-driven marketer passionate about transforming a key sector of the U.S. economy and redefining marketing in the age of AI.Key ResponsibilitiesYou will be pivotal in steering Foundation's next growth phases by integrating AI with marketing and real estate innovation. Your primary objectives will include:Accelerating Growth: Drive rapid expansion of our core AI-driven product line for homebuilders.
Join Our Team as a Customer Success Manager/Lead!At Foundation, we're on a mission to revolutionize the homebuilding industry and enhance the experience of buying, selling, and owning homes. We are looking for a dedicated Customer Success Manager to join our dynamic team and drive our core business forward.About UsWith $6.8M in funding from top-tier venture capitalists, including Y Combinator, Foundation is comprised of a talented team formerly from Opendoor, focused on transforming residential real estate. Our flagship product is a cutting-edge customer experience platform tailored for homebuilders — envision it as the 'Shopify for Homebuilders'. We partner with large-scale homebuilders to provide a modern digital experience that significantly boosts customer satisfaction and team productivity. Within just two years, we have achieved remarkable growth and established a strong product-market fit.The RoleAs a Customer Success Manager, you will play a pivotal role in fostering and expanding relationships with our diverse portfolio of clients. You will be the primary contact for our customers post-onboarding, ensuring they maximize their use of our platform and derive long-term value. This position requires a proactive approach in managing multiple accounts, identifying risks and opportunities, and collaborating closely with our Product, Engineering, and Operations teams.This is an ideal opportunity for someone who thrives in a fast-paced environment, enjoys tackling challenges, and seeks a meaningful role within a scaling startup. You will have the autonomy to influence our customer engagement strategies and drive significant impact.
Full-time|$125K/yr - $195K/yr|On-site|San Francisco Office
About Atomic SemiAtomic Semi is pioneering the development of a compact and agile semiconductor fabrication facility.With today’s technology, alongside a few innovative simplifications, we are capable of realizing this vision. We will create our own tools, allowing for rapid iterations and enhancements.Our goal is to assemble a small, exceptional team of hands-on engineers to drive this initiative forward. Our team is composed of experts in mechanical, electrical, hardware, computer, and process engineering. We will manage the entire stack, from atoms to architecture, with a forward-thinking approach that pushes the boundaries of technology.Our philosophy emphasizes that smaller, faster, and self-built systems are superior.We are confident that our team and lab can create anything we envision. Equipped with 3D printers, diverse microscopes, e-beam writers, and general fabrication tools, we are committed to inventing whatever tools we may need along the way.Founded by Sam Zeloof and Jim Keller, Atomic Semi combines Sam's garage chip-making prowess with Jim's extensive 40-year leadership in the semiconductor industry.About the RoleWe are in search of an Infrastructure & Site Reliability Engineer to design, construct, deploy, and oversee the on-premises backend infrastructure that drives our rapid semiconductor fabrication process.This multifaceted role encompasses all elements of backend infrastructure and services.Our infrastructure philosophy prioritizes minimalism, clarity, on-site operations, and proximity to hardware. Expect a focus on bare-metal Linux, systemd, and single-file binaries rather than extensive use of Docker, cloud services, or Kubernetes. Proficiency in Rust, Go, and Python will be beneficial.We welcome candidates from various experience levels—ranging from outstanding early-career engineers to seasoned professionals. We are not fixed on a specific background; what is paramount is your proven ability to build real systems, enthusiasm for hands-on engineering, and a strong display of engineering excellence. If you are passionate about performance engineering, developing complex features from the ground up, and swiftly mastering new domains, this is an exciting opportunity for you.A portfolio or GitHub account is generally required to apply: demonstrate the projects you’ve undertaken!
Join Our Team as a Lead DesignerFoundation is on the lookout for an innovative Lead Designer to advance our core mission: modernizing the homebuilding industry and enhancing the experience of buying, selling, and owning homes.About FoundationWith $6.8 million in support from prominent venture capitalists including Y Combinator, Foundation is revolutionizing residential real estate, coming from a team formerly at Opendoor.Our flagship product is a comprehensive customer experience platform for homebuilders, akin to “Shopify for Homebuilders.” We collaborate with large-scale homebuilders to provide a modern digital customer experience, significantly improving customer satisfaction and team productivity. In just two years, we have achieved evident product-market fit and rapid growth, all with contract design support.Growth PhasesWe are currently navigating the first of three key growth phases:Transformative, AI-driven vertical SaaS for homebuilding.The enterprise ecosystem for real estate that fosters collaboration and unlocks opportunities across related trillion-dollar industries.The AI-native home operating system that supports a seamless experience for homebuyers and homeowners.Role OverviewAs our first hands-on design leader, you will collaborate closely with our founders, customers, and engineering team. We are seeking a full-stack, impact-driven designer passionate about transforming a vital sector of the U.S. economy and redefining design for the AI era.Your contributions will be pivotal in propelling Foundation into its next growth stages, focusing on the convergence of design, AI, and real estate innovation.
Site Reliability EngineerLocation: San Francisco, CA (5 Days In-Office)As a Site Reliability Engineer at Latent, you will be the backbone of our infrastructure, ensuring the exceptional stability and performance of our cutting-edge clinical AI platform that serves major health systems. Your role is pivotal in enhancing operational excellence, directly impacting patient access to critical treatments.What Makes a Great Engineer at LatentWe seek individuals who are not just technically skilled but also passionate about ownership and high standards. You will thrive in our dynamic, in-office culture where teamwork and a winning mentality are key.Tool Proficiency: You are highly adept with your tools, fluent in command line operations, and skilled in keyboard shortcuts.Ownership: You take pride in managing complex systems and have a successful history of scaling mission-critical deployments.Automation Drive: You have a passion for automation, consistently seeking innovative methods to enhance efficiency and establish operational excellence.Problem Solver: You proactively address challenges, stepping in to resolve issues without waiting for others.Your ResponsibilitiesAs our SRE, you will take full ownership of the production environment and enhance the developer experience:Infrastructure Ownership: Design, implement, and maintain a robust production environment, having experience with over 500 machine deployments.Kubernetes Mastery: Utilize your expertise in Kubernetes and Helm to manage our containerized infrastructure, ensuring optimal deployment, scalability, and operational health.CI/CD & Deployment Optimization: Streamline the deployment pipelines for TypeScript and Python/ML, supporting rapid feature releases while upholding top-notch reliability.DevX Support: Enhance developer workflows by supporting Developer Experience (DevX) initiatives to improve tool proficiency and CI/CD systems.Infrastructure as Code (IaC): Manage infrastructure definitions using Terraform.
Join Our Team as a Site Reliability EngineerBlaxel is seeking a highly skilled Site Reliability Engineer to enhance the reliability, performance, and scalability of our cutting-edge AI infrastructure platform.In this role, you will develop and manage the essential systems that support scalable agentic AI. Your primary goal: maintain our ultra-low-latency, stateful, serverless compute engine, ensuring it remains robust as we handle billions of agent requests from the world's most advanced AI teams.This position is deeply technical and execution-oriented. You will take charge of our reliability framework, encompassing observability, performance optimization, incident management, infrastructure health, and the automation processes that ensure seamless operations. We are looking for innovators who can design new reliability systems, advance automation capabilities, and continuously adapt the platform to accommodate next-generation AI workloads. If you are a builder who excels in managing critical infrastructure at scale, we want to hear from you.Your ResponsibilitiesWorking closely with our founders, infrastructure team, and development team—leveraging AI for maximum efficiency—you will architect and manage the systems that keep Blaxel fast, resilient, and secure.Design, operate, and iteratively enhance the core infrastructure that drives our 25ms cold-start compute engine.Develop and refine our observability stack (metrics, traces, logs), ensuring proactive issue detection.Establish, monitor, and drive SLOs/SLIs across vital system components to ensure world-class reliability.Lead incident response with precision: conduct root cause analyses, post-mortems, and implement systemic solutions.Design and deploy self-healing, automated operational systems to minimize manual work and scale operations.Collaborate across compute, networking, storage, and sandboxed execution layers to optimize performance under intense workloads.Create automation tools—often utilizing AI agents—to enhance operations, debugging, capacity planning, and failure predictions.Test and stress our systems to their limits: engage in load testing, chaos engineering, and performance benchmarking.Champion security best practices at the infrastructure level, from sandboxed compute to network isolation.Collaborate with platform engineers to ensure reliability is an integral part of new features from inception.Who You AreExtensive technical expertise in site reliability engineering, with a passion for building scalable systems.
Who We AreAt Hyperbolic Labs, we are committed to democratizing AI by removing barriers to computing power with our Open-Access AI Cloud. By aggregating global computing resources, we provide an innovative GPU marketplace and AI inference service that ensures both affordability and accessibility. As trailblazers at the convergence of AI and open-source technology, we envision a future where AI innovation is only limited by creativity, not by resource availability. We invite forward-thinking individuals who share our dedication to making AI universally accessible, secure, and affordable. Join us in crafting a platform that empowers innovators worldwide to realize their visionary AI projects.In anticipation of our growth following our Series A funding, our team — guided by co-founders with advanced degrees in AI, Mathematics, and Computer Science — is set to transform the computing landscape.About the RoleWe are in search of a skilled Site Reliability Engineer to guarantee that Hyperbolic's GPU marketplace and AI infrastructure function with outstanding reliability, performance, and security. As an aggregator of computational resources from numerous global providers, our service level objectives (SLOs), trust, and economic efficiency are critical to our product. Your key responsibilities will include defining and maintaining service level objectives, developing resilient incident response protocols, managing capacity across our extensive GPU network, and implementing secure rollout and rollback mechanisms to ensure uninterrupted platform operation around the clock.In this influential role, you'll set the reliability benchmarks that foster customer trust in our platform, design comprehensive monitoring and alerting systems for enhanced infrastructure visibility, automate capacity management and resource allocation processes, lead incident response and post-mortem evaluations, and collaborate closely with engineering teams to bolster system resilience. Security and infrastructure hardening will be paramount, necessitating strong isolation protocols between tenants and suppliers, the implementation of effective key management systems, and the establishment of compliance frameworks. This high-impact position will directly affect our ability to deliver on our commitment to providing affordable, accessible AI compute at scale.
Join Our Team at EngFlowEngFlow is revolutionizing the software development process by enabling developers to save valuable time in their build and test cycles. Our innovative cloud-based distributed service optimizes workflows through advanced remote execution and caching, significantly enhancing efficiency, productivity, and product quality.Supported by esteemed investors, EngFlow is at the forefront of transforming how organizations develop software and deliver thoroughly tested products. Our solutions can accelerate builds by tenfold or more, and our observability platform provides crucial insights for ongoing optimization. Founded by leading contributors to Bazel, we create tools that empower engineering teams, from startups to Fortune 500 companies, to boost developer velocity and build performance.Discover more about our mission, culture, and team: EngFlow | Watch Our VideoWe are seeking a talented and experienced Site Reliability Engineer to join our dynamic engineering team. In this pivotal role, you will bridge the gap between software engineering and systems operations, ensuring our distributed infrastructure is highly available, performant, and scalable, thereby allowing our engineers to work swiftly and with confidence.
Full-time|$170K/yr - $250K/yr|On-site|San Francisco, CA
Supported by prominent investors from Silicon Valley, Peregrine Technologies is dedicated to assisting public safety organizations, governmental entities at state and local levels, federal agencies, and private institutions tackle societal challenges with remarkable efficiency and precision. Our AI-driven platform converts fragmented and isolated data into actionable intelligence, providing instant access to crucial information that facilitates enhanced decision-making and improves outcomes across various touchpoints. Currently, Peregrine serves hundreds of clients across over 30 states and two countries, positively impacting more than 125 million lives, with plans to further amplify our influence as we expand into the enterprise sector and internationally.Role OverviewWe are seeking a Senior Infrastructure Engineer to become an integral part of our expanding team, where you will design and implement the infrastructure that underpins the Peregrine platform.Our engineering team believes that empathy fuels better solutions. Observing users interact with our product is essential to identifying the right solutions. Engineers will have the chance to collaborate closely with our team on-site to comprehend the diverse applications that Peregrine supports. We champion strong ownership while promoting collaboration and continuous feedback.In your role as Senior Infrastructure Engineer, you will utilize a broad array of technologies to scale our platform, ensuring secure connections to customer networks while managing vast amounts of data. You will confront a variety of intricate challenges, including:Building highly available and secure production systemsDesigning scalable and secure network architecturesEnhancing the ingest engine for substantial data processingSpeeding up the entire engineering lifecycleOur technology stack is continuously advancing, built on AWS GovCloud, Kubernetes, Docker, Terraform, Pulumi, PostgreSQL, Redshift, Elasticsearch, and more.
About the Foundations Retrieval Team The Foundations Research group at OpenAI explores new approaches that could shape artificial intelligence for years to come. The team focuses on improving the science and data behind model training and scaling, especially for future advanced models. Areas of focus include data utilization, scaling laws, optimization strategies, model architectures, and efficiency improvements. Within Foundations, the Search team builds agentic search solutions. This group works closely with others to design interfaces between models and the core search stack, serving, indexing, and retrieval, so model intent leads to reliable, real-world results. The team develops large-scale systems to transform and index massive information sources, enabling models to reason over global knowledge. Close collaboration with researchers helps move new modeling ideas into production quickly, changing how intelligent systems discover and synthesize information at scale. Role Overview OpenAI is hiring a Software Engineer with expertise in retrieval system development and scalability for its San Francisco office. This role involves working with researchers and engineers to build infrastructure that lets models access the right information when needed. Responsibilities include designing and operating indexing systems, retrieval pipelines, and serving layers. Work in this role will directly improve retrieval capabilities across OpenAI’s research and products, with a strong influence on system performance, reliability, and scalability. What You’ll Do Develop and scale retrieval infrastructure, including indexing, serving, and query execution. Build low-latency, high-throughput systems for real-time model interactions. Work with research teams to bring embedding and retrieval methods into production. Support dense, sparse, and hybrid retrieval pipelines. Maintain system performance, reliability, and observability at scale. Collaborate with Pretraining, Inference, and Product teams to deliver end-to-end retrieval solutions. Help develop model-system interfaces for agentic workflows. Who We’re Looking For Experience building and scaling distributed systems. Background in developing high-performance, low-latency systems. Hands-on work with indexing and retrieval techniques. Familiarity with hybrid retrieval systems. Comfort working collaboratively across multiple teams.
About ChalkAt Chalk, we are revolutionizing the data platform that drives the future of machine learning applications. Our mission is to eliminate the complexity, latency, and scalability issues that have historically limited ML capabilities. Our platform seamlessly integrates Rust-speed performance with user-friendly tools that developers adore. Renowned companies trust Chalk to combat fraudulent credit card transactions, verify identities, and enhance clean energy utilization. Recently, we secured a $50 million Series A funding, spearheaded by Felicis.About the RoleWe are on the lookout for talented engineers to join our Infrastructure team. This is a unique opportunity to become one of our early hires and significantly impact a fast-growing startup. You will have the autonomy to solve complex engineering challenges and take ownership of your projects.We seek a platform engineer with a solid background in infrastructure engineering. At Chalk, we are tackling problems related to DBMS query planning, optimization, compilers, and distributed analytical data processing systems.Chalk employs dynamic and static analysis of Python code to optimize arbitrary user Python code, orchestrate the necessary infrastructure implied by that code, and track metadata regarding data flow through our systems.Our team works in the office five days a week. We are flexible with unavoidable conflicts, but this is not a hybrid position.What You Will DoDevelop code to automate the orchestration and provisioning of infrastructure to implement Chalk technology for our customers and prospects.Create a robust platform for managing our hosted services and deploying Chalk into customer-owned cloud environments across AWS and GCP.Collaborate closely with our Engineering and Sales teams.Contribute to interviewing and expanding the Engineering team.What We’re Looking ForMinimum of 2 years of experience in software development for automated infrastructure management.Proficiency in Python, Go, and/or Terraform.Hands-on experience with AWS and/or GCP.Strong collaborative skills in both technical and non-technical teams.
The OpportunityJoin rowspace as an Infrastructure Engineer and play a pivotal role in constructing and safeguarding the core of our cutting-edge AI data platform. In this position, you'll engineer systems capable of managing extensive volumes of sensitive financial information while adhering to rigorous security and compliance standards. Your work will involve real-time integration of public data with private, tenant-isolated customer data at scale.Key ResponsibilitiesDesign and implement scalable infrastructure to support our AI-driven knowledge engine that processes both structured and unstructured financial data.Establish a security-first architecture for private cloud environments, ensuring data governance aligns with financial services regulations.Create resilient data ingestion pipelines that accommodate a variety of data sources, from CapIQ feeds (structured data) to internal SharePoint documents (unstructured data).Develop comprehensive monitoring and alerting systems for our BYOC platform.Enforce access controls and maintain audit trails to ensure that AI interactions can be traced back to primary sources.Collaborate with our AI Research and Product teams to enhance infrastructure for LLM inference and training workloads, as well as agent infrastructure development.Establish CI/CD practices and infrastructure-as-code for swift, reliable deployments across multiple cloud providers.
Full-time|$166.9K/yr - $225.9K/yr|Hybrid|Hybrid - San Francisco
Drata helps organizations demonstrate their commitment to security and integrity. The platform supports companies as they build and maintain trust with users, customers, partners, and prospects. Values Built on Trust: Consistency shapes decisions and actions. Integrity: Choosing to do what is right, every time. Customer-Obsessed: Prioritizing customer needs above all else. Competitive Fire: Striving for higher standards and greater achievements. Diversity: Welcoming different perspectives to encourage creative solutions. Automation First: Pursuing efficiency by saving time and resources wherever possible. How the Team Works Drata blends high standards with a supportive environment focused on growth. Team members are encouraged to own their work, improve continuously, and deliver meaningful results. The company values quick, informed decisions that drive immediate impact, while always keeping the mission and customer needs at the center. The San Francisco-based team uses a hybrid work model. Colleagues collaborate in the office Tuesday through Thursday, focusing on alignment and innovation. Mondays and Fridays offer flexibility for deep work or personal needs. Growth and Culture Drata has expanded to over 600 professionals worldwide, recognized for a culture that values trust, speed, and continuous learning. The environment supports both personal and professional development. See the Speed: CEO Adam Markowitz discusses Drata’s rapid journey to $100M ARR in four years. Hear the Voice of the Team: Employee stories highlight collaboration and growth at Drata.
Join the Mercor TeamAt Mercor, we stand at the dynamic intersection of labor markets and AI research. Collaborating with premier AI labs and enterprises, we empower the human intelligence that is crucial for AI's evolution.Our expansive talent network plays a vital role in training cutting-edge AI models, akin to the way educators impart knowledge to their students—by sharing insights, experiences, and contextual understanding that code alone cannot convey. Currently, our network of over 30,000 experts generates more than $2 million daily.We are pioneering a novel category of work where expertise fuels AI progress. Achieving this vision necessitates an ambitious, fast-paced, and deeply dedicated team. You will collaborate with researchers, operators, and AI firms that are at the forefront of transforming societal structures.Mercor is a thriving Series C company with a valuation of $10 billion. We operate five days a week in-person at our new headquarters in San Francisco.About the RoleAs a Site Reliability Engineer (SRE) at Mercor, you will take ownership of production reliability for our critical systems, working closely with our infrastructure leadership. You will play a pivotal role in establishing our SRE function and defining how Mercor manages large-scale, high-availability systems.Your ResponsibilitiesEnsure the reliability and safety of production for key shared services and customer-facing systems.Collaborate directly with infrastructure leadership to outline SRE priorities, reliability benchmarks, and the production safety roadmap.Enhance the structure of our production systems to ensure stability, resource efficiency, isolation, and observability.Advocate for and implement modern SRE methodologies (e.g., incident management, postmortems, SLIs/SLOs) across engineering teams.Work alongside engineering and applied AI teams to facilitate sustainable growth.Promote SRE best practices internally, supporting teams in a safe, scalable, and consistent production onboarding process.Who We SeekThe ideal candidate will have:Extensive experience in genuine SRE roles (not merely operations) across various positions or organizations.A deep understanding of SRE methodologies popularized by Google (e.g., error budgets, reliability vs. risk trade-offs, large-scale distributed systems).5+ years of SRE experience; ideally, 15+ years in total experience for this inaugural SRE position.A proven track record of managing systems at scale, with a strong grasp of the complexities involved.
Full-time|$214K/yr - $260K/yr|Hybrid|Hub - San Francisco
At Superhuman, we embrace a vibrant hybrid work model that offers our team members the ideal blend of focused individual work and collaborative in-person interactions, fostering trust, innovation, and a robust team culture.About SuperhumanSuperhuman, the AI productivity platform, is on a transformative mission to unlock the superhuman potential within everyone. With the integration of Grammarly's writing assistance and innovative tools like Coda’s collaborative workspaces and Go, our proactive AI assistant, we empower over 40 million individuals and 50,000 organizations globally. Founded in 2009, we strive to eliminate busywork and enhance productivity. Discover more at superhuman.com and explore our values here.The OpportunityTo meet our ambitious goals, we are seeking a Site Reliability Engineer (SRE) to join our infrastructure team. This pivotal role focuses on developing software solutions to maintain the reliability of our back-end systems while collaborating with engineering teams to strategize our future growth. You will also engage with our production engineering teams in Europe as we transition from a “you build it, you own it” approach.At Superhuman, our engineers and researchers enjoy the autonomy to innovate and drive breakthroughs, directly impacting our product roadmap. As we rapidly scale our interfaces, algorithms, and infrastructure, the complexity of our technical challenges is growing. Learn more about our technical endeavors on our technical blog.As an SRE, your responsibilities will include:Scaling our Kubernetes-based control plane that processes billions of events each day.Enhancing our automation mechanisms to efficiently respond to workload demands.Deploying machine learning systems across various departments.
Full-time|$150K/yr - $300K/yr|On-site|San Francisco
About VibecodeAt Vibecode, we are revolutionizing the way software is created. Our innovative platform empowers anyone to articulate an idea and instantly transform it into a fully functional application—no coding skills required.We are tackling one of the most significant challenges in computing: aligning human intent with software execution. This endeavor necessitates groundbreaking advancements in AI reasoning, code generation, and user experience design.Our impressive seed funding comes from some of the top investors globally, including Alexis Ohanian (776), Arielle Zuckerberg, Cyan Banister (Long Journey), Ali Partovi, Suzanne Xie (Neo), and numerous esteemed angels from Google, Expo, OpenAI, and beyond.About the RoleAre you eager to be at the cutting edge of infrastructure design for a consumer product that will reach millions? If so, this opportunity is perfect for you.We seek an Infrastructure Engineer to develop the foundational systems that support millions of AI-generated applications. You will design a platform capable of securely hosting thousands of user-created applications concurrently while ensuring optimal performance and unwavering reliability.Your Responsibilities:Develop and implement secure sandbox environments for executing untrusted AI-generated code at scale.Create orchestration systems for stateless containers capable of launching over 10,000 applications simultaneously.Architect backend API services for real-time code generation, compilation, and deployment.Establish monitoring and observability systems for complex, multi-tenant application infrastructures.Design auto-scaling solutions to manage unpredictable traffic patterns from viral consumer applications.Build security-focused infrastructure that isolates user applications while preserving performance.This is not conventional infrastructure work. You will face unique challenges related to large-scale code execution, develop systems that are yet to be created, and establish infrastructure paradigms suited for the AI-native era.
Full-time|On-site|San Francisco, California, United States
We are seeking a highly skilled Senior Infrastructure Engineer to join our dynamic team at Bitgo. In this pivotal role, you will design, implement, and maintain our infrastructure systems, ensuring optimal performance and security. You will collaborate with cross-functional teams to develop innovative solutions and enhance our operational efficiency.Your expertise will be crucial in managing complex infrastructure environments and leveraging automation to streamline processes. If you are passionate about infrastructure engineering and are looking to make a significant impact in a fast-paced, cutting-edge environment, we want to hear from you!
About Chalk Inc. Chalk Inc. is revolutionizing the data platform landscape to empower the next generation of machine learning applications. We are dismantling traditional barriers of complexity, latency, and scalability that have historically limited ML capabilities. Our platform delivers Rust-speed performance paired with developer-friendly tools, making it the preferred choice for leading companies tackling issues such as preventing fraudulent credit card transactions, identity verification, and optimizing clean energy generation. Recently, we secured a $50 million Series A funding round, led by Felicis. About the Role We are seeking exceptional engineers to join our Infrastructure team. This is a unique opportunity to become an early employee in a high-growth startup, where you can make a significant impact. In this role, you will address complex engineering challenges with a high level of autonomy and ownership. Note: Our team works in the office five days a week. While we are flexible during unavoidable conflicts, this position is not hybrid. We are looking for a platform engineer with a solid foundation in infrastructure engineering. At Chalk, we tackle challenges from database management systems (DBMS), query planning and optimization, compilers, and distributed analytical data processing systems. Our team utilizes both dynamic and static analysis of Python programs to optimize arbitrary user Python code, orchestrate infrastructure based on the code structure, and track metadata concerning data flow through these systems. What You Will Do Develop automation code for orchestrating and provisioning infrastructure to execute Chalk technology for our customers and prospects. Create a management platform for our hosted services and oversee the deployment of Chalk in customer-owned cloud environments across AWS and GCP. Collaborate closely with our Engineering and Sales teams. Assist in interviewing and expanding the Engineering team. What We Are Looking For At least 2 years of experience in writing software for automating infrastructure management. Proficient in Python, Go, and/or Terraform. Experience with AWS and/or GCP. Strong collaboration skills.
At Greptile, we are on a mission to develop intelligent agents that autonomously verify code modifications. Our current focus involves utilizing AI to analyze pull requests on GitHub, effectively identifying bugs and enforcing coding standards. With our technology, we review nearly 1 billion lines of code each month for over 3,000 companies.Challenges We Are Excited To TackleDeveloping agents that can learn coding standards through experience, similar to how new hires adapt.Determining customer-specific preferences for pull request feedback using sample-efficient reinforcement learning to enhance signal-to-noise ratios.Implementing automated deployments of feature branches and leveraging agents to stress-test the application for bug detection.Our Growth TrajectoryServing over 7,000 customers.Successfully raised $30 million from prominent investors including Benchmark, Y Combinator, Paul Graham, and Initialized.Our TeamWe have curated a highly skilled team that has successfully scaled vital functions at leading companies such as Stripe, Google, Figma, and others.Key ResponsibilitiesDesign and implement resilient infrastructure to accommodate Greptile's expanding user base.Collaborate with our largest enterprise clients to facilitate the deployment of Greptile within their environments.Streamline the on-premise deployment process to support smaller clients with minimal hands-on intervention.
Mar 11, 2026
Sign in to browse more jobs
Create account — see all 11,446 results
Tailoring 0 resumes…
Tailoring 0 resumes…
We'll move completed jobs to Ready to Apply automatically.