Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Senior
Qualifications
Proven experience in Site Reliability Engineering or similar roles. Strong knowledge of cloud services (AWS, Azure, Google Cloud). Proficiency in programming languages such as Python, Go, or Java. Experience with container orchestration tools like Kubernetes. Excellent problem-solving skills and a collaborative mindset.
About the job
Pinterest is hiring a Senior Site Reliability Engineer in Toronto, ON, Canada. The focus of this role is to ensure that Pinterest’s services remain reliable, scalable, and perform well as the platform grows. Working closely with software engineers, this position involves designing and implementing solutions that strengthen system reliability and efficiency.
Key responsibilities
Partner with engineering teams to maintain and enhance the reliability of Pinterest’s services
Design and implement improvements to support scalability and performance
Troubleshoot and resolve service issues to reduce downtime
Requirements
Extensive experience in site reliability engineering or a closely related field
Strong technical background with proven problem-solving abilities
Comfort working alongside software engineers to improve systems
This position is located in Toronto, ON, Canada.
About Pinterest, Inc.
Pinterest is a visual discovery engine that helps people find inspiration for their next projects and interests. Join us to make a meaningful impact in the daily lives of millions around the globe.
Similar jobs
1 - 20 of 776 Jobs
Search for Site Reliability Engineer Inference Infrastructure
Pinterest is hiring a Senior Site Reliability Engineer in Toronto, ON, Canada. The focus of this role is to ensure that Pinterest’s services remain reliable, scalable, and perform well as the platform grows. Working closely with software engineers, this position involves designing and implementing solutions that strengthen system reliability and efficiency. Key responsibilities Partner with engineering teams to maintain and enhance the reliability of Pinterest’s services Design and implement improvements to support scalability and performance Troubleshoot and resolve service issues to reduce downtime Requirements Extensive experience in site reliability engineering or a closely related field Strong technical background with proven problem-solving abilities Comfort working alongside software engineers to improve systems This position is located in Toronto, ON, Canada.
Full-time|CA$243K/yr - CA$297K/yr|On-site|Toronto, ON
At Relay, we empower self-made business owners with a digital banking platform that transforms financial management into a source of clarity, confidence, and control. Our mission is to replace financial uncertainty with genuine visibility, enabling entrepreneurs to convert their hard work into enduring success. By alleviating the stress of cash flow management, we provide the tools necessary for owners to operate robust and resilient businesses.As Relay continues its growth trajectory, the reliability, performance, and resilience of our platform have become integral to both our customer experience and overall business success.This senior leadership position is crucial in steering a team of Site Reliability Engineers while shaping how reliability strategies influence engineering and product decisions throughout the organization. You will determine the future direction of the SRE function, promote operational excellence, and assist the company in anticipating and managing scale challenges before they pose risks.If you thrive on tackling complex systems, leading organizations, and building resilient platforms that customers depend on daily, we are eager to connect with you!Key ResponsibilitiesLead and enhance Relay’s Site Reliability Engineering function, establishing strategic direction as the company scales.Define and implement a long-term reliability roadmap, making informed trade-offs under real business and capacity constraints.Act as the senior reliability voice in discussions involving engineering and product leadership.Influence the integration of reliability considerations into product planning, architectural decisions, and delivery processes.Serve as a senior escalation point during critical production incidents, ensuring effective communication and thorough follow-up actions.Enhance Relay’s observability, performance, and operational maturity practices across teams.Establish and uphold standards concerning SLOs, operational readiness, incident management, and continuous improvement.Collaborate with stakeholders in Engineering, Product, Data, and Finance to balance velocity, risk, performance, and cost.Build and nurture a high-performing SRE organization capable of supporting future growth.
Full-time|CA$144K/yr - CA$200K/yr|Hybrid|Montreal; Toronto
The Storage Layer Services (SLS) team at MongoDB is embarking on an innovative journey to re-architect our cloud storage layer, forming the core of our next-generation cloud storage architecture. This newly established team is dedicated to creating high-performance, multi-tenant distributed storage services that not only enhance our current Atlas storage stack but also enable more efficient customer workloads. As a Senior Site Reliability Engineer, you will collaborate closely with teams responsible for these storage services to establish Service Level Objectives (SLOs), develop capacity plans, and guarantee the reliability, durability, and operational safety of the foundational storage layer supporting Atlas. By joining our small team of seasoned SREs, you will play an integral role in executing a multi-year roadmap for MongoDB’s cloud storage architecture. This position is open to candidates based in our Toronto or Montreal offices or those working remotely from anywhere in Canada, provided they are located in the Eastern or Central time zones.
About Us:At Cohere, we are dedicated to scaling intelligence to enhance human experience. We specialize in training and deploying cutting-edge AI models for developers and businesses, empowering them to create extraordinary applications such as content generation, semantic search, retrieval-augmented generation (RAG), and intelligent agents. We believe our innovative work is pivotal in driving the adoption of AI across various sectors.Our team is passionate and meticulous about what we create. Every team member plays a crucial role in enhancing our models' capabilities and the value they deliver to our clients. We prioritize hard work and agility to serve our customers effectively.Cohere comprises a diverse team of researchers, engineers, designers, and industry experts, all committed to excellence in their respective fields. We understand that a variety of perspectives is essential for developing outstanding products.Join us in our mission to shape the future of AI!Why This Position Matters:If you thrive on building high-performance, scalable, and reliable machine learning systems, and you are excited about defining the future of AI platforms that power advanced NLP applications, we want you on our Model Serving team at Cohere. As a Site Reliability Engineer, you will be instrumental in developing, deploying, and managing our AI platform, which delivers Cohere's large language models via user-friendly API endpoints. You will collaborate with multiple teams to deploy optimized NLP models in environments characterized by low latency, high throughput, and high availability. This role also offers the chance to engage with customers and create tailored deployments that address their unique requirements.Your Responsibilities:Design and build self-service systems that streamline the management, deployment, and operation of services.Develop custom Kubernetes operators that facilitate language model deployments.Automate observability and resilience within the environment, empowering developers to troubleshoot and resolve issues efficiently.Ensure adherence to defined Service Level Objectives (SLOs), which includes participating in an on-call rotation.Foster strong relationships with internal developers and help guide the Infrastructure team’s roadmap based on their feedback.Contribute to the development of our team through knowledge sharing and an active review process.
Full-time|CA$144K/yr - CA$200K/yr|Hybrid|Toronto; Vancouver
The TeamAt MongoDB, our Platform Engineering division within Site Reliability Engineering (SRE) is tasked with managing essential infrastructure and operational functions that empower our engineering teams. This includes our robust, multi-cloud Kubernetes infrastructure, deployment systems, and advanced observability and alerting mechanisms.The Fabric team is at the forefront of enabling secure communication across systems and from the public internet. Our responsibilities involve designing network architecture, implementing service mesh solutions, and optimizing edge load balancing to ensure the safety of customer data in transit. This team is vital in developing and maintaining a dependable and globally connected multi-cloud network that underpins MongoDB products.This position can be based in our Toronto or Vancouver offices, or you can work completely remotely from anywhere in North America. We provide flexible hybrid work arrangements for those in our offices.
Join Tenstorrent as a Site Reliability Engineer, where you will play a crucial role in ensuring the reliability and performance of our cutting-edge systems. As a member of our dedicated engineering team, you will work on innovative solutions to enhance our infrastructure and streamline operations. Your expertise will help us deliver exceptional service and uptime to our customers.
Full-time|$211.5K/yr - $258.5K/yr|On-site|Toronto, ON
At Relay, we are revolutionizing the way self-made business owners manage their finances through our cutting-edge digital banking platform. Our mission is to empower entrepreneurs with the tools and knowledge they need to achieve financial clarity, confidence, and control over their earnings. By transforming cash flow management from a source of stress into a clear, actionable insight, we help our customers build stronger and more resilient businesses.As we continue to grow, the reliability, performance, and resilience of our platform have become critical components of our customer experience and overall business success.We are currently seeking an Engineering Manager to lead our Site Reliability Engineering (SRE) team. In this pivotal role, you will oversee the scalability, reliability, and robustness of Relay's systems. This position transcends infrastructure management and incident response; it is a leadership opportunity that sits at the nexus of technology, team dynamics, and business strategy. You will mentor and manage a talented SRE team, influence how reliability is integrated across the organization, and ensure our systems can safely scale in response to increasing customer demands and complexity.If you thrive in technically demanding environments and are passionate about fostering strong teams, a healthy workplace culture, and effective cross-functional collaboration, this position is designed for you.
Join our innovative team at Newton as a Site Reliability Engineer, where you'll play a crucial role in ensuring the reliability and performance of our systems. In this fully remote position, you will collaborate with engineering and operations teams to develop solutions that enhance system uptime and efficiency.Your expertise will help us transition and maintain our infrastructure, ensuring our services are resilient and scalable. This is an exciting opportunity to contribute to a company that values innovation and teamwork.
Momentum Financial Services Group (MFSG) is the company behind Money Mart, Canada’s largest non-bank branch network. With over four decades of experience, MFSG delivers financial solutions for underserved communities, including short-term loans, money transfers, and prepaid cards. Each year, millions of customers rely on these services for timely financial support. Role Overview: Site Reliability Engineer The Site Reliability Engineer plays a key role in keeping MFSG’s digital banking and financial services platforms available, responsive, and resilient. This position centers on automating operational tasks, setting and maintaining service-level objectives, and engineering systems to withstand and recover from failures. Daily work involves close collaboration with engineering, DevOps, QA, cybersecurity, and compliance teams to ensure platform reliability meets both technical and regulatory requirements. The role also emphasizes proactive monitoring, incident response, and ongoing improvements to the software delivery process to reduce production risk. Why Join Momentum Financial Services Group? Competitive compensation that reflects experience and current market rates Annual bonus based on individual and company achievements Comprehensive benefits including health and dental coverage with premiums fully paid, plus Employee Assistance Program access Retirement planning support to help prepare for the future Hybrid work model offering flexibility between remote work and in-office collaboration at the Toronto headquarters Employee perks such as tuition reimbursement, professional development, Perkopolis discounts, and recognition programs Location Toronto, Canada (hybrid work model)
At Movable Ink, we empower marketers with cutting-edge content personalization through data-driven content creation and AI-driven decision-making. Our innovative platform is trusted by top global brands to enhance revenue, streamline workflows, and increase marketing agility. With our headquarters in New York City and a talented team of nearly 600 employees, Movable Ink has a presence across North America, Central America, Europe, Australia, and Japan.As a Lead Site Reliability Engineer, you will leverage your technical expertise and leadership skills to oversee infrastructure and software development initiatives. You will play a pivotal role in designing and evolving key systems within our multi-cloud, multi-region content serving platform, which handles over 25 billion requests daily. By fostering architectural vision, cross-team collaboration, and mentorship, you will spearhead reliability initiatives and define the technical strategies necessary for scaling our platform to accommodate 50 billion requests per day and beyond.
Empower Every Identity, from AI to HumanIdentity is the cornerstone of unlocking AI's potential. At Okta, we secure AI by creating a trustworthy, neutral infrastructure that allows organizations to confidently navigate this transformative era. This mission demands an unwavering commitment to addressing intricate challenges with significant real-world implications. We seek innovative builders who act with speed and urgency and execute with exceptional proficiency.This is your chance to engage in work that can define your career. We are fully dedicated to this mission. If you share this passion, we want to hear from you.Join Us in Securing Every Identity, from AI to HumanOkta is at the forefront of providing a superior authentication experience for hundreds of millions globally. Our focus on reliability forms the bedrock of our product, with a strong commitment to surpassing customer expectations for availability being a fundamental engineering priority. As a Senior Site Reliability Engineer, you will be part of our SRE team, ensuring our production systems are not only fully operational but also resilient, scalable, and poised for remarkable growth. This role goes beyond mere maintenance; it is about playing a significant role in enhancing the core robustness and resilience of our platform. You will be a proactive builder, developing solutions that inherently boost our system's reliability.Your Responsibilities:Craft and develop custom software in Go to bolster the platform’s reliability and resilience.Collaborate with engineering teams to integrate reliability principles, enhancing the availability, performance, and observability of our services.Utilize your profound understanding of infrastructure and observability to pinpoint improvement opportunities within the product and implement effective solutions.Participate in our on-call rotation, providing swift, effective responses to critical incidents and utilizing your expertise to troubleshoot, mitigate, or accurately escalate production issues.Enhance our SRE tooling and processes, focusing on automation and operational efficiency.Establish, document, and promote reliability best practices throughout the organization.
About Rootly At Rootly, we are dedicated to revolutionizing how organizations manage incidents. Our mission is to provide a reliable incident management platform that empowers companies to respond swiftly and effectively when challenges arise. Our innovative approach has established us as leaders in a new multi-billion dollar segment, and we are seeking exceptional talent to help us achieve our ambitious goals. Our customers, including industry giants like NVIDIA, Figma, Canva, and Tripadvisor, trust Rootly for their critical incident management needs. They appreciate our user-friendly platform and unique partnership approach, which has garnered us a stellar 5-star rating on G2. Join us in creating a reliable future for organizations worldwide. Backed by prestigious investors from Y Combinator to key operators in tech, we prioritize transparency and team involvement in our financial health. We conduct monthly business reviews and share updates through our weekly changelog. About the Role As a Senior Site Reliability Engineer at Rootly, you will play a crucial role in shaping our technical infrastructure. You will thrive in a dynamic environment where each day presents new challenges and opportunities for growth. This position is perfect for individuals who seek ownership, enjoy tackling complex technical problems, and are driven by a mission to enhance reliability. While the work will be demanding, it promises to be one of the most rewarding experiences in your career. Collaborate with product teams to enhance the observability, reliability, and performance of services. Take ownership of our CI/CD pipelines, observability tools, monitoring systems, and incident response processes. Develop tools and automation to reduce manual toil, enhance engineering velocity, and improve developer experience and system reliability. Engage deeply with engineering teams to gain insights into system performance and identify cross-functional reliability and scaling concerns. Design and scale our infrastructure while ensuring top-notch performance and operational excellence.
A Few Important Notes:Join a Profitable B2B SaaS company with teams primarily located in North America.This position is predominantly remote, with a requirement to meet in Toronto once a month.Candidates must possess the legal right to work in Canada; we are unable to provide visa sponsorship.As our platform continues to expand, we are actively seeking a Senior Site Reliability Engineer (SRE) / Cloud Engineer.Experience with Azure is highly prioritized as it is our primary cloud platform.About Our Company:We are recognized as one of the leading retail analytics platforms, empowering marketing teams and brands to decode retail data and execute targeted media campaigns without the need for coding. Our services enhance client understanding of customer behavior and maximize ROI on marketing campaigns, with notable clients including Home Depot.Utilize a modern cloud stack, with a focus on Azure, CI/CD, containerization, and distributed computing technologies.About You:We are in search of a dynamic and skilled Senior SRE/Cloud Engineer who is eager to take on a pivotal role in managing our Cloud Operations, ensuring uptime, reliability, and automation.Key Responsibilities:Collaborate with software engineering teams to design, implement, and maintain CI/CD pipelines for rapid and reliable software releases.Automate and optimize infrastructure provisioning, configuration, and management processes utilizing industry-standard tools and methodologies.Implement and manage containerization and orchestration technologies to enhance scalability and resource efficiency.Own the end-to-end availability and performance of our cloud infrastructure; proactively identify potential issues and implement automation to mitigate recurrence.Participate in an on-call rotation to ensure system stability and responsiveness during off-hours.Lead the development and implementation of service-level objectives crucial for maintaining product reliability.
Veeva Systems is a mission-driven leader in industry cloud technology, dedicated to accelerating the delivery of therapies to patients in the life sciences sector. As one of the fastest-growing SaaS companies ever, we surpassed $2 billion in revenue last fiscal year with significant growth prospects ahead.Central to Veeva's mission are our core values: Do the Right Thing, Customer Success, Employee Success, and Speed. Notably, we made history in 2021 by becoming a public benefit corporation (PBC), which legally commits us to balance the interests of our customers, employees, society, and investors.As a Work Anywhere company, we empower you to choose your work environment, whether it's from home or in our office, enabling you to excel in your preferred setting.Be part of our journey in transforming the life sciences industry and making a positive impact on our customers, employees, and communities.The RoleWe are seeking a talented Senior Site Reliability Engineer to join our Vault Platform team. In this role, you will be instrumental in ensuring the scalability and reliability of our enterprise applications. You will face complex challenges on a global scale, leveraging your extensive knowledge of Java and modern open-source technologies to create a meaningful impact on our production systems.The ideal candidate will possess substantial experience with Java applications and cutting-edge open-source technologies, particularly within the context of enterprise software development or a high-growth tech environment. As a Senior SRE, you should have a natural curiosity and a strong aptitude for problem-solving. Your unique engineering perspective will be critical as you understand how systems integrate in production to function efficiently on a global scale, supporting hundreds of customers across North America, Europe, and Asia.
At Veeva Systems, we are driven by a mission to revolutionize the life sciences industry, empowering companies to bring therapies to patients at an accelerated pace. As one of the fastest-growing SaaS companies in history, we achieved over $2 billion in revenue last fiscal year and possess immense growth potential.Our core values - Do the Right Thing, Customer Success, Employee Success, and Speed - define who we are. In 2021, we made history by becoming a public benefit corporation (PBC), committed to balancing the interests of our customers, employees, society, and investors.As a Work Anywhere organization, we offer the flexibility for you to work remotely or from our office, allowing you to thrive in your preferred environment.Join us in transforming the life sciences sector and making a positive impact on our customers, employees, and communities.
Cerebras Systems is at the forefront of AI technology, having developed the world's largest AI chip, which is 56 times larger than traditional GPUs. Our revolutionary wafer-scale architecture delivers unparalleled AI compute power equivalent to dozens of GPUs on a single chip, combined with the ease of programming as if it were a single device. This innovative approach enables us to achieve industry-leading training and inference speeds, allowing machine learning practitioners to run extensive ML applications effortlessly, without the complexities associated with managing numerous GPUs or TPUs. Cerebras is trusted by leading model labs, global enterprises, and pioneering AI-native startups. Notably, OpenAI recently announced a multi-year partnership with Cerebras, aimed at deploying 750 megawatts of scale, revolutionizing critical workloads with ultra high-speed inference. Thanks to our groundbreaking wafer-scale architecture, Cerebras Inference offers the fastest Generative AI inference solution globally, exceeding GPU-based hyperscale cloud inference services by more than 10 times. This significant enhancement in speed is redefining the user experience of AI applications, facilitating real-time iterations and amplifying intelligence through enhanced agentic computation.About The RoleAs a member of the inference performance team, you will work at the critical intersection of hardware and software, enhancing end-to-end model inference speed and throughput. Your focus will encompass low-level kernel performance debugging and optimization, system-level performance analysis, performance modeling, and the creation of tools for performance diagnostics and projections.ResponsibilitiesDevelop performance models (kernel-level, end-to-end) to forecast the performance of state-of-the-art and client ML models.Optimize and troubleshoot our kernel micro code and compiler algorithms to enhance ML model inference speed, throughput, and compute utilization on the Cerebras WSE.Analyze and debug runtime performance at the system and cluster level.Create tools and infrastructure to visualize performance data collected from the Wafer Scale Engine and our compute cluster.
Cerebras Systems is at the forefront of AI innovation, creating the world’s largest AI chip, a staggering 56 times larger than traditional GPUs. Our revolutionary wafer-scale architecture delivers the computational power of dozens of GPUs within a single chip, paired with the simplicity of a unified programming interface. This unique approach enables us to achieve unparalleled training and inference speeds, empowering machine learning practitioners to execute large-scale ML applications effortlessly, without the complexities associated with hundreds of GPUs or TPUs.Among our esteemed clientele are leading model laboratories, global enterprises, and pioneering AI-native startups. Recently, OpenAI announced a multi-year collaboration with Cerebras, aiming to leverage 750 megawatts of scale to revolutionize key workloads through ultra-high-speed inference.Thanks to our groundbreaking wafer-scale architecture, Cerebras Inference provides the fastest Generative AI inference solution available today, boasting speeds over ten times faster than GPU-based hyperscale cloud services. This extraordinary increase in speed is reshaping the user experience of AI applications, enabling real-time iterations and enhancing intelligence through advanced agentic computation.About The RoleJoin our inference model team, dedicated to advancing state-of-the-art models by numerically validating and accelerating innovative concepts on our wafer-scale hardware. In this role, you will prototype architectural enhancements, construct performance evaluation pipelines, and translate quantitative insights into actionable changes that drive production success.Key ResponsibilitiesPrototype and benchmark innovative concepts such as new attention mechanisms, mixture of experts (MoE), speculative decoding, and other emerging advancements.Create agent-driven automation tools that design experiments, schedule runs, triage regressions, and prepare pull requests.Collaborate closely with compiler, runtime, and silicon teams, gaining a unique perspective on the complete software/hardware innovation stack.Stay current with the latest open- and closed-source models; execute them on wafer scale first to identify new optimization opportunities.
Cerebras Systems is revolutionizing AI technology with the world's largest AI chip, which is 56 times larger than traditional GPUs. Our innovative wafer-scale architecture combines the immense computational power of multiple GPUs into a single chip while maintaining unparalleled programming simplicity. This allows us to provide extraordinary training and inference speeds, empowering machine learning users to seamlessly execute large-scale ML applications without the complexities of managing numerous GPUs or TPUs. We proudly serve a diverse clientele, including leading model laboratories, global corporations, and innovative AI-centric startups. Notably, OpenAI recently formed a multi-year partnership with Cerebras, committing to deploy 750 megawatts of scale to enhance critical workloads with ultra-high-speed inference. Thanks to our groundbreaking wafer-scale architecture, Cerebras Inference offers the fastest Generative AI inference solution globally, achieving speeds over 10 times faster than GPU-based hyperscale cloud inference services. This remarkable speed transformation enhances user experiences and facilitates real-time iterations while augmenting intelligence through advanced agentic computation. About The RoleThe Inference Core Platform team is integral to Cerebras’ mission of delivering the world’s fastest AI inference. Our engineers develop the core software and hardware infrastructure that enables low-latency, high-speed, and high-throughput deployment on the Cerebras Wafer-Scale Engine (WSE). We oversee the entire stack—from model compilation and scheduling to custom hardware kernels and driver development.The Platform Benchmarking team is crucial in enhancing the performance and scalability of AI inference on one of the most advanced computing systems ever developed. We spearhead the establishment of core inference capabilities and implement performance improvements at every development phase, from initial prototyping to full production deployment.We seek enthusiastic engineers eager to redefine the boundaries of AI inference. If you're passionate about developing systems that measure, analyze, and optimize performance on a large scale, this is your chance to make a transformative impact on the future of AI.
Join Opendoor as an Infrastructure Engineer and be at the forefront of revolutionizing the real estate industry through technology. You will collaborate with cross-functional teams to design, implement, and maintain scalable infrastructure solutions that enhance our platform's reliability and performance.
About the Role Momentum Financial Services Group is looking for a Senior Infrastructure Engineer in Toronto. This position focuses on designing, implementing, and maintaining infrastructure solutions that support the company’s goals. The role calls for hands-on work with systems and close collaboration with teams across the organization. What You Will Do Design and implement infrastructure solutions to meet business needs Maintain and improve system performance and security Partner with cross-functional teams to support ongoing projects Lead efforts to enhance infrastructure capabilities Location This role is based in Toronto, Canada.
Apr 20, 2026
Sign in to browse more jobs
Create account — see all 776 results
Tailoring 0 resumes…
Tailoring 0 resumes…
We'll move completed jobs to Ready to Apply automatically.