About the job
For over 25 years, Realtor.com® has stood as the premier online platform trusted by real estate professionals, seamlessly connecting buyers, sellers, and renters with invaluable insights and expert advice to discover their ideal home. Our comprehensive suite of tools not only transforms the real estate landscape, but also aids consumers in navigating one of life's most significant decisions—making it simple, intuitive, and empowering.
Join us in our mission to enable more individuals to find their way home by dismantling barriers, fostering meaningful connections, and instilling confidence with expert guidance.
About the Role
We are on the lookout for a Staff Site Reliability Engineer to become a vital member of our newly established Operations Excellence organization, reporting directly to the Director of Operations Excellence. This pivotal position will define the reliability, observability, and operational excellence of our platform infrastructure that serves millions of users. As a Staff SRE, you will take on a technical leadership role, mentoring others and establishing best practices, while influencing architectural decisions to empower our team of 600+ engineers in delivering outstanding customer experiences.
You will engage with crucial platform systems, including EKS infrastructure, Skyway (CI/CD), Frontdoor (Tyk API Gateway), Pantheon (Apollo GraphQL Federation), and our observability stack, all while implementing chaos engineering practices and spearheading cost optimization initiatives that yield measurable ROI.
We are committed to employing the best tools to expedite problem-solving. You will be expected to adeptly utilize AI coding assistants and LLMs to enhance development speed, generate boilerplate code, and troubleshoot intricate debugging scenarios. In addition to basic usage, this role demands the critical judgment to assess AI-generated outputs for security, performance, and accuracy. You should be comfortable incorporating AI tools into your daily tasks to minimize repetitive work, allowing you to concentrate on high-impact architectural and strategic engineering challenges.
What You'll Do
Platform Reliability & Infrastructure
- Design and maintain highly available AWS infrastructure, including EKS clusters, Fargate (ECS), and multi-region architectures.
- Take ownership of the reliability of essential services: Skyway (CI/CD), Frontdoor (Tyk), Pantheon (Apollo GraphQL), and associated infrastructure.
- Establish SLIs, SLOs, and error budgets for Tier 1/2/3 systems; lead architectural reviews focused on reliability and cost-efficiency.
- Drive...

