Qualifications
Key Responsibilities:
Ensure system reliability and performance across multi-cloud, multi-region platforms using first principles thinking.
Build and maintain comprehensive observability solutions (OpenTelemetry, New Relic, Grafana, Prometheus) that deliver actionable insights into system health and performance.
Automate infrastructure provisioning and deployments utilizing Terraform and infrastructure-as-code practices.
Define, implement, and monitor Service Level Objectives (SLOs) and Service Level Indicators (SLIs) that align with business-critical Service Level Agreements (SLAs) and foster accountability for reliability.
Manage and optimize Kubernetes clusters (EKS, GKE) focusing on security hardening, performance, and operational excellence.
Lead incident response initiatives, troubleshoot complex system issues, restore services promptly, and conduct thorough root cause analyses.
Implement preventive measures and reliability enhancements based on insights gained from incidents and system behavior patterns.
Collaborate with platform engineers and developers to integrate reliability best practices into system architecture and delivery pipelines.
Proactively scale infrastructure capacity based on growth forecasts and traffic patterns.
Contribute to architecture reviews with a strong focus on reliability, performance, and operational sustainability.
Encourage a culture of continuous improvement, systematic problem-solving, and operational excellence.
About the job
This position does not support visa sponsorship or transfers, including H1-B, F-1, OPT, STEM-OPT, or TN visas, nor is it available for corp-to-corp arrangements.
This is a hybrid position. Candidates are required to work in our Fort Mill, SC office three days a week, from Tuesday to Thursday, and may work remotely on Mondays and Fridays.
The Red Platform - Platform Engineering (RPPE) team at Red Ventures is actively seeking a dedicated Site Reliability Engineer. In this pivotal role, you will ensure that our platforms and applications are resilient, scalable, and capable of performing under high loads. Your focus will be on engineering reliability from the ground up, incorporating observability, automation, and proactive operational practices designed to prevent failures rather than merely responding to them.
As part of a small, high-impact team, you will manage enterprise-scale systems across AWS, GCP, and Kubernetes environments, with a strong emphasis on uptime. This role will require you to build reliability guardrails, establish comprehensive monitoring, and implement automation that allows our organization to operate with confidence and agility.