About the job
Join our Digital Resiliency Engineering (DRE) team, where we fuse software and systems engineering to create and manage large-scale, distributed systems built for the Singapore Government. Our mission is to ensure that Government services are dependable, performant, and tailored to meet user needs.
We are seeking talented individuals with a robust background in DevOps, Infrastructure Engineering, or Site Reliability Engineering (SRE) who have experience managing critical production technology infrastructures at scale. If you are eager to collaborate with a team of skilled practitioners and industry leaders, we invite you to apply.
As a Platform Engineer, you will develop essential services for the observability and automation of infrastructure services. You will participate in an on-call rotation with fellow engineers, providing swift responses to significant incidents affecting critical Government services. Your role will involve offering technical leadership to the team while closely collaborating with technical leads to maintain highly available solutions. You will also mentor team members on managing the availability and performance of mission-critical services, developing automation, and establishing monitoring solutions to prevent reoccurring issues.
In this capacity, you will oversee the execution of project priorities, timelines, and deliverables. You will lead the design of key components, systems, and features aimed at enhancing the availability, scalability, latency, and efficiency of services designed and implemented by the Government.
Key Responsibilities:
- Establish Service Level Indicators (SLIs), Service Level Objectives (SLOs), Error Budgets, and post-mortem incident processes.
- Participate in an on-call roster to ensure the reliability and performance of critical Government services, providing operational support for large-scale distributed systems to effectively resolve incidents.
- Analyze metrics and logs from operating systems and applications for capacity planning, performance tuning, and fault isolation.
- Develop automation to manage services, infrastructure, and applications.
- Enhance the reliability and quality of services through proactive monitoring.
- Continuously measure and optimize system performance, advancing SRE practices.
- Create an SRE playbook for government-wide reference.
- Identify and evaluate emerging technologies that can foster innovation for the Government.
- Collaborate within a cross-functional service team comprising software engineers, infrastructure engineers, DevOps, and other specialists.

