About the job
At Tecsys, we recognize the transformative power of remote work on employee well-being and the environment. Our commitment to remote work fosters enhanced employee morale, productivity, and reduced commuting times. We are proud to be a remote-first organization, supported by cutting-edge technologies and programs that create a fantastic foundation for our team. Our flexible remote environment, complemented by well-located offices and collaborative workspaces, empowers our staff to work in ways that maximize their productivity.
About Tecsys
Tecsys is a rapidly growing innovator in supply chain solutions for leading healthcare systems, hospitals, pharmacies, distributors, retailers, and 3PLs. We collaborate with industry leaders to transform their supply chains through technology. If you thrive on tackling challenges and seek continuous learning opportunities, we invite you to join our dynamic team!
Position Overview
We are in search of an Infrastructure Reliability Engineer to join our Network Operations and Security Center (NOC) team, which is pivotal to the reliability of our critical SaaS platforms. In this role, you will contribute to the maintenance, optimization, and assurance of the reliability and performance of the systems that drive our cloud infrastructure on AWS and Kubernetes. A strong focus will be placed on automation, observability, and continuous improvement.
This position amalgamates reliability engineering with incident management, placing you in a key role responsible for availability, performance, and innovation. You will be part of a highly skilled team that values creative problem-solving, operational excellence, and the continuous enhancement of resilience through automation and engineering.
Your Responsibilities
- Collaborate with engineering teams to support services prior to their launch through activities such as systems design consultation, platform and software framework development, capacity planning, and launch reviews.
- Continuously innovate by identifying weaknesses, proposing creative solutions, and driving initiatives that simplify, scale, and strengthen the platform.
- Maintain services post-launch by measuring and monitoring availability, latency, and overall system health.
- Ensure optimized observability: enhance and expand monitoring and alerting using Datadog; define SLOs/SLIs and create actionable dashboards that yield reliability outcomes.
- Develop and enhance...

