About this role
Position: SRE Lead
Location: Toronto, ON (Onsite)
Employment Type: Full-Time
Total Experience: 10 + years
Job Summary:
We are seeking an experienced SRE Lead to drive reliability, observability, automation, and operational excellence across complex enterprise platforms. The ideal candidate will possess deep expertise in cloud-native and on-prem ecosystems, advanced observability practices, and large-scale transaction processing environments.
Roles & Responsibility:
· Lead and execute SRE roadmap initiatives, capability assessments, and reliability improvement programs.
· Design, implement, and optimize observability solutions across applications, infrastructure, platforms, and networks.
· Serve as the SME for Dynatrace, including DQL, Grail, Gen3 Dashboards, ActiveGate, SRG Workflows, and Business Events.
· Drive end-to-end troubleshooting and root cause analysis across distributed enterprise systems.
· Build and enhance monitoring frameworks leveraging Metrics, Events, Logs, and Traces (MELT).
· Implement SRE best practices, platform engineering capabilities, self-service tooling, and policy-as-code frameworks.
· Develop automation solutions using Python, Node.js, AWS Lambda, ECS, and backend integrations.
· Establish cloud observability standards across AWS services including CloudWatch, API Gateway, Lambda, and Application Signals.
· Design monitoring strategies for highly integrated enterprise and financial systems, including middleware and AI-driven platforms.
Required Skills & Qualifications:
· 10+ years of experience in Site Reliability Engineering, Production Support, Platform Engineering, or Observability Engineering.
· Strong expertise in Dynatrace and enterprise observability platforms.
· Hands-on experience with AWS cloud services and monitoring ecosystems.
· Proficiency in Python and/or Node.js for automation and operational tooling.
· Deep understanding of distributed systems, performance engineering, and reliability practices.
· Experience supporting large-scale financial services or other mission-critical enterprise environments.
· Strong leadership, stakeholder management, and strategic planning capabilities.
Preferred Qualifications:
· Experience with IBM DataPower, API platforms, and enterprise integration technologies.
· Knowledge of Google SRE principles and modern platform engineering practices.
· Experience monitoring AI/ML-driven applications and services.