Role Summary
A people leader who oversees a team of AI-focused automation analysts within Expedia Group’s AI Resiliency Centre (ARC), the central hub for global IT operations, providing always-on monitoring, triage, and remediation across eCommerce and corporate services. This manager builds a piece of the follow the sun capability across hubs, orchestrating human and agentic AI responders to reduce noise, cut Mean Time to Detect/Restore (MTTK/MTTR), and prevent customer-impacting incidents before they occur.
In this role, you will:
Lead a 24/7 global reliability operations function that monitors, supports, and improves production systems, ensuring high availability, resilience, and rapid incident response across multiple services and domains.
Own and mature incident management practices, including detection, triage, escalation, communication, and post‑incident review processes, driving reduction in mean time to detect (MTTD) and mean time to resolve (MTTR).
Partner closely with engineering, SRE, and product teams to define and evolve operational standards, runbooks, and readiness criteria, including system design (LLD), API integration considerations, and data modeling that support reliable operations.
Develop and manage observability strategies (monitoring, alerting, logging, and dashboards) to proactively identify reliability risks and drive data‑driven improvements to system stability and performance.
Build, coach, and mentor a high‑performing reliability operations team, fostering a culture of continuous improvement, operational excellence, and accountability across multiple technical domains and platforms.
Safely integrate and operate AI/ML‑enabled solutions that improve incident detection, noise reduction, capacity forecasting, and operational workflows, including familiarity with AI‑driven systems, tools, or workflows and applying AI/ML concepts to real world products.
Minimum Qualifications:
Bachelor’s degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience in operating large‑scale, customer‑facing systems.
Substantial experience in reliability operations, SRE, production support, or related fields, including leading 24/7 operational teams and owning reliability for multiple services or a broad technical domain.
Proven track record implementing and operating incident management, on‑call, and observability practices (monitoring, alerting, logging, dashboards) for distributed systems, including collaboration with engineering on system design (LLD), API integration, and data modeling.
Demonstrated ability to use operational and performance data to drive decisions, prioritize reliability improvements, and manage trade‑offs between stability, velocity, and cost at scale.
Hands‑on familiarity with AI‑driven or automation‑focused operational tools (for example, intelligent alerting, anomaly detection, or automated remediation) and ability to ensure they are integrated and operated safely in production.
Experience with automation tools and at least one programming or scripting language (Python preferred).
Experience with monitoring and observability tools such as Datadog, Splunk, Catchpoint, PagerDuty, or similar platforms.
Strong incident response mindset, including the ability to analyze outages, identify root causes, and proactively recommend and implement automation-driven solutions to prevent recurrence.
Preferred Qualifications:
Experience leading reliability operations for complex, high‑traffic, globally distributed systems, including coordination across multiple engineering and product teams and ownership of multi‑service or multi‑domain reliability outcomes.
Demonstrated success defining and evolving operational architectures and runbooks in partnership with engineering, including low‑level system design, API design for operability, and data models that support effective monitoring, alerting, and incident analysis.
Strong track record driving operational excellence: improving incident response processes, leading blameless post‑incident reviews, reducing recurring incidents, and implementing long‑term reliability improvements grounded in data.
Experience scaling AI‑ or ML‑enabled capabilities within reliability operations, such as intelligent incident triage, predictive capacity and reliability modeling, or AI‑assisted runbooks, with clear governance and safety controls.
Depth in using AI‑driven observability or AIOps platforms to correlate signals across logs, metrics, and traces, and to continuously refine alerting and automation strategies that improve reliability outcomes.