Primary Submission Category: Applications in Physical Sciences, Engineering, Environment and Miscellaneous Applications
Causal Inference for AIOps: Root Cause Analysis in Microservices Incidents
Authors: Leandro Siqueira, Victor Medeiros, Willian Honorato, Kauê França, Alvino Júnior,
Presenting Author: Leandro Siqueira*
Microservice architectures propagate latency in complex, non‑linear ways, making root cause analysis of tail latency degradation difficult when relying solely on correlations. In banking production environments, P95 latency is a key indicator of user‑perceived impact during incidents. This work presents a causal AI approach based on graphical causal models (GCM/DAGs) applied to real distributed trace data to explain increases in P95 (ΔP95) in an interpretable and operationally actionable manner. The method is explicitly route aware: instead of relying on an aggregated service topology, it constructs just‑in‑time causal DAGs conditioned on the exact customer journeys actually traversed.
From 72 observed routes in production, only four exhibited significant degradation and were selected for detailed causal analysis, ensuring the model focuses on real, impactful user behavior. For each impacted route, the operational call graph is inverted to model upstream services as candidate causes of end‑to‑end latency at the BFF. Exclusive service latency defines each node’s causal mechanism, learned from baseline traces and contrasted against anomalous traces isolated via Isolation Forest. ΔP95 is attributed to services using Shapley values over mechanism changes, producing responsibility scores that disentangle true drivers from propagated effects.
Across the four impacted routes in a 52‑service system, ΔP95 consistently concentrates in a small subset of services -predominantly Core and
