Post-hoc feature attribution methods are widely deployed in safety-critical vision systems, yet their stability under realistic input perturbations remains poorly characterized. Existing metrics evaluate explanations primarily under additive noise, collapse stability to a single scalar, and fail to condition on prediction preservation, conflating explanation fragility with model sensitivity. We introduce the Feature Attribution Stability Suite (FASS), a benchmark that enforces prediction-invariance filtering, decomposes stability into three complementary metrics: structural similarity, rank correlation, and top-k Jaccard overlap–and evaluates across geometric, photometric, and compression perturbations. Evaluating four attribution methods (Integrated Gradients, GradientSHAP, Grad-CAM, LIME) across four architectures and three datasets–ImageNet-1K, MS COCO, and CIFAR-10, FASS shows that stability estimates depend critically on perturbation family and prediction-invariance filtering. Geometric perturbations expose substantially greater attribution instability than photometric changes, and without conditioning on prediction preservation, up to 99% of evaluated pairs involve changed predictions. Under this controlled evaluation, we observe consistent method-level trends, with Grad-CAM achieving the highest stability across datasets.
Prior approaches mainly focus on additive noise and collapse stability into a single metric without conditioning on predictions. FASS addresses these limitations through prediction-invariant evaluation, multi-axis stability measurement, and diverse perturbation testing.
Each input image is first subjected to controlled perturbations, including geometric (rotation, translation), photometric (brightness, noise), and compression (JPEG). Only pairs where the predicted class remains unchanged pass through a prediction-invariance filter, ensuring that stability is measured without conflating with model decision changes.
Filtered pairs are evaluated across three axes: spatial coherence, rank-order preservation, and top-k salient feature overlap. The resulting FASS score is the mean of these three metrics, providing a multi-dimensional view of explanation stability.
Four post-hoc attribution methods—Grad-CAM, Integrated Gradients, GradientSHAP, and LIME—are evaluated across multiple architectures and datasets. This modular setup allows direct comparison of method robustness under diverse perturbation families.
Evaluation shows that Grad-CAM consistently achieves the highest stability across datasets and architectures, with a global mean FASS of 0.718. Other methods score lower: IG 0.509, GradientSHAP 0.479, LIME 0.407. Geometric perturbations are most destabilizing (mean FASS 0.484), compared to photometric changes (0.546) and JPEG compression (0.511). Prediction-invariant filtering is essential, as many perturbations would otherwise alter the predicted class.
Method choice dominates architecture effects: the stability gap between Grad-CAM and other methods (≈0.21) is roughly twice the largest within-method architecture variation (≈0.09). These results highlight that attribution method selection is critical for explanation reliability, and FASS provides a unified benchmark to measure and compare stability under diverse perturbations.