Fig. 4: TWIX effectively mitigates explanation bias exhibited by SAIS against surgeons.

Reliability of attention-based explanations stratified across surgeon sub-cohorts when assessing the skill-level of a, needle handling and b, needle driving (see Supplementary Tables 3-6 for number of samples in each sub-cohort). We do not report caseload for SAH due to insufficient samples from one sub-cohort. Effect of TWIX on the reliability of AI-based explanations for the disadvantaged surgeon sub-cohort (worst-case AUPRC) when assessing the skill-level of c, needle handling and d, needle driving. AI-based explanations come in the form of attention placed on frames by SAIS or through the direct estimate of frame importance by TWIX (see Methods). We do not report caseload for SAH due to insufficient samples from one sub-cohort. Note that SAIS is trained exclusively on data from USC and then deployed on data from USC, SAH, and HMH. Results are an average across 10 Monte Carlo cross-validation folds, and errors bars reflect the 95% confidence interval.