Fig. 5: TWIX’ benefits persist across different experimental settings.

We present the effect of TWIX, in different experimental settings (ablation studies), on a, the reliability of explanations generated by SAIS, quantified via the AUPRC, and b, the explanation bias, quantified via improvements in the worst-case AUPRC (see Supplementary Tables 3-6 for number of samples in each sub-cohort). The default experimental setting is RGB + Flow and was used throughout this study. Other settings include withholding optical flow from SAIS (RGB) and formulating a multi-class skill assessment task (Multi-Skill). c–f SAIS can be used today to provide feedback to surgical trainees. c AI-based explanations often align with those provided by human experts. d SAIS exhibits an explanation bias against male surgical trainees. e TWIX mitigates the explanation bias by improving the reliability of explanations provided to male surgical trainees and f improves SAIS' performance in assessing the skill-level of needle handling. Note that SAIS is trained exclusively on live data from USC and then deployed on data from the training environment. Results are shown for all 10 Monte Carlo cross-validation folds.