Background and Objective: Hand function clinical assessments are used as outcome measures to evaluate new therapies after stroke, but do not capture true performance in natural environments. Existing deep learning methods for quantifying hand usage after stroke from at-home egocentric video are unable to consider behaviour across multiple different ADLs. This study presents a novel multi-video architecture for improved hand impairment estimation.
Methods: Late fusion (majority voting, fully-connected network) and intermediate fusion (concatenation, Markov chain) were investigated for building multi-video architectures for impairment detection and classification, with SlowFast as a base feature extractor. Models were developed for both cropped and full-frame inputs.
Results: Impairment detection models with intermediate concatenation achieved an F1-score of 0.778±0.129 for cropped inputs and 0.796±0.102 for full-frame inputs. Both significantly improved on their single-video counterparts.
Conclusion: A novel multi-video architecture is beneficial for estimating hand impairment from egocentric video after stroke.