Clinical datasets often comprise multiple data points or trials sampled from a single participant. When these datasets are used to train machine learning models, the method used to extract train and test sets must be carefully chosen. Using the standard machine learning approach (random-wise split), different trials from the same participant may appear in both training and test sets. This has led to schemes capable of segregating data points from a same participant into a single set (subject-wise split). Past investigations have demonstrated that models trained in this manner underperform compared to those trained using random-split schemes. Additional training of models via a small subset of trials, known as calibration, bridges the gap in performance across split schemes; however, the amount of calibration trials required to achieve strong model performance is unclear. Thus, this study aims to investigate the relationship between calibration training set size and prediction accuracy on the calibration test set. A database of 30 young, healthy adults performing multiple walking trials across nine different surfaces while fit with inertial measurement unit sensors on the lower limbs was used to develop a deep-learning classifier. For subject-wise trained models, calibration on a single gait cycle per surface yielded a 70% increase in F1-score, the harmonic mean of precision and recall, while 10 gait cycles per surface were sufficient to match the performance of a random-wise trained model. Code to generate calibration curves may be found at (
https://github.com/GuillaumeLam/PaCalC).