Building large and powerful artificial intelligence models through pre-training has become the dominant precedent in well understood areas such as natural language processing and computer vision. Pre-training has led to the creation of foundation models such as ChatGPT, GPT-4, Llama, and MocoV3, among many others. Within the sensor domain these strategies are less understood. While they have been applied previously, no one has yet to successfully build and create a foundation model for sensor applications. Within this work we first explore which of the two most broadly used training paradigms, masked language modeling and contrastive learning, is most suited for applications within the sensor domain. Our results show that mask language modeling builds more generalizable representations, specifically for both classification and regression based sensor tasks, and we provide insight as to why instance discrimination, the basis of contrastive learning objectives may fall short for some common sensor applications such as inertial odometry. We then apply these strategies to create ADAPT, a foundation model which can process times-series data collected from different physical dimensions and is the first model to be successfully pre-trained on a wide range of sensor data simultaneously without significant degradation of model accuracy. ADAPT demonstrates state-of-the-art performance on several classification tasks and data types. We provide further evidence to show that it is the diverse physical characteristics of sensor data that has previously inhibited many-to-one pre-training. This work has the potential to make a profound impact on the construction of pretrained models, in particular, foundation models in within the sensor domain.
Keywords: