Reconfigurable Architectures for Deep Learning

Rapid increases in available training data and compute capabilities have fueled the development of deep learning (DL) models that achieve unprecedented results on tasks that were originally considered unsuitable for machines to perform and led to the adoption of DL in myriad systems. Specialized hardware is often necessary to meet the performance requirements of state-of-the-art DL models, but the rapid pace of change in models and the wide variety of systems integrating DL make it impossible to create custom computer chips except for the largest markets. Reconfigurable computing architectures, among which field-programmable gate arrays (FPGAs) are the most prominent, uniquely combine flexibility and direct hardware execution, making them suitable for accelerating DL inference. In the first part of this thesis, we evaluate the security and performance of current FPGAs for DL inference acceleration. We demonstrate that the inherent resilience of DL models to computation mistakes can provide additional protection for FPGA-based DL accelerators against timing faults induced by voltage attacks. Next, we re-architect the neural processing unit (NPU) inference overlay to use the fabric-embedded tensor blocks in new DL-optimized FPGAs. We show that by exploiting the FPGA’s flexibility to directly chain function units and its tremendous on-chip memory bandwidth, our tensor-block-enhanced NPU achieves an order of magnitude higher performance and energy efficiency compared to same-generation DL-optimized GPUs. Then, we present Koios, a suite of 40 DL-targeted benchmarks to enable FPGA CAD and architecture research. In the second part of this thesis, we focus on the architecture exploration of novel reconfigurable acceleration devices (RADs) that combine programmable fabrics, packet-switched NoCs, and application-specialized accelerators. We develop a suite of tools for evaluating these architectures including a cycle-accurate performance simulator and tools for modeling the area and timing of different RAD components. We showcase these tools using different case studies that highlight their utility for architecture/application co-design, studying the performance/cost tradeoff of different RAD candidates, optimizing NoC placement, and modeling homogeneous and heterogeneous 3D-stacked devices. Our case study on accelerating DL recommendation models shows that future RADs can potentially offer an order of magnitude higher performance compared to current conventional FPGAs.