Network Resource Allocation and Topology Design for Distributed Machine Learning

Distributed learning has gained attention as a way to accelerate large-scale tasks using parallel computing and distributed storage. However, varying network conditions and heterogeneous computational capabilities among workers can lead to high latency, imbalanced workloads, and degraded learning performance. This thesis explores a distributed learning system where workers share limited communication resources. The goal is to minimize overall training time across different learning paradigms, including server-based and decentralized peer-to-peer learning, from offline to online solutions.

We first consider server-based distributed learning, where the server updates models by aggregating local information from workers. Training time per iteration is limited by the straggler, the last worker to send its data. To reduce idle time at synchronization, we generalize online bandwidth allocation and batch size tuning as distributed online min-max optimization. Our aim is to minimize the pointwise maximum of time-varying, monotone cost functions without prior knowledge of them. We propose two novel algorithms: Distributed Online resource Re-Allocation (DORA), where non-stragglers share resources with stragglers, and Distributed Online Load Balancing with rIsk-averse assistancE (DOLBIE), where underloaded workers assist the most overloaded ones. Notably, DORA and DOLBIE avoid gradient calculations and projections, significantly reducing communication and computation overhead in large-scale networks.

We consider decentralized learning where each worker updates its model using a weighted average of its own model and those received from neighbors. The weights that each worker assigns to its neighbors form a consensus weight matrix. The overall training time is influenced by the network topology and communication speed. We propose a novel algorithm, Communication-Efficient Network Topology (CENT), which reduces training latency by removing unnecessary communication links and enforcing graph sparsity in terms of the consensus matrix. CENT uses a fixed step size to balance convergence and sparsity, while its adaptive version (CENT-A) adjusts the trade-off factor based on objective feedback. Both CENT and CENT-A maintain the training convergence rate and outperform state-of-the-art algorithms in real-world scenarios.

We further consider practical systems where workers have heterogeneous computation capacities and communication channel conditions, which can vary dynamically. We tackle the problem of jointly designing the consensus weight matrix and bandwidth allocation in an unpredictable time-varying network. We propose Dynamic Communication-Efficient Network Topology (DCENT), an algorithm that adaptively adjusts the consensus weight matrix, eliminates poor communication links, and compensates important but low-quality links with more resources. DCENT guarantees bounded dynamic regret and ensures the convergence of decentralized training. Experiments with real-world machine learning tasks demonstrate the efficacy of the proposed solution and its performance advantage over state-of-the-art algorithms.