Online job scheduling for distributed machine learning in optical circuit switch networks

作者:

Highlights:

摘要

Networking has become a well-known performance bottleneck for distributed machine learning (DML). Although lots of works have focused on accelerating the communication process of DML, they ignore the impact of the physical network on the DML performance. Concurrently, optical circuit switches (OCSes) are increasingly applied in data centers and clusters, which can fundamentally improve DML performance. It is worth noting that the non-negligible OCS reconfiguration delay makes OCS scheduling algorithms have a great impact on the upper application performance. However, existing OCS scheduling solutions are not suitable for DML jobs due to the iterative nature of DML jobs and their interleaving characteristics of communication and computation stages. Therefore, in this paper, we study the online multi-job scheduling for DML in OCS networks. Firstly, we propose heaviest-load-first (HLF), a heuristic algorithm for intra-job scheduling, which is based on the fact that the completion time of flows on the heaviest load port has a significant impact on the job completion time. Furthermore, we present Shortest Weighted Remaining Time First (SWRTF) algorithm for inter-job scheduling. In SWRTF, an available DML job is scheduled when the served job moves from communication stage to the computation stage, which significantly improves the circuit utilization. Based on large-scale simulations, we demonstrate HLF can significantly reduce the iteration communication time by up to 64.97% compared to the state-of-the-art circuit scheduler Sunflow. Besides, SWRTF can save up to 42.9%, 54.2%, 27.2% of Weighted-Job-Completion-Time (WJCT) compared to Shortest-Job-First, Baraat and Weighted-First inter-job scheduling algorithms, respectively.

论文关键词:Distributed machine learning (DML),Optical circuit switch (OCS),Online job scheduling,Weighted Job Completion Time (WJCT)

论文评审过程:Received 18 February 2020, Revised 25 April 2020, Accepted 5 May 2020, Available online 12 May 2020, Version of Record 19 May 2020.

论文官网地址:https://doi.org/10.1016/j.knosys.2020.106002