On the No-Z-Cycle Property in Distributed Executions

作者:

Highlights:

摘要

Given a checkpoint and communication pattern of a distributed execution, the No Z-Cycle property (NZC) states that a dependency between a checkpoint and itself does not exist. In other words, a noncausal sequence of messages that starts after a checkpoint and terminates before that checkpoint does not exist. From an operational point of view, this property corresponds to the fact that each checkpoint belongs to at least one consistent global checkpoint. So it could be used, for example, for restarting a distributed application after the occurrence of a failure. In this paper we derive a characterization of the NZC property (previously an open problem). It identifies a subset of Z-cycles, namely core Z-cycles (CZCs), that has to be empty in order that the checkpoint and communication pattern of the execution satisfies the NZC property. Then, we present a communication-induced checkpointing protocol that prevents CZCs on-the-fly. This protocol actually removes the common causal part to any CZC. Finally we propose a taxonomy of communication-induced checkpointing protocols that ensure the NZC property.

论文关键词:checkpointing,causality,theory of distributed computing,global states,distributed algorithms,fault-tolerance,distributed systems

论文评审过程:Received 26 March 1998, Revised 26 May 2000, Available online 25 May 2002.

论文官网地址:https://doi.org/10.1006/jcss.2000.1720