Heartbeat (computing)
In
Heartbeat protocol
A heartbeat protocol is generally used to negotiate and monitor the availability of a resource, such as a floating
As a heartbeat is intended to be used to indicate the health of a machine, it is important that the heartbeat protocol and the transport that it runs on are as reliable as possible. Causing a failover because of a false alarm may, depending on the resource, be highly undesirable. It is also important to react quickly to an actual failure, further signifying the reliability of the heartbeat messages. For this reason, it is often desirable to have a heartbeat running over more than one transport; for instance, an Ethernet segment using UDP/IP, and a serial link.
A "cluster membership" of a node is a property of network reachability: if the master can communicate with the node , it's considered a member of the cluster and "dead" otherwise.
- Heartbeat Subsystem (HS): The subsystem that monitors the node's presence within the cluster through a series of keepalive or "hear-beat messages".
- Cluster Manager (CM): The subsystem within the cluster—usually the master server—which keeps track of the "cluster members" and records which resources are on which nodes.
- Cluster Transition (CT): When a node joins or leaves the cluster, this subsystem is responsible for keeping track of such occurrences for the purpose of triggering events to rebalancing and reconfiguring the master to distribute the load.
Heartbeat messages are sent in a periodic manner through techniques such as broadcast or multicasts in larger clusters.[6] Since CMs have transactions across the cluster, the most common pattern is to send heartbeat messages to all the nodes and "await" responses in non-blocking fashion.[8] Since the heartbeat or keepalive messages are the overwhelming majority of non-application related cluster control messages—which also goes to all the members of the cluster—major critical systems also include non-IP protocols like serial ports to deliver heartbeats.[9]
Design and implementation
Every CM on the master server maintains a finite-state machine with three states for each node it administers: Down, Init, and Alive.[10] Whenever a new node joins, the CM changes the state of the node from Down to Init and broadcasts a "boot-up message", which the node receives the executes set of start-up procedures. It then responses with an acknowledgment message, CM then includes the node as the member of the cluster and transitions the state of the node from Init to Alive. Every node in the Alive state would receive a periodic broadcast heartbeat message from the HS subsystem and expects an acknowledgment message back within a timeout range. If CM didn't receive an acknowledgment heartbeat message back, the node is considered unavailable, and a state transition from Alive to Down takes place for that node by CM.[11] The procedures or scripts to run, and actions to take between each state transition is an implementation detail of the system.
Heartbeat network
Heartbeat network is a private network which is shared only by the nodes in the cluster, and is not accessible from outside the cluster. It is used by cluster nodes in order to monitor each node's status and communicate with each other messages necessary for maintaining the operation of the cluster. The heartbeat method uses the FIFO nature of the signals sent across the network. By making sure that all messages have been received, the system ensures that events can be properly ordered.[12]
In this
In general, it is difficult to select a delta that is optimal for all applications. If delta is too small, it requires too much overhead and if it is large it results in performance degradation as everything waits for the next heartbeat signal.[14]
See also
- Watchdog timer, electronic timer that is used to detect and recover from computer malfunctions
- Heartbleed vulnerability
Notes
- ^ a b Hou & Huang 2003, p. 1.
- ^ "Definition of Heartbeat". pcmag.com Encyclopedia. Retrieved 7 October 2020.
- ^ a b Robertson 2000, p. 1.
- ^ US 4710926, Donald W. Brown, James W. Leth, James E. Vandendorpe, "Fault recovery in a distributed processing system", published 1987-12-01
- ISSN 0302-9743.
- ^ a b Robertson 2000, p. 2.
- ^ Robertson 2000, p. 1-2.
- ^ Robertson 2000, p. 2-3.
- ^ Robertson 2000, p. 5.
- ^ Li, Yu & Wu 2009, p. 2.
- ^ Li, Yu & Wu 2009, p. 2-3.
- ^ Nikoletseas 2011, p. 304.
- ^ Nikoletseas 2011, p. 304-305.
- ^ Nikoletseas 2011, p. 306.
References
- Nikoletseas, Sotiris; Rolim, José D.P., eds. (2011). "Theoretical Aspects of Distributed Computing in Sensor Networks". Monographs in Theoretical Computer Science. An EATCS Series. Berlin, Heidelberg: Springer Berlin Heidelberg. ISSN 1431-2654.
- Hou, Zonghao; Huang, Yongxiang (29 March 2003). Design and implementation of heartbeat in multi-machine environment. 17th International Conference on Advanced Information Networking and Applications, 2003. AINA 2003. China: ISBN 0-7695-1906-7.
- Robertson, Alan (2000). Linux-HA Heartbeat System Design (PDF). USENIX Annual Technical Conference. SUSE Labs.
- Li, Fei-Fei; Yu, Xiang-Zhan; Wu, Gang (11 July 2009). Design and Implementation of High Availability Distributed System Based on Multi-level Heartbeat Protocol. 2009 IITA International Conference on Control, Automation and Systems Engineering (case 2009). China: ISBN 978-0-7695-3728-3.