Replication (computing)
This article includes a list of general references, but it lacks sufficient corresponding inline citations. (October 2012) |
Replication in
Terminology
Replication in computing can refer to:
- Data replication, where the same data is stored on multiple storage devices
- Computation replication, where the same computing task is executed many times. Computational tasks may be:
- Replicated in space, where tasks are executed on separate devices
- Replicated in time, where tasks are executed repeatedly on a single device
Replication in space or in time is often linked to scheduling algorithms.[1]
Access to a replicated entity is typically uniform with access to a single non-replicated entity. The replication itself should be
Computer scientists further describe replication as being either:
- Active replication, which is performed by processing the same request at every replica
- Passive replication, which involves processing every request on a single replica and transferring the result to the other replicas
When one leader replica is designated via
Load balancing differs from task replication, since it distributes a load of different computations across machines, and allows a single computation to be dropped in case of failure. Load balancing, however, sometimes uses data replication (especially multi-master replication) internally, to distribute its data among machines.
Backup differs from replication in that the saved copy of data remains unchanged for a long period of time.[3] Replicas, on the other hand, undergo frequent updates and quickly lose any historical state. Replication is one of the oldest and most important topics in the overall area of distributed systems.
Data replication and computation replication both require processes to handle incoming events. Processes for data replication are passive and operate only to maintain the stored data, reply to read requests and apply updates. Computation replication is usually performed to provide fault-tolerance, and take over an operation if one component fails. In both cases, the underlying needs are to ensure that the replicas see the same events in equivalent orders, so that they stay in consistent states and any replica can respond to queries.
Replication models in distributed systems
Three widely cited models exist for data replication, each having its own properties and performance:
- Transactional replication: used for replicating one-copy serializability model is employed, which defines valid outcomes of a transaction on replicated data in accordance with the overall ACID(atomicity, consistency, isolation, durability) properties that transactional systems seek to guarantee.
- Virtual synchrony: involves a group of processes which cooperate to replicate in-memory data or to coordinate actions. The model defines a distributed entity called a process group. A process can join a group and is provided with a checkpoint containing the current state of the data replicated by group members. Processes can then send multicasts to the group and will see incoming multicasts in the identical order. Membership changes are handled as a special multicast that delivers a new "membership view" to the processes in the group.[6]
Database replication
In
Database replication becomes more complex when it scales up
When data is replicated between database servers, so that the information remains consistent throughout the database system and users cannot tell or even know which server in the DBMS they are using, the system is said to exhibit replication transparency.
However, replication transparency can not always be achieved. When data is replicated in a database, they will be constrained by CAP theorem or PACELC theorem. In the NoSQL movement, data consistency is usually sacrificed in exchange for other more desired properties, such as availability (A), partition tolerance (P), etc. Various data consistency models have also been developed to serve as Service Level Agreement (SLA) between service providers and the users.
Disk storage replication
Active (real-time) storage replication is usually implemented by distributing updates of a
The most basic method is
The main characteristic of such cross-site replication is how write operations are handled, through either asynchronous or synchronous replication; synchronous replication needs to wait for the destination server's response in any write operation whereas asynchronous replication does not.
In asynchronous replication, the write operation is considered complete as soon as local storage acknowledges it. Remote storage is updated with a small lag. Performance is greatly increased, but in case of a local storage failure, the remote storage is not guaranteed to have the current copy of data (the most recent data may be lost).
Semi-synchronous replication typically considers a write operation complete when acknowledged by local storage and received or logged by the remote server. The actual remote write is performed asynchronously, resulting in better performance but remote storage will lag behind the local storage, so that there is no guarantee of durability (i.e., seamless transparency) in the case of local storage failure.[citation needed]
Point-in-time replication produces periodic snapshots which are replicated instead of primary storage. This is intended to replicate only the changed data instead of the entire volume. As less information is replicated using this method, replication can occur over less-expensive bandwidth links such as iSCSI or T1 instead of fiberoptic lines.
Implementations
Many
Many commercial synchronous replication systems do not freeze when the remote replica fails or loses connection – behaviour which guarantees zero data loss – but proceed to operate locally, losing the desired zero
Techniques of wide-area network (WAN) optimization can be applied to address the limits imposed by latency.
File-based replication
File-based replication conducts data replication at the logical level (i.e., individual data files) rather than at the storage block level. There are many different ways of performing this, which almost exclusively rely on software.
Capture with a kernel driver
A
File-level replication solutions allow for informed decisions about replication based on the location and type of the file. For example, temporary files or parts of a filesystem that hold no business value could be excluded. The data transmitted can also be more granular; if an application writes 100 bytes, only the 100 bytes are transmitted instead of a complete disk block (generally 4,096 bytes). This substantially reduces the amount of data sent from the source machine and the storage burden on the destination machine.
Drawbacks of this software-only solution include the requirement for implementation and maintenance on the operating system level, and an increased burden on the machine's processing power.
File system journal replication
Similarly to database transaction logs, many file systems have the ability to journal their activity. The journal can be sent to another machine, either periodically or in real time by streaming. On the replica side, the journal can be used to play back file system modifications.
One of the notable implementations is Microsoft's System Center Data Protection Manager (DPM), released in 2005, which performs periodic updates but does not offer real-time replication.[citation needed]
Batch replication
This is the process of comparing the source and destination file systems and ensuring that the destination matches the source. The key benefit is that such solutions are generally free or inexpensive. The downside is that the process of synchronizing them is quite system-intensive, and consequently this process generally runs infrequently.
One of the notable implementations is rsync.
Replication within file
In a
In
This section needs expansion. You can help by adding to it. (November 2018) |
Another example of using replication appears in distributed shared memory systems, where many nodes of the system share the same page of memory. This usually means that each node has a separate copy (replica) of this page.
Primary-backup and multi-primary replication
Many classical approaches to replication are based on a primary-backup model where one device or process has unilateral control over one or more other processes or devices. For example, the primary might perform some computation, streaming a log of updates to a backup (standby) process, which can then take over if the primary fails. This approach is common for replicating databases, despite the risk that if a portion of the log is lost during a failure, the backup might not be in a state identical to the primary, and transactions could then be lost.
A weakness of primary-backup schemes is that only one is actually performing operations. Fault-tolerance is gained, but the identical backup system doubles the costs. For this reason, starting c. 1985, the distributed systems research community began to explore alternative methods of replicating data. An outgrowth of this work was the emergence of schemes in which a group of replicas could cooperate, with each process acting as a backup while also handling a share of the workload.
Computer scientist Jim Gray analyzed multi-primary replication schemes under the transactional model and published a widely cited paper skeptical of the approach "The Dangers of Replication and a Solution".[10][11] He argued that unless the data splits in some natural way so that the database can be treated as n n disjoint sub-databases, concurrency control conflicts will result in seriously degraded performance and the group of replicas will probably slow as a function of n. Gray suggested that the most common approaches are likely to result in degradation that scales as O(n³). His solution, which is to partition the data, is only viable in situations where data actually has a natural partitioning key.
In the 1985–1987, the
A number of modern products support similar schemes. For example, the Spread Toolkit supports this same virtual synchrony model and can be used to implement a multi-primary replication scheme; it would also be possible to use C-Ensemble or Quicksilver in this manner. WANdisco permits active replication where every node on a network is an exact copy or replica and hence every node on the network is active at one time; this scheme is optimized for use in a wide area network (WAN).
Modern multi-primary replication protocols optimize for the common failure-free operation. Chain replication[12] is a popular family of such protocols. State-of-the-art protocol variants[13] of chain replication offer high throughput and strong consistency by arranging replicas in a chain for writes. This approach enables local reads on all replica nodes but has high latency for writes that must traverse multiple nodes sequentially.
A more recent multi-primary protocol, Hermes,[14] combines cache-coherent-inspired invalidations and logical timestamps to achieve strong consistency with local reads and high-performance writes from all replicas. During fault-free operation, its broadcast-based writes are non-conflicting and commit after just one multicast round-trip to replica nodes. This design results in high throughput and low latency for both reads and writes.
See also
- Change data capture
- Fault-tolerant computer system
- Log shipping
- Multi-master replication
- Optimistic replication
- State machine replication
- Virtual synchrony
References
- ^ Mansouri, Najme, Gholam, Hosein Dastghaibyfard, and Ehsan Mansouri. "Combination of data replication and scheduling algorithm for improving data availability in Data Grids", Journal of Network and Computer Applications (2013)
- ^ V. Andronikou, K. Mamouras, K. Tserpes, D. Kyriazis, T. Varvarigou, "Dynamic QoS-aware Data Replication in Grid Environments", Elsevier Future Generation Computer Systems - The International Journal of Grid Computing and eScience, 2012
- ^ "Backup and Replication: What is the Difference?". Zerto. February 6, 2012.
- ^ Marton Trencseni, Attila Gazso (2009). "Keyspace: A Consistently Replicated, Highly-Available Key-Value Store". Retrieved 2010-04-18.
- ^ Mike Burrows (2006). "The Chubby Lock Service for Loosely-Coupled Distributed Systems". Archived from the original on 2010-02-09. Retrieved 2010-04-18.
- S2CID 7739589.
- ^ "Replication -- Conflict Resolution". ITTIA DB SQL™ User's Guide. ITTIA L.L.C. Archived from the original on 24 November 2018. Retrieved 21 October 2016.
- ^ Dragan Simic; Srecko Ristic; Slobodan Obradovic (April 2007). "Measurement of the Achieved Performance Levels of the WEB Applications With Distributed Relational Database" (PDF). Electronics and Energetics. Facta Universitatis. p. 31–43. Retrieved 30 January 2014.
- ^ Mokadem Riad; Hameurlain Abdelkader (December 2014). "Data Replication Strategies with Performance Objective in Data Grid Systems: A Survey" (PDF). Internal journal of grid and utility computing. Underscience Publisher. p. 30–46. Retrieved 18 December 2014.
- ^ "The Dangers of Replication and a Solution"
- ^ Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data: SIGMOD '99, Philadelphia, PA, US; June 1–3, 1999, Volume 28; p. 3.
- ^ van Renesse, Robbert; Schneider, Fred B. (2004-12-06). "Chain replication for supporting high throughput and availability". Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation - Volume 6. OSDI'04. USA: USENIX Association: 7.
- ^ Terrace, Jeff; Freedman, Michael J. (2009-06-14). "Object storage on CRAQ: high-throughput chain replication for read-mostly workloads". USENIX Annual Technical Conference. USENIX'09. USA: 11.
- S2CID 210921224.