Shard (database architecture)
This article needs additional citations for verification. (March 2021) |
A database shard, or simply a shard, is a
Some data within a database remains present in all shards,[a] but some appear only in a single shard. Each shard (or server) acts as the single source for this subset of data.[1]
Database architecture
Horizontal partitioning is a database design principle whereby rows of a database table are held separately, rather than being split into columns (which is what normalization and vertical partitioning do, to differing extents). Each partition forms part of a shard, which may in turn be located on a separate database server or physical location.
There are numerous advantages to the horizontal partitioning approach. Since the tables are divided and distributed into multiple servers, the total number of rows in each table in each database is reduced. This reduces
In practice, sharding is complex. Although it has been done for a long time by hand-coding (especially where rows have an obvious grouping, as per the example above), this is often inflexible. There is a desire to support sharding automatically, both in terms of adding code support for it, and for identifying candidates to be sharded separately. Consistent hashing is a technique used in sharding to spread large loads across multiple smaller services and servers.[3]
Where
Compared to horizontal partitioning
Sharding goes beyond this. It partitions the problematic table(s) in the same way, but it does this across potentially multiple instances of the schema. The obvious advantage would be that search load for the large partitioned table can now be split across multiple servers (logical or physical), not just multiple indexes on the same logical server.
Splitting shards across multiple isolated instances requires more than simple horizontal partitioning. The hoped-for gains in efficiency would be lost, if querying the database required multiple instances to be queried, just to retrieve a simple
This is also why sharding is related to a shared-nothing architecture—once sharded, each shard can live in a totally separate logical schema instance / physical database server / data center / continent. There is no ongoing need to retain shared access (from between shards) to the other unpartitioned tables in other shards.[citation needed]
This makes replication across multiple servers easy (simple horizontal partitioning does not). It is also useful for worldwide distribution of applications, where communications links between data centers would otherwise be a bottleneck.[citation needed]
There is also a requirement for some notification and replication mechanism between schema instances, so that the unpartitioned tables remain as closely synchronized as the application demands. This is a complex choice in the architecture of sharded systems: approaches range from making these effectively read-only (updates are rare and batched), to dynamically
Implementations
- Altibase provides combined (client-side and server-side) sharding architecture transparent to client applications.
- Apache HBase can shard automatically.[6]
- Azure SQL Database Elastic Database tools shards to scale out and in the data-tier of an application.[7]
- ClickHouse, a fast open-source OLAP database management system, shards.
- Couchbaseshards automatically and transparently.
- CUBRID shards since version 9.0
- Db2 Data Partitioning Feature (MPP) which is a shared-nothing database partitions running on separate nodes.
- DRDS (Distributed Relational Database Service) of Alibaba Cloud does database/table sharding,[8] and supports Singles' Day.[9]
- Elasticsearch enterprise search server shards.[10]
- IBM Informix shards since version 12.1 xC1 as part of the MACH11 technology. Informix 12.10 xC2 added full compatibility with MongoDB drivers, allowing the mix of regular relational tables with NoSQL collections, while still allowing sharding, fail-over and ACID properties.[14][15]
- Kdb+ shards since version 2.0.
- MariaDB Spider, an storage engine that supports table federation, table sharding, XA transactions, and ODBC data sources. The MariaDB Spider engine is bundled in MariaDB server since version 10.0.4.[16]
- MonetDB, an open-source column-store, does read-only sharding in its July 2015 release.[17]
- MongoDB shards since version 1.6.
- MySQL Cluster automatically and transparently shards across low-cost commodity nodes, allowing scale-out of read and write queries, without requiring changes to the application.[18]
- MySQL Fabric (part of MySQL utilities) shards.[19]
- Oracle Database shards since 12c Release 2 and in one liner: Combination of sharding advantages with well-known capabilities of enterprise ready multi-model Oracle Database.[20]
- Oracle NoSQL Database has automatic sharding and elastic, online expansion of the cluster (adding more shards).
- OrientDB shards since version 1.7
- Solr enterprise search server shards.[21]
- ScyllaDB runs sharded on each core in a server, across all the servers in a cluster
- Spanner, Google's global-scale distributed database, shards across multiple Paxos state machines to scale to "millions of machines across hundreds of data centers and trillions of database rows".[22]
- SQLAlchemy ORM, a data-mapper for the Python programming language shards.[23]
- SQL Server, since SQL Server 2005 shards with help of 3rd party tools.[24]
- Teradata markets a massive parallel database management system as a "data warehouse"
- Vault, a cryptocurrency, shards to drastically reduce the data that users need to join the network and verify transactions. This allows the network to scale much more.[25]
- Vitess open-source database clustering system shards MySQL. It is a Cloud Native Computing Foundation project.[26]
- Apache Software Foundation (ASF) project.[27]
Disadvantages
Sharding a database table before it has been optimized locally causes premature complexity. Sharding should be used only when all other options for optimization are inadequate.[according to whom?] The introduced complexity of database sharding causes the following potential problems:[citation needed]
- SQL complexity - Increased bugs because the developers have to write more complicated SQL to handle sharding logic
- Additional software - that partitions, balances, coordinates, and ensures integrity can fail
- Single point of failure - Corruption of one shard due to network/hardware/systems problems causes failure of the entire table.
- Fail-overserver complexity - Fail-over servers must have copies of the fleets of database shards.
- Backups complexity - Database backups of the individual shards must be coordinated with the backups of the other shards.
- Operational complexity - Adding/removing indexes, adding/deleting columns, modifying the schema becomes much more difficult.
Etymology
In a database context, most recognize the term "shard" is most likely derived from either one of two sources:
Today, the term "shard" refers to the deployment and use of redundant hardware across database systems.[citation needed]
See also
Notes
- dimension tables
References
- ISBN 978-0321826626.
- ^ Rahul Roy (July 28, 2008). "Shard - A Database Design".
- ^ Ries, Eric. "Sharding for Startups".
- S2CID 204749727.
- S2CID 204749727.
- ^ "Apache HBase – Apache HBase™ Home". hbase.apache.org.
- ^ "Introducing Elastic Scale preview for Azure SQL Database". azure.microsoft.com. 2 October 2014.
- ^ "Alibaba Cloud Help Center - Cloud Definition and Explanation of Cloud Based Services - Alibaba Cloud". www.alibabacloud.com.
- ^ "Focuses on Large-Scale Online Databases - Alibaba Cloud". www.alibabacloud.com.
- ^ "Index Shard Allocation | Elasticsearch Guide [7.13] | Elastic". www.elastic.co.
- ^ "IBM Docs".
- ^ "Hibernate Shards". 2007-02-08.
- ^ "Hibernate Shards". Archived from the original on 2008-12-16. Retrieved 2011-03-30.
- ^ "New Grid queries for Informix".
- ^ "NoSQL support in Informix (JSON storage, Mongo DB API)". September 24, 2013.
- ^ "Spider". MariaDB KnowledgeBase. Retrieved 2022-12-20.
- ^ "MonetDB July2015 Released". 31 August 2015.
- ^ "MySQL Cluster Features & Benefits". 2012-11-23.
- ^ "MySQL Fabric sharding quick start guide".
- ^ "Oracle Sharding". Oracle. 2018-05-24. Retrieved 2021-07-10.
- ^ "DistributedSearch - SOLR - Apache Software Foundation". cwiki.apache.org.
- ^ Corbett, James C; Dean, Jeffrey; Epstein, Michael; Fikes, Andrew; Frost, Christopher; Furman, JJ; Ghemawat, Sanjay; Gubarev, Andrey; Heiser, Christopher; Hochschild, Peter; Hsieh, Wilson; Kanthak, Sebastian; Kogan, Eugene; Li, Hongyi; Lloyd, Alexander; Melnik, Sergey; Mwaura, David; Nagle, David; Quinlan, Sean; Rao, Rajesh; Rolig, Lindsay; Saito, Yasushi; Szymaniak, Michal; Taylor, Christopher; Wang, Ruth; Woodford, Dale. "Spanner: Google's Globally-Distributed Database" (PDF). Proceedings of OSDI 2012. Retrieved 24 February 2014.
- ^ "sqlalchemy/sqlalchemy". July 9, 2021 – via GitHub.
- ^ "Partitioning and Sharding Options for SQL Server and SQL Azure". infoq.com.
- ^ "A faster, more efficient cryptocurrency". MIT News. 24 January 2019. Retrieved 2019-01-30.
- ^ "Vitess". vitess.io.
- ^ "ShardingSphere". shardingsphere.apache.org.
- ^ Sarin, DeWitt & Rosenberg, Overview of SHARD: A System for Highly Available Replicated Data, Technical Report CCA-88-01, Computer Corporation of America, May 1988
- ^ Koster, Raph (2009-01-08). "Database "sharding" came from UO?". Raph Koster's Website. Retrieved 2015-01-17.
- ^ a b c "Ultima Online: The Virtual Ecology | War Stories". Ars Technica Videos.