Column-oriented DBMS
A column-oriented DBMS or columnar DBMS is a
Practical use of a column store versus a row store differs little in the
Description
Background
A relational database management system provides data that represents a two-dimensional table of columns and rows. For example, a database might have this table:
RowId | EmpId | Lastname | Firstname | Salary |
---|---|---|---|---|
001 | 10 | Smith | Joe | 60000 |
002 | 12 | Jones | Mary | 80000 |
003 | 11 | Johnson | Cathy | 94000 |
004 | 22 | Jones | Bob | 55000 |
This simple table includes an employee identifier (EmpId), name fields (Lastname and Firstname) and a salary (Salary). This two-dimensional format is an abstraction. In an actual implementation, storage hardware requires the data to be serialized into one form or another.
The most expensive operations involving
A survey by Pinnecke et al.[1] covers techniques for column-/row hybridization as of 2017.
Row-oriented systems
A common method of storing a table is to serialize each row of data, like this:
001:10,Smith,Joe,60000; 002:12,Jones,Mary,80000; 003:11,Johnson,Cathy,94000; 004:22,Jones,Bob,55000;
As data is inserted into the table, it is assigned an internal ID, the rowid
that is used internally in the system to refer to data. In this case the records have sequential rowid
s independent of the user-assigned empid
. In this example, the DBMS uses short integers to store rowid
s. In practice, larger numbers, 64-bit or 128-bit, are normally used.
Row-oriented systems are designed to efficiently return data for an entire row, or record, in as few operations as possible. This matches the common use-case where the system is attempting to retrieve information about a particular object, say the contact information for a user in a rolodex system, or product information for an online shopping system. By storing the record's data in a single block on the disk, along with related records, the system can quickly retrieve records with a minimum of disk operations.
Row-oriented systems are not efficient at performing set-wide operations on the whole table, as opposed to a small number of specific records. For instance, in order to find all records in the example table with salaries between 40,000 and 50,000, the DBMS would have to fully scan through the entire table looking for matching records. While the example table shown above will likely fit in a single disk block, a table with even a few hundred rows would not, and multiple disk operations would be needed to retrieve the data and examine it.
To improve the performance of these sorts of operations (which are very common, and generally the point of using a DBMS), most DBMSs support the use of database indexes, which store all the values from a set of columns along with rowid
pointers back into the original table. An index on the salary column would look something like this:
55000:004; 60000:001; 80000:002; 94000:003;
As they store only single pieces of data, rather than entire rows, indexes are generally much smaller than the main table stores. Scanning this smaller set of data reduces the number of disk operations. If the index is heavily used, it can dramatically reduce the time for common operations. However, maintaining indexes adds overhead to the system, especially when new data is written to the database. Records not only need to be stored in the main table, but any attached indexes have to be updated as well.
The main reason why indexes dramatically improve performance on large datasets is that database indexes on one or more columns are typically sorted by value, which makes range queries operations (like the above "find all records with salaries between 40,000 and 50,000" example) very fast (lower time-complexity).
A number of row-oriented databases are designed to fit entirely in
Column-oriented systems
A column-oriented database serializes all of the values of a column together, then the values of the next column, and so on. For our example table, the data would be stored in this fashion:
001:10,002:12,003:11,004:22; 001:Smith,002:Jones,003:Johnson,004:Jones, 001:Joe,002:Mary,003:Cathy,004:Bob; 001:60000,002:80000,003:94000,004:55000;
In this layout, any one of the columns more closely matches the structure of an index in a row-oriented system. This may cause confusion that can lead to the mistaken belief a column-oriented store "is really just" a row-store with an index on every column. However, it is the mapping of the data that differs dramatically. In a row-oriented system, indices map column values to rowids, whereas in a column-oriented system, columns map rowids to column values.[2] This may seem subtle, but the difference can be seen in this common modification to the same store wherein the two "Jones" items, above, are compressed into a single item with two rowids:
…;Smith:001;Jones:002,004;Johnson:003;…
Whether or not a column-oriented system will be more efficient in operation depends heavily on the workload being automated. Operations that retrieve all the data for a given object (the entire row) are slower. A row-oriented system can retrieve the row in a single disk read, whereas numerous disk operations to collect data from multiple columns are required from a columnar database. However, these whole-row operations are generally rare. In the majority of cases, only a limited subset of data is retrieved. In a rolodex application, for instance, collecting the first and last names from many rows to build a list of contacts is far more common than reading all data for any single address. This is even more true for writing data into the database, especially if the data tends to be "sparse" with many optional columns. For this reason, column stores have demonstrated excellent real-world performance in spite of many theoretical disadvantages.[3]
Benefits
Access time
Comparisons between row-oriented and column-oriented databases are typically concerned with the efficiency of hard-disk access for a given workload, as
In practice, columnar databases are well-suited for
Compression
Column data is of uniform type; therefore, there are some opportunities for storage size optimizations available in column-oriented data that are not available in row-oriented data. For example, many popular modern compression schemes, such as
To improve compression, sorting rows can also help. For example, using
Columnar compression achieves a reduction in disk space at the expense of efficiency of retrieval. The greater adjacent compression achieved, the more difficult random-access may become, as data might need to be uncompressed to be read. Therefore, column-oriented architectures are sometimes enriched by additional mechanisms aimed at minimizing the need for access to compressed data.[13]
History
Column stores or transposed files have been implemented from the early days of DBMS development. TAXIR was the first application of a column-oriented database storage system with focus on information-retrieval in biology[14] in 1969. Clinical data from patient records with many more attributes than could be analyzed were processed in 1975 and after by a time-oriented database system (TODS).[8] Statistics Canada implemented the RAPID system[15] in 1976 and used it for processing and retrieval of the Canadian Census of Population and Housing as well as several other statistical applications. RAPID was shared with other statistical organizations throughout the world and used widely in the 1980s. It continued to be used by Statistics Canada until the 1990s.
Another column-oriented database was SCSS.[16][17][18]
Later column-oriented database packages included:
- 1993: KDB
- 1995: Sybase IQ
Since about 2004 there have been additional open source and commercial implementations. MonetDB was released under an open-source license on September 30, 2004,[19] followed closely by the now defunct C-Store.[20]
C-store was a university project that eventually, with team member Michael Stonebraker staying on, led to Vertica, which he co-founded in 2005.[21][22]
The MonetDB-related X100 project evolved into
Classic Relational DBMS can use column-oriented strategies by mixing row-oriented and column-oriented tables. Despite the DBMS complexity, this approach has proven to be valuable from the years 2010 to present as of 2020. For example in 2014 Citusdata introduced column-oriented tables for PostgreSQL[26] and McObject added support for columnar storage with its release of eXtremeDB Financial Edition in 2012[27] which was then used to establish a new standard of performance for the independently audited STAC-M3 benchmark.[28]
See also
- Data warehouse
- List of column-oriented DBMSes
- AOS and SOA
- RCFile
- BigQuery
References
- .
- ^ Daniel Abadi; Samuel Madden (31 July 2008). "Debunking Another Myth: Column-Stores vs. Vertical Partitioning". The Database Column. Archived from the original on December 4, 2008.
- ^ Stavros Harizopoulos; Daniel Abadi; Peter Boncz. "Column-Oriented Database Systems" (PDF). VLDB 2009 Tutorial. p. 5.
- ^ Masiero, Manuel (January 8, 2013). "Western Digital's 4 TB WD4001FAEX Review: Back In Black". Tom's Hardware.
- ^ Levinthal, David (2009). "Performance Analysis Guide for Intel® Core™ i7 Processor and Intel® Xeon™ 5500 processors" (PDF). Intel. p. 22. Retrieved 2017-11-10.
- ^ "Compacting Transactional Data in Hybrid OLTP&OLAP Databases" (PDF). Retrieved August 1, 2017.
- ^ "A Common Database Approach for OLTP and OLAP Using an In-Memory Column Database" (PDF). Retrieved August 1, 2017.
- ^ PMID 1157469.
- ^ D. J. Abadi; S. R. Madden; N. Hachem (2008). Column-stores vs. row-stores: how different are they really?. pp. 967–980.
{{cite book}}
:|work=
ignored (help) - ].
- ^ Daniel Lemire, Owen Kaser, Kamel Aouiche, "Sorting improves word-aligned bitmap indexes", Data & Knowledge Engineering, Volume 69, Issue 1 (2010), pp. 3-28.
- ^ Daniel Lemire and Owen Kaser, Reordering Columns for Smaller Indexes, Information Sciences 181 (12), 2011
- ^ Dominik Ślęzak; Jakub Wróblewski; Victoria Eastwood; Piotr Synak (2008). Brighthouse: an analytic data warehouse for ad hoc queries (PDF). Proceedings of the 34th VLDB Conference. Auckland, New Zealand. Archived from the original (PDF) on 2016-05-07. Retrieved 2009-05-04.
- .
- ^ "A DBMS for large statistical databases". acm.org. Vldb '79. 1979. pp. 319–327.
- ^ already on the market by September 1977
- ISBN 978-0070465336.
- ^ "SCSS from SPSS, Inc". ComputerWorld. September 26, 1977. p. 28.
- ^ "A short history about us". monetdb.org.
- ^ "C-Store". mit.edu. Archived from the original on 2012-03-05. Retrieved 2008-01-22.
- ^ "The Vertica Analytic Database: C-Store 7 Years Later" (PDF)" (PDF). VLDB.org. August 28, 2012.
- ^ Charles Babcock (February 21, 2008). "Database Pioneer Rethinks The Best Way To Organize Data". InformationWeek. Retrieved 2018-12-08.
- S2CID 9187072.
- S2CID 6372175.
- ^ "Druid". druid.io.
- ^ "Citusdata". github.com.
- ^ Saujani, Sandeep (19 June 2012). "McObject eXtremeDB Financial Edition In-Memory DBMS Breaks Through Capital Markets' Data Management Bottleneck". bobs guide.
- ^ STAC Benchmark Council, Leadership (3 November 2012). "McObject eXtremeDB 5.0 Financial Edition with Kove XPD L2 Storage System, Dell PowerEdge R910 Server and Mellanox ConnectX-2 and MIS5025Q QDR InfiniBand Switch". STAC.
External links
- Distinguishing Two Major Types Of Column-Stores
- VLDB 2009 Tutorial - overview
- Tour Through Hybrid Column-Row Oriented DBMS
- Weaving Relations for Cache Performance - column-oriented block layout
- The Design and Implementation of Modern Column-Oriented Database Systems Archived 2021-04-12 at the Wayback Machine