RCFile

RCFile (Record Columnar File) is a data structure that determines how to store relational tables on computer clusters. It was originally designed for systems using the MapReduce framework in 2010. Since then, RCFile has become a standard storage format for both conventional databases and distributed databases on clusters.

The basic data structure of a relational database consists of a two-dimensional table organized in rows and columns. There are two basic data formats to store the tables in computer systems: row-store and column-store. Row-store forms a table in a sequence of rows (or records) with two advantages: (1) It is easy to add or update rows or records, which are basic operations to modify, grow or shrink the table; and (2) all the columns in a row or a record are stored together, which is desirable for applications to access a complete record. A major disadvantage of the row-store format is related to I/O efficiency. Since all the columns of a row have to be read although only a few columns may be used in practice, particularly for a very wide table. Another disadvantage is that a row-based data compression may not be as efficient as that on a column-based one. The two disadvantages can be addressed by the column-store format that forms a table in a sequence of columns with two advantages: (1) Only required columns are read from the storage during the selection process of a query; and (2) a high compression rate can be achieved due to similar data types in a column. Two disadvantages are associated with the column-store format. First, if the result of a query needs operations among multiple columns that may be stored in different tracks in a disk, or even worse in different nodes connected by networks, significant delays may come from random accesses in disks or/and remote accesses via networks. Second, it is not easy for write intensive workloads where rows or records are frequently added/deleted and updated.

As data volume becomes increasingly large, the tables have to be partitioned and placed among many computing nodes in clusters. Thus, neither row-store nor column-store would be sufficiently efficient for processing distributed tables in large clusters. RCFIle is designed to address the above mentioned concerns by retaining the merits of both row-store and column-store methods. The RCFile structure includes a hybrid data storage format containing both rows and columns, data compression mechanisms in columns, and several optimization techniques for fast data accesses . It is able to meet all the four requirements of data storage format: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) a strong additivity to dynamic data access patterns.

RCFile is the result of research and collaborative efforts from Facebook, Ohio State University, and the Institute of Computing Technology at the Chinese Academy of Sciences. A research paper on RCFile was published in 2011.^[1] The data placement structure and its implementation presented in the paper were widely adopted in the open source software community, big data analytics industries, and data processing application areas.

Summary

Data storage format

For example, a table in a database consists of 4 columns (c1 to c4):

c1	c2	c3	c4
11	12	13	14
21	22	23	24
31	32	33	34
41	42	43	44
51	52	53	54

To serialize the table, RCFile partitions this table first horizontally and then vertically, instead of only partitioning the table horizontally like the row-oriented DBMS (row-store). The horizontal partitioning will first partition the table into multiple row groups based on the row-group size, which is a user-specified value determining the size of each row group. For example, the table mentioned above can be partitioned to two row groups if the user specifies three rows as the size of each row group.

Row Group 1
c1	c2	c3	c4
11	12	13	14
21	22	23	24
31	32	33	34

Row Group 2
c1	c2	c3	c4
41	42	43	44
51	52	53	54

Then, in every row group, RCFile partitions the data vertically like column-store. Thus, the table will be serialized as:

      Row Group 1   Row Group 2 
      11, 21, 31;     41, 51;
      12, 22, 32;     42, 52;
      13, 23, 33;     43, 53;
      14, 24, 34;     44, 54;

Column data compression

Within each row group, columns are compressed to reduce storage space usage. Since data of a column are stored adjacently, the pattern of a column can be detected and thus the suitable compression algorithm can be selected for a high compression ratio.

Performance Benefits

Column-store is more efficient when a query only requires a subset of columns, because column-store only read necessary columns from disks but row-store will read an entire row.

RCFile combines merits of row-store and column-store via horizontal-vertical partitioning. With horizontal partitioning, RCFile places all columns of a row in a single machine and thus can eliminate the extra network costs when constructing a row. With vertical partitioning, for a query, RCFile will only read necessary columns from disks and thus can eliminate the unnecessary local I/O costs. Moreover, in every row group, data compression can be done by using compression algorithms used in column-store.

For example, a database might have this table:

EmpId	Lastname	Firstname	Salary
10	Smith	Joe	40000
12	Jones	Mary	50000
11	Johnson	Cathy	44000
22	Jones	Bob	55000

This simple table includes an employee identifier (EmpId), name fields (Lastname and Firstname) and a salary (Salary). This two-dimensional format exists only in theory, in practice, storage hardware requires the data to be serialized into one form or another.

In MapReduce-based systems, data is normally stored on a distributed system, such as Hadoop Distributed File System (HDFS), and different data blocks might be stored in different machines. Thus, for column-store on MapReduce, different groups of columns might be stored on different machines, which introduces extra network costs when a query projects columns placed on different machines. For MapReduce-based systems, the merit of row-store is that there is no extra network costs to construct a row in query processing, and the merit of column-store is that there is no unnecessary local I/O costs when read data from disks.

Row-oriented systems

The common solution to the storage problem is to serialize each row of data, like this;

001:10,Smith,Joe,40000;002:12,Jones,Mary,50000;003:11,Johnson,Cathy,44000;004:22,Jones,Bob,55000;

Row-based systems are designed to efficiently return data for an entire row, or an entire record, in as few operations as possible. This matches use-cases where the system is attempting to retrieve all the information about a particular object, say the full information about one contact in a rolodex system, or the complete information about one product in an online shopping system.

Row-based systems are not efficient at performing operations that apply to the entire data set, as opposed to a specific record. For instance, in order to find all the records in the example table that have salaries between 40,000 and 50,000, the row-based system would have to seek through the entire data set looking for matching records. While the example table shown above may fit in a single disk block, a table with even a few hundred rows would not, therefore multiple disk operations would be needed to retrieve the data.

Column-oriented systems

A column-oriented system serializes all of the values of a column together, then the values of the next column. For our example table, the data would be stored in this fashion;

10:001,12:002,11:003,22:004;Smith:001,Jones:002,Johnson:003,Jones:004;Joe:001,Mary:002,Cathy:003,Bob:004;40000:001,50000:002,44000:003,55000:004;

The difference can be more clearly seen in this common modification:

...;Smith:001,Jones:002,004,Johnson:003;...

Two of the records store the same value, "Jones", therefore it is now possible to store this in the column-oriented system only once instead of twice. For many common searches, like "find all the people with the last name Jones", the answer can now be retrieved in a single operation.

Whether or not a column-oriented system will be more efficient in operation depends heavily on the operations being automated. Operations that retrieve data for objects would be slower, requiring numerous disk operations to assemble data from different columns to build up a whole-row record. However, such whole-row operations are generally rare. In the majority of cases, only a limited subset of data is retrieved. In a rolodex application, for instance, operations collecting the first names and last names from many rows in order to build a list of contacts is far more common than operations reading the data for home address.

Optimized RCFile (ORC)

RCFile has been optimized to improve its performance in different ways. (1) Row reordering, ORC gives users an opportunity to write rows in specific orders for performance optimization. (2) Table partitioning. ORC gives a large default size for the row group (it is called “stripe” in ORC), which improves data reading performance. ORC does not group columns. (3) Data packing. Although the default stripe size is large, ORC still store each in a single file for I/O efficiency. (4) Auxiliary data. ORC provides a set of sophisticated index for fast data search, such as to directly access a specific row by its row number. ORC also records the locations of stripes, so that the starting point of each stripe can be quickly found. ORC provides necessary statistical data for each row and column, such as the maximum value and minimum values. In this way, a range query can be quickly processed by eliminating a lot of unnecessary data accesses. (5) Efficient compression. ORC provides two levels of compression. It first automatically applies type-specific encoding methods to columns with different data types. Then, an optional codec is used to compress encoded data streams. The compression is highly efficient in ORC.

The ORC work has been documented in a paper in 2013. ^[2]

Adoption

RCFile has been adopted in real-world systems for big data analytics.

RCFile became the default data placement structure in Facebook's production Hadoop cluster.^[3] By 2010 it was the world's largest Hadoop cluster,^[4] where 40 terabytes compressed data sets are added every day.^[5] In addition, all the data sets stored in HDFS before RCFile have also been transformed to use RCFile .^[3]
RCFile has been adopted in Apache Hive (since v0.4),^[6] which is an open source data store system running on top of Hadoop and is being widely used in various companies around the world,^[7] including several Internet services, such as Facebook, Taobao, and Netflix.^[8]
RCFile has been adopted in Apache Pig (since v0.7),^[9] which is another open source data processing system being widely used in many organizations,^[10] including several major Web service providers, such as Twitter, Yahoo, LinkedIn, AOL, and Salesforce.com.
RCFile became the de facto standard data storage structure in Hadoop software environment supported by the Apache HCatalog project (formerly known as Howl^[11]) that is the table and storage management service for Hadoop.^[12] RCFile is supported by the open source Elephant Bird library used in Twitter for daily data analytics.^[13]

Over the following years, other Hadoop data formats also became popular. In February 2013, an Optimized Row Columnar (ORC) file format was announced by Hortonworks.^[14] A month later, the Apache Parquet format was announced, developed by Cloudera and Twitter.^[15]

References

^ RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems. April 11, 2011. pp. 1199–1208. doi:10.1109/ICDE.2011.5767933. ISBN 978-1-4244-8959-6. Retrieved May 4, 2017. {{cite book}}: |journal= ignored (help); Unknown parameter |authors= ignored (help)
^ Y. Huai, S. Ma, R. Lee, O. O'Malley, X. Zhang, "Understanding insights into the basic structure and essential issues of table placement methods in clusters", Proceedings of the VLDB Endowment, Vol. 6, No. 14, 2013. [1]
^ ^a ^b "Hive integration: HBase and Rcfile__HadoopSummit2010". 2010-06-30.
^ "Facebook has the world's largest Hadoop cluster!". 2010-05-09.
^ "Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain". 2011-02-24.
^ "Class RCFile". Archived from the original on 2011-11-23. Retrieved 2012-07-21.
^ "PoweredBy - Apache Hive - Apache Software Foundation".
^ "Hive user group presentation from Netflix (3/18/2010)". 2010-03-19.
^ "HiveRCInputFormat (Pig 0.17.0 API)".
^ "PoweredBy - Apache Pig - Apache Software Foundation".
^ Howl
^ "HCatalog". Archived from the original on 2012-07-20. Retrieved 2012-07-21.
^ "Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.: Kevinweil/elephant-bird". 2018-12-15.
^ Alan Gates (February 20, 2013). "The Stinger Initiative: Making Apache Hive 100 Times Faster". Hortonworks blog. Retrieved May 4, 2017.
^ Justin Kestelyn (March 13, 2013). "Introducing Parquet: Efficient Columnar Storage for Apache Hadoop". Cloudera blog. Retrieved May 4, 2017.

External links

[YongqiangHe-1] RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems. April 11, 2011. pp. 1199–1208. doi:10.1109/ICDE.2011.5767933. ISBN 978-1-4244-8959-6. Retrieved May 4, 2017. {{cite book}}: |journal= ignored (help); Unknown parameter |authors= ignored (help)

[2] Y. Huai, S. Ma, R. Lee, O. O'Malley, X. Zhang, "Understanding insights into the basic structure and essential issues of table placement methods in clusters", Proceedings of the VLDB Endowment, Vol. 6, No. 14, 2013. [1]

[hiveIntegration-3] "Hive integration: HBase and Rcfile__HadoopSummit2010". 2010-06-30.

[4] "Facebook has the world's largest Hadoop cluster!". 2010-05-09.

[5] "Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain". 2011-02-24.

[6] "Class RCFile". Archived from the original on 2011-11-23. Retrieved 2012-07-21.

[7] "PoweredBy - Apache Hive - Apache Software Foundation".

[8] "Hive user group presentation from Netflix (3/18/2010)". 2010-03-19.

[9] "HiveRCInputFormat (Pig 0.17.0 API)".

[10] "PoweredBy - Apache Pig - Apache Software Foundation".

[11] Howl

[12] "HCatalog". Archived from the original on 2012-07-20. Retrieved 2012-07-21.

[13] "Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.: Kevinweil/elephant-bird". 2018-12-15.

[14] Alan Gates (February 20, 2013). "The Stinger Initiative: Making Apache Hive 100 Times Faster". Hortonworks blog. Retrieved May 4, 2017.

[15] Justin Kestelyn (March 13, 2013). "Introducing Parquet: Efficient Columnar Storage for Apache Hadoop". Cloudera blog. Retrieved May 4, 2017.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]