Hive format for storing and reading files

Posted by MatrixGL on Fri, 06 Sep 2019 04:28:02 +0200

Hive files are stored in the following formats:

  1. TEXTFILE
  2. SEQUENCEFILE
  3. RCFILE
  4. ORCFILE (since 0.11)

TEXTFILE is the default format, which will be defaulted if tables are not specified. When data is imported, data files will be copied directly to hdfs for processing.

Tables in SequenceFile,RCFile,ORCFile format cannot import data directly from local files. Data must first be imported into a table in textfile format, and then into SequenceFile,RCFile,ORCFile tables with insert from the table.

Create a testfile_table table table in textfile format.

create table if not exists testfile_table( site string, url  string, pv   bigint, label string) row format delimited fields terminated by '\t' stored as textfile;

load data local inpath '/app/weibo.txt' overwrite into table textfile_table;

1. TEXTFILE

Default format, no data compression, high disk overhead, high data parsing overhead.It can be used in combination with Gzip and Bzip2 (system checks automatically and decompresses automatically when executing queries).

In this way, however, Hive does not slice the data, which makes it impossible for the data to be manipulated in parallel.
Example:

set hive.exec.compress.output=true;  
set mapred.output.compress=true;  
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;  
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;  
insert overwrite table textfile_table select * from textfile_table;  

 

SEQUENCEFILE

SequenceFile is a binary file support provided by the Hadoop API, which is easy to use, segmentable, and compressable.
SequenceFile supports three compression options: NONE, RECORD, and BLOCK.Record compression is low and BLOCK compression is generally recommended.
Example:

set hive.exec.compress.output=true;  
set mapred.output.compress=true;  
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;  
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;  
SET mapred.output.compression.type=BLOCK;
insert overwrite table seqfile_table select * from textfile_table;

 

3. RCFILE

RCFILE is a combination of row and column storage.First, it divides the data into rows, ensuring that the same record is on one block, and avoiding reading a record requires reading more than one block.Secondly, block data column storage facilitates data compression and fast column access.
Example:

set hive.exec.compress.output=true;  
set mapred.output.compress=true;  
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;  
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;  
insert overwrite table rcfile_table select * from textfile_table;

 

IV. ORCFILE()

ORCFILE is more efficient than RCFILE and is an improved version of RCFILE.Data is divided into rows, and each block is stored in columns.Fast compression enables fast column access.

 

summary

Compared with TEXTFILE and SEQUENCEFILE, RCFILE consumes more performance when loading data due to column storage, but has better compression ratio and query response.Data warehouse is characterized by one write and multiple read, so RCFILE has obvious advantages over the other two formats as a whole.

Topics: Hadoop Apache hive codec