Use of HBase ImportTSV tool

Posted by margarette_a on Wed, 15 Dec 2021 21:19:27 +0100

1. Importtsv function description

Load text data in the format of tsv (or csv, each field in each row of data is separated by a separator) into the HBase table.
1) . load and import in Put mode
2) . bulk load and import are adopted
Use the following command to view the instructions for the official HBase built-in tool class:

HADOOP_HOME=/export/servers/hadoop
HBASE_HOME=/export/servers/hbase
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase mapredcp`:${HBASE_HOME}/conf
${HADOOP_HOME}/bin/yarn jar ${HBASE_HOME}/lib/hbase-server-1.2.0-
cdh5.14.0.jar

Executing the above command prompts the following information:

An example program must be given as the first argument.
Valid program names are:
CellCounter: Count cells in HBase table.
WALPlayer: Replay WAL files.
completebulkload: Complete a bulk data load.
copytable: Export a table from local cluster to peer cluster.
export: Write table data to HDFS.
exportsnapshot: Export the specific snapshot to a given FileSystem.
import: Import data written by Export.
importtsv: Import data in TSV format.
rowcounter: Count rows in HBase table.
verifyrep: Compare the data from tables in two different clusters.

Translation:

A sample program must be given as the first parameter.
Valid program names are:
Cellcounter: the number of cell s in the HBase table.
Wallayer: replay the WAL file.
completebulkload: complete batch data loading.
copytable: export tables from the local cluster to the peer cluster.
export: writes table data to HDFS.
exportsnapshot: exports a specific snapshot to a given file system.
Import: import data written by Export.
importtsv: import data in TSV format.
Rowcounter: the number of rows in the HBase table.
verifyrep: compares data from tables in two different clusters.

importtsv is to import the data of text files (such as CSV, TSV and other formats) into the HBase table tool class
The description is as follows:

Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir>
The column names of the TSV data must be specified using the -
Dimporttsv.columns
option. This option takes the form of comma-separated column names, where
each
column name is either a simple column family, or a columnfamily:qualifier.
The special column name HBASE_ROW_KEY is used to designate that this column
should be used as the row key for each imported record.
To instead generate HFiles of data to prepare for a bulk data load, pass
the option:
-Dimporttsv.bulk.output=/path/for/output
'-Dimporttsv.separator=|' - eg separate on pipes instead of tabs
For performance consider the following options:
 -Dmapreduce.map.speculative=false
 -Dmapreduce.reduce.speculative=false

Translation:
Usage: importtsv -Dimporttsv. Column names for TSV data such as column = a, B, C < Table > < inputdir > must use - specify dimporttsv Columns selection. This option takes the form of comma separated column names, where each column name can be a simple column family or a column family: qualifier. Special column name HBASE_ROW_KEY specifies that the column should be used as the row key for each imported record. To prepare batch data loading instead of HFiles that generate data, select:

-Dimporttsv.bulk.output=/path/for/output
'-Dimporttsv.separator=|' - eg separate on pipes instead of tabs
For performance consider the following options:
 -Dmapreduce.map.speculative=false
 -Dmapreduce.reduce.speculative=false

2. Direct import Put mode

HADOOP_HOME=/export/servers/hadoop
HBASE_HOME=/export/servers/hbase
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase mapredcp`:${HBASE_HOME}/conf
${HADOOP_HOME}/bin/yarn jar ${HBASE_HOME}/lib/hbase-server-1.2.0-
cdh5.14.0.jar \
importtsv \
-
Dimporttsv.columns=HBASE_ROW_KEY,detail:log_id,detail:remote_ip,detail:s
ite_global_ticket,detail:site_global_session,detail:global_user_id,detai
l:cookie_text,detail:user_agent,detail:ref_url,detail:loc_url,detail:log
_time \
tbl_logs \
/user/hive/warehouse/tags_dat.db/tbl_logs

The above command essentially runs a MapReduce application to convert and encapsulate each line of data in the text file into Put
Object and insert it into the HBase table.
To review:
Big data Sqoop imports Mysql data into Hbase with Hive

3 convert to HFile file and load it into table

# 1. Generate HFILES file
HADOOP_HOME=/export/servers/hadoop
HBASE_HOME=/export/servers/hbase
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase mapredcp`:${HBASE_HOME}/conf
${HADOOP_HOME}/bin/yarn jar ${HBASE_HOME}/lib/hbase-server-1.2.0-
cdh5.14.0.jar \
importtsv \
-Dimporttsv.bulk.output=hdfs://bigdata-
cdh01.itcast.cn:8020/datas/output_hfile/tbl_logs \
-
Dimporttsv.columns=HBASE_ROW_KEY,detail:log_id,detail:remote_ip,detail:
site_global_ticket,detail:site_global_session,detail:global_user_id,det
ail:cookie_text,detail:user_agent,detail:ref_url,detail:loc_url,detail:
log_time \
tbl_logs \
/user/hive/warehouse/tags_dat.db/tbl_logs
# 2. Load the HFILE file into the table
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase mapredcp`:${HBASE_HOME}/conf
${HADOOP_HOME}/bin/yarn jar ${HBASE_HOME}/lib/hbase-server-1.2.0-
cdh5.14.0.jar \
completebulkload \
hdfs://bigdata-cdh01.itcast.cn:8020/datas/output_hfile/tbl_logs \
tbl_logs

Disadvantages:
1) , ROWKEY cannot be a combined primary key
Can only be one field
2) When there are many columns in the table, write - dimporttsv Columns value is troublesome and error prone

Summary: This tool is very good for importing some small batch data, but it is not commonly used

Topics: Big Data Hadoop HBase