hbase data synchronization tool - HashTable/SyncTable

Posted by vladibo on Wed, 19 Jan 2022 09:15:30 +0100

HashTable/SyncTable is a synchronization hbase A tool for table data. The process is divided into two steps, both of which are MapReduce jobs. Like the CopyTable tool, it can also be used to synchronize some or all of the table data between the same or different clusters. However, compared with CopyTable, this tool performs better in synchronizing table data between different clusters. Instead of copying table data in an interval, it first executes HashTable in the source cluster to generate hash sequence based on the source data table, and then executes SyncTable in the target cluster to generate hash sequence based on the source data table, source data table, target table and target table, and compares the hash sequences generated by the two tables to find out the missing data. Then, when synchronizing, you only need to synchronize the missing data, which can greatly reduce the bandwidth and data transmission.

step1 HashTable

First, execute HashTable in the source cluster:

How to use HbaseTable:

Usage: HashTable [options] <tablename> <outputpath>

Options:
 batchsize     the target amount of bytes to hash in each batch
               rows are added to the batch until this size is reached
               (defaults to 8000 bytes)
 numhashfiles  the number of hash files to create
               if set to fewer than number of regions then
               the job will create this number of reducers
               (defaults to 1/100 of regions -- at least 1)
 startrow      the start row
 stoprow       the stop row
 starttime     beginning of the time range (unixtime in millis)
               without endtime means from starttime to forever
 endtime       end of the time range.  Ignored if no starttime specified.
 scanbatch     scanner batch size to support intra row scans
 versions      number of cell versions to include
 families      comma-separated list of families to include

Args:
 tablename     Name of the table to hash
 outputpath    Filesystem path to put the output data

Examples:
 To hash 'TestTable' in 32kB batches for a 1 hour window into 50 files:
 $ bin/hbase org.apache.hadoop.hbase.mapreduce.HashTable --batchsize=32000 --numhashfiles=50 --starttime=1265875194289 --endtime=1265878794289 --families=cf2,cf3 TestTable /hashes/testTableCopy

The batchsize attribute defines how much cell data is in a single hash value based on a given region. The setting of this attribute directly affects the efficiency of synchronization. This may result in a reduction in the number of scans performed by the SyncTable mapper task (this is the next step in the process). The rule of thumb is that the fewer cells out of sync (the lower the probability of finding differences), the larger the batch size value can be determined. In other words, if there is less unsynchronized data, this value can be set larger. vice versa.

step2 SyncTable

After HashTable is executed in the source cluster, SyncTable can be executed in the target cluster. Like replication and other synchronization tasks, the target cluster needs to be able to access the regionServers/DataNodes nodes nodes of all source clusters.

SyncTable usage:

Usage: SyncTable [options] <sourcehashdir> <sourcetable> <targettable>

Options:
 sourcezkcluster  ZK cluster key of the source table
                  (defaults to cluster in classpath's config)
 targetzkcluster  ZK cluster key of the target table
                  (defaults to cluster in classpath's config)
 dryrun           if true, output counters but no writes
                  (defaults to false)

Args:
 sourcehashdir    path to HashTable output dir for source table
                  if not specified, then all data will be scanned
 sourcetable      Name of the source table to sync from
 targettable      Name of the target table to sync to

Examples:
 For a dry run SyncTable of tableA from a remote source cluster
 to a local target cluster:
 $ bin/hbase org.apache.hadoop.hbase.mapreduce.SyncTable --dryrun=true --sourcezkcluster=zk1.example.com,zk2.example.com,zk3.example.com:2181:/hbase hdfs://nn:9000/hashes/tableA tableA tableACopy

The dryrun option is very useful in read-only operations and table comparison. It can display the number of differences between the two tables without making any changes to the table. It can be used as a substitute for the VerifyReplication tool

By default, SyncTable makes the target table a replica of the source table.

Setting doDeletes to false modifies the default behavior to not delete data that is not in the source table but in the target table. Similarly, setting doPuts to false will also modify the default behavior so as not to increase the data in the source table but not in the target table. If you set both doDeletes and doPuts to false, the effect is the same as setting dryrun to true.

In two way replication or other scenarios, such as when the cluster data on the source side and the target side have other data inputs, it is recommended to set the doDeletes option to false.

example

Here, take different tables in the same cluster as an example. Different clusters are similar:

Source table:

Table name: Student Table data

hbase(main):010:0> scan 'Student'
ROW                                       COLUMN+CELL
 0001                                     column=Grades:BigData, timestamp=1604988333715, value=80
 0001                                     column=Grades:Computer, timestamp=1604988336890, value=90
 0001                                     column=Grades:Math, timestamp=1604988339775, value=85
 0001                                     column=StuInfo:Age, timestamp=1604988324791, value=18
 0001                                     column=StuInfo:Name, timestamp=1604988321380, value=Tom Green
 0001                                     column=StuInfo:Sex, timestamp=1604988328806, value=Male
1 row(s) in 0.0120 secondsCopy

Table structure:

hbase(main):011:0> describe 'Student'
Table Student is ENABLED
Student

COLUMN FAMILIES DESCRIPTION

{NAME => 'Grades', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}

{NAME => 'StuInfo', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}

2 row(s) in 0.0400 secondsCopy

Execute HashTable

hbase org.apache.hadoop.hbase.mapreduce.HashTable Student /tmp/hash/Student
Copy

After execution, you can see the following in the / tmp/hash/Student directory:

drwxr-xr-x+  - hadoop supergroup          0 2020-12-17 16:09 /tmp/hash/Student/hashes
-rw-r--r--+  2 hadoop supergroup          0 2020-12-17 16:09 /tmp/hash/Student/hashes/_SUCCESS
drwxr-xr-x+  - hadoop supergroup          0 2020-12-17 16:09 /tmp/hash/Student/hashes/part-r-00000
-rw-r--r--+  2 hadoop supergroup        158 2020-12-17 16:09 /tmp/hash/Student/hashes/part-r-00000/data
-rw-r--r--+  2 hadoop supergroup        220 2020-12-17 16:09 /tmp/hash/Student/hashes/part-r-00000/index
-rw-r--r--+  2 hadoop supergroup         80 2020-12-17 16:08 /tmp/hash/Student/manifest
-rw-r--r--+  2 hadoop supergroup        153 2020-12-17 16:08 /tmp/hash/Student/partitionsCopy

Create a new table named student_ two

create 'Student_2','StuInfo','Grades'Copy

Current Student_2 insert some data in the Student table:

put 'Student_2', '0001', 'StuInfo:Name', 'Tom Green', 1
put 'Student_2', '0001', 'StuInfo:Age', '18'
put 'Student_2', '0001', 'StuInfo:Sex', 'Male'Copy

So now Student and Student_2 the data of the two tables are as follows:

hbase(main):015:0> scan 'Student_2'
ROW                                       COLUMN+CELL
 0001                                     column=StuInfo:Age, timestamp=1608192992466, value=18
 0001                                     column=StuInfo:Name, timestamp=1, value=Tom Green
 0001                                     column=StuInfo:Sex, timestamp=1608192995476, value=Male
1 row(s) in 0.0180 seconds

hbase(main):016:0> scan 'Student'
ROW                                       COLUMN+CELL
 0001                                     column=Grades:BigData, timestamp=1604988333715, value=80
 0001                                     column=Grades:Computer, timestamp=1604988336890, value=90
 0001                                     column=Grades:Math, timestamp=1604988339775, value=85
 0001                                     column=StuInfo:Age, timestamp=1604988324791, value=18
 0001                                     column=StuInfo:Name, timestamp=1604988321380, value=Tom Green
 0001                                     column=StuInfo:Sex, timestamp=1604988328806, value=Male
1 row(s) in 0.0070 secondsCopy

Now synchronize students to students through SyncTable_ 2:

implement

hbase org.apache.hadoop.hbase.mapreduce.SyncTable --dryrun=false --sourcezkcluster=hadoop:2181:/hbase hdfs://hadoop:8020/tmp/hash/Student Student Student_2Copy

After completing the task, you can see that the two tables are synchronized:

hbase(main):001:0> scan 'Student_2'
ROW                                       COLUMN+CELL
 0001                                     column=Grades:BigData, timestamp=1604988333715, value=80
 0001                                     column=Grades:Computer, timestamp=1604988336890, value=90
 0001                                     column=Grades:Math, timestamp=1604988339775, value=85
 0001                                     column=StuInfo:Age, timestamp=1604988324791, value=18
 0001                                     column=StuInfo:Name, timestamp=1604988321380, value=Tom Green
 0001                                     column=StuInfo:Sex, timestamp=1604988328806, value=Male
1 row(s) in 0.2620 seconds

hbase(main):002:0> scan 'Student'
ROW                                       COLUMN+CELL
 0001                                     column=Grades:BigData, timestamp=1604988333715, value=80
 0001                                     column=Grades:Computer, timestamp=1604988336890, value=90
 0001                                     column=Grades:Math, timestamp=1604988339775, value=85
 0001                                     column=StuInfo:Age, timestamp=1604988324791, value=18
 0001                                     column=StuInfo:Name, timestamp=1604988321380, value=Tom Green
 0001                                     column=StuInfo:Sex, timestamp=1604988328806, value=Male
1 row(s) in 0.0210 secondsCopy

This article is the original article of "xiaozhch5", a blogger from big data to artificial intelligence. It follows the CC 4.0 BY-SA copyright agreement. Please attach the original source link and this statement for reprint.

Original link: https://lrting.top/backend/382/