HashTable/SyncTable is a synchronization hbase A tool for table data. The process is divided into two steps, both of which are MapReduce jobs. Like the CopyTable tool, it can also be used to synchronize some or all of the table data between the same or different clusters. However, compared with CopyTable, this tool performs better in synchronizing table data between different clusters. Instead of copying table data in an interval, it first executes HashTable in the source cluster to generate hash sequence based on the source data table, and then executes SyncTable in the target cluster to generate hash sequence based on the source data table, source data table, target table and target table, and compares the hash sequences generated by the two tables to find out the missing data. Then, when synchronizing, you only need to synchronize the missing data, which can greatly reduce the bandwidth and data transmission.
step1 HashTable
First, execute HashTable in the source cluster:
How to use HbaseTable:
Usage: HashTable [options] <tablename> <outputpath> Options: batchsize the target amount of bytes to hash in each batch rows are added to the batch until this size is reached (defaults to 8000 bytes) numhashfiles the number of hash files to create if set to fewer than number of regions then the job will create this number of reducers (defaults to 1/100 of regions -- at least 1) startrow the start row stoprow the stop row starttime beginning of the time range (unixtime in millis) without endtime means from starttime to forever endtime end of the time range. Ignored if no starttime specified. scanbatch scanner batch size to support intra row scans versions number of cell versions to include families comma-separated list of families to include Args: tablename Name of the table to hash outputpath Filesystem path to put the output data Examples: To hash 'TestTable' in 32kB batches for a 1 hour window into 50 files: $ bin/hbase org.apache.hadoop.hbase.mapreduce.HashTable --batchsize=32000 --numhashfiles=50 --starttime=1265875194289 --endtime=1265878794289 --families=cf2,cf3 TestTable /hashes/testTableCopy
The batchsize attribute defines how much cell data is in a single hash value based on a given region. The setting of this attribute directly affects the efficiency of synchronization. This may result in a reduction in the number of scans performed by the SyncTable mapper task (this is the next step in the process). The rule of thumb is that the fewer cells out of sync (the lower the probability of finding differences), the larger the batch size value can be determined. In other words, if there is less unsynchronized data, this value can be set larger. vice versa.
step2 SyncTable
After HashTable is executed in the source cluster, SyncTable can be executed in the target cluster. Like replication and other synchronization tasks, the target cluster needs to be able to access the regionServers/DataNodes nodes nodes of all source clusters.
SyncTable usage:
Usage: SyncTable [options] <sourcehashdir> <sourcetable> <targettable> Options: sourcezkcluster ZK cluster key of the source table (defaults to cluster in classpath's config) targetzkcluster ZK cluster key of the target table (defaults to cluster in classpath's config) dryrun if true, output counters but no writes (defaults to false) Args: sourcehashdir path to HashTable output dir for source table if not specified, then all data will be scanned sourcetable Name of the source table to sync from targettable Name of the target table to sync to Examples: For a dry run SyncTable of tableA from a remote source cluster to a local target cluster: $ bin/hbase org.apache.hadoop.hbase.mapreduce.SyncTable --dryrun=true --sourcezkcluster=zk1.example.com,zk2.example.com,zk3.example.com:2181:/hbase hdfs://nn:9000/hashes/tableA tableA tableACopy
The dryrun option is very useful in read-only operations and table comparison. It can display the number of differences between the two tables without making any changes to the table. It can be used as a substitute for the VerifyReplication tool
By default, SyncTable makes the target table a replica of the source table.
Setting doDeletes to false modifies the default behavior to not delete data that is not in the source table but in the target table. Similarly, setting doPuts to false will also modify the default behavior so as not to increase the data in the source table but not in the target table. If you set both doDeletes and doPuts to false, the effect is the same as setting dryrun to true.
In two way replication or other scenarios, such as when the cluster data on the source side and the target side have other data inputs, it is recommended to set the doDeletes option to false.
example
Here, take different tables in the same cluster as an example. Different clusters are similar:
Source table:
Table name: Student Table data
hbase(main):010:0> scan 'Student' ROW COLUMN+CELL 0001 column=Grades:BigData, timestamp=1604988333715, value=80 0001 column=Grades:Computer, timestamp=1604988336890, value=90 0001 column=Grades:Math, timestamp=1604988339775, value=85 0001 column=StuInfo:Age, timestamp=1604988324791, value=18 0001 column=StuInfo:Name, timestamp=1604988321380, value=Tom Green 0001 column=StuInfo:Sex, timestamp=1604988328806, value=Male 1 row(s) in 0.0120 secondsCopy
Table structure:
hbase(main):011:0> describe 'Student' Table Student is ENABLED Student COLUMN FAMILIES DESCRIPTION {NAME => 'Grades', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} {NAME => 'StuInfo', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} 2 row(s) in 0.0400 secondsCopy
Execute HashTable
hbase org.apache.hadoop.hbase.mapreduce.HashTable Student /tmp/hash/Student Copy
After execution, you can see the following in the / tmp/hash/Student directory:
drwxr-xr-x+ - hadoop supergroup 0 2020-12-17 16:09 /tmp/hash/Student/hashes -rw-r--r--+ 2 hadoop supergroup 0 2020-12-17 16:09 /tmp/hash/Student/hashes/_SUCCESS drwxr-xr-x+ - hadoop supergroup 0 2020-12-17 16:09 /tmp/hash/Student/hashes/part-r-00000 -rw-r--r--+ 2 hadoop supergroup 158 2020-12-17 16:09 /tmp/hash/Student/hashes/part-r-00000/data -rw-r--r--+ 2 hadoop supergroup 220 2020-12-17 16:09 /tmp/hash/Student/hashes/part-r-00000/index -rw-r--r--+ 2 hadoop supergroup 80 2020-12-17 16:08 /tmp/hash/Student/manifest -rw-r--r--+ 2 hadoop supergroup 153 2020-12-17 16:08 /tmp/hash/Student/partitionsCopy
Create a new table named student_ two
create 'Student_2','StuInfo','Grades'Copy
Current Student_2 insert some data in the Student table:
put 'Student_2', '0001', 'StuInfo:Name', 'Tom Green', 1 put 'Student_2', '0001', 'StuInfo:Age', '18' put 'Student_2', '0001', 'StuInfo:Sex', 'Male'Copy
So now Student and Student_2 the data of the two tables are as follows:
hbase(main):015:0> scan 'Student_2' ROW COLUMN+CELL 0001 column=StuInfo:Age, timestamp=1608192992466, value=18 0001 column=StuInfo:Name, timestamp=1, value=Tom Green 0001 column=StuInfo:Sex, timestamp=1608192995476, value=Male 1 row(s) in 0.0180 seconds hbase(main):016:0> scan 'Student' ROW COLUMN+CELL 0001 column=Grades:BigData, timestamp=1604988333715, value=80 0001 column=Grades:Computer, timestamp=1604988336890, value=90 0001 column=Grades:Math, timestamp=1604988339775, value=85 0001 column=StuInfo:Age, timestamp=1604988324791, value=18 0001 column=StuInfo:Name, timestamp=1604988321380, value=Tom Green 0001 column=StuInfo:Sex, timestamp=1604988328806, value=Male 1 row(s) in 0.0070 secondsCopy
Now synchronize students to students through SyncTable_ 2:
implement
hbase org.apache.hadoop.hbase.mapreduce.SyncTable --dryrun=false --sourcezkcluster=hadoop:2181:/hbase hdfs://hadoop:8020/tmp/hash/Student Student Student_2Copy
After completing the task, you can see that the two tables are synchronized:
hbase(main):001:0> scan 'Student_2' ROW COLUMN+CELL 0001 column=Grades:BigData, timestamp=1604988333715, value=80 0001 column=Grades:Computer, timestamp=1604988336890, value=90 0001 column=Grades:Math, timestamp=1604988339775, value=85 0001 column=StuInfo:Age, timestamp=1604988324791, value=18 0001 column=StuInfo:Name, timestamp=1604988321380, value=Tom Green 0001 column=StuInfo:Sex, timestamp=1604988328806, value=Male 1 row(s) in 0.2620 seconds hbase(main):002:0> scan 'Student' ROW COLUMN+CELL 0001 column=Grades:BigData, timestamp=1604988333715, value=80 0001 column=Grades:Computer, timestamp=1604988336890, value=90 0001 column=Grades:Math, timestamp=1604988339775, value=85 0001 column=StuInfo:Age, timestamp=1604988324791, value=18 0001 column=StuInfo:Name, timestamp=1604988321380, value=Tom Green 0001 column=StuInfo:Sex, timestamp=1604988328806, value=Male 1 row(s) in 0.0210 secondsCopy
This article is the original article of "xiaozhch5", a blogger from big data to artificial intelligence. It follows the CC 4.0 BY-SA copyright agreement. Please attach the original source link and this statement for reprint.
Original link: https://lrting.top/backend/382/