1. Small and large tables join
1.Definition
//Keys are relatively dispersed and tables with small amounts of data are placed on the left side of the join, which effectively reduces the chance of memory overflow errors.
//Further, you can use map join to advance memory for small dimension tables (less than 1000 records).Finish reduce on the map side.
//The actual test found that the new version of hive has optimized small table JOIN big table and large table JOIN small table.There is no difference between a small table on the left and a small table on the right.
2.Example
(1)demand
//Testing the efficiency of large and small JOIN tables
(2)Building large tables, small tables and JOIN Statements in the post-table
//Create large tables
create table bigtable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string)
row format delimited fields terminated by '\t';
//Create Small Table
create table smalltable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string)
row format delimited fields terminated by '\t';
//Statements for creating join ed post-table
create table jointable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string)
row format delimited fields terminated by '\t';
(3)Import data into large and small tables, respectively
load data local inpath '/root/hivedata/bigtable' into table bigtable;
load data local inpath '/root/hivedata/smalltable' into table smalltable;
(4)Close mapjoin function(On by default)
set hive.auto.convert.join = false;
(5)Execute Small Table JOIN Large table statement
insert overwrite table jointable
select b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from smalltable s
left join bigtable b
on b.id = s.id;
(6)Execute large table JOIN Small Table Statement
insert overwrite table jointable
select b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from bigtable b
left join smalltable s
on s.id = b.id;
2. Large Table Join Large Table
1.empty KEY filter
sometimes join Timeout is due to some key Too much data, but the same key The corresponding data is sent to the same reducer This causes insufficient memory.
key Corresponding to data exceptions, we need to SQL Filter in statement.
For example: key The corresponding field is empty, as follows:
(1)Configure History Server(mapred-site.xml)
<property>
<name>mapreduce.jobhistory.address</name>
<value>node01:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>node01:19888</value>
</property>
(2)Start History Server
sbin/mr-jobhistory-daemon.sh start historyserver
stay"http://"View jobhistory in node01:19888/jobhistory"
(3)Create original data table, empty id Table, merged data table
//Create original table
create table ori(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string)
row format delimited fields terminated by '\t';
//Create an empty id table
create table nullidtable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string)
row format delimited fields terminated by '\t';
//Post-join table created
create table jointable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string)
row format delimited fields terminated by '\t';
(4)Load raw data and empty data separately id Data into corresponding tables
load data local inpath '/root/hivedata/ori' into table ori;
load data local inpath '/root/hivedata/nullid' into table nullidtable;
(5)Test but not filter id
insert overwrite table jointable
select n.* from nullidtable n left join ori o on n.id = o.id;
(6)Test filter empty id
insert overwrite table jointable
select n.* from (select * from nullidtable where id is not null )
n left join ori o on n.id = o.id;
2.empty key Transformation
Sometimes even though some key Null corresponds to a lot of data, but the corresponding data is not exception data and must be included join As a result,
Now we can table a in key Assign a random value to an empty field so that the data is randomly and evenly distributed across different fields reducer Up.
2.1 Example 1: Do not randomly distribute empty space null Value:
(1)Set up 5 reduce Number
set mapreduce.job.reduces=5;
(2)JOIN Two tables
insert overwrite table jointable
select n.* from nullidtable n left join ori b on n.id = b.id;
(3)Result
There are data skews, some reducer Resource consumption is much greater than other reducer.
2.2 Example 2: Random distribution null value
(1)Set up 5 reduce Number
set mapreduce.job.reduces = 5;
(2)JOIN Two tables
insert overwrite table jointable
select n.* from nullidtable n full join ori o on
case when n.id is null then concat('hive', rand()) else n.id end = o.id;
(3)Result
Eliminate data skew, load balancing reducer Resource consumption
3. MapJoin
1.Definition
//If MapJoin is not specified or does not meet MapJoin criteria, the Hive parser converts the Join operation to Common Join.
//That is, the join is completed in the Reduce phase.Data skew is prone.
//You can use MapJoin to load all the small tables into memory and join on the map side to avoid reducer processing.
2.open MapJoin Parameter Settings
(1)Set Automatic Selection Mapjoin
set hive.auto.convert.join = true; Default to true
(2)Threshold settings for large and small tables(Default 25 M Think of it as a small table)
set hive.mapjoin.smalltable.filesize=25000000;
3.Example
(1)open mapjoin function
set hive.auto.convert.join = true; Default to true
(2)Execute Small Table JOIN Large table statement
insert overwrite table jointable
select b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from smalltable s
join bigtable b
on s.id = b.id;
(3)Execute large table JOIN Small Table Statement
insert overwrite table jointable
select b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from bigtable b
join smalltable s
on s.id = b.id;
IV. GroupBy
1. Definition
By default, the same Key data is distributed to a reduce during the Map phase and skewed when a key data is too large.
Not all aggregation operations need to be done at the Reduce end. Many aggregation operations can be partially aggregated at the Map end and end up at the Reduce end.
2. Turn on map-side aggregation parameter settings
(1) Whether to aggregate on the Map side or not, defaulting to True
hive.map.aggr = true
(2) Number of entries aggregating on the Map side
hive.groupby.mapaggr.checkinterval = 100000
(3) Load balancing when data is skewed (default is false)
hive.groupby.skewindata = true
5. Count(Distincf) de-counting
1. Definition
It doesn't matter when the amount of data is small. In case of large amount of data, COUNT DISTINCT operation needs a Reduce Task to complete.
The amount of data that this Reduce needs to process is so large that it makes the entire Job difficult to complete.
COUNT DISTINCT is generally replaced by GROUP BY before COUNT
2. Examples
(1) 1. Create a large table
create table bigtable(id bigint, time bigint, uid string, keyword
string, url_rank int, click_num int, click_url string) row format delimited
fields terminated by '\t';
(2) Loading data
load data local inpath '/root/hivedata/bigtable' into table bigtable;
(3) Set five reduce s
set mapreduce.job.reduces = 5;
(4) Execute de-duplicate id query
select count(distinct id) from bigtable;
(5) using GROUP by to remove duplicate IDS
select count(id) from (select id from bigtable group by id) a;
(6) Summary
Although you'll use one more Job to do this, it's definitely worth it when you have a lot of data.
6. Cartesian Product
Avoid Cartesian product whenever possible, join without on condition or invalid on condition.
Hive can only use one reducer to complete the Cartesian product.
7. Row and column filtration
1. Definition
Column handling: In SELECT, only take the columns you need, if you have one, use partition filtering as much as possible, and use SELECT * less.
Row handling: In partition clipping, when using outer associations, if the filter conditions for the secondary table are written after Where, then the entire table is associated before filtering.
2. Row Processing Instances
(1) The test first associates two tables, then filters with where conditions
select o.id from bigtable b
join ori o on o.id = b.id
where o.id <= 10;
(2) Re-associate tables after subqueries
select b.id from bigtable b
join (select id from ori where id <= 10 ) o on b.id = o.id;
8. Dynamic Zoning Adjustment
1.Turn on dynamic partition parameter settings
(1)Turn on dynamic partitioning (default) true,Open)
hive.exec.dynamic.partition=true
(2)Set to non-strict mode (dynamic partitioning mode, default strict,Indicates that at least one partition must be specified as a static partition, nonstrict The schema representation allows dynamic partitioning for all partition fields.)
hive.exec.dynamic.partition.mode=nonstrict
(3)In All Executions MR The maximum number of dynamic partitions that can be created on a node.
hive.exec.max.dynamic.partitions=1000
(4)In each execution MR The maximum number of dynamic partitions that can be created on a node.This parameter needs to be set based on the actual data.For example, the source data contains data for one year, that is day The field has 365 values, so this parameter needs to be set to be greater than 365, and an error will be reported if the default value of 100 is used.
hive.exec.max.dynamic.partitions.pernode=100
(5)Whole MR Job Medium, maximum number of can be created HDFS File.
hive.exec.max.created.files=100000
(6)Whether to throw an exception when an empty partition is generated.Settings are generally not required.
hive.error.on.empty.partition=false
2.Example
(1)Requirements:
//Insert data from ori into the corresponding partition of the target table ori_partitioned_target according to time (e.g., 20111230000008).
(2)Create partition table
create table ori_partitioned(id bigint, time bigint, uid string, keyword string,
url_rank int, click_num int, click_url string)
partitioned by (p_time bigint)
row format delimited fields terminated by '\t';
(3)Loading data into partitioned tables
load data local inpath '/home/atguigu/ds1' into table ori_partitioned partition(p_time='20111230000010') ;
load data local inpath '/home/atguigu/ds2' into table ori_partitioned partition(p_time='20111230000011') ;
(4)Create Target Partition Table
create table ori_partitioned_target(id bigint, time bigint, uid string,
keyword string, url_rank int, click_num int, click_url string)
PARTITIONED BY (p_time STRING) row format delimited fields terminated by '\t';
(5)Setting up dynamic partitions
set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode = nonstrict;
set hive.exec.max.dynamic.partitions = 1000;
set hive.exec.max.dynamic.partitions.pernode = 100;
set hive.exec.max.created.files = 100000;
set hive.error.on.empty.partition = false;
insert overwrite table ori_partitioned_target partition (p_time)
select id, time, uid, keyword, url_rank, click_num, click_url, p_time from ori_partitioned;
(6)View partitions of the target partition table
show partitions ori_partitioned_target;