20 Enterprise Tuning 2-Table Optimization

Posted by BigMike on Sun, 18 Aug 2019 04:24:46 +0200

1. Small and large tables join

1.Definition
	//Keys are relatively dispersed and tables with small amounts of data are placed on the left side of the join, which effectively reduces the chance of memory overflow errors.
	//Further, you can use map join to advance memory for small dimension tables (less than 1000 records).Finish reduce on the map side.
	//The actual test found that the new version of hive has optimized small table JOIN big table and large table JOIN small table.There is no difference between a small table on the left and a small table on the right.

2.Example
	(1)demand
		//Testing the efficiency of large and small JOIN tables
		
	(2)Building large tables, small tables and JOIN Statements in the post-table
			//Create large tables
		create table bigtable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) 
		row format delimited fields terminated by '\t';
			//Create Small Table
		create table smalltable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string)
		row format delimited fields terminated by '\t';
			//Statements for creating join ed post-table
		create table jointable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string)
		row format delimited fields terminated by '\t';
		
	(3)Import data into large and small tables, respectively
		load data local inpath '/root/hivedata/bigtable' into table bigtable;
		load data local inpath '/root/hivedata/smalltable' into table smalltable;
		
	(4)Close mapjoin function(On by default)
		set hive.auto.convert.join = false;
	
	(5)Execute Small Table JOIN Large table statement
		insert overwrite table jointable
		select b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
		from smalltable s
		left join bigtable  b
		on b.id = s.id;
		
	(6)Execute large table JOIN Small Table Statement
		insert overwrite table jointable
		select b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
		from bigtable  b
		left join smalltable  s
		on s.id = b.id;

2. Large Table Join Large Table

1.empty KEY filter
	sometimes join Timeout is due to some key Too much data, but the same key The corresponding data is sent to the same reducer This causes insufficient memory.
	key Corresponding to data exceptions, we need to SQL Filter in statement.
	For example: key The corresponding field is empty, as follows:
		(1)Configure History Server(mapred-site.xml)
			<property>
				<name>mapreduce.jobhistory.address</name>
				<value>node01:10020</value>
			</property>
			<property>
				<name>mapreduce.jobhistory.webapp.address</name>
				<value>node01:19888</value>
			</property>
		
		(2)Start History Server
			sbin/mr-jobhistory-daemon.sh start historyserver
				stay"http://"View jobhistory in node01:19888/jobhistory"
			
		(3)Create original data table, empty id Table, merged data table
				//Create original table
			create table ori(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) 
			row format delimited fields terminated by '\t';
				//Create an empty id table
			create table nullidtable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) 
			row format delimited fields terminated by '\t';
				//Post-join table created
			create table jointable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) 
			row format delimited fields terminated by '\t';
			
		(4)Load raw data and empty data separately id Data into corresponding tables
			load data local inpath '/root/hivedata/ori' into table ori;
			load data local inpath '/root/hivedata/nullid' into table nullidtable;
		
		(5)Test but not filter id
			insert overwrite table jointable 
				select n.* from nullidtable n left join ori o on n.id = o.id;
		
		(6)Test filter empty id
			insert overwrite table jointable 
				select n.* from (select * from nullidtable where id is not null )
					n  left join ori o on n.id = o.id;
				
2.empty key Transformation	
	Sometimes even though some key Null corresponds to a lot of data, but the corresponding data is not exception data and must be included join As a result,
	Now we can table a in key Assign a random value to an empty field so that the data is randomly and evenly distributed across different fields reducer Up.
	
	2.1 Example 1: Do not randomly distribute empty space null Value:
		(1)Set up 5 reduce Number
			set mapreduce.job.reduces=5;
		(2)JOIN Two tables
			insert overwrite table jointable 
				select n.* from nullidtable n left join ori b on n.id = b.id;
		(3)Result
			There are data skews, some reducer Resource consumption is much greater than other reducer. 
			
	2.2 Example 2: Random distribution null value	
		(1)Set up 5 reduce Number
			set mapreduce.job.reduces = 5;
		(2)JOIN Two tables
			insert overwrite table jointable
			select n.* from nullidtable n full join ori o on 
			case when n.id is null then concat('hive', rand()) else n.id end = o.id;
		(3)Result
			Eliminate data skew, load balancing reducer Resource consumption

3. MapJoin

1.Definition
	//If MapJoin is not specified or does not meet MapJoin criteria, the Hive parser converts the Join operation to Common Join.
	//That is, the join is completed in the Reduce phase.Data skew is prone.
	//You can use MapJoin to load all the small tables into memory and join on the map side to avoid reducer processing.
			
2.open MapJoin Parameter Settings
	(1)Set Automatic Selection Mapjoin
		set hive.auto.convert.join = true; Default to true
	
	(2)Threshold settings for large and small tables(Default 25 M Think of it as a small table)
		set hive.mapjoin.smalltable.filesize=25000000;
	
3.Example
	(1)open mapjoin function
		set hive.auto.convert.join = true; Default to true

	(2)Execute Small Table JOIN Large table statement
		insert overwrite table jointable
		select b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
		from smalltable s
		join bigtable  b
		on s.id = b.id;
	
	(3)Execute large table JOIN Small Table Statement
		insert overwrite table jointable
		select b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
		from bigtable  b
		join smalltable  s
		on s.id = b.id;

IV. GroupBy

1. Definition
	By default, the same Key data is distributed to a reduce during the Map phase and skewed when a key data is too large.
	Not all aggregation operations need to be done at the Reduce end. Many aggregation operations can be partially aggregated at the Map end and end up at the Reduce end.
	
2. Turn on map-side aggregation parameter settings
	(1) Whether to aggregate on the Map side or not, defaulting to True
		hive.map.aggr = true
		
	(2) Number of entries aggregating on the Map side
		hive.groupby.mapaggr.checkinterval = 100000
		
	(3) Load balancing when data is skewed (default is false)
		hive.groupby.skewindata = true

5. Count(Distincf) de-counting

1. Definition
	It doesn't matter when the amount of data is small. In case of large amount of data, COUNT DISTINCT operation needs a Reduce Task to complete.
	The amount of data that this Reduce needs to process is so large that it makes the entire Job difficult to complete.
	COUNT DISTINCT is generally replaced by GROUP BY before COUNT
	
2. Examples
	(1) 1. Create a large table
		create table bigtable(id bigint, time bigint, uid string, keyword
		string, url_rank int, click_num int, click_url string) row format delimited
		fields terminated by '\t';
	
	(2) Loading data
		load data local inpath '/root/hivedata/bigtable' into table bigtable;
	
	(3) Set five reduce s
		set mapreduce.job.reduces = 5;
	
	(4) Execute de-duplicate id query
		select count(distinct id) from bigtable;

	(5) using GROUP by to remove duplicate IDS
		select count(id) from (select id from bigtable group by id) a;		
	
	(6) Summary
		Although you'll use one more Job to do this, it's definitely worth it when you have a lot of data.

6. Cartesian Product

Avoid Cartesian product whenever possible, join without on condition or invalid on condition.
Hive can only use one reducer to complete the Cartesian product.

7. Row and column filtration

1. Definition
	Column handling: In SELECT, only take the columns you need, if you have one, use partition filtering as much as possible, and use SELECT * less.
	Row handling: In partition clipping, when using outer associations, if the filter conditions for the secondary table are written after Where, then the entire table is associated before filtering.

2. Row Processing Instances
	(1) The test first associates two tables, then filters with where conditions
		select o.id from bigtable b
		join ori o on o.id = b.id
		where o.id <= 10;

	(2) Re-associate tables after subqueries
		select b.id from bigtable b
		join (select id from ori where id <= 10 ) o on b.id = o.id;

8. Dynamic Zoning Adjustment

1.Turn on dynamic partition parameter settings
	(1)Turn on dynamic partitioning (default) true,Open)
		hive.exec.dynamic.partition=true
	
	(2)Set to non-strict mode (dynamic partitioning mode, default strict,Indicates that at least one partition must be specified as a static partition, nonstrict The schema representation allows dynamic partitioning for all partition fields.)
		hive.exec.dynamic.partition.mode=nonstrict
	
	(3)In All Executions MR The maximum number of dynamic partitions that can be created on a node.
		hive.exec.max.dynamic.partitions=1000
	
	(4)In each execution MR The maximum number of dynamic partitions that can be created on a node.This parameter needs to be set based on the actual data.For example, the source data contains data for one year, that is day The field has 365 values, so this parameter needs to be set to be greater than 365, and an error will be reported if the default value of 100 is used.
		hive.exec.max.dynamic.partitions.pernode=100
	
	(5)Whole MR Job Medium, maximum number of can be created HDFS File.
		hive.exec.max.created.files=100000
	
	(6)Whether to throw an exception when an empty partition is generated.Settings are generally not required.
		hive.error.on.empty.partition=false

2.Example
	(1)Requirements:
		//Insert data from ori into the corresponding partition of the target table ori_partitioned_target according to time (e.g., 20111230000008).
		
	(2)Create partition table
		create table ori_partitioned(id bigint, time bigint, uid string, keyword string,
		url_rank int, click_num int, click_url string) 
		partitioned by (p_time bigint) 
		row format delimited fields terminated by '\t';
	
	(3)Loading data into partitioned tables
		load data local inpath '/home/atguigu/ds1' into table ori_partitioned partition(p_time='20111230000010') ;
		load data local inpath '/home/atguigu/ds2' into table ori_partitioned partition(p_time='20111230000011') ;
	
	(4)Create Target Partition Table
		create table ori_partitioned_target(id bigint, time bigint, uid string,
		keyword string, url_rank int, click_num int, click_url string) 
		PARTITIONED BY (p_time STRING) row format delimited fields terminated by '\t';
	
	(5)Setting up dynamic partitions
		set hive.exec.dynamic.partition = true;
		set hive.exec.dynamic.partition.mode = nonstrict;
		set hive.exec.max.dynamic.partitions = 1000;
		set hive.exec.max.dynamic.partitions.pernode = 100;
		set hive.exec.max.created.files = 100000;
		set hive.error.on.empty.partition = false;
			insert overwrite table ori_partitioned_target partition (p_time) 
				select id, time, uid, keyword, url_rank, click_num, click_url, p_time from ori_partitioned;
		
	(6)View partitions of the target partition table
		show partitions ori_partitioned_target;

Topics: hive less SQL xml