Hive SQL optimization ideas

Posted by ESCForums.com on Thu, 30 Dec 2021 12:47:56 +0100

Hive optimization is mainly divided into: configuration optimization, SQL statement optimization, task optimization and other schemes. Among them, SQL optimization may be mainly involved in the development process.

The core idea of optimization is:

Reduce the amount of data (such as partitioning, column clipping)
Avoid data skew (e.g. adding parameters and Key scattering)
Avoid full table scanning (e.g. on adding partitions, etc.)
Reduce the number of job s (for example, join with the same on condition is put together as a task)

This article starts with the official account [five minutes learning big data].

HQL statement optimization

1. Use partition clipping and column clipping

In partition clipping, when external association is used, if the filter condition of the secondary table is written after Where, the full table Association will be performed first and then filtered.

select a.*  
from a  
left join b on  a.uid = b.uid  
where a.ds='2020-08-10'  
and b.ds='2020-08-10'

The above SQL mainly makes two mistakes:

The where condition of the secondary table (table b above) is written after the join, which will cause the full table to be associated in the filter partition first.

Note: Although the where condition of table a is also written after the join, table a will perform predicate push down, that is, execute the where condition first and then the join, but table b will not perform predicate push down!

The on condition does not filter null values. If there are a large number of null values in two data tables, the data will be skewed.

Correct writing:

select a.*  
from a  
left join b on (d.uid is not null and a.uid = b.uid and b.ds='2020-08-10') 
where a.ds='2020-08-10'

If null value is also required, it needs to be converted on condition or taken out separately

select a.*  
from a  
left join b on (a.uid is not null and a.uid = b.uid and b.ds='2020-08-10')  
where a.ds='2020-08-10'  
union all  
select a.* from a where a.uid is null

Or:

select a.*  
from a  
left join b on   
case when a.uid is null then concat("test",RAND()) else a.uid end = b.uid and b.ds='2020-08-10'  
where a.ds='2020-08-10'

Or (subquery):

select a.*  
from a  
left join   
(select uid from where ds = '2020-08-10' and uid is not null) b on a.uid = b.uid 
where a.uid is not null  
and a.ds='2020-08-10'

2. Try not to use COUNT DISTINCT

Because the COUNT DISTINCT operation needs to be completed with a Reduce Task, the amount of data to be processed by this Reduce is too large, which will make it difficult to complete the whole Job. Generally, COUNT DISTINCT is replaced by GROUP BY COUNT. Although it will be completed with one more Job, it is definitely worth it in the case of large amount of data.

select count(distinct uid)  
from test  
where ds='2020-08-10' and uid is not null

Convert to:

select count(a.uid)  
from   
(select uid 
 from test 
 where uid is not null and ds = '2020-08-10' 
 group by uid
) a

3. Use with as

Slow down the efficiency of Hive query. In addition to the shuffle generated by join, there is also sub query, which should be minimized in SQL statements. With as is to extract the subquery used in the statement in advance (similar to a temporary table), so that all modules in the whole query can call the query result. Using with as can prevent Hive from double calculating the same subquery in different parts.

select a.*  
from  a  
left join b on  a.uid = b.uid  
where a.ds='2020-08-10'  
and b.ds='2020-08-10'

Can be converted to:

with test1 as 
(
select uid  
from b  
where ds = '2020-08-10' and uid is not null  
)  
select a.*  
from a  
left join test1 on a.uid = test1.uid  
where a.ds='2020-08-10' and a.uid is not null

4. join of size table

When writing a query statement with join operation, there is a principle: the table / subquery with few entries should be placed to the left of the join operator. The reason is that in the Reduce phase of the join operation, the contents of the table on the left of the join operator will be loaded into memory. Placing the table with few entries on the left can effectively Reduce the probability of OOM errors. However, the new version of hive has optimized small table join large table and large table join small table. There is no obvious difference between the small table on the left and the right. However, in the process of joining, small tables can appropriately Reduce the amount of data and improve efficiency.

5. Data skew

We all know the principle of data skew, that is, one or several keys occupy 90% of the whole data, so the efficiency of the whole task will be slowed down by the processing of this key. At the same time, memory overflow may be caused by the aggregation of the same keys.

Data skew occurs only during shuffle. Here are some common operators that may trigger the shuffle operation: distinct, groupByKey, reduceByKey, aggregateByKey, join, cogroup, repartition, etc. When data skew occurs, it may be caused by the use of one of these operators in your code.

hive's data tilt general processing scheme:

Common practices, through parameter tuning:

set hive.map.aggr=true;  
set hive.groupby.skewindata = ture;

When the option is set to true, the generated query plan has two MapReduce tasks.

In the first MapReduce, the output result set of the map will be randomly distributed to the reduce. Each reduce performs partial aggregation operations and outputs results.

The result of this processing is that the same Group By Key may be distributed to different reduce to achieve the purpose of load balancing;

The second MapReduce task distributes the Group By Key to reduce according to the preprocessed data results (this process can ensure that the same Group By Key is distributed to the same reduce), and finally completes the final aggregation operation.

However, this treatment scheme is a black box for us and can not be controlled.

The general processing scheme is to break up the corresponding key values.

For example:

select a.*  
from a  
left join b on  a.uid = b.uid  
where a.ds='2020-08-10'  
and b.ds='2020-08-10'

If 90% of the key s are null, data skew is inevitable.

select a.uid  
from test1 as a  
join(  
   select case when uid is null then cast(rand(1000000) as int)  
   else uid  
   from test2 where ds='2020-08-10') b   
on a.uid = b.uid  
where a.ds='2020-08-10'

Of course, this is only a theoretical solution.

The normal scheme is null filtering, but this special key is not used in daily situations.

How to deal with this data skew in the case of daily needs:

sample sampling, which set of key s to obtain;
Add the key s in the set to random numbers according to certain rules;
join, because it is scattered, data skew is avoided;
In the processing result, the previously added random number is segmented into original data.

Of course, these optimizations are for SQL itself, and some are adjusted through parameter settings, which will not be described in detail.

But the core ideas of optimization are similar:

Reduce data volume
Avoid data skew
Reduce the number of jobs
Virtual core point: optimize the overall business implementation according to the business logic;
Virtual solution: use presto, impala and other special query engines, and use spark computing engine to replace MR/TEZ

Programmer Think