Differences between in and not in, exists and not exists in SQL and performance analysis

Posted by The Little Guy on Mon, 28 Feb 2022 01:45:33 +0100

1. in and exists

In is a hash connection between the outer table and the inner table, while exists is a loop loop loop for the outer table. Each loop loop loop queries the inner table. It has always been inaccurate to say that exists is more efficient than in.

If the two tables of the query are of the same size, there is little difference between in and exists; If one of the two tables is smaller and the other is larger, use exists for the large sub query table and in for the small sub query table;

For example: table a (small table), table B (large table)

select * from A where cc in(select cc from B)  -->Low efficiency, used A On the table cc Index of the column;

select * from A where exists(select cc from B where cc=A.cc)  -->High efficiency, used B On the table cc Index of the column.

contrary:

select * from B where cc in(select cc from A)  -->High efficiency, used B On the table cc Index of column

select * from B where exists(select cc from A where cc=B.cc)  -->Low efficiency, used A On the table cc Index of the column.

2. not in and not exists

Not in is not logically equivalent to not exists. If you misuse not in, be careful that your program has fatal bugs. See the following example:

create table #t1(c1 int,c2 int);

create table #t2(c1 int,c2 int);

insert into #t1 values(1,2);

insert into #t1 values(1,3);

insert into #t2 values(1,2);

insert into #t2 values(1,null);

 

select * from #t1 where c2 not in(select c2 from #t2); -- > execution result: None

select * from #t1 where not exists(select 1 from #t2 where #t2.c2=#t1.c2) -- > execution result: 1 3

As you can see, not in has an unexpected result set and there is a logical error. If you look at the execution plans of the above two select statements, it will also be different. The latter uses hash_aj, so try not to use not in (it will call the subquery), but try to use not exists (it will call the associated subquery).

If any record returned in the subquery contains a null value, the query will not return any records. If the subquery field has non empty restrictions, you can use not in at this time, and you can prompt it to use hasg_aj or merge_aj connection.

If not in is used in the query statement, full table scanning is performed for both internal and external tables without index; The sub query of not exists can still use the index on the table. Therefore, no matter which table is large, using not exists is faster than not in.

3. Difference between in and =

select name from student where name in('zhang','wang','zhao');

And

select name from student where name='zhang' or name='wang' or name='zhao'

The results are the same.

Other analysis:

1. Implementation process of exists

select * from t1 where exists ( select null from t2 where y = x ) 

It can be understood as:

for x in ( select * from t1 ) loop 

if ( exists ( select null from t2 where y = x.x ) then 
OUTPUT THE RECORD 
end if 
end loop 

Performance difference between in and exists:

If the result set records obtained from the sub query are small, and the table in the main query is large and has an index, use in. On the contrary, if the main query records in the outer layer are small, and the table in the sub query is large and has an index, use exists.

In fact, the difference between in and exists is mainly caused by the change of the driving order (which is the key to the performance change). If it is exists, the outer table will be used as the driving table and will be accessed first. If it is in, the sub query will be executed first. Therefore, we will take the rapid return of the driving table as the goal, and the relationship between the index and the result set will be considered

IN addition, NULL is not processed when IN

For example: select 1 from dual where null in (0,1,2,null) is null

2.NOT IN and NOT EXISTS:

Execution process of NOT EXISTS

select ..... from rollup R  where not exists ( select 'Found' from title T where R.source_id = T.Title_ID); 

It can be understood as:

for x in ( select * from rollup ) loop 
if ( not exists ( that query ) ) then 
OUTPUT 
end if; 
end loop; 

Note: NOT EXISTS and NOT IN cannot completely replace each other, depending on the specific requirements. If the selected column can be empty, it cannot be replaced.

In the official account, Xiao Le replied to "Java", and got the Java interview question and the surprise surprise package.

For example, see the following sentences to see their differences:

select x,y from t; 

Query x and y data as follows:

x y 
------ ------ 
1 3 
3 1 
1 2 
1 1 
3 1 
5 

The query results using not in and not exists are as follows:

select * from t where x not in (select y from t t2 ) ;

Query no results: no rows

select * from t where not exists (select null from t t2 where t2.y=t.x ) ;

The query result is:

x y 
------ ------ 
5 NULL 

Therefore, it should be determined by specific needs

Performance differences between not in and not exists:

Not in is used only when the field after the select keyword has a not null constraint or such hint in the sub query. In addition, if the table in the main query is large and the table in the sub query is small but there are many records, you should use not in and anti hash join

If there are few records in the main query table and many records in the sub query table, and there are indexes, you can use not exists. In addition, you can also use / * + hash for not in_ AJ * / or external connection + is null

NOT IN is better in cost based applications

For example:

select ..... 
from rollup R 
where not exists ( select 'Found' from title T 
where R.source_id = T.Title_ID); 

Change to (better)

select ...... 
from title T, rollup R 
where R.source_id = T.Title_id(+) 
and T.Title_id is null; 

Or (better)

sql> select /*+ HASH_AJ */ ... 
from rollup R 
where ource_id NOT IN ( select ource_id 
from title T 
where ource_id IS NOT NULL ) 

Discuss IN and EXISTS.

select * from t1 where x in ( select y from t2 ) 

In fact, it can be understood as:

select * 
from t1, ( select distinct y from t2 ) t2 
where t1.x = t2.y; 

——If you have some experience in SQL optimization, it is natural to think from this sentence that T2 must not be a large table, because T2 needs to be "uniquely sorted" for the whole table. If T2 is large, the sorting performance is unbearable. But t1 can be very big. Why? The most popular understanding is because t1 x=t2. Y can go to the index.

But this is not a good explanation. Imagine if t1 X and t2 Y has an index. We know that the index is an orderly structure, so the best solution between t1 and t2 is merge join. In addition, if t2 There is an index on y, and the sorting performance of t2 is also greatly improved.

select * from t1 where exists ( select null from t2 where y = x ) 

It can be understood as:

for x in ( select * from t1 ) 
loop 
if ( exists ( select null from t2 where y = x.x ) 
then 
OUTPUT THE RECORD! 
end if 
end loop 

——This is easier to understand. t1 is always a table scan! Therefore, t1 must not be a large table, while t2 can be large, because y=x.x can go to t2 Index of Y.

Based on the above discussion of IN/EXISTS, we can draw a basic general conclusion: IN is suitable for the situation of large appearance and small inner appearance; EXISTS is suitable for small appearance and large inner appearance.

We should make corresponding optimization according to the actual situation. We can't absolutely say who has high efficiency and who has low efficiency. Everything is relative

Topics: Database SQL Interview