[original] how to find intersection in database?

Posted by RussellReal on Thu, 13 Jan 2022 18:02:43 +0100

scene

What are the files under Tag1, Tag2 and Tag3 in the many to many relationship table of a file and tag?

id tag_id file_id
1 TAG_1 F1
2 TAG_2 F1
3 TAG_3 F1
4 TAG_1 F2
5 TAG_2 F2
6 TAG_3 F3
7 TAG_1 F3

For example, find the compliance TAG_1 and TAG_2 all documents below. Through observation, it can be found that F1 and F3 meet the requirements. Because F3 contains tags, TAG_1, TAG_2, TAG_3. F1 contains TAG_1, TAG_2, TAG_3. F2 contains TAG_2. Therefore, F1 and F3 meet the requirements.

How to find intersection?

There are many ways to get the intersection. The way to get the intersection mentioned in the textbook is to use intersect and then get the intersection. However, many DBMS do not support this method of finding intersection. What should I do? Let's analyze a wave in detail.

Use intersect directly to find the intersection

Intuitively, find TAG_1 file: F1, F3. Get TAG_2 files: F1, F2,F3. Finally, the intersection of the obtained results is obtained to obtain the file ID results F1 and F3.

select file_id
from tag_label
where tag_id = 'TAG_1'
intersect
select file_id
from tag_label
where tag_id = 'TAG_1';

Use join query instead of intersect to find intersection

Because the keyword intersection operation is not owned by all DBMS, connection query can be used in MySQL to obtain the intersection.

  1. TAG_1 and TAG_2 connection operation

    SELECT TAG_1.`file_id` T1_F, TAG_2.`file_id` T2_F
    FROM (SELECT * FROM `tag_file` WHERE `tag_id` = 'TAG_1') TAG_1
    LEFT JOIN (SELECT * FROM `tag_file` WHERE `tag_id` = 'TAG_2') TAG_2 
    	ON TAG_1.`file_id` = TAG_2.`file_id`;
    
    T1_F T2_F
    F1 F1
    F2 F2
    F3 NULL
  2. TAG_1 and TAG_2 connect query and add filtering operation

    SELECT TAG_1.`file_id`
    FROM (SELECT * FROM `tag_file` WHERE `tag_id` = 'TAG_1') TAG_1
    LEFT JOIN (SELECT * FROM `tag_file` WHERE `tag_id` = 'TAG_2') TAG_2 
    	ON TAG_1.`file_id` = TAG_2.`file_id`
    WHERE TAG_1.`file_id` IS NOT NULL AND  TAG_2.`file_id` IS NOT NULL;
    
    file_id
    F1
    F2

Using join query is a costly query operation. Although it can be queried in this way, it is not the best. Second, if multiple tag ID S need to be built, a long SQL query statement needs to be constructed, which is an elegant query operation. To sum up, a new query operation is needed to solve this problem.

Use grouping to find intersections

Because this method of finding intersection is not supported by many DBMS, this method is not feasible. What should I do? It's better to change the idea and group according to the file name (file_id). Then judge whether the file belongs to these tags at the same time according to the length of the tag, so as to obtain the file ID list. It's useless to say more. It's better to do a few experiments.

  1. See what files contain a list of tags?

    SELECT *
    FROM tag_file
    WHERE `tag_id` IN ('TAG_1', 'TAG_2');
    
    id tag_id file_id
    1 TAG_1 F1
    2 TAG_2 F1
    4 TAG_1 F2
    5 TAG_2 F2
    7 TAG_1 F3

    By analyzing these results, it is found that the result in the file id column is consistent with TAG_1 or tag compliant_ 2, the result of union is obtained. But if you want to ask for intersection, how should you proceed?

  2. Group by file ID and check the length of tag

    SELECT `file_id`, COUNT(`tag_id`) tag_count
    FROM tag_file
    WHERE `tag_id` IN ('TAG_1', 'TAG_2')
    GROUP BY `file_id`;
    
    file_id tag_count
    F1 2
    F2 2
    F3 1

    According to the nature of GROUP BY, you can know that GROUP BY is grouped according to the queried result set, so you can avoid the performance loss of connection query. Secondly, a rule can be found, because grouping is based on search results, so tag_ The length of count cannot exceed the length of the tag list to be queried. According to the scenario, that is, tag_ The result of count must be less than or equal to 2, and the result with length equal to 2 must meet all labels. Therefore, the intersection can be obtained in this way.

  3. Find the list of qualified files

    SELECT `file_id`
    FROM tag_file
    WHERE `tag_id` IN ('TAG_1', 'TAG_2')
    GROUP BY `file_id`
    HAVING COUNT(`tag_id`) = 2;
    
    file_id
    F1
    F2

    Finally, the result is obtained in this way.

Construct a dynamic SQL using MyBatis:

<select id="getFileIdListByTagIds" resultType="String" parameterType="java.lang.List">
    <bind name="tagListSize" value="tagList.size()" />
    SELECT `file_id`
    FROM tag_file
    WHERE `tag_id` IN
        <foreach collection="tagList" item="tagId" open="(" sparator="," close=")">
            #{tagId}
        </foreach>
    GROUP BY `file_id`
    HAVING COUNT(`tag_id`) = #{tagListSize};
</select>

Method evaluation

This method is mainly used for intersection query within the same table, but it is not suitable for a wide range of intersection operations.

Topics: MySQL