The impact of large compressed files on the query performance of Impala

Posted by shane0714 on Sat, 21 Mar 2020 10:54:37 +0100

Hadoop/HDFS/MapReduce/Impala is designed to store and process a large number of files, such as terabytes or petabytes of data. A large number of small files have a great impact on the query performance, because NameNode needs to save a large number of HDFS file metadata. If it queries many partitions or files at one time, it needs to obtain the file list and read the file information one by one, which will not only have a great impact on the query performance, but also may exceed the limit of the number of file descriptors in the operating system, leading to query failure.

So, does that mean we want to keep the file as large as possible? Of course not. Large files also have an impact on the performance of tables, because in most cases, Hadoop users will compress the data stored in HDFS, which can save disk space, but if you have a large compressed file, the time spent on decompression will also slow down the query.

To prove the above, I tested the following in the CDH environment:

1. I have prepared a 565M file in normal Text format and a 135M file compressed by bzip2. The download link is as follows: Kaggle's Flight Delay Dataset

2. I created a table named bzip2 ﹐ smallfiles ﹐ 4 with four such files, and another table named bzip2 ﹐ smallfiles ﹐ 8 with eight such files

3. Then, I merged the text file four times, generated a text file, compressed it with bzip2, the size changed to about 510MB, and created a table named bzip2 ﹐ bigfile ﹐ 4 on it

4. It's the same as 3. But I merged the file eight times, making it larger, compressed to 1.1GB, and created a new table called bzip2 ﹐ bigfile ﹐ 8

5. Then, I run the "SELECT COUNT(*) FROM" query on each of the four tables to compare the results

There is no doubt that I saw the slowest query for table bzip2 ﹣ bigfile ﹣ 8. Here are the test data of the four tables:

bzip2_smallfiles_4:

  • 4 hosts running query
  • Query run time is about 53 seconds
  • Maximum scan time 52 seconds
  • The maximum decompression time is 49 seconds
Operator       Hosts  Avg Time  Max Time   #Rows  Est. #Rows  Peak Mem  Est. Peak Mem  Detail 
00:SCAN HDFS        4  26s464ms  52s687ms  23.28M          -1  40.32 MB      160.00 MB  test.bzip2_smallfiles_4

    Query Timeline
...
      Rows available: 53.86s (53861836202)
      First row fetched: 53.87s (53869836178)
      Unregister query: 53.87s (53874836163)

    Fragment F00
      Instance fc48dc3e014eb7a5:7d7a2dc100000004 (host=xxxx:22000)
        AGGREGATION_NODE (id=1)
          HDFS_SCAN_NODE (id=0)
            File Formats: TEXT/BZIP2:2 
            - DecompressionTime: 49.45s (49449847498)

bzip2_smallfiles_8:

  • 4 hosts running query
  • Query run time is about 54.69 seconds
  • Maximum scanning time 54.196 seconds
  • The maximum decompression time is 51.18 seconds
Operator       Hosts  Avg Time  Max Time   #Rows  Est. #Rows  Peak Mem  Est. Peak Mem  Detail 
00:SCAN HDFS        4  52s514ms  54s196ms  46.55M          -1  40.32 MB      160.00 MB  test.bzip2_smallfiles_8 

    Query Timeline
...
      Rows available: 54.36s (54359822792)
      First row fetched: 54.68s (54683821736)
      Unregister query: 54.69s (54688821720)

    Fragment F00
      Instance 5642f67b9a975652:c19438dc00000004 (host=xxxx:22000)
        AGGREGATION_NODE (id=1)
          HDFS_SCAN_NODE (id=0)
            File Formats: TEXT/BZIP2:2 
            - DecompressionTime: 51.18s (51183849937)

bzip2_bigfile_4:

  • 4 hosts running query
  • Query running time is about 1 minute and 50 seconds
  • The maximum scanning time is 1 minute and 49 seconds
  • The maximum decompression time is 1 minute and 7 seconds
Operator       Hosts  Avg Time  Max Time   #Rows  Est. #Rows  Peak Mem  Est. Peak Mem  Detail 
00:SCAN HDFS        4  27s394ms      1m49s  23.28M          -1  40.15 MB      176.00 MB  test.bzip2_bigfile_4 

    Query Timeline
...
      Rows available: 1.8m (109781665214)
      First row fetched: 1.8m (110408663300)
      Unregister query: 1.8m (110413663284)

    Fragment F00
      Instance 4545c110dbca4c9c:6cd1db1100000004 (host=xxxx:22000)
        AGGREGATION_NODE (id=1)
          HDFS_SCAN_NODE (id=0)
            File Formats: TEXT/BZIP2:2 
            - DecompressionTime: 1.7m (104339662922)

bzip2_bigfile_8:

  • 4 hosts running query
  • Query run time is about 3 minutes and 6 seconds
  • The maximum scanning time is 3 minutes and 35 seconds
  • The maximum decompression time is 3 minutes and 4 seconds
Operator       Hosts  Avg Time  Max Time   #Rows  Est. #Rows  Peak Mem  Est. Peak Mem  Detail 
00:SCAN HDFS        4  53s902ms      3m35s  46.55M          -1  40.32 MB      176.00 MB  test.bzip2_bigfile_8 

    Query Timeline
...
      Rows available: 3.6m (215992297509)
      First row fetched: 3.6m (216480295920)
      Unregister query: 3.6m (216484295907)

    Fragment F00
      Instance 8f42a3b6ca6cf1cf:72fd65e100000004 (host=xxxx:22000)
        AGGREGATION_NODE (id=1)
          HDFS_SCAN_NODE (id=0)
            File Formats: TEXT/BZIP2:2
             - DecompressionTime: 3.4m (203596406406)

The reason why I choose bzip2 compression format is that bzip2 is separable. All my test queries are run on four hosts, even for those two large bzip2 files.

As we can see, in order for Impala to read the largest 1.1GB bzip2 file, it takes almost four minutes to unzip the file. For table bzip2 ﹣ smallfiles ﹣ 8, although we have more files to decompress, because we can decompress on multiple hosts in parallel, it will not have a great impact on performance.

To sum up, too many small files (such as KB or smaller MB files) are not allowed in Hadoop. However, it is not good to have too few files and too large compression size. Ideally, we should make the file size as close to the block size as possible (256MB by default in CDH) to optimize performance.

Compile from: BIG COMPRESSED FILE WILL AFFECT QUERY PERFORMANCE FOR IMPALA

Topics: Big Data Fragment Hadoop