Tutorial: How to Realize Data File Format Conversion through DLA

Posted by growler123 on Thu, 09 May 2019 17:50:03 +0200

Preface

Data Lake Analytic DLA (hereinafter referred to as DLA) can help users directly query and analyze data stored on OSS and TableStore through standard SQL statements.

For the same data, it is not only different in storage space, but also different in execution time when using DLA query. Generally speaking, when data of the same size is stored in ORC and PARQUET, the performance is better than that in CSV format. Most users'data on OSS is stored in CSV format. If they want to get better query efficiency, they often need to use third-party tools to convert the file format first, then import the converted data file into OSS, and then use DLA to query, which is more troublesome.

This article will introduce how to convert different file formats in DLA.

Transformation method

Simply put, two tables are created in DLA according to the format of original data file and target data file respectively, and then the data is written into OSS in the format specified by target table through INSERT INTO target_table SELECT FROM source_table statement.

Detailed example

Following will be TPC-H orders.tbl file as an example, how to convert ordinary text files into ORC format file.

  1. Create table orders_txt in DLA and point LOCATION to the path of file orders.tbl in OSS.

    CREATE EXTERNAL TABLE orders_txt (
        O_ORDERKEY INT, 
        O_CUSTKEY INT, 
        O_ORDERSTATUS STRING, 
        O_TOTALPRICE DOUBLE, 
        O_ORDERDATE DATE, 
        O_ORDERPRIORITY STRING, 
        O_CLERK STRING, 
        O_SHIPPRIORITY INT, 
        O_COMMENT STRING
    ) 
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' 
    STORED AS TEXTFILE LOCATION 'oss://mybucket/datasets/jinluo/test/convert/orders.tbl';
  2. Create a table orders_orc in DLA and point LOCATION to the target location in OSS. Note that the path must be an existing directory, that is, end with/
CREATE EXTERNAL TABLE orders_orc (
    O_ORDERKEY INT, 
    O_CUSTKEY INT, 
    O_ORDERSTATUS STRING, 
    O_TOTALPRICE DOUBLE, 
    O_ORDERDATE DATE, 
    O_ORDERPRIORITY STRING, 
    O_CLERK STRING, 
    O_SHIPPRIORITY INT, 
    O_COMMENT STRING
) 
STORED AS ORC LOCATION 'oss://mybucket/datasets/jinluo/test/convert/orders_orc/';
  1. Execute the INSERT...SELECT statement and insert the required data from the orders_txt table into orders_orc.
INSERT INTO orders_orc SELECT * FROM orders_txt;
  1. When the INSERT statement is executed successfully, you will see the generated data file under the directory pointed to by the target table orders_orc on OSS.
2018-11-22 10:27:15  0.00B Standard oss://mybucket/datasets/jinluo/test/convert/orders_orc/
2018-11-22 10:59:26 1005.62MB Standard oss://mybucket/datasets/jinluo/test/convert/orders_orc/20181122_025537_6_558tf_0eec9b17-dbc3-4ba0-a9df-4024aa6c7d97
2018-11-22 10:59:26 1005.74MB Standard oss://mybucket/datasets/jinluo/test/convert/orders_orc/20181122_025537_6_558tf_74016d12-a570-409d-b444-a216b69a3411

Matters needing attention

  1. Each execution of the INSERT statement will not overwrite the existing data files in the table directory, but will only add new files in the directory.
  2. Since DLA does not have deletion permission for user BUCKET, when INSERT statement _ fails to execute _, user _ is required to manually delete _ data files that have been generated under LOCATION.
  3. The number of newly generated files in the target directory is related to the actual cluster environment, and is not fixed.

More articles

Using Data Lake Analytics to Clean Data from OSS to AnalyticDB
Data Lake Practice Based on Data Lake Analysis
How to Use Data Lake Analytics to Analyse Table Store Data in Ali Cloud
Geospatial Analysis Function of Data Lake Analytics

Topics: Database SQL