Hive tutorial (06) - Hive SerDe serialization and deserialization

Posted by Jackanape on Tue, 22 Feb 2022 04:24:36 +0100

01 introduction

In the previous tutorial, you have a preliminary understanding of Hive's data model, data types and operation commands. Interested students can refer to:

Since we know where to store data using hive (data model), the type of data stored and the method of storage (command), a question will be raised here. The data will eventually be stored in a file, but the source of the data is an "object". How is this process converted? This article will explain.

02 SerDe

2.1 concept

What is SerDe? In fact, it stands for two abbreviations:

  • Serializer serialization: the process of converting an object into a sequence of bytes
  • Deserializer deserialization: it is the process of converting byte sequence into object

The process of serialization and deserialization is as follows:

  • Serializer serialization: Row object - > serialization - > outputfileformat - > HDFS file
  • Deserializer deserialization: HDFS file - > inputfileformat - > deserialization - > Row object

SerDe allows Hive to read data from the table, then write the data back to HDFS in any custom format, and develop a custom SerDe implementation according to the specific data format.

When Hive creates a table, specify the serialization and deserialization methods of data. The template is as follows:

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)]
INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]

2.2 classification

Hive SerDe is divided into built-in and customized ones. Let's talk about them below.

2.2.1 built in SerDe type

Hive uses the following FileFormat types to read and write HDFS files:

  • TextInputFormat/HiveIgnoreKeyTextOutputFormat: this format is used to read and write data in text file format.
  • SequenceFileInputFormat/SequenceFileOutputFormat
    : this format is used to read and write the sequence file format of Hadoop.

The built-in SerDe types are:

  • Avro
  • ORC
  • RegEx
  • Thrift
  • Parquet
  • CSV
  • JsonSerDe

2.2.1.1 MetadataTypedColumnsetSerDe

This SerDe type is used to read and write records separated by a separator. For example, records with comma separator (CSV) and tab key separator.

2.2.1.2 LazySimpleSerDe

This is the default SerDe type. Read the same data format as MetadataTypedColumnsetSerDe and TCTLSeparatedProtocol.

You can use this Hive SerDe type, which creates objects in an inert way, so it has better performance.

After Hive 0.14.0, it supports specified character encoding when reading and writing data. For example:

ALTER TABLE person SET SERDEPROPERTIES ('serialization.encoding'='GBK')

If the configuration attribute hive lazysimple. extended_ boolean_ If literal is set to true (hive version after 0.14.0), LazySimpleSerDe can treat't ','t', 'f', 'f', '1', and '0' as legal Boolean literals. This configuration is false by default, so it will only treat 'true' and 'false' as legal Boolean literals.

2.2.1.3 Thrift SerDe

This type of Hive SerDe can be used to read and write Thrift serialized objects. To be sure, for Thrift objects, the class file must be loaded first.

2.2.1.4 dynamic SerDe

In order to read and write Thrift serialized objects, we can use this SerDe type.

It can understand Thrift DDL statements, so the pattern of objects can be provided at run time.

In addition, it supports many different protocols, including tbinaryprotocol, tjsonprotocol and tctlseparated protocol.

2.2.2 custom SerDe type

2.2.2.1 step 1: customize SerDe

First, define a class, inherit the abstract class AbstractSerDe, and implement the initialize and deserialize methods.

Note: in the following code, Hive uses the ObjectInspector object to analyze the internal structure of row objects and the structure of columns. Specifically, ObjectInspector provides a unified way to access complex objects. Objects can be stored in memory in a variety of formats:

  • Java class instance, Thrift or native Java
  • Standard Java objects, such as map fields, we use Java util. List represents Struct and Array, and uses Java util. Map
  • Lazy initialization object

In addition, a complex object can be represented by this structure (Object Inspector, Java object). It provides us with a way to access the internal fields of the object without involving the information related to the object structure. For the purpose of serialization, Hive suggests creating a custom objectinspector for a custom SerDes. SerDes has two constructors, a parameterless constructor and a regular constructor.

Example code:

public class MySerDe extends AbstractSerDe {
    // params
    private List<String> columnNames = null;
    private List<TypeInfo> columnTypes = null;
    private ObjectInspector objectInspector = null;
    // seperator
    private String nullString = null;
    private String lineSep = null;
    private String kvSep = null;
    @Override
    public void initialize(Configuration conf, Properties tbl)
            throws SerDeException {
        // Read sep
        lineSep = "\n";
        kvSep = "=";
        nullString = tbl.getProperty(Constants.SERIALIZATION_NULL_FORMAT, "");
        // Read Column Names
        String columnNameProp = tbl.getProperty(Constants.LIST_COLUMNS);
        if (columnNameProp != null && columnNameProp.length() > 0) {
            columnNames = Arrays.asList(columnNameProp.split(","));
        } else {
            columnNames = new ArrayList<String>();
        }
        // Read Column Types
        String columnTypeProp = tbl.getProperty(Constants.LIST_COLUMN_TYPES);
        // default all string
        if (columnTypeProp == null) {
            String[] types = new String[columnNames.size()];
            Arrays.fill(types, 0, types.length, Constants.STRING_TYPE_NAME);
            columnTypeProp = StringUtils.join(types, ":");
        }
        columnTypes = TypeInfoUtils.getTypeInfosFromTypeString(columnTypeProp);
        // Check column and types equals
        if (columnTypes.size() != columnNames.size()) {
            throw new SerDeException("len(columnNames) != len(columntTypes)");
        }
        // Create ObjectInspectors from the type information for each column
        List<ObjectInspector> columnOIs = new ArrayList<ObjectInspector>();
        ObjectInspector oi;
        for (int c = 0; c < columnNames.size(); c++) {
            oi = TypeInfoUtils
                    .getStandardJavaObjectInspectorFromTypeInfo(columnTypes
                            .get(c));
            columnOIs.add(oi);
        }
        objectInspector = ObjectInspectorFactory
                .getStandardStructObjectInspector(columnNames, columnOIs);
    }
    @Override
    public Object deserialize(Writable wr) throws SerDeException {
        // Split to kv pair
        if (wr == null)
            return null;
        Map<String, String> kvMap = new HashMap<String, String>();
        Text text = (Text) wr;
        for (String kv : text.toString().split(lineSep)) {
            String[] pair = kv.split(kvSep);
            if (pair.length == 2) {
                kvMap.put(pair[0], pair[1]);
            }
        }
        // Set according to col_names and col_types
        ArrayList<Object> row = new ArrayList<Object>();
        String colName = null;
        TypeInfo type_info = null;
        Object obj = null;
        for (int i = 0; i < columnNames.size(); i++) {
            colName = columnNames.get(i);
            type_info = columnTypes.get(i);
            obj = null;
            if (type_info.getCategory() == ObjectInspector.Category.PRIMITIVE) {
                PrimitiveTypeInfo p_type_info = (PrimitiveTypeInfo) type_info;
                switch (p_type_info.getPrimitiveCategory()) {
                case STRING:
                    obj = StringUtils.defaultString(kvMap.get(colName), "");
                    break;
                case LONG:
                case INT:
                    try {
                        obj = Long.parseLong(kvMap.get(colName));
                    } catch (Exception e) {
                    }
                }
            }
            row.add(obj);
        }
        return row;
    }
    @Override
    public ObjectInspector getObjectInspector() throws SerDeException {
        return objectInspector;
    }
    @Override
    public SerDeStats getSerDeStats() {
        return null;
    }
    @Override
    public Class<? extends Writable> getSerializedClass() {
        return Text.class;
    }
    @Override
    public Writable serialize(Object arg0, ObjectInspector arg1)
            throws SerDeException {
        return null;
    }
}

2.2.2.2 step 2: hive add Serde

Use custom Serde type:

hive > add jar MySerDe.jar

2.2.2.3 step 3: use Serde

When creating a table, the attribute row fromat specifies the custom SerDe class:

CREATE EXTERNAL TABLE IF NOT EXISTS teacher ( 
          id BIGINT, 
          name STRING,
          age INT)
ROW FORMAT SERDE 'com.coder4.hive.MySerDe'
STORED AS TEXTFILE
LOCATION '/usr/hive/text/'

03 end

This paper mainly refers to the following literature:

  • https://www.cnblogs.com/rrttp/p/9024153.html
  • https://www.hadoopdoc.com/hive/hive-serde

End of this article!

Topics: Hadoop hive hdfs