01 introduction
In the previous tutorial, you have a preliminary understanding of Hive's data model, data types and operation commands. Interested students can refer to:
- Hive tutorial (01) - getting to know hive
- Hive tutorial (02) - hive installation
- Hive tutorial (03) - hive data model
- Hive tutorial (04) - hive data types
- Hive tutorial (05) - hive command summary
Since we know where to store data using hive (data model), the type of data stored and the method of storage (command), a question will be raised here. The data will eventually be stored in a file, but the source of the data is an "object". How is this process converted? This article will explain.
02 SerDe
2.1 concept
What is SerDe? In fact, it stands for two abbreviations:
- Serializer serialization: the process of converting an object into a sequence of bytes
- Deserializer deserialization: it is the process of converting byte sequence into object
The process of serialization and deserialization is as follows:
- Serializer serialization: Row object - > serialization - > outputfileformat - > HDFS file
- Deserializer deserialization: HDFS file - > inputfileformat - > deserialization - > Row object
SerDe allows Hive to read data from the table, then write the data back to HDFS in any custom format, and develop a custom SerDe implementation according to the specific data format.
When Hive creates a table, specify the serialization and deserialization methods of data. The template is as follows:
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path]
2.2 classification
Hive SerDe is divided into built-in and customized ones. Let's talk about them below.
2.2.1 built in SerDe type
Hive uses the following FileFormat types to read and write HDFS files:
- TextInputFormat/HiveIgnoreKeyTextOutputFormat: this format is used to read and write data in text file format.
- SequenceFileInputFormat/SequenceFileOutputFormat
: this format is used to read and write the sequence file format of Hadoop.
The built-in SerDe types are:
- Avro
- ORC
- RegEx
- Thrift
- Parquet
- CSV
- JsonSerDe
2.2.1.1 MetadataTypedColumnsetSerDe
This SerDe type is used to read and write records separated by a separator. For example, records with comma separator (CSV) and tab key separator.
2.2.1.2 LazySimpleSerDe
This is the default SerDe type. Read the same data format as MetadataTypedColumnsetSerDe and TCTLSeparatedProtocol.
You can use this Hive SerDe type, which creates objects in an inert way, so it has better performance.
After Hive 0.14.0, it supports specified character encoding when reading and writing data. For example:
ALTER TABLE person SET SERDEPROPERTIES ('serialization.encoding'='GBK')
If the configuration attribute hive lazysimple. extended_ boolean_ If literal is set to true (hive version after 0.14.0), LazySimpleSerDe can treat't ','t', 'f', 'f', '1', and '0' as legal Boolean literals. This configuration is false by default, so it will only treat 'true' and 'false' as legal Boolean literals.
2.2.1.3 Thrift SerDe
This type of Hive SerDe can be used to read and write Thrift serialized objects. To be sure, for Thrift objects, the class file must be loaded first.
2.2.1.4 dynamic SerDe
In order to read and write Thrift serialized objects, we can use this SerDe type.
It can understand Thrift DDL statements, so the pattern of objects can be provided at run time.
In addition, it supports many different protocols, including tbinaryprotocol, tjsonprotocol and tctlseparated protocol.
2.2.2 custom SerDe type
2.2.2.1 step 1: customize SerDe
First, define a class, inherit the abstract class AbstractSerDe, and implement the initialize and deserialize methods.
Note: in the following code, Hive uses the ObjectInspector object to analyze the internal structure of row objects and the structure of columns. Specifically, ObjectInspector provides a unified way to access complex objects. Objects can be stored in memory in a variety of formats:
- Java class instance, Thrift or native Java
- Standard Java objects, such as map fields, we use Java util. List represents Struct and Array, and uses Java util. Map
- Lazy initialization object
In addition, a complex object can be represented by this structure (Object Inspector, Java object). It provides us with a way to access the internal fields of the object without involving the information related to the object structure. For the purpose of serialization, Hive suggests creating a custom objectinspector for a custom SerDes. SerDes has two constructors, a parameterless constructor and a regular constructor.
Example code:
public class MySerDe extends AbstractSerDe { // params private List<String> columnNames = null; private List<TypeInfo> columnTypes = null; private ObjectInspector objectInspector = null; // seperator private String nullString = null; private String lineSep = null; private String kvSep = null; @Override public void initialize(Configuration conf, Properties tbl) throws SerDeException { // Read sep lineSep = "\n"; kvSep = "="; nullString = tbl.getProperty(Constants.SERIALIZATION_NULL_FORMAT, ""); // Read Column Names String columnNameProp = tbl.getProperty(Constants.LIST_COLUMNS); if (columnNameProp != null && columnNameProp.length() > 0) { columnNames = Arrays.asList(columnNameProp.split(",")); } else { columnNames = new ArrayList<String>(); } // Read Column Types String columnTypeProp = tbl.getProperty(Constants.LIST_COLUMN_TYPES); // default all string if (columnTypeProp == null) { String[] types = new String[columnNames.size()]; Arrays.fill(types, 0, types.length, Constants.STRING_TYPE_NAME); columnTypeProp = StringUtils.join(types, ":"); } columnTypes = TypeInfoUtils.getTypeInfosFromTypeString(columnTypeProp); // Check column and types equals if (columnTypes.size() != columnNames.size()) { throw new SerDeException("len(columnNames) != len(columntTypes)"); } // Create ObjectInspectors from the type information for each column List<ObjectInspector> columnOIs = new ArrayList<ObjectInspector>(); ObjectInspector oi; for (int c = 0; c < columnNames.size(); c++) { oi = TypeInfoUtils .getStandardJavaObjectInspectorFromTypeInfo(columnTypes .get(c)); columnOIs.add(oi); } objectInspector = ObjectInspectorFactory .getStandardStructObjectInspector(columnNames, columnOIs); } @Override public Object deserialize(Writable wr) throws SerDeException { // Split to kv pair if (wr == null) return null; Map<String, String> kvMap = new HashMap<String, String>(); Text text = (Text) wr; for (String kv : text.toString().split(lineSep)) { String[] pair = kv.split(kvSep); if (pair.length == 2) { kvMap.put(pair[0], pair[1]); } } // Set according to col_names and col_types ArrayList<Object> row = new ArrayList<Object>(); String colName = null; TypeInfo type_info = null; Object obj = null; for (int i = 0; i < columnNames.size(); i++) { colName = columnNames.get(i); type_info = columnTypes.get(i); obj = null; if (type_info.getCategory() == ObjectInspector.Category.PRIMITIVE) { PrimitiveTypeInfo p_type_info = (PrimitiveTypeInfo) type_info; switch (p_type_info.getPrimitiveCategory()) { case STRING: obj = StringUtils.defaultString(kvMap.get(colName), ""); break; case LONG: case INT: try { obj = Long.parseLong(kvMap.get(colName)); } catch (Exception e) { } } } row.add(obj); } return row; } @Override public ObjectInspector getObjectInspector() throws SerDeException { return objectInspector; } @Override public SerDeStats getSerDeStats() { return null; } @Override public Class<? extends Writable> getSerializedClass() { return Text.class; } @Override public Writable serialize(Object arg0, ObjectInspector arg1) throws SerDeException { return null; } }
2.2.2.2 step 2: hive add Serde
Use custom Serde type:
hive > add jar MySerDe.jar
2.2.2.3 step 3: use Serde
When creating a table, the attribute row fromat specifies the custom SerDe class:
CREATE EXTERNAL TABLE IF NOT EXISTS teacher ( id BIGINT, name STRING, age INT) ROW FORMAT SERDE 'com.coder4.hive.MySerDe' STORED AS TEXTFILE LOCATION '/usr/hive/text/'
03 end
This paper mainly refers to the following literature:
- https://www.cnblogs.com/rrttp/p/9024153.html
- https://www.hadoopdoc.com/hive/hive-serde
End of this article!