Senior big data Development Engineer - Hive learning notes

Hive improved chapter Use of Hive Hive's bucket table 1. Principle of drum dividing table Bucket splitting is a more fine-grained partition relative to partition. Hive table or partition table can further divide bucketsDivide the bucket, take the hash value of the whole data content according to a column, and determine which bucket th ...

Posted by luisluis on Wed, 08 Dec 2021 08:35:11 +0100

Hbase index (Phoenix secondary index)

1. Introduction to Phoenix    Hbase is suitable for storing a large number of NOSQL data with low requirements for relational operations. Due to the limitations of Hbase design, it is not possible to directly use the native API to perform the operations such as condition judgment and aggregation commonly used in relational databa ...

Posted by natronp on Fri, 03 Dec 2021 22:04:23 +0100

Big data offline processing data project data cleaning ETL writes MapReduce program to realize data cleaning

Introduction: Functions: clean the collected log data, filter invalid data and static resources Method: write MapReduce for processing Classes involved: 1) Entity class Bean Describe various fields of log data, such as client ip, request url, request status, etc 2) Tool class Used to process beans: set the validity or invalidity of log ...

Posted by KRAK_JOE on Fri, 03 Dec 2021 17:01:43 +0100

Hive UDF < user defined functions > getting started

1, Introduction Hive has three types of UDFs: (normal) UDF, user-defined aggregate function (UDAF), and user-defined table generating function (UDTF). UDF: the operation acts on a single data row and produces a data row as output. Most functions, such as mathematical and string functions, fall into this category.UDAF: accepts multiple input d ...

Posted by ublapach on Fri, 03 Dec 2021 05:59:54 +0100

FlinkCDC+Hudi+Hive big data real-time basic combat into the lake

catalogue The new architecture is integrated with the lake warehouse 1, Version Description 2, Compile and package Hudi version 0.10.0 1. Use git to clone the latest master on github 2. Compilation and packaging 3, Create a flick project 1. Main contents of POM document 2.checkpoint 3.flinkcdc code 4.hudi code (refer to the official ...

Posted by WebbDawg on Fri, 03 Dec 2021 03:42:39 +0100

Big data offline processing data project website log file data collection log splitting data collection to HDFS and preprocessing

Introduction: This article is about the first process of big data offline data processing project: data collection Main contents: 1) Use flume to collect website log file data to access.log 2) Write shell script: split the collected log data file (otherwise the access.log file is too large) and rename it to access_ Mm / DD / yyyy.log. &nbsp ...

Posted by erth on Tue, 30 Nov 2021 12:59:03 +0100

scala ---- list, ancestor, set and related knowledge

1. Array 1.1 general Array is a container used to store multiple elements of the same type. Each element has a number (also known as subscript, subscript and index), and the number starts from 0. In Scala, there are two kinds of arrays, one is a fixed length array and the other is a variable length array 1.2 fixed length array 1.2.1 feature ...

Posted by soupy127 on Mon, 29 Nov 2021 23:32:31 +0100

MapReduce core design -- Hadoop RPC framework

Hadoop RPC is divided into four parts Serialization layer: convert structured objects into byte streams for transmission over the network or write to persistent storage. In the RPC framework, it is mainly used to convert parameters or responses in user requests into byte streams for cross machine transmission.Function call layer: locate the fu ...

Posted by DuNuNuBatman on Mon, 29 Nov 2021 22:37:21 +0100

MapReduce programming model

summary MR distributed computing framework and application scenarios have a common feature: tasks can be decomposed into independent subproblems. Therefore, the distributed programming method of MR programming model, 5 steps: Iteration: traverse the input data and parse it into kv pairsMapping: mapping input kv pairs to other kv pairsGrouping ...

Posted by CorkyMcDoogle on Mon, 29 Nov 2021 12:23:08 +0100

III-3. Interaction between HBase and Hive

III-3. Interaction between HBase and Hive 3.1 comparison between HBase and Hive [Hive] OutlineElaborate1. Data warehouseThe essence of Hive is actually equivalent to making a bijection relationship between the files already stored in HDFS in Mysql toIt is convenient to use HQL to manage queries.2. Used for data analysis and cleaningHive is s ...

Posted by abitshort on Sun, 28 Nov 2021 11:10:16 +0100