Senior big data Development Engineer - Hive learning notes
Hive improved chapter
Use of Hive
Hive's bucket table
1. Principle of drum dividing table
Bucket splitting is a more fine-grained partition relative to partition. Hive table or partition table can further divide bucketsDivide the bucket, take the hash value of the whole data content according to a column, and determine which bucket th ...
Posted by luisluis on Wed, 08 Dec 2021 08:35:11 +0100
Hbase index (Phoenix secondary index)
1. Introduction to Phoenix
Hbase is suitable for storing a large number of NOSQL data with low requirements for relational operations. Due to the limitations of Hbase design, it is not possible to directly use the native API to perform the operations such as condition judgment and aggregation commonly used in relational databa ...
Posted by natronp on Fri, 03 Dec 2021 22:04:23 +0100
Big data offline processing data project data cleaning ETL writes MapReduce program to realize data cleaning
Introduction:
Functions: clean the collected log data, filter invalid data and static resources
Method: write MapReduce for processing
Classes involved:
1) Entity class Bean
Describe various fields of log data, such as client ip, request url, request status, etc
2) Tool class
Used to process beans: set the validity or invalidity of log ...
Posted by KRAK_JOE on Fri, 03 Dec 2021 17:01:43 +0100
Hive UDF < user defined functions > getting started
1, Introduction
Hive has three types of UDFs: (normal) UDF, user-defined aggregate function (UDAF), and user-defined table generating function (UDTF).
UDF: the operation acts on a single data row and produces a data row as output. Most functions, such as mathematical and string functions, fall into this category.UDAF: accepts multiple input d ...
Posted by ublapach on Fri, 03 Dec 2021 05:59:54 +0100
FlinkCDC+Hudi+Hive big data real-time basic combat into the lake
catalogue
The new architecture is integrated with the lake warehouse
1, Version Description
2, Compile and package Hudi version 0.10.0
1. Use git to clone the latest master on github
2. Compilation and packaging
3, Create a flick project
1. Main contents of POM document
2.checkpoint
3.flinkcdc code
4.hudi code (refer to the official ...
Posted by WebbDawg on Fri, 03 Dec 2021 03:42:39 +0100
Big data offline processing data project website log file data collection log splitting data collection to HDFS and preprocessing
Introduction:
This article is about the first process of big data offline data processing project: data collection
Main contents:
1) Use flume to collect website log file data to access.log
2) Write shell script: split the collected log data file (otherwise the access.log file is too large) and rename it to access_ Mm / DD / yyyy.log.   ...
Posted by erth on Tue, 30 Nov 2021 12:59:03 +0100
scala ---- list, ancestor, set and related knowledge
1. Array
1.1 general
Array is a container used to store multiple elements of the same type. Each element has a number (also known as subscript, subscript and index), and the number starts from 0. In Scala, there are two kinds of arrays, one is a fixed length array and the other is a variable length array
1.2 fixed length array
1.2.1 feature ...
Posted by soupy127 on Mon, 29 Nov 2021 23:32:31 +0100
MapReduce core design -- Hadoop RPC framework
Hadoop RPC is divided into four parts
Serialization layer: convert structured objects into byte streams for transmission over the network or write to persistent storage. In the RPC framework, it is mainly used to convert parameters or responses in user requests into byte streams for cross machine transmission.Function call layer: locate the fu ...
Posted by DuNuNuBatman on Mon, 29 Nov 2021 22:37:21 +0100
MapReduce programming model
summary
MR distributed computing framework and application scenarios have a common feature: tasks can be decomposed into independent subproblems. Therefore, the distributed programming method of MR programming model, 5 steps:
Iteration: traverse the input data and parse it into kv pairsMapping: mapping input kv pairs to other kv pairsGrouping ...
Posted by CorkyMcDoogle on Mon, 29 Nov 2021 12:23:08 +0100
III-3. Interaction between HBase and Hive
III-3. Interaction between HBase and Hive
3.1 comparison between HBase and Hive
[Hive]
OutlineElaborate1. Data warehouseThe essence of Hive is actually equivalent to making a bijection relationship between the files already stored in HDFS in Mysql toIt is convenient to use HQL to manage queries.2. Used for data analysis and cleaningHive is s ...
Posted by abitshort on Sun, 28 Nov 2021 11:10:16 +0100