Azkaban of big data task scheduling
Azkaban is a batch workflow task scheduler launched by Linkedin company. It is mainly used to run a group of work and processes in a specific order in a workflow. Its configuration is to set dependencies through simple < key, value > pairs and dependencies in the configuration. Azkaban uses job profiles to establish dependencies betwe ...
Posted by suzuki on Mon, 10 Jan 2022 23:38:51 +0100
Java Collection interface: List interface & Set interface
Big data technology AI
Flink/Spark/Hadoop / data warehouse, data analysis, interview, source code interpretation and other dry goods learning materials
101 original content
official account
1. List interface
The elements in the List collection class are ordered and repeatable, and each element in the collection has its corresponding seq ...
Posted by fredroines on Mon, 10 Jan 2022 03:40:09 +0100
Introduction to Spark development
What is Spark
The whole Hadoop ecosystem is divided into distributed file system HDFS, computing framework MapReduce and resource scheduling framework Yan. However, with the development of the times, MapReduce's high-intensity disk IO, network communication frequency and dead write make it seriously slow down the operation speed of the who ...
Posted by switchdoc on Sun, 09 Jan 2022 10:15:36 +0100
MapReduce learning 1: overview and simple case preparation
1, Overview
1.1MapReduce definition
MapReduce is a programming framework for distributed computing programs and the core framework for users to develop "hadoop based data analysis applications".
The core function of MapReduce is to integrate the business logic code written by the user and its own default components into a complete ...
Posted by pckidcomplainer on Sat, 08 Jan 2022 13:35:01 +0100
Resource scheduling in Yarn
Three Scheduling Strategies
FIFO Scheduler, Capacity Scheduler and Fair Scheduler policies are listed from left to right. These three policies are introduced below
FIFO Scheduler: first in, first out scheduling strategy Tasks are carried out in turn. Resources can only be released after the execution of previous tasks. This is unreasona ...
Posted by ConnorSBB on Sat, 08 Jan 2022 04:07:34 +0100
elasticsearch 7.X data type
reference resources
binary
Binary values are encoded as Base64 strings.
PUT /es_field_type?pretty=true
{
"mappings":{
"properties":{
"binary":{
"type":"binary"
}
}
}
}
POST es_field_type/_doc
{
"binary":"U29tZSBiaW5hcnkgYmxvYg=="
}
boolean Boolean
true and false.
keyword whole word
The string cannot ...
Posted by ginoitalo on Thu, 06 Jan 2022 03:19:37 +0100
[Spark] user defined functions UDF and UDAF
All the trees we use in this article are user JSON, as shown in the figure below
{"username": "zhangsan","age": 20} {"username": "lisi","age": 21} {"username": "wangwu","age": 19}
Custom UDF
Introduction to UDF
UDF: enter a line and return a result For one-to-one relationship, if you put a value into a function, you will return ...
Posted by shaunie123 on Wed, 05 Jan 2022 21:44:19 +0100
Big data and Hadoop & distributed file systems & distributed Hadoop clusters | Cloud computing
1. Deploy Hadoop
1.1 problems
This case requires the installation of stand-alone Hadoop:
Hot word analysis:Minimum configuration: 2cpu, 2G memory, 10G hard diskVirtual machine IP: 192.168.1.50 Hadoop 1Installing and deploying hadoopData analysis to find the most frequently occurring words
1.2 steps
To implement this case, you need to ...
Posted by ball420 on Wed, 05 Jan 2022 18:23:47 +0100
MapReduce notes - serialized cases
serialize
When the hosts transmit data to each other, they cannot directly send an object to another host. They need to encapsulate the content of the object into a packet in some form and then send it over the network. The most important method is to program the object in the form of a string, and the writing rules of this string are secretly ...
Posted by lnt on Tue, 04 Jan 2022 19:29:57 +0100
Spark + parse text + recursion + pattern matching + broadcast filtering
catalogue
Requirements: query a given number of tables, several of which are used in the code
Transformation concept: given key
Word, several hits in the log file
Type selection: Spark parsing keyword list rdd1;Spark parses file directory data rdd2; rdd1 join rdd2(broadcast)
Upper Code:
Step 1: create SparkContext
Step 2: read the file ...
Posted by shan111 on Tue, 04 Jan 2022 15:35:44 +0100