Big Data [Page 16] - Programmer Think - where programmers share thinking

Big Data

Azkaban of big data task scheduling

Azkaban is a batch workflow task scheduler launched by Linkedin company. It is mainly used to run a group of work and processes in a specific order in a workflow. Its configuration is to set dependencies through simple < key, value > pairs and dependencies in the configuration. Azkaban uses job profiles to establish dependencies betwe ...

Posted by suzuki on Mon, 10 Jan 2022 23:38:51 +0100

Java Collection interface: List interface & Set interface

Big data technology AI Flink/Spark/Hadoop / data warehouse, data analysis, interview, source code interpretation and other dry goods learning materials 101 original content official account 1. List interface The elements in the List collection class are ordered and repeatable, and each element in the collection has its corresponding seq ...

Posted by fredroines on Mon, 10 Jan 2022 03:40:09 +0100

Introduction to Spark development

What is Spark The whole Hadoop ecosystem is divided into distributed file system HDFS, computing framework MapReduce and resource scheduling framework Yan. However, with the development of the times, MapReduce's high-intensity disk IO, network communication frequency and dead write make it seriously slow down the operation speed of the who ...

Posted by switchdoc on Sun, 09 Jan 2022 10:15:36 +0100

MapReduce learning 1: overview and simple case preparation

1, Overview 1.1MapReduce definition MapReduce is a programming framework for distributed computing programs and the core framework for users to develop "hadoop based data analysis applications". The core function of MapReduce is to integrate the business logic code written by the user and its own default components into a complete ...

Posted by pckidcomplainer on Sat, 08 Jan 2022 13:35:01 +0100

Resource scheduling in Yarn

Three Scheduling Strategies FIFO Scheduler, Capacity Scheduler and Fair Scheduler policies are listed from left to right. These three policies are introduced below FIFO Scheduler: first in, first out scheduling strategy Tasks are carried out in turn. Resources can only be released after the execution of previous tasks. This is unreasona ...

Posted by ConnorSBB on Sat, 08 Jan 2022 04:07:34 +0100

elasticsearch 7.X data type

reference resources binary Binary values are encoded as Base64 strings. PUT /es_field_type?pretty=true { "mappings":{ "properties":{ "binary":{ "type":"binary" } } } } POST es_field_type/_doc { "binary":"U29tZSBiaW5hcnkgYmxvYg==" } boolean Boolean true and false. keyword whole word The string cannot ...

Posted by ginoitalo on Thu, 06 Jan 2022 03:19:37 +0100

[Spark] user defined functions UDF and UDAF

All the trees we use in this article are user JSON, as shown in the figure below {"username": "zhangsan","age": 20} {"username": "lisi","age": 21} {"username": "wangwu","age": 19} Custom UDF Introduction to UDF UDF: enter a line and return a result For one-to-one relationship, if you put a value into a function, you will return ...

Posted by shaunie123 on Wed, 05 Jan 2022 21:44:19 +0100

Big data and Hadoop & distributed file systems & distributed Hadoop clusters | Cloud computing

1. Deploy Hadoop 1.1 problems This case requires the installation of stand-alone Hadoop: Hot word analysis:Minimum configuration: 2cpu, 2G memory, 10G hard diskVirtual machine IP: 192.168.1.50 Hadoop 1Installing and deploying hadoopData analysis to find the most frequently occurring words 1.2 steps To implement this case, you need to ...

Posted by ball420 on Wed, 05 Jan 2022 18:23:47 +0100

MapReduce notes - serialized cases

serialize When the hosts transmit data to each other, they cannot directly send an object to another host. They need to encapsulate the content of the object into a packet in some form and then send it over the network. The most important method is to program the object in the form of a string, and the writing rules of this string are secretly ...

Posted by lnt on Tue, 04 Jan 2022 19:29:57 +0100

Spark + parse text + recursion + pattern matching + broadcast filtering

catalogue Requirements: query a given number of tables, several of which are used in the code Transformation concept: given key Word, several hits in the log file Type selection: Spark parsing keyword list rdd1;Spark parses file directory data rdd2; rdd1 join rdd2(broadcast) Upper Code: Step 1: create SparkContext Step 2: read the file ...

Posted by shan111 on Tue, 04 Jan 2022 15:35:44 +0100

Hot Topics