MapReduce learning 1: overview and simple case preparation
1, Overview
1.1MapReduce definition
MapReduce is a programming framework for distributed computing programs and the core framework for users to develop "hadoop based data analysis applications".
The core function of MapReduce is to integrate the business logic code written by the user and its own default components into a complete ...
Posted by pckidcomplainer on Sat, 08 Jan 2022 13:35:01 +0100
MapReduce notes - serialized cases
serialize
When the hosts transmit data to each other, they cannot directly send an object to another host. They need to encapsulate the content of the object into a packet in some form and then send it over the network. The most important method is to program the object in the form of a string, and the writing rules of this string are secretly ...
Posted by lnt on Tue, 04 Jan 2022 19:29:57 +0100
MapReduce framework principle - InputFormat data input
catalogue
1, Introduction to InputFormat
2, Parallelism of slicing and MapTask tasks
3, Job submission process source code
4, InputFormat implementation subclass
5, Slicing mechanism of FileInputFormat
(1) Slicing mechanism:
(2) Slice source code analysis
(3) Slicing steps
(4) FileInputFormat default slice size parameter configuration ...
Posted by cihan on Tue, 04 Jan 2022 16:10:16 +0100
Use of MapReduce Framework-Join
Catalog
I. Introduction
2. Use of Join in Relational Database MySQL
Cartesian product: CROSS JOIN
Internal connection: INNER JOIN
Left Connection: LEFT JOIN
Right Connection: RIGHT JOIN
Outer connection: OUTER JOIN
3. Reduce Join
1. Introduction to Reduce Join
2. Cases
2.1 Requirements:
2.2 Ideas for implementation: reduce end table ...
Posted by Seol on Tue, 04 Jan 2022 11:58:06 +0100
Call MapReduce to count the occurrence times of each word in the file
Note: the places that need to be installed and configured are in the final reference materials
1, Upload the files to be analyzed (no less than 100000 English words) to HDFS
demo.txt is the file to be analyzed
Start Hadoop
Upload the file to the input folder of hdfs
Ensure successful upload
2. Call MapReduce to count the n ...
Posted by crazytoon on Fri, 31 Dec 2021 05:40:31 +0100
There is no one of the simplest service response time optimization methods
Preface - From Wan Junfeng Kevin
The average delay of the service is basically about 30ms. One of the very big prerequisites is that we make extensive use of MapReduce technology, so that even if our service calls many services, it often depends only on the duration of the slowest request.
For your existing services, you do not need to optimi ...
Posted by twistisking on Thu, 30 Dec 2021 20:13:38 +0100
JAVA HDFS API programming II
Design pattern in java: template pattern
Define the skeleton (which is abstracted by the general algorithm) and hand over the specific implementation to the subclass. This means that as long as the process is defined in the template and how to implement it, the template method does not pay attention to the specific implementation. The specific ...
Posted by alecapone on Thu, 23 Dec 2021 23:39:28 +0100
Three level practical subject report
Three level practical subject report
Project Name: website log analysis system of canva paintable online graphic design software
Major name: Data Science and big data technology
Class: = = 201
Student No.:============
Student Name: CS daydream
Instructor:===
– -- December 2021 –
Abstract
With the development of the Intern ...
Posted by rebelo on Sun, 19 Dec 2021 20:30:54 +0100
Big data offline processing data project data cleaning ETL writes MapReduce program to realize data cleaning
Introduction:
Functions: clean the collected log data, filter invalid data and static resources
Method: write MapReduce for processing
Classes involved:
1) Entity class Bean
Describe various fields of log data, such as client ip, request url, request status, etc
2) Tool class
Used to process beans: set the validity or invalidity of log ...
Posted by KRAK_JOE on Fri, 03 Dec 2021 17:01:43 +0100
MapReduce core design -- Hadoop RPC framework
Hadoop RPC is divided into four parts
Serialization layer: convert structured objects into byte streams for transmission over the network or write to persistent storage. In the RPC framework, it is mainly used to convert parameters or responses in user requests into byte streams for cross machine transmission.Function call layer: locate the fu ...
Posted by DuNuNuBatman on Mon, 29 Nov 2021 22:37:21 +0100