MapReduce learning 1: overview and simple case preparation

1, Overview 1.1MapReduce definition MapReduce is a programming framework for distributed computing programs and the core framework for users to develop "hadoop based data analysis applications". The core function of MapReduce is to integrate the business logic code written by the user and its own default components into a complete ...

Posted by pckidcomplainer on Sat, 08 Jan 2022 13:35:01 +0100

MapReduce notes - serialized cases

serialize When the hosts transmit data to each other, they cannot directly send an object to another host. They need to encapsulate the content of the object into a packet in some form and then send it over the network. The most important method is to program the object in the form of a string, and the writing rules of this string are secretly ...

Posted by lnt on Tue, 04 Jan 2022 19:29:57 +0100

MapReduce framework principle - InputFormat data input

catalogue 1, Introduction to InputFormat 2, Parallelism of slicing and MapTask tasks 3, Job submission process source code 4, InputFormat implementation subclass 5, Slicing mechanism of FileInputFormat (1) Slicing mechanism: (2) Slice source code analysis (3) Slicing steps (4) FileInputFormat default slice size parameter configuration ...

Posted by cihan on Tue, 04 Jan 2022 16:10:16 +0100

Use of MapReduce Framework-Join

Catalog I. Introduction 2. Use of Join in Relational Database MySQL Cartesian product: CROSS JOIN Internal connection: INNER JOIN Left Connection: LEFT JOIN Right Connection: RIGHT JOIN Outer connection: OUTER JOIN 3. Reduce Join 1. Introduction to Reduce Join 2. Cases 2.1 Requirements: 2.2 Ideas for implementation: reduce end table ...

Posted by Seol on Tue, 04 Jan 2022 11:58:06 +0100

Call MapReduce to count the occurrence times of each word in the file

Note: the places that need to be installed and configured are in the final reference materials 1, Upload the files to be analyzed (no less than 100000 English words) to HDFS  demo.txt is the file to be analyzed Start Hadoop Upload the file to the input folder of hdfs Ensure successful upload 2. Call MapReduce to count the n ...

Posted by crazytoon on Fri, 31 Dec 2021 05:40:31 +0100

There is no one of the simplest service response time optimization methods

Preface - From Wan Junfeng Kevin The average delay of the service is basically about 30ms. One of the very big prerequisites is that we make extensive use of MapReduce technology, so that even if our service calls many services, it often depends only on the duration of the slowest request. For your existing services, you do not need to optimi ...

Posted by twistisking on Thu, 30 Dec 2021 20:13:38 +0100

JAVA HDFS API programming II

Design pattern in java: template pattern Define the skeleton (which is abstracted by the general algorithm) and hand over the specific implementation to the subclass. This means that as long as the process is defined in the template and how to implement it, the template method does not pay attention to the specific implementation. The specific ...

Posted by alecapone on Thu, 23 Dec 2021 23:39:28 +0100

Three level practical subject report

Three level practical subject report Project Name: website log analysis system of canva paintable online graphic design software Major name: Data Science and big data technology Class: = = 201 Student No.:============ Student Name: CS daydream Instructor:=== – -- December 2021 – Abstract With the development of the Intern ...

Posted by rebelo on Sun, 19 Dec 2021 20:30:54 +0100

Big data offline processing data project data cleaning ETL writes MapReduce program to realize data cleaning

Introduction: Functions: clean the collected log data, filter invalid data and static resources Method: write MapReduce for processing Classes involved: 1) Entity class Bean Describe various fields of log data, such as client ip, request url, request status, etc 2) Tool class Used to process beans: set the validity or invalidity of log ...

Posted by KRAK_JOE on Fri, 03 Dec 2021 17:01:43 +0100

MapReduce core design -- Hadoop RPC framework

Hadoop RPC is divided into four parts Serialization layer: convert structured objects into byte streams for transmission over the network or write to persistent storage. In the RPC framework, it is mainly used to convert parameters or responses in user requests into byte streams for cross machine transmission.Function call layer: locate the fu ...

Posted by DuNuNuBatman on Mon, 29 Nov 2021 22:37:21 +0100