Big data offline processing data project data cleaning ETL writes MapReduce program to realize data cleaning

Introduction: Functions: clean the collected log data, filter invalid data and static resources Method: write MapReduce for processing Classes involved: 1) Entity class Bean Describe various fields of log data, such as client ip, request url, request status, etc 2) Tool class Used to process beans: set the validity or invalidity of log ...

Posted by KRAK_JOE on Fri, 03 Dec 2021 17:01:43 +0100

MapReduce core design -- Hadoop RPC framework

Hadoop RPC is divided into four parts Serialization layer: convert structured objects into byte streams for transmission over the network or write to persistent storage. In the RPC framework, it is mainly used to convert parameters or responses in user requests into byte streams for cross machine transmission.Function call layer: locate the fu ...

Posted by DuNuNuBatman on Mon, 29 Nov 2021 22:37:21 +0100

MapReduce programming -- merging and de duplication of files

catalogue 1, Problem description 2, Specific code 3, Specific operation 1, Problem description Merge multiple input files, eliminate the duplicate contents, and output the duplicated contents to one file.         Main idea: according to the process characteristics of reduce, the input value set will be automatically ...

Posted by PHP-Nut on Tue, 16 Nov 2021 11:53:04 +0100