Installation and configuration of sqoop

Recently, I need to export MySQL data to HDFS, so I found sqoop2. Compared with sqoop1, sqoop2 has the advantage of directly using programs to connect to sqoop on the cluster for remote operation. The process needs to create a link first, which can also be understood as an object to be operated. For example, one link is HDFS and the other lin ...

Posted by canadian_angel on Sun, 16 Jan 2022 17:47:04 +0100

Introduction to big data -- Hive data query

grammar SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference [WHERE where_condition] [GROUP BY col_list] [ORDER BY col_list] [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list] ] [LIMIT number] WHERE Similar to SQL Extended RLIKE supports regular expressions sort order by Global sorting, only one ...

Posted by kingconnections on Sun, 16 Jan 2022 10:29:28 +0100

Flink usage and execution: the optimization scheme in the multi stream aggregation scenario must be collected!

Catalogue of series articles Flink user's Guide: teach you to develop Flink SQL custom Connector, which is more convenient to use SQL for warehousing! Flink user's Guide: Flink sets global variables and gets them in functions to make your code more elegant! Flink user's Guide: the Checkpoint mechanism is completely understood. You are the ...

Posted by kristolklp on Sun, 16 Jan 2022 03:28:42 +0100

all shards failed exception caused by ElasticSearch sorting

preface Note: ElasticSearch version is 5.4. Some system indexes are required in our log system. These system indexes will be added to ElasticSearch during application initialization. When there is no index data, these system indexes in ElasticSearch only have index name and some configuration information, but no mapping information. When ...

Posted by Gubbins on Sat, 15 Jan 2022 21:20:55 +0100

Python POST crawler crawls Nuggets user information

Python POST crawler crawls Nuggets user information 1. General Python third-party library requests provides two functions for accessing http web pages, get() function based on GET mode and post() function based on POST mode. The get function is the most commonly used crawling method. It can obtain static HTML pages and most dynamically loade ...

Posted by lj11 on Sat, 15 Jan 2022 17:41:55 +0100

Spark core programming - Introduction to RDD (distributed elastic data set), RDD core attributes, RDD execution principle

preface Spark computing framework encapsulates three data structures for high concurrency and high throughput data processing Handle different application scenarios. The three data structures are: RDD: distributed elastic datasetAccumulator: distributed shared write only variableBroadcast variables: distributed shared read-only variables C ...

Posted by deth4uall on Fri, 14 Jan 2022 21:41:16 +0100

Thread Basics

summary 1. Problems solved Blocking operation: keyboard inputTime consuming operation: cycleTime slice rotation strategy 2.I/O BIO: blockingNIOAIO: asynchronous operation Process and thread 1. Introduction System performance bottleneck: long waiting time, fake death, blockingThe user inputs an instruction and the computer performs an ope ...

Posted by magic003 on Fri, 14 Jan 2022 17:48:28 +0100

Application of Form in data stack: Verification

1, IntroductionThe theme of this paper is the application of Form in the data stack, which aims to help you better understand the cognition and practice of Form verification and linkage verification through some examples that have been applied in the data stack and the small tips sorted out by the author.This paper focuses on the verification o ...

Posted by imawake on Wed, 12 Jan 2022 10:25:34 +0100

From theory to engineering practice -- an introduction to user portrait

​ User portrait is the most important link in the top-level application of big data. It is particularly important to build a set of user portrait suitable for the company's system. However, the data of user portraits are often mostly theoretical, less practical, and less engineering practical cases. ​ This document combines the common user ...

Posted by $0.05$ on Wed, 12 Jan 2022 03:29:21 +0100

Flink difficulty analysis: unveiling the mystery of Watermark

Apache Flink is called the ultimate streaming framework. It not only provides real-time computing power with high throughput, low latency and exactly once semantics, but also provides computing power based on streaming engine to process batch data. In a real sense, it realizes batch flow unification. It is undoubtedly a rising star after Sp ...

Posted by barnbuster on Mon, 10 Jan 2022 23:42:20 +0100