Installation and configuration of sqoop
Recently, I need to export MySQL data to HDFS, so I found sqoop2. Compared with sqoop1, sqoop2 has the advantage of directly using programs to connect to sqoop on the cluster for remote operation. The process needs to create a link first, which can also be understood as an object to be operated. For example, one link is HDFS and the other lin ...
Posted by canadian_angel on Sun, 16 Jan 2022 17:47:04 +0100
Introduction to big data -- Hive data query
grammar
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT number]
WHERE
Similar to SQL Extended RLIKE supports regular expressions
sort
order by
Global sorting, only one ...
Posted by kingconnections on Sun, 16 Jan 2022 10:29:28 +0100
Flink usage and execution: the optimization scheme in the multi stream aggregation scenario must be collected!
Catalogue of series articles
Flink user's Guide: teach you to develop Flink SQL custom Connector, which is more convenient to use SQL for warehousing!
Flink user's Guide: Flink sets global variables and gets them in functions to make your code more elegant!
Flink user's Guide: the Checkpoint mechanism is completely understood. You are the ...
Posted by kristolklp on Sun, 16 Jan 2022 03:28:42 +0100
all shards failed exception caused by ElasticSearch sorting
preface
Note: ElasticSearch version is 5.4.
Some system indexes are required in our log system. These system indexes will be added to ElasticSearch during application initialization. When there is no index data, these system indexes in ElasticSearch only have index name and some configuration information, but no mapping information. When ...
Posted by Gubbins on Sat, 15 Jan 2022 21:20:55 +0100
Python POST crawler crawls Nuggets user information
Python POST crawler crawls Nuggets user information
1. General
Python third-party library requests provides two functions for accessing http web pages, get() function based on GET mode and post() function based on POST mode.
The get function is the most commonly used crawling method. It can obtain static HTML pages and most dynamically loade ...
Posted by lj11 on Sat, 15 Jan 2022 17:41:55 +0100
Spark core programming - Introduction to RDD (distributed elastic data set), RDD core attributes, RDD execution principle
preface
Spark computing framework encapsulates three data structures for high concurrency and high throughput data processing Handle different application scenarios. The three data structures are:
RDD: distributed elastic datasetAccumulator: distributed shared write only variableBroadcast variables: distributed shared read-only variables
C ...
Posted by deth4uall on Fri, 14 Jan 2022 21:41:16 +0100
Thread Basics
summary
1. Problems solved
Blocking operation: keyboard inputTime consuming operation: cycleTime slice rotation strategy
2.I/O
BIO: blockingNIOAIO: asynchronous operation
Process and thread
1. Introduction
System performance bottleneck: long waiting time, fake death, blockingThe user inputs an instruction and the computer performs an ope ...
Posted by magic003 on Fri, 14 Jan 2022 17:48:28 +0100
Application of Form in data stack: Verification
1, IntroductionThe theme of this paper is the application of Form in the data stack, which aims to help you better understand the cognition and practice of Form verification and linkage verification through some examples that have been applied in the data stack and the small tips sorted out by the author.This paper focuses on the verification o ...
Posted by imawake on Wed, 12 Jan 2022 10:25:34 +0100
From theory to engineering practice -- an introduction to user portrait
User portrait is the most important link in the top-level application of big data. It is particularly important to build a set of user portrait suitable for the company's system. However, the data of user portraits are often mostly theoretical, less practical, and less engineering practical cases.
This document combines the common user ...
Posted by $0.05$ on Wed, 12 Jan 2022 03:29:21 +0100
Flink difficulty analysis: unveiling the mystery of Watermark
Apache Flink is called the ultimate streaming framework. It not only provides real-time computing power with high throughput, low latency and exactly once semantics, but also provides computing power based on streaming engine to process batch data. In a real sense, it realizes batch flow unification. It is undoubtedly a rising star after Sp ...
Posted by barnbuster on Mon, 10 Jan 2022 23:42:20 +0100