Create and run an EMR on EKS cluster

The creation of EMR on EKS is completely command-line driven. At present, there is no corresponding UI interface to complete relevant operations. This article will demonstrate how to create and run an EMR on EKS cluster from the command line. The process of creating EMR on EKS can be divided into two stages: the first stage is to create an E ...

Posted by algy on Tue, 01 Feb 2022 07:56:44 +0100

HiveSql&SparkSql -- use left semi join to optimize subqueries of in and exists types

Introduction to LEFT SEMI JOIN The main use scenario of SEMI JOIN (equivalent to LEFT SEMI JOIN) is to solve EXISTS IN. LEFT SEMI JOIN is a more efficient implementation of IN/EXISTS sub query. Although LEFT SEMI JOIN contains LEFT, its implementation effect is equivalent to INNER JOIN, but the JOIN result only takes the columns in the orig ...

Posted by Waldir on Mon, 31 Jan 2022 15:40:32 +0100

DataSkew -- Summary of data skew problem analysis and solution practice

Note that we should distinguish between data skew and excessive data. Data skew means that a few tasks are assigned most of the data, so a few tasks run slowly; Excessive data means that the amount of data allocated to all tasks is very large, the difference is not much, and all tasks run slowly. What is data skew In short, data s ...

Posted by n00b Saibot on Mon, 31 Jan 2022 15:19:02 +0100

Spark BigData Program: big data real-time stream processing log

Spark BigData Program: big data real-time stream processing log 1, Project content Write python scripts to continuously generate user behavior logs of learning websites.Start Flume to collect the generated logs.Start Kafka to receive the log received by Flume.Use Spark Streaming to consume Kafka's user logs.Spark Streaming cleans the data ...

Posted by iriedodge on Mon, 31 Jan 2022 13:43:09 +0100

Algorithmic model mining tags for user portraits

RFM user value model 1 demand Assuming I am a marketer, I might think about the following questions before doing an activityWho are my more valuable customers?Who has the potential to become a valuable customer?Who's losing?Who can stay?Who cares about this event?In fact, all the above thoughts focus on one theme valueRFM is one of the most ...

Posted by 90Nz0 on Sun, 30 Jan 2022 20:19:59 +0100

Underlying implementation principle of Spark SQL

1. Spark SQL architecture design The development of big data is realized by using SQL directly. It supports both DSL and SQL syntax style. At present, in the whole architecture design of spark, all spark modules, such as SQL, SparkML, sparkGrahpx and structured streaming, run on the catalyst optimization & tungsten execution module, The fo ...

Posted by Rochtus on Sat, 29 Jan 2022 23:46:11 +0100

Python big data processing library PySpark actual combat summary III

Shared variable broadcast variable Broadcast variables allow programs to cache a read-only variable on each machine in the cluster instead of saving a copy for each task. With broadcast variables, you can share some data in a more efficient way, such as a global configuration file. from pyspark.sql import SparkSession spark = SparkSe ...

Posted by cunoodle2 on Sat, 29 Jan 2022 14:37:23 +0100

65. Spark comprehensive case (Sogou search log analysis)

Sogou lab: the search engine query log database is designed as a collection of Web query log data including some web query requirements of Sogou search engine and user clicks for about one month (June 2008). Provide benchmark research corpus for researchers who analyze the behavior of Chinese search engine users catalogue Original da ...

Posted by Ace_Online on Thu, 27 Jan 2022 16:49:35 +0100

spark the way of God (15) - broadcast variables

brief introduction Broadcast variables allow us to keep a read-only variable on each computer instead of making a copy for each task. For example, they can be used to provide a copy of a large input dataset for each compute node in an efficient manner. Spark also tries to use effective broadcast algorithms to distribute broadcast variables to ...

Posted by sheckel on Thu, 27 Jan 2022 02:48:34 +0100

scala parses arrays using Gson

The complete code is at the end. You can jump through the directory 1, Background Assuming that the return value of an HTTP interface is as follows, how can we parse the result using Gson in scala? { "code":0, "message":"OK", "result":[ { "originalityId":7, "conversionType":10011, ...

Posted by david_s0 on Sun, 23 Jan 2022 07:46:35 +0100