1, Spark overview and quick start

Posted by brokeDUstudent on Sun, 27 Feb 2022 03:14:52 +0100

1, Spark overview

1.1 what is Spark

Spark is a fast, universal and scalable big data analysis and calculation engine based on memory;

1.2 Spark && Hadoop

SPark	Hadoop
1.Scala development, fast, universal and extended big data analysis engine	1. Java development, an open source framework for storing massive data on distributed server clusters and running distributed analysis applications
2. Spark Core provides the most basic and core content of spark	2. HDFS is based on GFS theory and stores data distributed
3. Sparksql is a component used by Spark to operate structured data. Through Spark SQL, users have used SQL or HQL to query data	3. MapReduce is based on Google MapReduce and distributed computing
4. SparkStreaming is a component of the Spark platform that performs streaming computing for real-time data and provides rich API s for processing data streams	5. Hbase is based on Bigtable and distributed database. It is good at real-time random reading and writing large-scale data sets

One time data calculation: when processing data, the framework will read data from the storage device, conduct logical operation, and then re store the processing results in the media.

1.3 Spark Or Hadoop

1.4 Spark core module

|Module | function|
|Spark core | spark core provides the most basic and core functions of spark. Other functional modules of spark are extended on the basis of spark core|
|Spark SQL||

2, Spark quick start

Simple WordCount program
Necessary environment:
1. Install Scala plug-in in IDEA: Refer to this article

2.1 create Maven project

New project - > select the appropriate project JDK (JDK 1.8) - > next - > fill in the appropriate gav
Install the scala plug-in in settings - > plugins and Refer to this article
Add the sdk of scala in project structure - > Global libraries
Create a new module in maven project to classify the project
Add Spark 3.0 dependency in pom file

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.12</artifactId>
            <version>3.0.0</version>
        </dependency>
    </dependencies>

Configure log4j to better track the program execution log, that is, create log4j in the resources directory of maven project Properties file and add the log configuration information as follows:

Control log level Only ERROR is displayed

log4j.rootCategory=ERROR, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd
HH:mm:ss} %p %c{1}: %m%n
# Set the default spark-shell log level to ERROR. When running the spark-shell,
the
# log level for this class is used to overwrite the root logger's log level, so
that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=ERROR
# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark_project.jetty=ERROR
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=ERROR
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=ERROR
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent
UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR

2.2 Spark's WordCount

Let's first write the wordcount process in Scala

If you are not familiar with Scala set functions, you can read the author's article: V -2. Full summary of common functions in scala set

object SparkWordCountDemo {
  def main(args: Array[String]): Unit = {
    /spark api operation

    //2. Create the configuration object and set the necessary parameters
    var conf = new SparkConf
        conf.setMaster("local[*]") //Local mode
        conf.setAppName("WordCountEasy")

    //1. Create Spark context,
    var sc: SparkContext = new SparkContext(conf)

    //3. Operation
    //3.1 reading file data
    val line: RDD[String] = sc.textFile("spark_demo_data/input")

    /3.2 Divide the data in each row into spaces, Then each word is stored separately
    val words: RDD[String] = line.flatMap(x => x.split(" ")) /// ?  What kind of data is generated

    /3.3 Group words. remember, Group name as key, Each group of elements value
    val groupedWords: RDD[(String, Iterable[String])] = words.groupBy(words => words)

    3.4 Convert word group to (word, words aggregate) => (wrods, String An iterator of) => (words, words.size)
    val resRDD: RDD[(String, Int)] = groupedWords.map(x => (x._1, x._2.size))

    3.5 Collect the conversion results to the console
    val res: Array[(String, Int)] = resRDD.collect()

    println("=====")
    println(res.mkString("Array(",",",")"))

    //sc.groupBy
    //4. Close the connection
    sc.stop()
  }
}

Now that we are learning spark, we should try to use Spark's aggregation calculation method 'reduceByKey'

object WordCount {
  def main(args: Array[String]): Unit = {

    //2. Create the profile object of Spark and set some necessary configurations
    val conf = new SparkConf()
      conf.setMaster("local[*]")
      conf.setAppName("StandardWordCount")

    //1. Create Spark context,
    //2.1 pass in the configuration object to the context object
    val sc = new SparkContext(conf)

    //3. Read each line of the file
    val line: RDD[String] = sc.textFile("spark_demo_data/input")

    //4. Format each line, word segmentation
    val words: RDD[String] = line.flatMap(_.split(" "))

    //5. Mark each word with 1, i.e. (word, 1)
    val wordsWithOne: RDD[(String, Int)] = words.map(x => (x, 1))

    //6. Statute
    val resRDD: RDD[(String, Int)] = wordsWithOne.reduceByKey(_ + _)



    //7. Get the result from memory and output it
    val resArray: Array[(String, Int)] = resRDD.collect()
    println(resArray.mkString("Array(",",",")") + "\n")

    //8. Close resource connection
    sc.stop()
  }
}

Following the above idea, we can also use the advanced functions of Scala set to implement wordcount

package cn.cyy.spark.core.wordcountdemo.wordcount

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object WordCount {
  def main(args: Array[String]): Unit = {

    //2. Create the profile object of Spark and set some necessary configurations
    val conf = new SparkConf()
      conf.setMaster("local[*]")
      conf.setAppName("StandardWordCount")

    //1. Create Spark context,
    //2.1 pass in the configuration object to the context object
    val sc = new SparkContext(conf)

    //3. Read each line of the file
    val line: RDD[String] = sc.textFile("spark_demo_data/input")

    //4. Format each line, word segmentation
    val words: RDD[String] = line.flatMap(_.split(" "))

    //5. Mark each word with 1, i.e. (word, 1)
    val wordsWithOne: RDD[(String, Int)] = words.map(x => (x, 1))

    //5.1 grouping
    val groupedWordWithOne: RDD[(String, Iterable[(String, Int)])] = wordsWithOne.groupBy(tuple => tuple._1)

    //5.2 reduce first, then map
    val resRDD: RDD[(String, Int)] = groupedWordWithOne.map {
      case (word, list) => {
        list.reduce(
          (x, y) => {
            (x._1, x._2 + y._2)
          }
        )
      }
    }

//    //6.  Statute
//    val resRDD: RDD[(String, Int)] = wordsWithOne.reduceByKey(_ + _)

    //7. Get the result from memory and output it
    val resArray: Array[(String, Int)] = resRDD.collect()
    println(resArray.mkString("Array(",",",")"))

    //8. Close resource connection
     sc.stop()
  }
}

Topics: Scala Hadoop Spark

Programmer Think