65. Spark comprehensive case (Sogou search log analysis)

Posted by Ace_Online on Thu, 27 Jan 2022 16:49:35 +0100

Sogou lab: the search engine query log database is designed as a collection of Web query log data including some web query requirements of Sogou search engine and user clicks for about one month (June 2008). Provide benchmark research corpus for researchers who analyze the behavior of Chinese search engine users

catalogue

Original data display

Business requirements

Business logic

Word segmentation tool

Maven dependency

code implementation

Effect display

Sogou search log official website: http://www.sogou.com/labs/resource/q.php

Mini log download link: http://download.labs.sogou.com/dl/sogoulabdown/SogouQ/SogouQ.mini.zip

Note: due to the test use, the mini version of data can meet the requirements

Original data display

Note: there are 10000 pieces of original data, and the fields are: access time \ tuser ID \t [query term] \ t the ranking of the URL in the returned results \ t the sequence number clicked by the user \ t # the URL clicked by the user

Business requirements

Requirement Description: segment SougouSearchLog and count the following indicators:

  1. Popular search terms
  2. User popular search terms (with user id)
  3. Search heat in each time period

 

Business logic

Business logic: for SougoQ users to query different fields in log data, use SparkContext to read log data, encapsulate it into RDD data set, invoke Transformation function and Action function to process different business statistics and analysis.

Word segmentation tool

HanLP official website: http://www.sogou.com/labs/resource/q.php

 

Main functions of HanLP: Based on the latest technology of HanLP, it uses 100 million General Corpus training and direct API call, which is simple and efficient!

Maven dependency

<dependency>
<groupId>com.hankcs</groupId>
<artifactId>hanlp</artifactId>
<version>portable-1.7.7</version>
</dependency>

HanLP introduction case

package org.example.spark

import java.util

import com.hankcs.hanlp.HanLP
import com.hankcs.hanlp.seg.common.Term


/**
 * Author tuomasi
 * Desc HanLP Introductory case
 */
object HanLPTest {
  def main(args: Array[String]): Unit = {
    val words = "[HanLP Introductory case]"
    val terms: util.List[Term] = HanLP.segment(words) //subsection
    println(terms) //Directly print java list:[[/w, HanLP/nx, introduction / vn, case / N,] / w]
    import scala.collection.JavaConverters._
    println(terms.asScala.map(_.word)) //To scala's list:ArrayBuffer([, HanLP, introduction, case,])

    val cleanWords1: String = words.replaceAll("\\[|\\]", "") //Replace "[" or "]" with empty "/" HanLP introduction case "
    println(cleanWords1) //HanLP introduction case
    println(HanLP.segment(cleanWords1).asScala.map(_.word)) //ArrayBuffer(HanLP, introduction, case)

    val log = """00:00:00 2982199073774412    [360 Security guard]   8 3 download.it.com.cn/softweb/software/firewall/antivirus/20067/17938.html"""
    val cleanWords2 = log.split("\\s+")(2) //[360 security guard]
      .replaceAll("\\[|\\]", "") //360 security guard
    println(HanLP.segment(cleanWords2).asScala.map(_.word)) //ArrayBuffer(360, security guard)
  }
}

Console printing effect

 

code implementation

package org.example.spark

import com.hankcs.hanlp.HanLP
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
import shapeless.record
import spire.std.tuples

import scala.collection.immutable.StringOps
import scala.collection.mutable

/**
 * Author tuomasi
 * Desc Requirement: segment SougouSearchLog and count the following indicators:
 * 1.Popular search terms
 * 2.User popular search terms (with user id)
 * 3.Search heat in each time period
 */
object SougouSearchLogAnalysis {
  def main(args: Array[String]): Unit = {
    //TODO 0. Prepare the environment
    val conf: SparkConf = new SparkConf().setAppName("spark").setMaster("local[*]")
    val sc: SparkContext = new SparkContext(conf)
    sc.setLogLevel("WARN")
    //TODO 1. Load data
    val lines: RDD[String] = sc.textFile("data/input/SogouQ.sample")

    //TODO 2. Processing data
    //Encapsulate data
    val SogouRecordRDD: RDD[SogouRecord] = lines.map(line => { //map is one in and one out
      val arr: Array[String] = line.split("\\s+")
      SogouRecord(
        arr(0),
        arr(1),
        arr(2),
        arr(3).toInt,
        arr(4).toInt,
        arr(5)
      )
    })

    //Cutting data
    /* val wordsRDD0: RDD[mutable.Buffer[String]] = SogouRecordRDD.map(record => {
       val wordsStr: String = record.queryWords.replaceAll("\\[|\\]", "") //360 Security guard
       import scala.collection.JavaConverters._ //Convert Java collections to scala collections
       HanLP.segment(wordsStr).asScala.map(_.word) //ArrayBuffer(360, Security guard)
     })*/

    val wordsRDD: RDD[String] = SogouRecordRDD.flatMap(record => { //flatMap is a security guard = = > [360, security guard]
      val wordsStr: String = record.queryWords.replaceAll("\\[|\\]", "") //360 security guard
      import scala.collection.JavaConverters._ //Convert Java collections to scala collections
      HanLP.segment(wordsStr).asScala.map(_.word) //ArrayBuffer(360, security guard)
    })

    //TODO 3. statistical indicators
    //--1. Popular search terms
    val result1: Array[(String, Int)] = wordsRDD
      .filter(word => !word.equals(".") && !word.equals("+"))
      .map((_, 1))
      .reduceByKey(_ + _)
      .sortBy(_._2, false)
      .take(10)

    //--2. User popular search terms (with user id)
    val userIdAndWordRDD: RDD[(String, String)] = SogouRecordRDD.flatMap(record => { //flatMap is a security guard = = > [360, security guard]
      val wordsStr: String = record.queryWords.replaceAll("\\[|\\]", "") //360 security guard
      import scala.collection.JavaConverters._ //Convert Java collections to scala collections
      val words: mutable.Buffer[String] = HanLP.segment(wordsStr).asScala.map(_.word) //ArrayBuffer(360, security guard)
      val userId: String = record.userId
      words.map(word => (userId, word))
    })
    val result2: Array[((String, String), Int)] = userIdAndWordRDD
      .filter(t => !t._2.equals(".") && !t._2.equals("+"))
      .map((_, 1))
      .reduceByKey(_ + _)
      .sortBy(_._2, false)
      .take(10)

    //--3. Search heat in each time period
    val result3: Array[(String, Int)] = SogouRecordRDD.map(record => {
      val timeStr: String = record.queryTime
      val hourAndMitunesStr: String = timeStr.substring(0, 5)
      (hourAndMitunesStr, 1)
    }).reduceByKey(_ + _)
      .sortBy(_._2, false)
      .take(10)

    //TODO 4. Output results
    result1.foreach(println)
    result2.foreach(println)
    result3.foreach(println)

    //TODO 5. Release resources
    sc.stop()
  }
  //Prepare a sample class to encapsulate the data

  /**
   * User search click web Record
   *
   * @param queryTime  Access time, format: HH:mm:ss
   * @param userId     User ID
   * @param queryWords Query words
   * @param resultRank The ranking of the URL in the returned results
   * @param clickRank  Sequence number clicked by the user
   * @param clickUrl   URL clicked by the user
   */
  case class SogouRecord(
                          queryTime: String,
                          userId: String,
                          queryWords: String,
                          resultRank: Int,
                          clickRank: Int,
                          clickUrl: String
                        )
}

Effect display

Note: the SougouSearchLog is segmented and the following indicators are counted: popular search words, user popular search words (with user id) and search popularity in each time period. This effect is basically consistent with the expected idea

Topics: Operation & Maintenance Big Data Hadoop Spark