The fifth practice of data acquisition and fusion technology

Posted by konetch on Wed, 24 Nov 2021 22:36:25 +0100

The fifth practice of data mining

Assignment 1

Jingdong information crawling experiment

Job content

  1. Requirements: be familiar with Selenium's search for HTML elements, crawling Ajax web page data, waiting for HTML elements, etc. Use Selenium framework to crawl the information and pictures of certain commodities in Jingdong Mall.
  2. Candidate sites:
  3. Key words: Students' free choice

Practice process

Change the teacher's Sqlite database to sql server

    def startUp(self, url, key):
      # # Initializing Chrome browser
      chrome_options = Options()
      self.driver = webdriver.Chrome(chrome_options=chrome_options)

      # Initializing variables
      self.threads = []
      self.No = 0
      self.imgNo = 0
      # Initializing database
          self.con = pyodbc.connect(
              'DRIVER={SQL Server};SERVER=(local);DATABASE=test;UID=DESKTOP-FG7JKFI\Nimble;PWD=29986378;Trusted_Connection=yes')
          self.cursor = self.con.cursor()
              # If there are tables, delete them
              self.cursor.execute("drop table phones")

              #  Create a new table
              sql = "create  table  phones  (mNo  char(32) primary key, mMark char(256),mPrice char(32),mNote char(1024),mFile char(256))"

      except Exception as err:

    def showDB(self):
            con = pyodbc.connect(
                'DRIVER={SQL Server};SERVER=(local);DATABASE=test;UID=DESKTOP-FG7JKFI\Nimble;PWD=29986378;Trusted_Connection=yes')
            cursor =con.cursor()
            print("%-8s%-16s%-8s%-16s%s"%("No", "Mark", "Price", "Image", "Note"))
            cursor.execute("select mNo,mMark,mPrice,mFile,mNote from phones  order by mNo")

            rows = cursor.fetchall()
            for row in rows:
                print("%-8s %-16s %-8s %-16s %s" % (row[0], row[1], row[2], row[3],row[4]))

        except Exception as err:

Job results

Job 1 source code

Assignment 2

MOOC crawling test

Job content

  1. Requirements: be familiar with Selenium's search for HTML elements, realizing user simulated Login, crawling Ajax web page data, waiting for HTML elements, etc. Use Selenium framework + MySQL to crawl the course resource information of China mooc network (course number, course name, teaching progress, course status and course picture address), and store the picture in the imgs folder under the root directory of the local project. The name of the picture is stored with the course name.

  2. Candidate website: China mooc website:

Practice process

  1. Realize login function

    # Login page link
    url = ""
    ua = UserAgent(path="D:\\program\\python\\CrawlLearning\\fake_useragent_0.1.11.json").random
    chrome_options = Options()
    chrome_options.add_argument("user-agent=" + ua)
    driver = webdriver.Chrome(chrome_options=chrome_options)
    time.sleep(10)	# Allow time for code scanning and login
  2. Page information positioning

    Take obtaining the course name as an example:

    def getInfo(driver,infos):
        """ Get course information
              :param driver: Created webdriver
              :param infos: List for storing information
              :return: Course information list infos[[name,school,process,date,imgurl],]
        trs = driver.find_elements_by_xpath('//div[@class="box"]')
        for tr in trs:
            name = tr.find_element_by_xpath("./a/div/div/div/div/span[@class='text']").text     # Course name
            school = tr.find_element_by_xpath("./a/div/div/div/a").text     # school
            process = tr.find_element_by_xpath('./a/div/div/div[@class="course-progress"]/div/div/a/span[@class="course-progress-text-span"]').text     # rate of learning
            date = tr.find_element_by_xpath('./a/div/div/div[@class="course-status"]').text     # End date
            imgurl = tr.find_element_by_xpath('./a/div[@class="img"]/img').get_attribute("src")     # Cover address
        return infos
  3. Data storage in database

    def savetoDB(infos):
        # Store to database
        conn = pyodbc.connect(
            'DRIVER={SQL Server};SERVER=(local);DATABASE=test;UID=DESKTOP-FG7JKFI\Nimble;PWD=29986378;Trusted_Connection=yes')
        cur = conn.cursor()
        # Judge whether the table exists in the database. If it exists, delete it
            cur.execute("DROP TABLE classInfo")
        cur.execute('CREATE TABLE classInfo (Cname char(200),school char(200), Cschedule char(100),Cdate char(100),Cimg char(500))')
        # Data writing
        for s in infos:
            sql = 'insert into classinfo([Cname],[school],[Cschedule],[Cdate],[Cimg]) values(?,?,?,?,?)'
            cur.execute(sql, (s[0], s[1],s[2],s[3],s[4]))
            print(s[1],"saving complete")

Job results

Job 2 source code


This experiment realizes the login function, and some information can be displayed only after login. Pay attention to the sleep time when logging in. It also takes a certain time for the login page to jump, otherwise it may not be able to climb out of the page.

Assignment 3

Flume experiment

Job content

  1. Requirements: understand Flume architecture and key features, and master the use of Flume to complete log collection tasks. Complete Flume log collection experiment, including the following steps:

    Task 1: open MapReduce service

    Task 2: generate test data from Python script

    Task 3: configure Kafka

    Task 4: install Flume client

    Task 5: configure Flume to collect data

Practice process

Task 1: open MapReduce service

Task 2: generate test data from Python script

Task 3: configure Kafka

Task 4: install Flume client

Task 5: configure Flume to collect data

Capture successful