python crawler xpath case

Posted by EvilPrimate on Mon, 20 Dec 2021 21:34:09 +0100

xpath review

Import third-party libraries, make requests to web pages, get html files, load html files into elementary objects and load them into tree s, then you can use the xpath method, which is an indeterminate path. By passing in the determined path as a string, you can find elements based on the path.
Requirements:
Remove the text and class values from the first three li tags and save them in the csv file

from lxml import etree
import csv
# etree contains some features of xpath
html = """
<html>
    <head>
        <title>test</title> 
    </head>
    <body>
         <li class="item-0">first item</li>
         <li class="item-1">second item</li>
         <li class="item-inactive">third item</li>
         <li class="item-1">fourth item</li>
         <div>
            <li class="item-0">fifth item</li>
         </div>
         <span>
            <li class="item-0">sixth item</li>
            <div>
                <li class="item-0">eighth item</li>
            </div>
         </span>
     </body>
</html>
"""
tree = etree.HTML(html)
# Requirements: Get content from the first three li Tags
# /Represents hierarchical relationships, //Represents child bytes (descendant content)
# text = tree.xpath('//li/text()')  # ['first item', 'second item', 'third item', 'fourth item', 'fifth item', 'sixth item', 'eighth item']
# Displays all content in li as a list, indexed by subscripts, left-closed, right-open, and obtains attributes without using text()
text = tree.xpath('//li/text()')[:3]  # ['first item', 'second item', 'third item']
cs = tree.xpath('//Li/@class') [:3] # ['item-0','item-1','item-inactive'], get the attribute value of the class
# Stored as {'class':'item-0','text':'first item'},{'class':'item-1','text':'second item'}
# print(text)
# print(cs)
lis_data = []  # Used to store all data
for c in cs:
    d = {}  # Define a dictionary to hold a piece of data
    # print(c)  # item-0 item-1 item-inactive
    # # print(cs.index(c))  # 0 1 2 Print content and index value one-to-one correspondence
    # print(text[cs.index(c)])
    # print('*'*50)  # Loop through pairs to extract text and attribute values.
    d['class'] = c
    d['text'] = text[cs.index(c)]  # Loop to add key and value values to an empty dictionary
    # print(d)
    lis_data.append(d)  # Add the obtained dictionary to the empty list
# print(lis_data)
# Save data
header = ('class', 'text')
with open('lis_data.csv', 'w', encoding='utf-8', newline="") as f:
    write = csv.DictWriter(f, header)
    write.writeheader()
    write.writerows(lis_data)

Case Study: Crawling a Top250 Valve

target

How to parse data with xpath and become familiar with how xpath parses data.
By observing the target site to understand the crawling requirements, the data is ultimately saved in csv.

demand

What to crawl: Title Rating Number of Ratings Quote Details Page url

Page Analysis

First copy part of the content to see if it is in the source of the web page, click the right mouse button to see the source of the web page. From the picture below, you can see that all the content we need is in the source of the web page. We only need to target the web site. https://movie.douban.com/top250 ) Initiate a request and extract the data from the source code of the web page. The web site is loaded statically and belongs to server rendering. To get 250 pages of data, you have to crawl the contents of the first page and change the start value dynamically to get data from other pages.
Target url: https://movie.douban.com/top250
We can try changing the url of the first page to start=0 and still access the first page
Page flipping:
https://movie.douban.com/top250?start=0 Page 1 025
https://movie.douban.com/top250?start=25 Page 2 125
https://movie.douban.com/top250?start=50 Page 3 225
https://movie.douban.com/top250?start=75 Page 4 325
start = (page-1) * 25


Mouse over the name of the first movie on the page, right-click, check, you can see that when the cursor is placed at the Salvation of Shawson, the corresponding title is highlighted, and the entire movie's information is highlighted at <div class = "info", indicating that the information needed is under this label; Place the cursor at <div class = "pic" and the picture is highlighted so that you can get the information of the picture at this time. When the cursor is placed at <. li>, all the information of the first movie is highlighted, indicating that the <. li> tag contains the information of the entire movie; Look further at a total of 25 /<. li> tags, with the cursor on each tag corresponding to 25 movies on this page. Elementation is loaded last, but we still want to go to the web source to see if it is consistent, such as copying'class='info'to the web source, you can see 25 identical tags.

Open xpath helper with shortcut CTRL+shift+x to find the location of the movie name

step

  • Send request, get web page source, first function
  • Get the source code, parse the data, second function. Each piece of data is placed in a div tag with class="info" of a li tag and parsed in a div tag with class="info"
  • Save data, third function. [{'title':'xxx','score':'xxx','com_num':'xxx','quote':'xxx','link_url':'xxxx'}, {xxx},{xxx}]
    *Tools needed: requests xpath csv

code implementation

import requests
from lxml import etree
import csv

# Send the request and get it accordingly. Incoming url to initiate the request
def get_url(url):

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36 Edg/92.0.902.62'
    }
    res = requests.get(url, headers=headers)
    html = res.content.decode("utf-8")
    # print(html)
    return html

# Parse data
def s_sourse(h):
    # Loading Web page source code into an element object
    tree = etree.HTML(h)
    divs = tree.xpath('//div[@class="info"]')
    # print(len(divs))  # 25
    last_data = []  # List holding all data
    for div in divs:  # Use xpath to get info information for all movies and extract it from each movie object
        d = {}  # A dictionary used to store one piece of data
        # print(div)
        # Use the relative path of the object to get the information for each movie. The subscript of the list starts at 0, and strip() removes the blank characters at the beginning and end of the string.
        # If you remove the white space in the middle of a string, you can use replace()
        title = div.xpath('./div[@class="hd"]/a/span[@class="title"][1]/text()')[0].strip()
        score = div.xpath('./div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()')[0].strip()
        com_mun = div.xpath('./div[@class="bd"]/div[@class="star"]/span[4]/text()')[0].strip()
        quote = div.xpath('./div[@class="bd"]/p/span/text()')
        link_url = div.xpath('./div[@class="hd"]/a/@href')[0].strip()
        # print(title, score, com_mun, quote, link_url)
        # [0].strip() takes out the first element of the list for easy storage
        # Add to empty dictionary as key, value value
        d['title'] = title
        d['score'] = score
        d['com_mun'] = com_mun
        if quote:
            d['quote'] = quote[0].strip()
        else:
            d['quote'] = ""
        d['link_url'] = link_url
        # print(d)
        # Insert a dictionary containing one piece of data into the list
        last_data.append(d)
        # The data for each movie is stored in a dictionary and for all movies in a large list
    # print(last_data)
    return last_data

# Save data
def save_page(last_data, header):
    # w starts from scratch and a is an append
    with open('movie_data2.csv', 'a', encoding='utf-8', newline="") as f:
        writ = csv.DictWriter(f, header)
        # Write headers no longer pass arguments
        writ.writeheader()
        # writerows writes multiple rows of data at a time
        writ.writerows(last_data)

# Main function calls function
def main():

    for m in range(10):
        print(f'Crawling #{m}page')
        url = f'https://movie.douban.com/top250?start={m*25}'
        # url = f'https://movie.douban.com/top250'
        h = get_url(url)
        # Headers can be passed in as dictionaries, lists, tuples
        header = ('title', 'score', 'com_mun', 'quote', 'link_url')
        lis = s_sourse(h)
        save_page(lis, header)


if __name__ == '__main__':
    main()

So far, page crawling has been implemented. There is a pit here. It is important to note that there will be a blank quote section, which causes the program to appear "IndexError: list index out of range". When taking a quote, "quote = div.xpath('. /div[@class="bd"]/p/span/text()')", which cannot be followed by [0]. Strip(), because the data you get is blank, you will have problems with the results when you remove the blanks. This part of the normal value, when added to the dictionary, to judge the quote, if not empty, return "quote[0].strip()", empty directly return ", so as to avoid errors.
Once you get the results, you will find that the headers are printed repeatedly. Here are a few ways to remove duplicate headers.

Method 1: Change only the main function
def main():
    data_box = []  # Define a box to hold all the data
    # Headers can be passed in as dictionaries, lists, tuples
    header = ('title', 'score', 'com_mun', 'quote', 'link_url')
    for m in range(10):
        time.sleep(5)
        print(f'Crawling #{m}page')
        url = f'https://movie.douban.com/top250?start={m * 25}'
        # url = f'https://movie.douban.com/top250'
        h = get_url(url)
        data_box += s_sourse(h)  # Put all the return values of the function in a box
    # print(data_box)
    save_page(data_box, header)

Method 1: Define a box in the main function that holds all the data, put all the data you get into the box, and then put the box data_box passed as a parameter to save_ In the page() function, when you print multiple lines, you only print data. During the test, the IP was blocked, using the proxy IP settings described in the crawler file (3), and the cookie s after someone else enters the web page in the request header can be accessed normally. To avoid being blocked again, a time limit was added.

Method 2: Write the header in the main function ahead of time, and after the loop save_page()Only write data in
def save_page(last_data, header):
    # w starts from scratch and a is an append
    with open('movie_data3.csv', 'a', encoding='utf-8', newline="") as f:
        writ = csv.DictWriter(f, header)

        # writerows writes multiple rows of data at a time
        writ.writerows(last_data)

# Main function calls function
def main():
    header = ('title', 'score', 'com_mun', 'quote', 'link_url')
    with open('movie_data3.csv', 'a', encoding='utf-8', newline="") as f:
        writ = csv.DictWriter(f, header)
        writ.writeheader()
    for m in range(10):
        time.sleep(5)
        print(f'Crawling #{m}page')
        url = f'https://movie.douban.com/top250?start={m * 25}'
        # url = f'https://movie.douban.com/top250'
        h = get_url(url)
        lis = s_sourse(h)  # Put all the return values of the function in a box
    # print(data_box)
        save_page(lis, header)

Method 2: In the main function, first write the header of the file outside the loop, and then write the data save_ When page, only the crawled data is written.

Method 3: Write the header in the main function first, without writing the function, and write the multiline data directly.
def main():
    header = ('title', 'score', 'com_mun', 'quote', 'link_url')
    f = open('movie_data4.csv', 'a', encoding='utf-8', newline="")
    writ = csv.DictWriter(f, header)
    # Write headers no longer pass arguments
    writ.writeheader()
    for m in range(10):
        time.sleep(5)
        print(f'Crawling #{m}page')
        url = f'https://movie.douban.com/top250?start={m*25}'
        # url = f'https://movie.douban.com/top250'
        h = get_url(url)
        # Headers can be passed in as dictionaries, lists, tuples

        lis = s_sourse(h)
        # save_page(lis, header)
        writ.writerows(lis)
        

Method 3: Write the header directly in main function, data parsing function s_ Sorse no longer writes data to function save_page is uploaded and stored directly in a table

Method 4: Define traversal in the main function m Is a global variable, when writing a function m Make a judgment, write the header the first time you write, otherwise don't write the header
# Save data
def save_page(last_data, header):
    # w starts from scratch and a is an append
    with open('movie_data5.csv', 'a', encoding='utf-8', newline="") as f:
        writ = csv.DictWriter(f, header)
        if m == 0:
            writ.writeheader()
        # writerows writes multiple rows of data at a time
        writ.writerows(last_data)

# Main function calls function
def main():
    global m
    for m in range(10):
        time.sleep(5)
        print(f'Crawling #{m}page')
        url = f'https://movie.douban.com/top250?start={m * 25}'
        # url = f'https://movie.douban.com/top250'
        header = ('title', 'score', 'com_mun', 'quote', 'link_url')
        h = get_url(url)
        lis = s_sourse(h)  # Put all the return values of the function in a box
    # print(data_box)
        save_page(lis, header)

Method 4: The clever point is that the "m" used for traversal is defined as a global variable in main, which allows m to be used in other functions, and in save_ M is judged in page. If 0 is the first page written, the header is written, otherwise the header is not written.

Topics: Python crawler