Application and practice of software engineering in Shandong University -- ECommerceCrawlers code analysis

Posted by estero2002 on Mon, 20 Dec 2021 21:51:38 +0100

2021SC@SDUSC

catalogue

1, Abstract

2, get_parks_companies_threads.py code analysis

1. Part 1

2. Part 2

3. Part 3

4. Part 4

3, Summary

1, Abstract

This blog is the third and last blog of the third project "QiChaCha", In this chapter, I will analyze the code of the rest of the project (since the main contents of the remaining files "get_parks_companies.py" and "get_parks_companies_threads.py" are similar, the only difference is that the latter is executed by multithreading, so I will analyze the code of "get_parks_companies_threads.py")

2, get_parks_companies_threads.py code analysis

1. Part 1

    def __init__(self, cookie, proxies, companies_name):
        self.cookie = cookie
        self.proxies = proxies
        self.companies_name = companies_name
        self.headers = {
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
            'Accept-Encoding': 'gzip, deflate, br',
            'Accept-Language': 'zh-CN,zh;q=0.9',
            'Connection': 'keep-alive',
            'Cookie': self.cookie,
            'DNT': '1',
            'Host': 'www.qichacha.com',
            'Referer': 'https://www.qichacha.com/more_zonecompany.html?id=000c85b2a120712454f4c5b74e4fdfae&p=2',
            'Sec-Fetch-Dest': 'document',
            'Sec-Fetch-Mode': 'navigate',
            'Sec-Fetch-Site': 'none',
            'Sec-Fetch-User': '?1',
            'Upgrade-Insecure-Requests': '1',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'
        }
        self.path = './csv/'
        self.file_name = self.path+self.companies_name+'.csv'
        self.ListTask = []
        self.csv_data = pd.read_csv('./csv/National Industrial Park Information.csv')
        self.length = len(self.csv_data)
        self.work()

self.cookie = cookie
self.proxies = proxies
self.companies_name = companies_name

Should_ init_ Function is an initialization function used to initialize some parameters. These three lines of code are initialization cookie s, proxies and companies_name parameter.

self.headers = { }

Here is the value of custom headers, which is used to handle some simple anti crawl operations.

   self.path = './csv/'
        self.file_name = self.path+self.companies_name+'.csv'
        self.ListTask = []
        self.csv_data = pd.read_csv('./csv / national industrial park information. CSV')
        self.length = len(self.csv_data)
        self.work()

Here is to create a csv file and read the previously built '/ csv / national industrial park information csv 'file information to obtain the length of the data. The file created here is used to store the following enterprise information.

2. Part 2

    def get_companies(self, id, page_no):
        url = 'https://www.qichacha.com/more_zonecompany.html?id={}&p={}'.format(id, 
        page_no)
        while True:
            try:
                with requests.get(url, headers=self.headers) as response:
                    html = response.text
                    parseHtml = etree.HTML(html)
                    return parseHtml
            except Exception as e:
                log('Connection failure, repeat task!')
                pass

 url = 'https://www.qichacha.com/more_zonecompany.html?id={}&p={}'.format(id, page_no)

get_ The parameters of the companies function are self, id and page_no. id corresponds to the id of an industrial park, page_no corresponds to the number of pages of enterprises in the Industrial Park (after specifying an industrial park, each page of the web page displays some enterprises).

This url is the list information of the corresponding enterprises in an industrial park. (the current enterprise query url has been updated, which is different from the url when this code was written). Take Suzhou Industrial Park as an example:

 "https://www.qcc.com/more_zonecompany.html?id=b19cba4b1694f59019fc3c7c95bac24d&p=2 ”The id in represents the Industrial Park id, and p = {} represents the current page number.

        while True:
            try:
                with requests.get(url, headers=self.headers) as response:
                    html = response.text
                    parseHtml = etree.HTML(html)
                    return parseHtml
            except Exception as e:
log('connection failure, duplicate task! ')
                pass

In the while loop, requests are made for each url and header The get function obtains the web page information, and finally converts the web page code text information into an etree object and stores it in parseHtml. If there is an error in the process of obtaining web page content, the log "connection failure" will be output.

3. Part 3

    def get_companies_all(self, name_thread, id, province, city, county, park, area, numcop):
        num_page = numcop // 10 + 1

        for i in range(1, num_page+1):
            num_writer = 0  # Calculate whether there is information writing (anti pickpocketing mechanism)
            # for i in range(1, 2):
            parseHtml = self.get_companies(id, i)
            # '/firm_2468290f38f4601299b29acdf6eccce9.html'
            rUrls = parseHtml.xpath(
                '//div[@class="e_zone-company"]/section/table/tbody/tr/td[2]/a/@href')
            # 'Linhai interworking Automobile Sales Co., Ltd.'
            rTitle = parseHtml.xpath(
                '//div[@class="e_zone-company"]/section/table/tbody/tr/td[2]/a/text()')
            # 'Huang Jianyong'
            rPerson = parseHtml.xpath(
                '//div[@class="e_zone-company"]/section/table/tbody/tr/td[2]/p[1]/a/text()')
            # 'Registered capital: RMB 10 million '
            rCapital = parseHtml.xpath(
                '//div[@class="e_zone-company"]/section/table/tbody/tr/td[2]/p[1]/span[1]/text()')
            # 'date of establishment: September 8, 2017 '
            rSetTime = parseHtml.xpath(
                '//div[@class="e_zone-company"]/section/table/tbody/tr/td[2]/p[1]/span[2]/text()')
            # '\ nmailbox: 3093847569@QQ.COM\n               '
            rEmail = parseHtml.xpath(
                '//div[@class="e_zone-company"]/section/table/tbody/tr/td[2]/p[2]/text()')
            # "Tel: 0576-85323665"
            rPhone = parseHtml.xpath(
                '//div[@class="e_zone-company"]/section/table/tbody/tr/td[2]/p[2]/span/text()')
            # '\ naddress: Hengda home building materials City, Jiangnan street, Linhai City, Taizhou City, Zhejiang Province (No. 112, Jingjiang South Road) \ n'
            rAddress = parseHtml.xpath(
                '//div[@class="e_zone-company"]/section/table/tbody/tr/td[2]/p[3]/text()')
            # 'survival '
            rState = parseHtml.xpath(
                '//div[@class="e_zone-company"]/section/table/tbody/tr/td[3]/span/text()')

get_ companies_ The all function is to obtain all the corresponding enterprises in an industrial park.

Obtain the enterprise information of the corresponding industrial park on each page in the for loop. Use xpath to parse the parseHtml object previously obtained, obtain the relevant information of each enterprise and save it in the corresponding variables, such as enterprise url, enterprise name, enterprise legal person, enterprise registered capital and so on.

4. Part 4

num_current = len(rUrls)
            for num in range(num_current):
                try:
                    url = 'https://www.qichacha.com'+rUrls[num]
                    company = rTitle[num]
                    person = rPerson[num]
                    capital = rCapital[num].replace('Registered capital:', '')
                    settime = rSetTime[num].replace('Date of establishment:', '')
                    email = rEmail[num].replace(
                        '\n', '').replace('Email:', '').strip()
                    phone = rPhone[num].replace('Telephone:', '')
                    address = rAddress[num].replace(
                        '\n', '').replace('Address:', '').strip()
                    state = rState[num]
                    L = [province, city, county, park, area, numcop, company,
                         person, capital, settime, email, phone, address, state, url]
                    with open(self.file_name, 'a', newline='', encoding='utf-8') as f:
                        writer = csv.writer(f)
                        writer.writerow(L)
                        num_writer += 1
                except Exception as e:
                    self.err_log(id, i)
                    log(
                        '{} report errors ID: {} , Page number: {} / {}'.format(name_thread, id, i, num_page))
            if num_writer == 0:
                log('{} No message write ID: {} , Page number: {} / {} Please check the anti pickpocketing mechanism'.format(name_thread, id, i, num_page))
                self.err_log(id, i)
            else:
                log('{} Complete crawling ID: {} , Page number: {} / {}'.format(name_thread, id, i, num_page))

Get the length information of the previously obtained URLs, and use this length to perform a for loop operation on the URLs.

url = 'https://www.qichacha.com'+rUrls[num]
company = rTitle[num]
person = rPerson[num]
capital = rCapital[num].replace('Registered capital: ',' ')
settime = rSetTime[num].replace('establishment date: ',' ')
email = rEmail[num].replace('\n', '').replace('mailbox: ',' ') strip()
phone = rPhone[num].replace('phone: ',' ')
address = rAddress[num].replace('\n', '').replace('address: ',' ') strip()
state = rState[num]
L = [province, city, county, park, area, numcop, company,person, capital, settime, email, phone, address, state, url]

Splice the previously obtained URLs and the header of the url to form a complete url, assign various previously obtained information (company name, legal representative, registered capital, establishment date, etc.) to new local variables, and finally integrate these variables into a complete enterprise information object L.

with open(self.file_name, 'a', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(L)
num_writer += 1 

Write the complete enterprise object L to the file and add one to the number of files (num_writer).

except Exception as e:
        self.err_log(id, i)
log('{} error ID: {}, page number: {} / {}'. format(name_thread, id, i, num_page))
if num_writer == 0:
log('{} no information written, ID: {}, page: {} / {} please check the anti pickpocketing mechanism'. format(name_thread, id, i, num_page))
        self.err_log(id, i)
else:
log('{} completed crawling ID: {}, page number: {} / {}'. format(name_thread, id, i, num_page))

If there is an error in the above L assignment process, the log log "error id + number of threads + specific page position" will be output. If no information is written to the file after the whole process, there may be a problem with the crawler process, and the log prompt message is "please check the anti pickpocketing mechanism". After the successful completion of the whole process, the information of crawling is output.

3, Summary

So far, the analysis of the core code of the "get_parks_companies_threads.py" file has been completed, and the analysis of the core code of the project has been completed.

Through the code analysis of the enterprise inspection project, I have gained a lot of skills and knowledge about crawlers, and have a deeper understanding of the application of crawlers. I have mastered some crawling skills for information that needs to be logged in or viewed by members, and I have benefited a lot from the operation of crawling input and result access to CSV files.

In my next blog post, I will analyze the core code of the last of the four projects.

Topics: .NET microsoft