Python crawling to the second-hand housing community details page data of anjuke.com

Posted by joeysarsenal on Wed, 09 Feb 2022 00:20:35 +0100

Hello, guys, the first two blog cases basically introduced the basic process of crawler. This blog began to put a heavy bomb, and the difficulty coefficient increased a little (difficulty 1: involving secondary page crawling, difficulty 2: crawling a total of 16 fields). The main content of this article: Take Shijiazhuang City as an example, crawl the relevant field information on the details page of the second-hand housing community of anjuke.com, and crawl the information on the home page of the second-hand housing community, which will not be introduced here, because it is different from the previous blog( Python crawls 58 houses on sale in the same city )The steps of the crawler are basically the same. Interested partners can have a look. All right, no more nonsense, let's start~

First of all, let's open the official website of anjuke and set up two screening conditions: Shijiazhuang city and second-hand housing community (this is selected according to the interests of our partners). We can see that there are 11688 selected communities, 25 on each page, so there are about 468 pages of data. If we crawl all the community data, it will take more time, This article mainly focuses on explaining the process, so here, we mainly crawl the relevant field data of the detail page of the first 500 communities. Now let's take a look at what fields can be crawled on the detail page of second-hand housing communities?

Taking Evergrande Yujing Peninsula, the first second-hand housing community on the home page, as an example, we open the community details page as shown in the figure below. From the figure, we can see that there are many field information. This time, our task is to crawl these relevant fields, mainly including: community name, location and address, average price of the community, number of second-hand houses, number of rental houses, property type, property fee, total construction area Total number of households, completion time, parking space, plot ratio, greening rate, developer, property company and business district, there are 16 fields in total.

It is also mentioned at the beginning of the article that compared with the previous two crawler cases, the difficulty of this crawler case should be increased. The difficulty is mainly concentrated in two aspects: one is secondary page crawling, and the other is that there are many crawling fields. But don't panic, steady, in fact, it's not difficult. Here, I will briefly describe the general crawling process, and the little partners will understand how to crawl. General process: first crawl the URL of each cell detail page according to the URL of the cell list page, then traverse the URL of each cell detail page, and crawl the relevant field information of its detail page in turn during the cycle. Basically, it is the logic of loop set loop! If you still don't understand, you may be surprised to see the code directly later!

1. Get the URL of Shijiazhuang second-hand housing community of anjuke.com

As for how to get the URL, here is no more introduction. Just put the results directly. If you have just started to contact a small partner, you can see my first two basic cases of crawler blog.

# Home page URL
url = 'https://sjz.anjuke.com/community/p1'

# Multi page crawling: for the convenience of crawling, take the first 500 communities as an example, 25 per page, a total of 20 pages
for i in range(20):
    url = 'https://sjz.anjuke.com/community/p{}'.format(i)

2. Analyze the html code of the web page and view the location of the web page where each field information is located

Here, the html codes of two pages are involved. One is the cell list page and the other is the details page of each cell. Let's take a look at them respectively:

(1) Cell list page html code:

On the cell list page, we only need to obtain two aspects: one is the URL of each cell details page, and the other is the average price of each cell;

(2) Cell details page html code:

3. Use Xpath to parse the web page and obtain the value of the corresponding field

(1) Cell list page:

# URL of each cell details page:
link = html.xpath('.//div[@class="list-cell"]/a/@href')
# Average price of community:
price = html.xpath('.//div[@class="list-cell"]/a/div[3]/div/strong/text()')

(2) Cell details page:

dict_result = {'Community name':'-','Price':'-','Cell address':'-','Property type':'-','Property fee': '-','Total construction area': '-','Total households': '-','Construction age': '-','Parking space': '-','Plot ratio': '-','Greening rate': '-','Developers': '-','Property company': '-','Business district': '-','Number of second-hand houses':'-','Number of rental houses':'-'}
dict_result['Community name'] = html.xpath('.//div[@class="comm-title"]/h1/text()')
dict_result['Cell address'] = html.xpath('.//div[@class="comm-title"]/h1/span/text()')
dict_result['Property type'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[1]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[1]/text()')
dict_result['Property fee'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[2]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[2]/text()')
dict_result['Total construction area'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[3]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[3]/text()')
dict_result['Total households'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[4]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[4]/text()')
dict_result['Construction age'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[5]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[5]/text()')
dict_result['Parking space'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[6]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[6]/text()')
dict_result['Plot ratio'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[7]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[7]/text()')
dict_result['Greening rate'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[8]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[8]/text()')
dict_result['Developers'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[9]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[9]/text()')
dict_result['Property company'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[10]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[10]/text()')
dict_result['Business district'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[11]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[11]/text()')
dict_result['Number of second-hand houses'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/div[3]/a[1]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/div[3]/a[1]/text()')
dict_result['Number of rental houses'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/div[3]/a[2]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/div[3]/a[2]/text()')

4. Home page crawling - 25 cell details page data

In general, I will first consider crawling the content of the home page. When all the field information of the home page content is crawling without error, I will add a cycle to crawl multiple pages of content. If you have a foundation, you can skip this chapter and see the complete code for crawling all the data at last( 5. Complete code analysis during multi page crawling);

(1) Import packages and create file objects

## Import related packages
from lxml import etree
import requests
from fake_useragent import UserAgent
import random
import time
import csv
import re

## Create file object
f = open('Anjuke.com Shijiazhuang second-hand housing information.csv', 'w', encoding='utf-8-sig', newline="")  # Create file object
csv_write = csv.DictWriter(f, fieldnames=['Community name', 'Price', 'Cell address', 'Property type','Property fee','Total construction area','Total households', 'Construction age','Parking space','Plot ratio','Greening rate','Developers','Property company','Business district','Number of second-hand houses','Number of rental houses'])
csv_write.writeheader() # Write file header

(2) Set reverse crawl

## Set request header parameters: user agent, cookie, referer
ua = UserAgent()
headers = {
    # Randomly generate user agent
    "user-agent": ua.random,
    # Cookies are different when different users visit at different times. See another blog for access methods according to their own web pages
    "cookie": "sessid=C7103713-BE7D-9BEF-CFB5-6048A637E2DF; aQQ_ajkguid=263AC301-A02C-088D-AE4E-59D4B4D4726A; ctid=28; twe=2; id58=e87rkGCpsF6WHADop0A3Ag==; wmda_uuid=1231c40ad548840be4be3d965bc424de; wmda_new_uuid=1; wmda_session_id_6289197098934=1621733471115-664b82b6-8742-1591; wmda_visited_projects=%3B6289197098934; obtain_by=2; 58tj_uuid=8b1e1b8f-3890-47f7-ba3a-7fc4469ca8c1; new_session=1; init_refer=http%253A%252F%252Flocalhost%253A8888%252F; new_uv=1; _ga=GA1.2.1526033348.1621734712; _gid=GA1.2.876089249.1621734712; als=0; xxzl_cid=7be33aacf08c4431a744d39ca848819a; xzuid=717fc82c-ccb6-4394-9505-36f7da91c8c6",
    # Set where to jump from
    "referer": "https://sjz.anjuke.com/community/p1/",
}

## Randomly obtain an IP from the proxy IP pool. For example, the ProxyPool project must be running
def get_proxy():
    try:
        PROXY_POOL_URL = 'http://localhost:5555/random'
        response = requests.get(PROXY_POOL_URL)
        if response.status_code == 200:
            return response.text
    except ConnectionError:
        return None

(3) Parsing first level page functions:

It mainly crawls the URL of each cell detail page in the cell list and the average price of each cell;

## Parsing first level page function
def get_link(url):
    text = requests.get(url=url, headers=headers, proxies={"http": "http://{}".format(get_proxy())}).text
    html = etree.HTML(text)
    link = html.xpath('.//div[@class="list-cell"]/a/@href')
    price = html.xpath('.//div[@class="list-cell"]/a/div[3]/div/strong/text()')
    #print(link)
    #print(price)
    return zip(link, price)

(4) Analyze the secondary page function, that is, the cell details page

## Parsing secondary page functions
def parse_message(url, price):
    dict_result = {'Community name': '-','Price': '-','Cell address': '-','Property type': '-',
                   'Property fee': '-','Total construction area': '-','Total households': '-','Construction age': '-',
                   'Parking space': '-','Plot ratio': '-','Greening rate': '-','Developers': '-',
                   'Property company': '-','Business district': '-','Number of second-hand houses':'-','Number of rental houses':'-'}

    text = requests.get(url=url, headers=headers,proxies={"http": "http://{}".format(get_proxy())}).text
    html = etree.HTML(text)
    dict_result['Community name'] = html.xpath('.//div[@class="comm-title"]/h1/text()')
    dict_result['Cell address'] = html.xpath('.//div[@class="comm-title"]/h1/span/text()')
    dict_result['Property type'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[1]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[1]/text()')
    dict_result['Property fee'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[2]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[2]/text()')
    dict_result['Total construction area'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[3]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[3]/text()')
    dict_result['Total households'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[4]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[4]/text()')
    dict_result['Construction age'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[5]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[5]/text()')
    dict_result['Parking space'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[6]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[6]/text()')
    dict_result['Plot ratio'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[7]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[7]/text()')
    dict_result['Greening rate'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[8]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[8]/text()')
    dict_result['Developers'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[9]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[9]/text()')
    dict_result['Property company'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[10]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[10]/text()')
    dict_result['Business district'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[11]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[11]/text()')
    dict_result['Number of second-hand houses'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/div[3]/a[1]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/div[3]/a[1]/text()')
    dict_result['Number of rental houses'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/div[3]/a[2]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/div[3]/a[2]/text()')

    # Simply preprocess the crawled data
    for key,value in dict_result.items():
        value = list(map(lambda item: re.sub('\s+', '', item), value))  # Remove line breaks and tabs
        dict_result[key] = list(filter(None, value)) # Remove the empty element generated in the previous step
        if len(dict_result[key]) == 0:
            dict_result[key] = ''
        else:
            dict_result[key] = dict_result[key][0]
    dict_result['Price'] = price
    return dict_result

(5) Save data to file save_csv() function

## Read data into csv file
def save_csv(result):
    for row in result: # A cell data is stored in a dictionary
        csv_write.writerow(row)

(6) Only the main function when crawling the home page

#Main function
C = 1
k = 1 # Number of crawling listings
print("************************Page 1 start crawling************************")
# First page URL
url = 'https://sjz.anjuke.com/community/p1'
# Parse the first level page function, which returns the URL and average price of the detail page
link = get_link(url)
list_result = [] # Store dictionary data in the list
for j in link:
    try:
        # Parse the secondary page function and pass in two parameters: the URL of the detail page and the average price
        result = parse_message(j[0], j[1])
        list_result.append(result)
        print("Crawled{}Data bar".format(k))
        k = k + 1 # Control the number of cells crawled
        time.sleep(round(random.randint(5, 10), C)) # Set sleep interval
    except Exception as err:
        print("-----------------------------")
        print(err)
# Save data to file
save_csv(list_result)
print("************************Page 1 successfully crawled************************")

5. Multi page crawling - complete code analysis

Since the code is long, you must read it patiently. If you are just starting to learn crawler, you can first look at Part 4 above. After learning to crawl home page data, it will be much easier to crawl multi page data;

## Import related packages
from lxml import etree
import requests
from fake_useragent import UserAgent
import random
import time
import csv
import re


## Create file object
f = open('Anjuke.com Shijiazhuang second-hand housing information.csv', 'w', encoding='utf-8-sig', newline="")  # Create file object
csv_write = csv.DictWriter(f, fieldnames=['Community name', 'Price', 'Cell address', 'Property type','Property fee','Total construction area','Total households', 'Construction age','Parking space','Plot ratio','Greening rate','Developers','Property company','Business district','Number of second-hand houses','Number of rental houses'])
csv_write.writeheader() # Write file header


## Set request header parameters: user agent, cookie, referer
ua = UserAgent()
headers = {
    # Randomly generate user agent
    "user-agent": ua.random,
    # Cookies are different when different users visit at different times. See another blog for access methods according to their own web pages
    "cookie": "sessid=C7103713-BE7D-9BEF-CFB5-6048A637E2DF; aQQ_ajkguid=263AC301-A02C-088D-AE4E-59D4B4D4726A; ctid=28; twe=2; id58=e87rkGCpsF6WHADop0A3Ag==; wmda_uuid=1231c40ad548840be4be3d965bc424de; wmda_new_uuid=1; wmda_session_id_6289197098934=1621733471115-664b82b6-8742-1591; wmda_visited_projects=%3B6289197098934; obtain_by=2; 58tj_uuid=8b1e1b8f-3890-47f7-ba3a-7fc4469ca8c1; new_session=1; init_refer=http%253A%252F%252Flocalhost%253A8888%252F; new_uv=1; _ga=GA1.2.1526033348.1621734712; _gid=GA1.2.876089249.1621734712; als=0; xxzl_cid=7be33aacf08c4431a744d39ca848819a; xzuid=717fc82c-ccb6-4394-9505-36f7da91c8c6",
    # Set where to jump from
    "referer": "https://sjz.anjuke.com/community/p1/",
}


## Randomly obtain an IP from the proxy IP pool. For example, the ProxyPool project must be running
def get_proxy():
    try:
        PROXY_POOL_URL = 'http://localhost:5555/random'
        response = requests.get(PROXY_POOL_URL)
        if response.status_code == 200:
            return response.text
    except ConnectionError:
        return None

## Parsing first level page function
def get_link(url):
    text = requests.get(url=url, headers=headers, proxies={"http": "http://{}".format(get_proxy())}).text
    html = etree.HTML(text)
    link = html.xpath('.//div[@class="list-cell"]/a/@href')
    price = html.xpath('.//div[@class="list-cell"]/a/div[3]/div/strong/text()')
    #print(link)
    #print(price)
    return zip(link, price)

## Parsing secondary page functions
def parse_message(url, price):
    dict_result = {'Community name': '-','Price': '-','Cell address': '-','Property type': '-',
                   'Property fee': '-','Total construction area': '-','Total households': '-','Construction age': '-',
                   'Parking space': '-','Plot ratio': '-','Greening rate': '-','Developers': '-',
                   'Property company': '-','Business district': '-','Number of second-hand houses':'-','Number of rental houses':'-'}

    text = requests.get(url=url, headers=headers,proxies={"http": "http://{}".format(get_proxy())}).text
    html = etree.HTML(text)
    dict_result['Community name'] = html.xpath('.//div[@class="comm-title"]/h1/text()')
    dict_result['Cell address'] = html.xpath('.//div[@class="comm-title"]/h1/span/text()')
    dict_result['Property type'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[1]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[1]/text()')
    dict_result['Property fee'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[2]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[2]/text()')
    dict_result['Total construction area'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[3]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[3]/text()')
    dict_result['Total households'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[4]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[4]/text()')
    dict_result['Construction age'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[5]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[5]/text()')
    dict_result['Parking space'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[6]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[6]/text()')
    dict_result['Plot ratio'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[7]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[7]/text()')
    dict_result['Greening rate'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[8]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[8]/text()')
    dict_result['Developers'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[9]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[9]/text()')
    dict_result['Property company'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[10]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[10]/text()')
    dict_result['Business district'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[11]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[11]/text()')
    dict_result['Number of second-hand houses'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/div[3]/a[1]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/div[3]/a[1]/text()')
    dict_result['Number of rental houses'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/div[3]/a[2]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/div[3]/a[2]/text()')

    # Simply preprocess the crawled data
    for key,value in dict_result.items():
        value = list(map(lambda item: re.sub('\s+', '', item), value))  # Remove line breaks and tabs
        dict_result[key] = list(filter(None, value)) # Remove the empty element generated in the previous step
        if len(dict_result[key]) == 0:
            dict_result[key] = ''
        else:
            dict_result[key] = dict_result[key][0]
    dict_result['Price'] = price
    return dict_result


## Read data into csv file
def save_csv(result):
    for row in result:
        csv_write.writerow(row)

        
## Main code
C = 1
k = 1 # Number of crawling listings

# Multi page crawling. Due to time constraints, only the detailed data of the first 500 cells are crawled, and subsequent interested partners can crawl by themselves
for i in range(1,21): #There are 25 communities on each page, and the first 500 are 20 pages
    print("************************" + "The first%s Page start crawling" % i + "************************")
    url = 'https://sjz.anjuke.com/community/p{}'.format(i)
    
    # Parse the first level page function, which returns the URL and average price of the detail page
    link = get_link(url)
    list_result = [] # Define a list to store the dictionary data of each cell
    
    for j in link:
        try:
            # Parse the secondary page function and pass in two parameters: the URL of the detail page and the average price
            result = parse_message(j[0], j[1])
            list_result.append(result) # Store dictionary data in the list
            print("Crawled{}Data bar".format(k))
            k = k + 1 # Control the number of cells crawled
            time.sleep(round(random.randint(1,3), C)) # Set the sleep interval and control the two-level page access time
        except Exception as err:
            print("-----------------------------")
            print(err)
            
    # Save data to file
    save_csv(list_result)
    time.sleep(random.randint(1,3)) # Set the sleep interval and control the first level page access time
    
    print("************************" + "The first%s Page crawling succeeded" % i + "************************")

6. Data finally crawled

Well, the third crawler case is almost over. This paper mainly uses Xpath to crawl the relevant data on the details page of second-hand housing community in Shijiazhuang. Compared with the first two cases, the difficulty of this case has increased by one level. The difficulty is mainly reflected in two aspects: one is related to the crawling of secondary pages, You need to obtain the URL of the secondary page from the primary page; The other is that there are many fields to crawl. You need to constantly try to check whether the corresponding fields can be crawled successfully. Generally speaking, although the difficulty has increased, as long as the partners can keep reading, I believe there will be no small harvest! When I was a little white student, my first feeling was that reptiles could still play like this. It was still very interesting! As for the follow-up blog plan, I have crawled Baidu map POI data, public comments, etc. in the process of learning before. This may be what I want to summarize in the next step. If my friends are interested, they can pay attention to Bo. Hey hey!

If there is a place where the introduction is not very comprehensive, you are welcome to leave a message in the comment area, and I will continue to improve it!

It's all here. Are you sure you don't leave anything, hee hee~

Topics: Python crawler

Programmer Think