Where is it cheap and good to rent a house? python visual crawler tells you!

Posted by jodyanne on Tue, 14 Dec 2021 20:04:19 +0100

After graduation, I want to develop in the local city. Renting is the top priority I face. Where are there many houses? Where is the cheapest house price? I want to face south, I want high-rise....

No problem, the reptile is done!

First of all, when we open the home page of chain house rental, we can see that there are 61363 houses under rent. So many houses must have my favorite one.

The information we want to obtain includes the name of the house, the location of the house, the orientation of the house, the area of the house, the layout of the house, and, most importantly, the price of the house!

1. First, let's get the web link:

https://xa.lianjia.com/zufang/pg1/
https://xa.lianjia.com/zufang/pg5/
https://xa.lianjia.com/zufang/pg8/

It can be found that the page turning is determined by the page parameter.

2. Add the following three parameters to simulate browser access.

url = f'https://xa.lianjia.com/zufang/pg{page}/'
        #Simulation browser
        headers = {
            'Referer':f'https://xa.lianjia.com/zufang/pg{page}/',
            'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4484.7 Safari/537.36',
            'Cookie':'select_city=610100; lianjia_uuid=0026697d-0b4b-404a-9700-6ffdee75e3be; _smt_uid=60928372.2a8cb08f; UM_distinctid=1793c51792a47d-0751bdbf625319-71153641-1fa400-1793c51792ba23; _jzqckmp=1; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%221793c517f2b1-0931efe9a56ec3-71153641-2073600-1793c517f2c463%22%2C%22%24device_id%22%3A%221793c517f2b1-0931efe9a56ec3-71153641-2073600-1793c517f2c463%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E4%BB%98%E8%B4%B9%E5%B9%BF%E5%91%8A%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22https%3A%2F%2Fwww.baidu.com%2Fother.php%22%2C%22%24latest_referrer_host%22%3A%22www.baidu.com%22%2C%22%24latest_search_keyword%22%3A%22%E9%93%BE%E5%AE%B6%22%2C%22%24latest_utm_source%22%3A%22baidu%22%2C%22%24latest_utm_medium%22%3A%22pinzhuan%22%2C%22%24latest_utm_campaign%22%3A%22wyxian%22%2C%22%24latest_utm_content%22%3A%22biaotimiaoshu%22%2C%22%24latest_utm_term%22%3A%22biaoti%22%7D%7D; _ga=GA1.2.1228540593.1620214646; _gid=GA1.2.1956937792.1620214646; Hm_lvt_9152f8221cb6243a53c83b956842be8a=1620214650; digv_extends=%7B%22utmTrackId%22%3A%2221583074%22%7D; lianjia_ssid=5b2db515-b320-4c75-b676-e28e8b35d4ec; CNZZDATA1255849580=598881228-1620209949-https%253A%252F%252Fwww.baidu.com%252F%7C1620296334; CNZZDATA1254525948=1027467720-1620211125-https%253A%252F%252Fwww.baidu.com%252F%7C1620296930; CNZZDATA1255633284=1302951133-1620212328-https%253A%252F%252Fwww.baidu.com%252F%7C1620297661; CNZZDATA1255604082=1771052689-1620212893-https%253A%252F%252Fwww.baidu.com%252F%7C1620294612; _jzqa=1.3096002108450772000.1620214643.1620214643.1620299994.2; _jzqc=1; _jzqy=1.1620214643.1620299994.1.jzqsr=baidu|jzqct=%E9%93%BE%E5%AE%B6.-; _qzja=1.2112865692.1620214644413.1620214644413.1620299994647.1620214650759.1620299994647.0.0.0.3.2; _qzjc=1; _qzjto=1.1.0; _jzqb=1.1.10.1620299994.1; _qzjb=1.1620299994647.1.0.0.0; srcid=eyJ0Ijoie1wiZGF0YVwiOlwiNGE0Y2E5MmE5MTFiMjhjODMyNTIxMDdkM2VjMDU2NGQzY2I0ZGQzMDlhZDY2Y2MwY2FjZWJjOWVmNmY1NWRlYzIwNDYwZDY3ZmZiZTZmODM5Yzc3MWFmMzNkMmI0NDI5ZWFkMTFhYzY4YWU0Y2Q3ZTlhYWZhMzk4MTc0MjcwNTZiZTg3MTc1MWI5YWQzMDUxNzk4Y2MwZDJmMTA5YzgzOWMxZjUyZTVhMmI0NDVmYzNkN2NjNjI2NjZiYTRiYzI2YTZmNGNlMzViMTZkMTc3NTA4ZGE0YzkxMDAzZDkwYmIwNTM0NGU1OTY3OThlZWNiN2YyNjIxZjQ5YzRiOGY5NWY3Zjc5NWM2YzIzNmNkOGI0NjQ2ZWNmNjczNzk0MzBhZWRjODdjYjg4MWNkOWFjYjFhMWVkYTllOTc2NzYxMTgyZmU1MjljYjdlNWNiZmZjZDY5ZDA5YTEwZGM5OTcxNlwiLFwia2V5X2lkXCI6XCIxXCIsXCJzaWduXCI6XCJkMWZhZDhjYlwifSIsInIiOiJodHRwczovL3hhLmxpYW5qaWEuY29tL3p1ZmFuZy9wZzEvIiwib3MiOiJ3ZWIiLCJ2IjoiMC4xIn0='
        }
        #Send network request
        try:
            resp = requests.get(url, headers = headers)
        except:
            print('something wrong!')

3. By analyzing the web page, we can see that all the information of each house is in this div. there are 30 sets of house information on each page, that is, there are such 30 div.

So our idea is very simple. Use css to get all the information of the 30 div s, and then use the loop to get the specific information in each house.

4. As follows, we can observe the page structure and find that all listing div information is in a class named content__ Div inside list.

So let's get all the div s first

 divs = seletor.css('.content__list--item')  #All div Tags

5. Then we traverse all div s to get their internal specific information.

for d in divs:
   title = d.css('.content__list--item--title a::text').get().strip()  #House title
   address = d.css('.content__list--item--des a::text').getall()
   address = '-'.join(address).strip()                                #House community name
   area = d.css('.content__list--item--des::text').getall()[4].strip()  #House area
   toward = d.css('.content__list--item--des::text').getall()[5].strip()  #House orientation
   structure = d.css('.content__list--item--des::text').getall()[6].strip()  #House structure
   addinfo = d.css('.content__list--item--bottom i::text').getall()
   addinfo = '|'.join(addinfo)                                             #House other information
   maint = d.css('.content__list--item--time::text').get().strip()  #House maintenance details
   price = d.css('.content__list--item-price em::text').get().strip() + 'element/month' #House monthly rent

6. Then we save the obtained information into Excel to facilitate subsequent pandas operations.

ws = op.Workbook()
wb = ws.create_sheet(index=0)

wb.cell(row=1, column=1, value='title')
wb.cell(row=1, column=2, value='Cell name')
wb.cell(row=1, column=3, value='the measure of area')
wb.cell(row=1, column=4, value='orientation')
wb.cell(row=1, column=5, value='structure')
wb.cell(row=1, column=6, value='Additional information')
wb.cell(row=1, column=7, value='Maintenance')
wb.cell(row=1, column=8, value='Monthly rent')
ws.save('Xi'an rental information.xlsx')

7. The Excel information obtained is as follows:

8. Then we set a small goal to obtain 100 pages of data first. That is, 3000 listings.

Next, let's simply visualize the availability of houses in various regions of Xi'an to see which region has more houses, and when there are more houses, there will be more choices. The price is natural You know!

1. We first use pandas to read excel, and then get the location of all houses:

pd_data = pd.read_excel(r'./Xi'an rental information.xlsx')    #Read excel
address = pd_data['Cell name']                      #Get all addresses

Because we only need the area where the first two houses are located, and we need to know the total number of all houses in each area. No other information is required, so it needs to be further optimized.

#Get the first two digits of the address as the area name
address_01 = []
for i in address:
    address_01.append(i[0:2])
#Statistics
data = []
address_01_set = set(address_01)
for item in address_01_set:
    data.append((item, address_01.count(item)))

#Rename column
df = DataFrame(data)
df.columns = ('area', 'number')

Next, we use bar chart, pie chart, line chart, funnel chart and dashboard to show them respectively.

#1. Histogram
def barPage() -> Bar:
    bar = (
        Bar()
        .add_xaxis(x_data)
        .add_yaxis("Rental distribution in Xi'an",y_data)
        .set_global_opts(
            title_opts=opts.TitleOpts(title="Bar-Xi'an rental distribution map"),
            legend_opts=opts.LegendOpts(is_show=False),)
    )
    return bar

Encapsulate all the visual parts into functions for display.

page = (
    Page(layout=Page.DraggablePageLayout)
    .add(
        barPage(),
        piePage(),
        linePage(),
        funnelPage(),
        gaugePage())
)

page.render("page_demo.html")

Let's see the effect~~~

How, it's much better than finding one by one on the web page! It's more intuitive than looking at Excel

Topics: Python crawler