Python crawls the information of the broker

Posted by herschen on Wed, 11 Dec 2019 00:08:36 +0100

Python crawls the information of the broker

Python2.7.15
Today, we're going to crawl through the information of the settlement agent. This time we don't use regular, we use beautiful soup. If you don't know about it, you can read this document first for easy understanding. https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

I. source code acquisition

for page in range(1,8):
    url ="https://beijing.anjuke.com/tycoon/p" + str(page)+"/"
    response = urllib2.urlopen(url)
    content = response.read()

Old routine urllib2

II. Use bs4

First, look at the source code, find the label corresponding to the broker information, and then use the beautifulsoup method, where html.parser is the corresponding parser

    soup = BeautifulSoup(content,'html.parser')
    a = soup.find_all('h3')
    b = soup.find_all(class_=re.compile("brokercard-sd-cont clearfix"))
    c = soup.find_all("p", attrs={"class": "jjr-desc"})
    d = soup.find_all("p", attrs={"class": "jjr-desc xq_tag"})
    e = soup.find_all(class_=re.compile("broker-tags clearfix"))

A, B, C, D and e respectively correspond to broker's name, evaluation, store, familiarity and business
Each item is a list
Loop them out

    n = 0
    for jjr in a:
        o = jjr.get_text(strip=True).encode('utf-8')
        p = b[n].get_text(strip=True).encode('utf-8')
        q = c[2*n].get_text(strip=True).encode('utf-8')
        r = d[n].get_text(strip=True).encode('utf-8')
        s = e[n].get_text(strip=True).encode('utf-8')
        n+=1

Pay attention to the coding problem here. The parsed document with beautifulsoup is Unicode, and the direct output will be garbled. Moreover, this coding mode cannot be written to the document or database, so encode('utf-8 ') should be added later to recode

III. write to database

        insert_agent = ("INSERT INTO AGENT(Full name,evaluate,store,be familiar with,business)" "VALUES(%s,%s,%s,%s,%s)")
        data_agent = (o,p,q,r,s)
        cursor.execute(insert_agent, data_agent)

Remember to establish the database connection and the table to be written

IV. complete code

# coding=utf-8
from bs4 import BeautifulSoup
import urllib2
import re
import MySQLdb

conn=MySQLdb.connect(host="127.0.0.1",user="root",passwd="199855pz",db="pz",charset='utf8')
print 'Successful connection'
cursor = conn.cursor()
cursor.execute("DROP TABLE IF EXISTS AGENT")
sql = '''CREATE TABLE AGENT(Full name char(4) ,evaluate char(50) ,store char(50) ,be familiar with char(50) ,business char(50))'''
cursor.execute(sql)

for page in range(1,8):
    url ="https://beijing.anjuke.com/tycoon/p" + str(page)+"/"
    response = urllib2.urlopen(url)
    content = response.read()
    soup = BeautifulSoup(content,'html.parser')
    a = soup.find_all('h3')
    b = soup.find_all(class_=re.compile("brokercard-sd-cont clearfix"))
    c = soup.find_all("p", attrs={"class": "jjr-desc"})
    d = soup.find_all("p", attrs={"class": "jjr-desc xq_tag"})
    e = soup.find_all(class_=re.compile("broker-tags clearfix"))

    n = 0
    for jjr in a:
        o = jjr.get_text(strip=True).encode('utf-8')
        p = b[n].get_text(strip=True).encode('utf-8')
        q = c[2*n].get_text(strip=True).encode('utf-8')
        r = d[n].get_text(strip=True).encode('utf-8')
        s = e[n].get_text(strip=True).encode('utf-8')
        n+=1
        insert_agent = ("INSERT INTO AGENT(Full name,evaluate,store,be familiar with,business)" "VALUES(%s,%s,%s,%s,%s)")
        data_agent = (o,p,q,r,s)
        cursor.execute(insert_agent, data_agent)
conn.commit()

PS. anjuke has been updated. There are some changes in the source code, but crawling information is still the old way.

Topics: Big Data Database SQL Python