Python crawls the information of the broker
Python2.7.15
Today, we're going to crawl through the information of the settlement agent. This time we don't use regular, we use beautiful soup. If you don't know about it, you can read this document first for easy understanding. https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/
I. source code acquisition
for page in range(1,8): url ="https://beijing.anjuke.com/tycoon/p" + str(page)+"/" response = urllib2.urlopen(url) content = response.read()
Old routine urllib2
II. Use bs4
First, look at the source code, find the label corresponding to the broker information, and then use the beautifulsoup method, where html.parser is the corresponding parser
soup = BeautifulSoup(content,'html.parser') a = soup.find_all('h3') b = soup.find_all(class_=re.compile("brokercard-sd-cont clearfix")) c = soup.find_all("p", attrs={"class": "jjr-desc"}) d = soup.find_all("p", attrs={"class": "jjr-desc xq_tag"}) e = soup.find_all(class_=re.compile("broker-tags clearfix"))
A, B, C, D and e respectively correspond to broker's name, evaluation, store, familiarity and business
Each item is a list
Loop them out
n = 0 for jjr in a: o = jjr.get_text(strip=True).encode('utf-8') p = b[n].get_text(strip=True).encode('utf-8') q = c[2*n].get_text(strip=True).encode('utf-8') r = d[n].get_text(strip=True).encode('utf-8') s = e[n].get_text(strip=True).encode('utf-8') n+=1
Pay attention to the coding problem here. The parsed document with beautifulsoup is Unicode, and the direct output will be garbled. Moreover, this coding mode cannot be written to the document or database, so encode('utf-8 ') should be added later to recode
III. write to database
insert_agent = ("INSERT INTO AGENT(Full name,evaluate,store,be familiar with,business)" "VALUES(%s,%s,%s,%s,%s)") data_agent = (o,p,q,r,s) cursor.execute(insert_agent, data_agent)
Remember to establish the database connection and the table to be written
IV. complete code
# coding=utf-8 from bs4 import BeautifulSoup import urllib2 import re import MySQLdb conn=MySQLdb.connect(host="127.0.0.1",user="root",passwd="199855pz",db="pz",charset='utf8') print 'Successful connection' cursor = conn.cursor() cursor.execute("DROP TABLE IF EXISTS AGENT") sql = '''CREATE TABLE AGENT(Full name char(4) ,evaluate char(50) ,store char(50) ,be familiar with char(50) ,business char(50))''' cursor.execute(sql) for page in range(1,8): url ="https://beijing.anjuke.com/tycoon/p" + str(page)+"/" response = urllib2.urlopen(url) content = response.read() soup = BeautifulSoup(content,'html.parser') a = soup.find_all('h3') b = soup.find_all(class_=re.compile("brokercard-sd-cont clearfix")) c = soup.find_all("p", attrs={"class": "jjr-desc"}) d = soup.find_all("p", attrs={"class": "jjr-desc xq_tag"}) e = soup.find_all(class_=re.compile("broker-tags clearfix")) n = 0 for jjr in a: o = jjr.get_text(strip=True).encode('utf-8') p = b[n].get_text(strip=True).encode('utf-8') q = c[2*n].get_text(strip=True).encode('utf-8') r = d[n].get_text(strip=True).encode('utf-8') s = e[n].get_text(strip=True).encode('utf-8') n+=1 insert_agent = ("INSERT INTO AGENT(Full name,evaluate,store,be familiar with,business)" "VALUES(%s,%s,%s,%s,%s)") data_agent = (o,p,q,r,s) cursor.execute(insert_agent, data_agent) conn.commit()
PS. anjuke has been updated. There are some changes in the source code, but crawling information is still the old way.