Python crawler example: News total crawling
preface
Some time ago, due to the needs of tasks, it was necessary to climb the number of news of Shanghai Stock Exchange 50 index component stocks on certain dates.
The idea at first was to climb Baidu news advanced search But one day Baidu news advanced search suddenly couldn't be used. No matter what search is, it will jump to Baidu's home page. So far (June 11, 2020) it hasn't recovered. I don't know if Baidu company has stopped this business.
So we have to find substitutes. The blogger's eyes are on them China News advanced search , known as the national team of search industry.
analysis
Web page parsing
To crawler, first of all, you have to parse the url structure of the web page.
First, take the full-text search keyword "ICBC" as an example, set the search starting point 2020-01-08 and the end point 2020-01-08, that is, search the news volume of the keyword "ICBC" on January 8, 2020, Click here to enter the web page.
The home page is as follows:
Among them, the total amount of news is the data we expect to crawl, and search keywords and search time interval are the variables we need to change in the process of crawling.
Then analyze its url: http://news.chinaso.com/newssearch.htm?q=%E5%B7%A5%E5%95%86%E9%93%B6%E8%A1%8C&startTime=20200108&endTime=20200108
In this list of URLs, the previous http://news.chinaso.com/newssearch.htm? Some of them represent the national search homepage, which does not need to be changed in the process of crawling; the rest consists of three variables: q, startTime and endTime. The meaning of variable q is search keyword. In this url, "% E5%B7%A5%E5%95%86%E9%93%B6%E8%A1%8C" stands for the keyword "industrial and Commercial Bank of China"; the variables startTime and endTime stand for the starting point and focus of time.
In the actual operation process, url coding is too complicated and complicated, which can be directly replaced by Chinese text. For example, I want to search the news volume of the keyword "China Ping An" on January 8, 2020, which can be constructed as follows: http://news.chinaso.com/newssearch.htm?q= Ping An China & starttime = 20200108 & Endtime = 20200108.
For the total amount of data news that you want to crawl, first use F12 developer tools to locate it:
After finding the node, you can analyze the news volume. There are many methods, including bs, regular, etc. here the blogger uses regular statements:
num=re.match('<div class="toolTab_xgxwts">Find news for you(.*)piece</div>',str(retext))
Here, the most basic part of the reptile is completed, and the rest of the head and tail details need to be processed.
Stock data source
At the beginning, it was said that it is necessary to climb the 50 component stocks of Shanghai Stock Exchange. First of all, it is necessary to know what the component stocks are. The blogger understands that the component stocks of an index are not fixed. Due to the long time period (more than one year) for crawling, it is safer to request a crawling of the component stocks list on each day of crawling. Here, the blogger uses the jqdatasdk data package provided by jukuan platform. The first use should be in Official website Register the account and type pip install jqdatasdk installation on the command line.
During use, you need to log in first:
jqdatasdk.auth('xxxxxx','xxxxxx')#Log in to joinquant with your account password
Then historical market data can be obtained:
raw_data_everyday = jqdatasdk.get_index_weights('000016.XSHG', date=temporaryTime1_raw)
From it, the daily list of constituent stocks can be parsed for cyclic crawling.
Proxy IP
In the actual operation, we found that guosou is still strict with crawlers (worthy of being the national team), and it needs to use proxy ip. The specific ip platform is not described here, to avoid being used as an advertisement, just mention the method of using proxy ip.
First, build the ip pool, copy the available ip and its ports from the ip provider (of course, the interface can also be accessed)
proxies_list=['58.218.200.227:8601', '58.218.200.223:3841', '58.218.200.226:3173', '58.218.200.228:8895', '58.218.200.226:8780', '58.218.200.227:6646', '58.218.200.228:7469', '58.218.200.228:5760', '58.218.200.223:8830', '58.218.200.228:5418', '58.218.200.223:6918', '58.218.200.225:5211', '58.218.200.227:8141', '58.218.200.228:7779', '58.218.200.226:3999', '58.218.200.226:3345', '58.218.200.228:2433', '58.218.200.226:6042', '58.218.200.225:4760', '58.218.200.228:2547', '58.218.200.225:3886', '58.218.200.226:7384', '58.218.200.228:8604', '58.218.200.227:6996', '58.218.200.223:3986', '58.218.200.226:6305', '58.218.200.225:6208', '58.218.200.223:4006', '58.218.200.225:8079', '58.218.200.228:7042', '58.218.200.225:7086', '58.218.200.227:8913', '58.218.200.227:3220', '58.218.200.226:2286', '58.218.200.228:7337', '58.218.200.227:2010', '58.218.200.227:9062', '58.218.200.225:8799', '58.218.200.223:3568', '58.218.200.228:3184', '58.218.200.223:5874', '58.218.200.225:3963', '58.218.200.228:3696', '58.218.200.227:7113', '58.218.200.226:4501', '58.218.200.223:7636', '58.218.200.225:9108', '58.218.200.228:6940', '58.218.200.223:5310', '58.218.200.225:2864', '58.218.200.226:5225', '58.218.200.228:6468', '58.218.200.223:8127', '58.218.200.225:8575', '58.218.200.223:7269', '58.218.200.228:7039', '58.218.200.226:6674', '58.218.200.226:5945', '58.218.200.225:3108', '58.218.200.226:3990', '58.218.200.223:8356', '58.218.200.227:5274', '58.218.200.227:6535', '58.218.200.225:3934', '58.218.200.223:6866', '58.218.200.227:3088', '58.218.200.227:7253', '58.218.200.223:2215', '58.218.200.228:2715', '58.218.200.226:4071', '58.218.200.228:7232', '58.218.200.225:5561', '58.218.200.226:7476', '58.218.200.223:3917', '58.218.200.227:2931', '58.218.200.223:5612', '58.218.200.226:6409', '58.218.200.223:7785', '58.218.200.228:7906', '58.218.200.227:8476', '58.218.200.227:3012', '58.218.200.226:6388', '58.218.200.225:8819', '58.218.200.225:2093', '58.218.200.227:4408', '58.218.200.225:7457', '58.218.200.223:3593', '58.218.200.225:2028', '58.218.200.227:2119', '58.218.200.223:3094', '58.218.200.226:3232', '58.218.200.227:6769', '58.218.200.223:4013', '58.218.200.227:9064', '58.218.200.223:6034', '58.218.200.227:4292', '58.218.200.228:5228', '58.218.200.228:2397', '58.218.200.226:2491', '58.218.200.226:3948', '58.218.200.227:2630', '58.218.200.228:4857', '58.218.200.228:2541', '58.218.200.225:5653', '58.218.200.226:7068', '58.218.200.223:2129', '58.218.200.227:4093', '58.218.200.226:2466', '58.218.200.226:4089', '58.218.200.225:4932', '58.218.200.228:8511', '58.218.200.227:6660', '58.218.200.227:2536', '58.218.200.226:5777', '58.218.200.228:4755', '58.218.200.227:4138', '58.218.200.223:5297', '58.218.200.226:2367', '58.218.200.225:7920', '58.218.200.225:6752', '58.218.200.228:4508', '58.218.200.223:3120', '58.218.200.227:3329', '58.218.200.226:6911', '58.218.200.228:7032', '58.218.200.223:8029', '58.218.200.228:2009', '58.218.200.223:3487', '58.218.200.228:9078', '58.218.200.225:3985', '58.218.200.227:6955', '58.218.200.228:8847', '58.218.200.228:4376', '58.218.200.225:3942', '58.218.200.228:4983', '58.218.200.225:9082', '58.218.200.225:7907', '58.218.200.226:6141', '58.218.200.226:5268', '58.218.200.226:4986', '58.218.200.223:8374', '58.218.200.226:4850', '58.218.200.225:5397', '58.218.200.226:2983', '58.218.200.225:3156', '58.218.200.226:6176', '58.218.200.225:4273', '58.218.200.226:8625', '58.218.200.226:8424', '58.218.200.226:5714', '58.218.200.223:8166', '58.218.200.226:4194', '58.218.200.223:6850', '58.218.200.228:6994', '58.218.200.223:3825', '58.218.200.226:7129', '58.218.200.223:3941', '58.218.200.227:8775', '58.218.200.228:4195', '58.218.200.227:4570', '58.218.200.223:3255', '58.218.200.225:6626', '58.218.200.226:8286', '58.218.200.225:4605', '58.218.200.223:3667', '58.218.200.223:7281', '58.218.200.225:6862', '58.218.200.228:2340', '58.218.200.227:7144', '58.218.200.223:3691', '58.218.200.228:3849', '58.218.200.228:7871', '58.218.200.225:6678', '58.218.200.225:6435', '58.218.200.223:3726', '58.218.200.226:8436', '58.218.200.223:7461', '58.218.200.223:4113', '58.218.200.223:3912', '58.218.200.225:4666', '58.218.200.227:7176', '58.218.200.225:5462', '58.218.200.225:8643', '58.218.200.227:7591', '58.218.200.227:2134', '58.218.200.227:5480', '58.218.200.228:9013', '58.218.200.227:5178', '58.218.200.223:8970', '58.218.200.223:5423', '58.218.200.227:2832', '58.218.200.225:5636', '58.218.200.223:2347', '58.218.200.227:4171', '58.218.200.227:5288', '58.218.200.227:4254', '58.218.200.227:3254', '58.218.200.228:6789', '58.218.200.223:4956', '58.218.200.226:6146']
For each crawler, an ip is randomly selected from the ip pool for camouflage.
def randomip_scrapy(proxies_list,url,headers): proxy_ip = random.choice(proxies_list) proxies = {'https': 'https://'+proxy_ip,'http':'http://'+proxy_ip} R=requests.get(url,headers=headers,proxies=proxies) return R
code implementation
All technical difficulties are analyzed, and then the complete code is put directly:
#Traverse the stock traversal date and output the daily data of a stock 1*n matrix to integrate the data of M stocks into m*n matrix import requests from bs4 import BeautifulSoup import re import datetime import random from retrying import retry import jqdatasdk import pandas as pd import numpy as np jqdatasdk.auth('xxxxxx','xxxxxx')#Log in to joinquant with your account @retry(stop_max_attempt_number=100)#retry def randomip_scrapy(proxies_list,url,headers): proxy_ip = random.choice(proxies_list) proxies = {'https': 'https://'+proxy_ip,'http':'http://'+proxy_ip} R=requests.get(url,headers=headers,proxies=proxies) return R #We need to traverse the data of each stock and day def newsSpyder(start_year,start_month,start_day,proxies_list,time_step,time_count): startTime_raw = datetime.date(year=start_year,month=start_month,day=start_day) temporaryTime1_raw = startTime_raw#Define temporary date temporaryTime2_raw = startTime_raw+datetime.timedelta(days=time_step) #Structured date #startTime = startTime_raw.strftime('%Y%m%d') j = 0 #Traversal in time series while j < time_count*time_step: #For each day, create three storage lists that need to record data name_list_everyday = [] weight_list_everyday = [] newsnum_list_everyday = [] date_list = [] temporaryTime2_raw = temporaryTime1_raw+datetime.timedelta(days=time_step) temporaryTime1 = temporaryTime1_raw.strftime('%Y%m%d')#Structured temporary date1 temporaryTime2 = temporaryTime2_raw.strftime('%Y%m%d')#Structured temporary date2 #Get data from jqdata and analyze raw_data_everyday = jqdatasdk.get_index_weights('000016.XSHG', date=temporaryTime1_raw)#If you want to change the index, change it here raw_data_everyday_array = np.array(raw_data_everyday) raw_data_everyday_list = raw_data_everyday_array.tolist() for list in raw_data_everyday_list: name_list_everyday.append(list[1]) weight_list_everyday.append(list[0]) date_list.append(list[2]) j = j + 1 count = 0 for name in name_list_everyday: url="http://news.chinaso.com/newssearch.htm?q="+str(name)+"&type=news&page=0&startTime="+str(temporaryTime1)+"&endTime="+str(temporaryTime2) user_agent_list = ["Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)", "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.5; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15", ] #cookies="uid=CgqAiV1r244oky3nD6KkAg==; wdcid=5daa44151f0a6fc9; cookie_name=222.16.63.188.1567349674276674; Hm_lvt_91fa1aefc990a9fc21c08506e5983ddf=1567349649,1567349703; Hm_lpvt_91fa1aefc990a9fc21c08506e5983ddf=1567349703; wdlast=1567351002" agent = random.choice(user_agent_list)#Random agent headers = { 'User-Agent':agent } #If you select unavailable ip, go back to this step r = randomip_scrapy(proxies_list,url,headers) #proxy_ip = random.choice(proxies_list) #proxies = {'https': 'https://'+proxy_ip,'http':'http://'+proxy_ip} #r=requests.get(url,headers=headers,proxies=proxies) content = r.content soup=BeautifulSoup(content,'lxml') retext=soup.find(class_='toolTab_xgxwts') num=re.match('<div class="toolTab_xgxwts">Find news for you(.*)piece</div>',str(retext)) count = count + 1 print('The first',count,'individual') if str(num)=='None':#If you do not climb to the stock, no news on the day, enter 0 newsnum_list_everyday.append(0) else: newsnum_list_everyday.append(int(num.group(1))) #The next step is to store the four lists in the csv database every day a = [num for num in newsnum_list_everyday] b = [name for name in name_list_everyday] c = [weight for weight in weight_list_everyday] d = [date for date in date_list] dataframe = pd.DataFrame({'num':a,'name':b,'weight':c,'date':d}) dataframe.to_csv(r"20181022-20181231.csv",mode = 'a',sep=',',encoding="gb2312") temporaryTime1_raw = temporaryTime1_raw+datetime.timedelta(days=time_step)#Date + step if __name__ == "__main__": proxies_list=['58.218.200.227:8601', '58.218.200.223:3841', '58.218.200.226:3173', '58.218.200.228:8895', '58.218.200.226:8780', '58.218.200.227:6646', '58.218.200.228:7469', '58.218.200.228:5760', '58.218.200.223:8830', '58.218.200.228:5418', '58.218.200.223:6918', '58.218.200.225:5211', '58.218.200.227:8141', '58.218.200.228:7779', '58.218.200.226:3999', '58.218.200.226:3345', '58.218.200.228:2433', '58.218.200.226:6042', '58.218.200.225:4760', '58.218.200.228:2547', '58.218.200.225:3886', '58.218.200.226:7384', '58.218.200.228:8604', '58.218.200.227:6996', '58.218.200.223:3986', '58.218.200.226:6305', '58.218.200.225:6208', '58.218.200.223:4006', '58.218.200.225:8079', '58.218.200.228:7042', '58.218.200.225:7086', '58.218.200.227:8913', '58.218.200.227:3220', '58.218.200.226:2286', '58.218.200.228:7337', '58.218.200.227:2010', '58.218.200.227:9062', '58.218.200.225:8799', '58.218.200.223:3568', '58.218.200.228:3184', '58.218.200.223:5874', '58.218.200.225:3963', '58.218.200.228:3696', '58.218.200.227:7113', '58.218.200.226:4501', '58.218.200.223:7636', '58.218.200.225:9108', '58.218.200.228:6940', '58.218.200.223:5310', '58.218.200.225:2864', '58.218.200.226:5225', '58.218.200.228:6468', '58.218.200.223:8127', '58.218.200.225:8575', '58.218.200.223:7269', '58.218.200.228:7039', '58.218.200.226:6674', '58.218.200.226:5945', '58.218.200.225:3108', '58.218.200.226:3990', '58.218.200.223:8356', '58.218.200.227:5274', '58.218.200.227:6535', '58.218.200.225:3934', '58.218.200.223:6866', '58.218.200.227:3088', '58.218.200.227:7253', '58.218.200.223:2215', '58.218.200.228:2715', '58.218.200.226:4071', '58.218.200.228:7232', '58.218.200.225:5561', '58.218.200.226:7476', '58.218.200.223:3917', '58.218.200.227:2931', '58.218.200.223:5612', '58.218.200.226:6409', '58.218.200.223:7785', '58.218.200.228:7906', '58.218.200.227:8476', '58.218.200.227:3012', '58.218.200.226:6388', '58.218.200.225:8819', '58.218.200.225:2093', '58.218.200.227:4408', '58.218.200.225:7457', '58.218.200.223:3593', '58.218.200.225:2028', '58.218.200.227:2119', '58.218.200.223:3094', '58.218.200.226:3232', '58.218.200.227:6769', '58.218.200.223:4013', '58.218.200.227:9064', '58.218.200.223:6034', '58.218.200.227:4292', '58.218.200.228:5228', '58.218.200.228:2397', '58.218.200.226:2491', '58.218.200.226:3948', '58.218.200.227:2630', '58.218.200.228:4857', '58.218.200.228:2541', '58.218.200.225:5653', '58.218.200.226:7068', '58.218.200.223:2129', '58.218.200.227:4093', '58.218.200.226:2466', '58.218.200.226:4089', '58.218.200.225:4932', '58.218.200.228:8511', '58.218.200.227:6660', '58.218.200.227:2536', '58.218.200.226:5777', '58.218.200.228:4755', '58.218.200.227:4138', '58.218.200.223:5297', '58.218.200.226:2367', '58.218.200.225:7920', '58.218.200.225:6752', '58.218.200.228:4508', '58.218.200.223:3120', '58.218.200.227:3329', '58.218.200.226:6911', '58.218.200.228:7032', '58.218.200.223:8029', '58.218.200.228:2009', '58.218.200.223:3487', '58.218.200.228:9078', '58.218.200.225:3985', '58.218.200.227:6955', '58.218.200.228:8847', '58.218.200.228:4376', '58.218.200.225:3942', '58.218.200.228:4983', '58.218.200.225:9082', '58.218.200.225:7907', '58.218.200.226:6141', '58.218.200.226:5268', '58.218.200.226:4986', '58.218.200.223:8374', '58.218.200.226:4850', '58.218.200.225:5397', '58.218.200.226:2983', '58.218.200.225:3156', '58.218.200.226:6176', '58.218.200.225:4273', '58.218.200.226:8625', '58.218.200.226:8424', '58.218.200.226:5714', '58.218.200.223:8166', '58.218.200.226:4194', '58.218.200.223:6850', '58.218.200.228:6994', '58.218.200.223:3825', '58.218.200.226:7129', '58.218.200.223:3941', '58.218.200.227:8775', '58.218.200.228:4195', '58.218.200.227:4570', '58.218.200.223:3255', '58.218.200.225:6626', '58.218.200.226:8286', '58.218.200.225:4605', '58.218.200.223:3667', '58.218.200.223:7281', '58.218.200.225:6862', '58.218.200.228:2340', '58.218.200.227:7144', '58.218.200.223:3691', '58.218.200.228:3849', '58.218.200.228:7871', '58.218.200.225:6678', '58.218.200.225:6435', '58.218.200.223:3726', '58.218.200.226:8436', '58.218.200.223:7461', '58.218.200.223:4113', '58.218.200.223:3912', '58.218.200.225:4666', '58.218.200.227:7176', '58.218.200.225:5462', '58.218.200.225:8643', '58.218.200.227:7591', '58.218.200.227:2134', '58.218.200.227:5480', '58.218.200.228:9013', '58.218.200.227:5178', '58.218.200.223:8970', '58.218.200.223:5423', '58.218.200.227:2832', '58.218.200.225:5636', '58.218.200.223:2347', '58.218.200.227:4171', '58.218.200.227:5288', '58.218.200.227:4254', '58.218.200.227:3254', '58.218.200.228:6789', '58.218.200.223:4956', '58.218.200.226:6146'] newsSpyder(2018,12,17,proxies_list,1,15)#Year, month, day, time step, how many time steps #22.46
summary
The scale of this project is not small, but the technical difficulties to be solved are all basic problems of reptiles, so the overall difficulty is not too high. In this project, the blogger is the first to fully apply the crawler knowledge, including data acquisition, web page analysis, ip spoofing, data cleaning, etc. It's also an exercise for yourself.
Original link: https://blog.csdn.net/chandler_scut/article/details/106685617