Python crawler example: News total crawling

Posted by HavokDelta6 on Thu, 11 Jun 2020 07:10:42 +0200

Python crawler example: News total crawling

preface

Some time ago, due to the needs of tasks, it was necessary to climb the number of news of Shanghai Stock Exchange 50 index component stocks on certain dates.

The idea at first was to climb Baidu news advanced search But one day Baidu news advanced search suddenly couldn't be used. No matter what search is, it will jump to Baidu's home page. So far (June 11, 2020) it hasn't recovered. I don't know if Baidu company has stopped this business.

So we have to find substitutes. The blogger's eyes are on them China News advanced search , known as the national team of search industry.

analysis

Web page parsing

To crawler, first of all, you have to parse the url structure of the web page.

First, take the full-text search keyword "ICBC" as an example, set the search starting point 2020-01-08 and the end point 2020-01-08, that is, search the news volume of the keyword "ICBC" on January 8, 2020, Click here to enter the web page.
The home page is as follows:

Among them, the total amount of news is the data we expect to crawl, and search keywords and search time interval are the variables we need to change in the process of crawling.

Then analyze its url: http://news.chinaso.com/newssearch.htm?q=%E5%B7%A5%E5%95%86%E9%93%B6%E8%A1%8C&startTime=20200108&endTime=20200108

In this list of URLs, the previous http://news.chinaso.com/newssearch.htm? Some of them represent the national search homepage, which does not need to be changed in the process of crawling; the rest consists of three variables: q, startTime and endTime. The meaning of variable q is search keyword. In this url, "% E5%B7%A5%E5%95%86%E9%93%B6%E8%A1%8C" stands for the keyword "industrial and Commercial Bank of China"; the variables startTime and endTime stand for the starting point and focus of time.

In the actual operation process, url coding is too complicated and complicated, which can be directly replaced by Chinese text. For example, I want to search the news volume of the keyword "China Ping An" on January 8, 2020, which can be constructed as follows: http://news.chinaso.com/newssearch.htm?q= Ping An China & starttime = 20200108 & Endtime = 20200108.

For the total amount of data news that you want to crawl, first use F12 developer tools to locate it:

After finding the node, you can analyze the news volume. There are many methods, including bs, regular, etc. here the blogger uses regular statements:

num=re.match('<div class="toolTab_xgxwts">Find news for you(.*)piece</div>',str(retext))

Here, the most basic part of the reptile is completed, and the rest of the head and tail details need to be processed.

Stock data source

At the beginning, it was said that it is necessary to climb the 50 component stocks of Shanghai Stock Exchange. First of all, it is necessary to know what the component stocks are. The blogger understands that the component stocks of an index are not fixed. Due to the long time period (more than one year) for crawling, it is safer to request a crawling of the component stocks list on each day of crawling. Here, the blogger uses the jqdatasdk data package provided by jukuan platform. The first use should be in Official website Register the account and type pip install jqdatasdk installation on the command line.

During use, you need to log in first:

jqdatasdk.auth('xxxxxx','xxxxxx')#Log in to joinquant with your account password

Then historical market data can be obtained:

raw_data_everyday = jqdatasdk.get_index_weights('000016.XSHG', date=temporaryTime1_raw)

From it, the daily list of constituent stocks can be parsed for cyclic crawling.

Proxy IP

In the actual operation, we found that guosou is still strict with crawlers (worthy of being the national team), and it needs to use proxy ip. The specific ip platform is not described here, to avoid being used as an advertisement, just mention the method of using proxy ip.

First, build the ip pool, copy the available ip and its ports from the ip provider (of course, the interface can also be accessed)

    proxies_list=['58.218.200.227:8601', '58.218.200.223:3841', '58.218.200.226:3173', '58.218.200.228:8895', '58.218.200.226:8780', '58.218.200.227:6646', '58.218.200.228:7469', '58.218.200.228:5760', '58.218.200.223:8830', '58.218.200.228:5418', '58.218.200.223:6918', '58.218.200.225:5211', '58.218.200.227:8141', '58.218.200.228:7779', '58.218.200.226:3999', '58.218.200.226:3345', '58.218.200.228:2433', '58.218.200.226:6042', '58.218.200.225:4760', '58.218.200.228:2547', '58.218.200.225:3886', '58.218.200.226:7384', '58.218.200.228:8604', '58.218.200.227:6996', '58.218.200.223:3986', '58.218.200.226:6305', '58.218.200.225:6208', '58.218.200.223:4006', '58.218.200.225:8079', '58.218.200.228:7042', '58.218.200.225:7086', '58.218.200.227:8913', '58.218.200.227:3220', '58.218.200.226:2286', '58.218.200.228:7337', '58.218.200.227:2010', '58.218.200.227:9062', '58.218.200.225:8799', '58.218.200.223:3568', '58.218.200.228:3184', '58.218.200.223:5874', '58.218.200.225:3963', '58.218.200.228:3696', '58.218.200.227:7113', '58.218.200.226:4501', '58.218.200.223:7636', '58.218.200.225:9108', '58.218.200.228:6940', '58.218.200.223:5310', '58.218.200.225:2864', '58.218.200.226:5225', '58.218.200.228:6468', '58.218.200.223:8127', '58.218.200.225:8575', '58.218.200.223:7269', '58.218.200.228:7039', '58.218.200.226:6674', '58.218.200.226:5945', '58.218.200.225:3108', '58.218.200.226:3990', '58.218.200.223:8356', '58.218.200.227:5274', '58.218.200.227:6535', '58.218.200.225:3934', '58.218.200.223:6866', '58.218.200.227:3088', '58.218.200.227:7253', '58.218.200.223:2215', '58.218.200.228:2715', '58.218.200.226:4071', '58.218.200.228:7232', '58.218.200.225:5561', '58.218.200.226:7476', '58.218.200.223:3917', '58.218.200.227:2931', '58.218.200.223:5612', '58.218.200.226:6409', '58.218.200.223:7785', '58.218.200.228:7906', '58.218.200.227:8476', '58.218.200.227:3012', '58.218.200.226:6388', '58.218.200.225:8819', '58.218.200.225:2093', '58.218.200.227:4408', '58.218.200.225:7457', '58.218.200.223:3593', '58.218.200.225:2028', '58.218.200.227:2119', '58.218.200.223:3094', '58.218.200.226:3232', '58.218.200.227:6769', '58.218.200.223:4013', '58.218.200.227:9064', '58.218.200.223:6034', '58.218.200.227:4292', '58.218.200.228:5228', '58.218.200.228:2397', '58.218.200.226:2491', '58.218.200.226:3948', '58.218.200.227:2630', '58.218.200.228:4857', '58.218.200.228:2541', '58.218.200.225:5653', '58.218.200.226:7068', '58.218.200.223:2129', '58.218.200.227:4093', '58.218.200.226:2466', '58.218.200.226:4089', '58.218.200.225:4932', '58.218.200.228:8511', '58.218.200.227:6660', '58.218.200.227:2536', '58.218.200.226:5777', '58.218.200.228:4755', '58.218.200.227:4138', '58.218.200.223:5297', '58.218.200.226:2367', '58.218.200.225:7920', '58.218.200.225:6752', '58.218.200.228:4508', '58.218.200.223:3120', '58.218.200.227:3329', '58.218.200.226:6911', '58.218.200.228:7032', '58.218.200.223:8029', '58.218.200.228:2009', '58.218.200.223:3487', '58.218.200.228:9078', '58.218.200.225:3985', '58.218.200.227:6955', '58.218.200.228:8847', '58.218.200.228:4376', '58.218.200.225:3942', '58.218.200.228:4983', '58.218.200.225:9082', '58.218.200.225:7907', '58.218.200.226:6141', '58.218.200.226:5268', '58.218.200.226:4986', '58.218.200.223:8374', '58.218.200.226:4850', '58.218.200.225:5397', '58.218.200.226:2983', '58.218.200.225:3156', '58.218.200.226:6176', '58.218.200.225:4273', '58.218.200.226:8625', '58.218.200.226:8424', '58.218.200.226:5714', '58.218.200.223:8166', '58.218.200.226:4194', '58.218.200.223:6850', '58.218.200.228:6994', '58.218.200.223:3825', '58.218.200.226:7129', '58.218.200.223:3941', '58.218.200.227:8775', '58.218.200.228:4195', '58.218.200.227:4570', '58.218.200.223:3255', '58.218.200.225:6626', '58.218.200.226:8286', '58.218.200.225:4605', '58.218.200.223:3667', '58.218.200.223:7281', '58.218.200.225:6862', '58.218.200.228:2340', '58.218.200.227:7144', '58.218.200.223:3691', '58.218.200.228:3849', '58.218.200.228:7871', '58.218.200.225:6678', '58.218.200.225:6435', '58.218.200.223:3726', '58.218.200.226:8436', '58.218.200.223:7461', '58.218.200.223:4113', '58.218.200.223:3912', '58.218.200.225:4666', '58.218.200.227:7176', '58.218.200.225:5462', '58.218.200.225:8643', '58.218.200.227:7591', '58.218.200.227:2134', '58.218.200.227:5480', '58.218.200.228:9013', '58.218.200.227:5178', '58.218.200.223:8970', '58.218.200.223:5423', '58.218.200.227:2832', '58.218.200.225:5636', '58.218.200.223:2347', '58.218.200.227:4171', '58.218.200.227:5288', '58.218.200.227:4254', '58.218.200.227:3254', '58.218.200.228:6789', '58.218.200.223:4956', '58.218.200.226:6146']

For each crawler, an ip is randomly selected from the ip pool for camouflage.

def randomip_scrapy(proxies_list,url,headers):
     proxy_ip = random.choice(proxies_list)
     proxies = {'https': 'https://'+proxy_ip,'http':'http://'+proxy_ip}
     R=requests.get(url,headers=headers,proxies=proxies)
     return R

code implementation

All technical difficulties are analyzed, and then the complete code is put directly:

#Traverse the stock traversal date and output the daily data of a stock 1*n matrix to integrate the data of M stocks into m*n matrix
import requests
from bs4 import BeautifulSoup
import re
import datetime 
import random
from retrying import retry
import jqdatasdk
import pandas as pd
import numpy as np

jqdatasdk.auth('xxxxxx','xxxxxx')#Log in to joinquant with your account

@retry(stop_max_attempt_number=100)#retry 
def randomip_scrapy(proxies_list,url,headers):
     proxy_ip = random.choice(proxies_list)
     proxies = {'https': 'https://'+proxy_ip,'http':'http://'+proxy_ip}
     R=requests.get(url,headers=headers,proxies=proxies)
     return R

#We need to traverse the data of each stock and day
def newsSpyder(start_year,start_month,start_day,proxies_list,time_step,time_count):
    startTime_raw = datetime.date(year=start_year,month=start_month,day=start_day) 
    temporaryTime1_raw = startTime_raw#Define temporary date
    temporaryTime2_raw = startTime_raw+datetime.timedelta(days=time_step)
    #Structured date
    #startTime = startTime_raw.strftime('%Y%m%d')
    
    j = 0
    #Traversal in time series
    while j < time_count*time_step:
        #For each day, create three storage lists that need to record data
        name_list_everyday = []
        weight_list_everyday = []
        newsnum_list_everyday = []
        date_list = []
        temporaryTime2_raw = temporaryTime1_raw+datetime.timedelta(days=time_step)
        temporaryTime1 = temporaryTime1_raw.strftime('%Y%m%d')#Structured temporary date1
        temporaryTime2 = temporaryTime2_raw.strftime('%Y%m%d')#Structured temporary date2
        #Get data from jqdata and analyze
        raw_data_everyday = jqdatasdk.get_index_weights('000016.XSHG', date=temporaryTime1_raw)#If you want to change the index, change it here
        raw_data_everyday_array = np.array(raw_data_everyday)
        raw_data_everyday_list = raw_data_everyday_array.tolist()
        for list in raw_data_everyday_list:
            name_list_everyday.append(list[1])
            weight_list_everyday.append(list[0])
            date_list.append(list[2])
        j = j + 1
        count = 0
        for name in name_list_everyday:
            
            url="http://news.chinaso.com/newssearch.htm?q="+str(name)+"&type=news&page=0&startTime="+str(temporaryTime1)+"&endTime="+str(temporaryTime2)
            user_agent_list = ["Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
                        "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
                        "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36",
                        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36",
                        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36",
                        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
                        "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.5; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15",
                        ]
            
            #cookies="uid=CgqAiV1r244oky3nD6KkAg==; wdcid=5daa44151f0a6fc9; cookie_name=222.16.63.188.1567349674276674; Hm_lvt_91fa1aefc990a9fc21c08506e5983ddf=1567349649,1567349703; Hm_lpvt_91fa1aefc990a9fc21c08506e5983ddf=1567349703; wdlast=1567351002"
            agent = random.choice(user_agent_list)#Random agent
            headers = {
            'User-Agent':agent
            }
            #If you select unavailable ip, go back to this step
            r = randomip_scrapy(proxies_list,url,headers)
            #proxy_ip = random.choice(proxies_list)
            #proxies = {'https': 'https://'+proxy_ip,'http':'http://'+proxy_ip}
            #r=requests.get(url,headers=headers,proxies=proxies)
            content = r.content
            soup=BeautifulSoup(content,'lxml')
            retext=soup.find(class_='toolTab_xgxwts')
            num=re.match('<div class="toolTab_xgxwts">Find news for you(.*)piece</div>',str(retext))
            count = count + 1
            print('The first',count,'individual')
            if str(num)=='None':#If you do not climb to the stock, no news on the day, enter 0
                newsnum_list_everyday.append(0)
            else:
                newsnum_list_everyday.append(int(num.group(1)))
        #The next step is to store the four lists in the csv database every day
        a = [num for num in newsnum_list_everyday]
        b = [name for name in name_list_everyday]
        c = [weight for weight in weight_list_everyday]
        d = [date for date in date_list]
        dataframe = pd.DataFrame({'num':a,'name':b,'weight':c,'date':d})
        dataframe.to_csv(r"20181022-20181231.csv",mode = 'a',sep=',',encoding="gb2312")
        temporaryTime1_raw = temporaryTime1_raw+datetime.timedelta(days=time_step)#Date + step
        
            



if __name__ == "__main__":
    proxies_list=['58.218.200.227:8601', '58.218.200.223:3841', '58.218.200.226:3173', '58.218.200.228:8895', '58.218.200.226:8780', '58.218.200.227:6646', '58.218.200.228:7469', '58.218.200.228:5760', '58.218.200.223:8830', '58.218.200.228:5418', '58.218.200.223:6918', '58.218.200.225:5211', '58.218.200.227:8141', '58.218.200.228:7779', '58.218.200.226:3999', '58.218.200.226:3345', '58.218.200.228:2433', '58.218.200.226:6042', '58.218.200.225:4760', '58.218.200.228:2547', '58.218.200.225:3886', '58.218.200.226:7384', '58.218.200.228:8604', '58.218.200.227:6996', '58.218.200.223:3986', '58.218.200.226:6305', '58.218.200.225:6208', '58.218.200.223:4006', '58.218.200.225:8079', '58.218.200.228:7042', '58.218.200.225:7086', '58.218.200.227:8913', '58.218.200.227:3220', '58.218.200.226:2286', '58.218.200.228:7337', '58.218.200.227:2010', '58.218.200.227:9062', '58.218.200.225:8799', '58.218.200.223:3568', '58.218.200.228:3184', '58.218.200.223:5874', '58.218.200.225:3963', '58.218.200.228:3696', '58.218.200.227:7113', '58.218.200.226:4501', '58.218.200.223:7636', '58.218.200.225:9108', '58.218.200.228:6940', '58.218.200.223:5310', '58.218.200.225:2864', '58.218.200.226:5225', '58.218.200.228:6468', '58.218.200.223:8127', '58.218.200.225:8575', '58.218.200.223:7269', '58.218.200.228:7039', '58.218.200.226:6674', '58.218.200.226:5945', '58.218.200.225:3108', '58.218.200.226:3990', '58.218.200.223:8356', '58.218.200.227:5274', '58.218.200.227:6535', '58.218.200.225:3934', '58.218.200.223:6866', '58.218.200.227:3088', '58.218.200.227:7253', '58.218.200.223:2215', '58.218.200.228:2715', '58.218.200.226:4071', '58.218.200.228:7232', '58.218.200.225:5561', '58.218.200.226:7476', '58.218.200.223:3917', '58.218.200.227:2931', '58.218.200.223:5612', '58.218.200.226:6409', '58.218.200.223:7785', '58.218.200.228:7906', '58.218.200.227:8476', '58.218.200.227:3012', '58.218.200.226:6388', '58.218.200.225:8819', '58.218.200.225:2093', '58.218.200.227:4408', '58.218.200.225:7457', '58.218.200.223:3593', '58.218.200.225:2028', '58.218.200.227:2119', '58.218.200.223:3094', '58.218.200.226:3232', '58.218.200.227:6769', '58.218.200.223:4013', '58.218.200.227:9064', '58.218.200.223:6034', '58.218.200.227:4292', '58.218.200.228:5228', '58.218.200.228:2397', '58.218.200.226:2491', '58.218.200.226:3948', '58.218.200.227:2630', '58.218.200.228:4857', '58.218.200.228:2541', '58.218.200.225:5653', '58.218.200.226:7068', '58.218.200.223:2129', '58.218.200.227:4093', '58.218.200.226:2466', '58.218.200.226:4089', '58.218.200.225:4932', '58.218.200.228:8511', '58.218.200.227:6660', '58.218.200.227:2536', '58.218.200.226:5777', '58.218.200.228:4755', '58.218.200.227:4138', '58.218.200.223:5297', '58.218.200.226:2367', '58.218.200.225:7920', '58.218.200.225:6752', '58.218.200.228:4508', '58.218.200.223:3120', '58.218.200.227:3329', '58.218.200.226:6911', '58.218.200.228:7032', '58.218.200.223:8029', '58.218.200.228:2009', '58.218.200.223:3487', '58.218.200.228:9078', '58.218.200.225:3985', '58.218.200.227:6955', '58.218.200.228:8847', '58.218.200.228:4376', '58.218.200.225:3942', '58.218.200.228:4983', '58.218.200.225:9082', '58.218.200.225:7907', '58.218.200.226:6141', '58.218.200.226:5268', '58.218.200.226:4986', '58.218.200.223:8374', '58.218.200.226:4850', '58.218.200.225:5397', '58.218.200.226:2983', '58.218.200.225:3156', '58.218.200.226:6176', '58.218.200.225:4273', '58.218.200.226:8625', '58.218.200.226:8424', '58.218.200.226:5714', '58.218.200.223:8166', '58.218.200.226:4194', '58.218.200.223:6850', '58.218.200.228:6994', '58.218.200.223:3825', '58.218.200.226:7129', '58.218.200.223:3941', '58.218.200.227:8775', '58.218.200.228:4195', '58.218.200.227:4570', '58.218.200.223:3255', '58.218.200.225:6626', '58.218.200.226:8286', '58.218.200.225:4605', '58.218.200.223:3667', '58.218.200.223:7281', '58.218.200.225:6862', '58.218.200.228:2340', '58.218.200.227:7144', '58.218.200.223:3691', '58.218.200.228:3849', '58.218.200.228:7871', '58.218.200.225:6678', '58.218.200.225:6435', '58.218.200.223:3726', '58.218.200.226:8436', '58.218.200.223:7461', '58.218.200.223:4113', '58.218.200.223:3912', '58.218.200.225:4666', '58.218.200.227:7176', '58.218.200.225:5462', '58.218.200.225:8643', '58.218.200.227:7591', '58.218.200.227:2134', '58.218.200.227:5480', '58.218.200.228:9013', '58.218.200.227:5178', '58.218.200.223:8970', '58.218.200.223:5423', '58.218.200.227:2832', '58.218.200.225:5636', '58.218.200.223:2347', '58.218.200.227:4171', '58.218.200.227:5288', '58.218.200.227:4254', '58.218.200.227:3254', '58.218.200.228:6789', '58.218.200.223:4956', '58.218.200.226:6146']
    
    
    newsSpyder(2018,12,17,proxies_list,1,15)#Year, month, day, time step, how many time steps
    #22.46

summary

The scale of this project is not small, but the technical difficulties to be solved are all basic problems of reptiles, so the overall difficulty is not too high. In this project, the blogger is the first to fully apply the crawler knowledge, including data acquisition, web page analysis, ip spoofing, data cleaning, etc. It's also an exercise for yourself.

Original link: https://blog.csdn.net/chandler_scut/article/details/106685617

Topics: Windows REST Python pip