How to quickly crawl Sina News and save it locally

Posted by sandrine2411 on Wed, 17 Jun 2020 05:31:34 +0200

This article can quickly teach you how to crawl Sina News. I hope this article can help you! If you have any knowledge or suggestions you want to learn, please leave a message to the author~

How to quickly crawl Sina News and save it locally

1, Crawling scene

1. Web loading mode

  • Dynamic web page


Dynamic web pages are different from traditional static web pages. If you want to crawl in the traditional way, you will make mistakes.

  • Static web page


The image above is a traditional static web page.

2. Web page structure

List page - details page

API traversal

2, API traversal method crawling news

1. API found

  • Use the developer tool network module to find the API
  • API: https://feed.mix.sina.com.cn/api/roll/get?pageid=153&lid=2509&k=&num=50&page=1&r=0.7210316507361691&callback=jQuery1112011987950839178185_1585738993071&_=1585738993083

2. Analyze key API parameters

  • Pageid: should be the allocation parameter of sina API list
  • Num: the number of news per page. You can change this parameter by
  • Page: the number of pages in the news list. This parameter can be modified to traverse

3. Design crawler logic according to API parameters

  • First, use APII to traverse the news URL
  • Then get the detailed page according to the news URL and extract the data we want

4. Test use

3, Code implementation

1. Try to get dynamic pages

import requests
print(requests.get("https://feed.mix.sina.com.cn/api/roll/get?pageid=153&lid=2509&k=&num=50&page=1&r=0.7210316507361691&callback=jQuery1112011987950839178185_1585738993071&_=1585738993083").content.decode("utf-8"))


First of all, we can see that there is overlap between URL links and the results obtained. We can try to simplify URL links. At the same time, because the format itself is "utf-8", we can change the format to "Unicode"_ escape“

So we can get the modules we want.

We can use it json.loads() get string

import requests
import json
print(json.loads(requests.get("https://feed.mix.sina.com.cn/api/roll/get?pageid=153&lid=2509&k=&num=50&page=1&r=0.7210316507361691").content))

2. Watch URL

https://feed.mix.sina.com.cn/api/roll/get?pageid=153&lid=2509&k=&num=50&page=1&r=0.7210316507361691

At this time, we may use an online tool such as timestamp:
Address: https://tool.lu/timestamp/

In the red box above, it is found through the test:
num controls the overall content of the page
Page shows the contents of the page
Let's start with the following test
The results are as follows:

At this time, we look up the value of ctime and transform the time through the time stamp:

The following code is the key point

3. Procedure

3.1. Import required packages

#Import required packages
import codecs #Used to store crawled information
from pybloom_live import ScalableBloomFilter # For URL de duplication
import requests #Used to initiate a request and obtain web page information
import json #Processing data in json format
from bs4 import BeautifulSoup as bs #For data extraction
from lxml import etree #For data extraction
import re #Regular language library

3.2 write extraction module

1. The function is to obtain the news title, content, source, time and other information in the detail page through the incoming URL parameter.

#Define a function. The function is to obtain the news title, content, source, time and other information in the detail page through the incoming URL parameter.
#Function name: getdetailpagebybs; required parameter: URL
def getdetailpagebybs(url):
    detail = {} #Create a dictionary to store URL, title, newstime and other information
    detail["url"] = url #Save the URL time into the corresponding key value in the detail dictionary
    page=requests.get(url).content #use requests.get Method to obtain the web page code. Since bs4 can automatically decode the URL code, no decode is needed here
    html=bs(page, "lxml") #Using the lxml parser
    title=html.find(class_="main-title") #Get the title information in the news page. There is only one "class = main title" in the page, so use find
    print(title.text) #Show news headlines
    detail["title"] = title.text #Save the news title as text in the corresponding key value in the detail dictionary
    artibody=html.find(class_="article") #Using find method to get article information in news web page
    #print(artibody.text)
    detail["artibody"]=artibody.text#. . . . . . . 
    date_source = html.find(class_="date-source") #Use find method to get date source information in news page
    #Because different tag elements are used between different news detail pages, direct extraction may result in errors, so judgment statements are used to distinguish crawling
    if date_source.a: #Determine whether the date source node contains the 'a' element
        #print(date_source.span.text)
        detail["newstime"]=date_source.span.text #Extract time information from 'span' tag
        #print(date_source.a.text)
        detail["newsfrom"] =date_source.a.text #Extract news source information from 'a' tag
    else:
        #print(date_source("span")[0].text)
        detail["newstime"] =date_source("span")[0].text #Extract the time information contained in the 'span' label
        #print(date_source("span")[1].text)
        detail["newsfrom"] =date_source("span")[1].text#Extract news source information contained in the 'span' tag
    #You can also use regular expressions to extract information
    #print(date_source.prettify())
    r=re.compile("(\d{4}year\d{2}month\d{2}day \d{2}:\d{2})") #Writing regular expressions of time information
    re_newstime=r.findall(date_source.text) #Using findall method, according to the written regular statement, from date_ Time information contained in the source node
    detail["re_newstime"] =re_newstime.text #Save the news time into the corresponding key value in the detail dictionary
    
    return detail #The return value of the function is the dictionary where the extracted information is stored

2. Use lxml, write extraction module to write a function, use lxml to extract module, use xpath method to extract the news title, content, source, time and other information in the detail page.

#Define a function. The function is to obtain the news title, content, source, time and other information in the detail page through the incoming URL parameter.
#Function name: getdetailpagebylxml; required parameter: URL
def getdetailpagebylxml(url):
    detail={} #Create a dictionary to store URL, title, newstime and other information
    detail["url"]=url #Store the URL in the corresponding key value in the dictionary
    page = requests.get(url).content.decode("utf-8") #Get the source code of the web page and use utf-8 encoding
    
    #Because the structure of the web page may change with the update of the web site and other reasons, when using xpath method to extract information, the xpath of copying elements from the web page may not be directly used
    #For example, the xpath of the date source element copied from the web page in this example is "/ / * [@ id="top_bar"]/div/div[2]", according to the directly copied xpath, the element information cannot be obtained normally
    #You need to modify and adjust manually. After adjusting to "/ / div [@ class = \" date source \ ", you can get the element information normally
    html = etree.HTML(page)
    title = html.xpath("/html/head/title/text()")[0] #Extracting title information using xpath method
    detail["title"]=title #Store the title in the corresponding key value in the dictionary
    print(title)
    artibody = html.xpath("//div[@class=\"article\"]/p//text()") #Extract article information using xpath method
    detail["artibody"]=''.join(artibody) # The article body is an original list object, which is connected by the join method
    date_source = html.xpath("//div[@class=\"date-source\"]//text()")
    detail["newstime"]=date_source[1]
    detail["newsfrom"]=date_source[3]
    #Regular expressions can also be used to extract information and parse regular modules in the same function as 2.1
    #print(date_source.prettify())
    r=re.compile("(\d{4}year\d{2}month\d{2}day \d{2}:\d{2})") 
    re_newstime=r.findall(date_source.text) 
    detail["re_newstime"] =re_newstime.text 
    
    return detail 

3.3. Write storage module

Write a function, use codecs package to store the extracted information into a file in a specified location

#Function name: savenews; required parameters: data, new
def savenews(data,new):
    fp = codecs.open('.\\sinaNews\\'+new+'.txt', 'a+', 'utf-8')
    fp.write(json.dumps(data, ensure_ascii=False))
    fp.close()

3.4. Write reptile theme

Use the above two modules to run the crawler

1. Set global variables

#Initialize global variables
#Use ScalableBloomFilter module to de reprocess the obtained URL
urlbloomfilter=ScalableBloomFilter(initial_capacity=100, error_rate=0.001, mode=ScalableBloomFilter.LARGE_SET_GROWTH)
page = 1 #Set page number for initial crawling
error_url=set() #Create a collection for error URL links

2. Get URL

  • Get URL due to Sina News page in this case
    It's a dynamic web page, so it can't be crawled directly. You need to view the NetWork of the web page through the developer tool, find the API URL of the web page, and use it as the initial URL for crawling. The number of pages to be crawled is controlled by splicing parameter 'page'.
  • Use the loop to control the crawler, and call the extraction module and storage module written before to run the crawler
1. Extract modules and enclosures with beautifulsop
#Extract modules and enclosures with beautifulsop
#Set the upper limit of crawling page. Because it is only used for case presentation, only the news data of the previous page will be crawled here
while page <= 1: 
    #Start to get url list with API as index
    data = requests.get("https://feed.mix.sina.com.cn/api/roll/get?pageid=153&lid=2509&k=&num=50&page="+str(page)) #Splicing URL and obtaining index page information
    if data.status_code == 200:  #Get the web page data when the request page returns 200 (correct)
        #json the acquired data
        data_json = json.loads(data.content)
        news=data_json.get("result").get("data") #Get the data in the data node under the result node, which is the information of the news details page
        #From the news detail page information list news, use for loop to traverse the information of each news detail page
        for new in news:
            # Duplicate check, extract the URL from new, and use scalable bloom filter to duplicate check
            if new["url"] not in urlbloomfilter:
                urlbloomfilter.add(new["url"]) #Put the crawled URL into urlbloom filter
                try:
                    #print(new)
                    #bs4 is used in extraction module
                    detail = getdetailpagebybs(new["url"])
                    #Enclosure save to txt
                    #savenews(detail,new['docid'][-7:])#################################################################
                except Exception as e:
                    error_url.add(new["url"]) #Save the URL that failed to crawl normally to the collection error_url
        page+=1 #Page number plus 1
2. Extract modules and enclosures with lxml
while page <= 1:
    #Start to get url list with API as index
    data = requests.get("https://feed.mix.sina.com.cn/api/roll/get?pageid=153&lid=2509&k=&num=50&page="+str(page)) #Splicing URL and obtaining index page information
    if data.status_code == 200:  #Get the web page data when the request page returns 200 (correct)
        #json the acquired data
        data_json = json.loads(data.content)
        news=data_json.get("result").get("data") #Get the data in the data node under the result node, which is the information of the news details page
        #From the news detail page information list news, use for loop to traverse the information of each news detail page
        for new in news:
            # Duplicate check, extract the URL from new, and use scalable bloom filter to duplicate check
            if new["url"] not in urlbloomfilter:
                urlbloomfilter.add(new["url"]) #Put the crawled URL into urlbloom filter
                try:
                    #print(new)
                    #Using lxml for extraction module
                    detail=getdetailpagebylxml(new["url"])
                    #Enclosure save to txt
                    #savenews(detail,new['docid'][-7:])#######################################################
                except Exception as e:
                    error_url.add(new["url"]) #Save the URL that failed to crawl normally to the collection error_url
        page+=1 #Page number plus 1

4, Full code

import codecs #Used to store crawled information
from pybloom_live import ScalableBloomFilter # For URL de duplication
import requests #Used to initiate a request and obtain web page information
import json #Processing data in json format
from bs4 import BeautifulSoup as bs #For data extraction
import re #Regular language library


def getdetailpagebybs(url):
    detail = {}  # Create a dictionary to store URL, title, newstime and other information
    detail["url"] = url  # Save the URL time into the corresponding key value in the detail dictionary
    page = requests.get(url).content  # use requests.get Method to obtain the web page code. Since bs4 can automatically decode the URL code, no decode is needed here
    html = bs(page, "lxml")  # Using the lxml parser
    title = html.find(class_="main-title")  # Get the title information in the news page. There is only one "class = main title" in the page, so use find
    print(title.text)  # Show news headlines
    detail["title"] = title.text  # Save the news title as text in the corresponding key value in the detail dictionary
    artibody = html.find(class_="article")  # Using find method to get article information in news web page
    print(artibody.text)
    detail["artibody"] = artibody.text  # . . . . . . . 
    date_source = html.find(class_="date-source")  # Use find method to get date source information in news page
    # Because different tag elements are used between different news detail pages, direct extraction may result in errors, so judgment statements are used to distinguish crawling
    if date_source.a:  # Determine whether the date source node contains the 'a' element
        print(date_source.span.text)
        detail["newstime"] = date_source.span.text  # Extract time information from 'span' tag
        print(date_source.a.text)
        detail["newsfrom"] = date_source.a.text  # Extract news source information from 'a' tag
    else:
        print(date_source("span")[0].text)
        detail["newstime"] = date_source("span")[0].text  # Extract the time information contained in the 'span' label
        print(date_source("span")[1].text)
        detail["newsfrom"] = date_source("span")[1].text  # Extract news source information contained in the 'span' tag
    return detail  # The return value of the function is the dictionary where the extracted information is stored


def savenews(data,new):
    fp = codecs.open('D:/sinaNews/'+new+'.txt', 'a+', 'utf-8')
    fp.write(json.dumps(data, ensure_ascii=False))
    fp.close()

#Use ScalableBloomFilter module to de reprocess the obtained URL
urlbloomfilter=ScalableBloomFilter(initial_capacity=100, error_rate=0.001, mode=ScalableBloomFilter.LARGE_SET_GROWTH)
page = 1 #Set page number for initial crawling
error_url=set() #Create a collection for error URL links
#Extract modules and enclosures with beautifulsop
#Set the upper limit of crawling page,
while page <= 10:
    #Start to get url list with API as index
    data = requests.get("https://feed.mix.sina.com.cn/api/roll/get?pageid=153&lid=2509&k=&num=50&page="+str(page)) #Splicing URL and obtaining index page information
    if data.status_code == 200:  #Get the web page data when the request page returns 200 (correct)
        #json the acquired data
        data_json = json.loads(data.content)
        news=data_json.get("result").get("data") #Get the data in the data node under the result node, which is the information of the news details page
        #From the news detail page information list news, use for loop to traverse the information of each news detail page
        for new in news:
            # Duplicate check, extract the URL from new, and use scalable bloom filter to duplicate check
            if new["url"] not in urlbloomfilter:
                urlbloomfilter.add(new["url"]) #Put the crawled URL into urlbloom filter
                try:
                    print(new)
                    #bs4 is used in extraction module
                    detail = getdetailpagebybs(new["url"])
                    #Enclosure save to txt
                    savenews(detail,new['docid'][-7:])#################################################################
                except Exception as e:
                    error_url.add(new["url"]) #Save the URL that failed to crawl normally to the collection error_url
        page+=1 #Page number plus 1
        

The results are as follows:


Friends passing by, if you think you can learn something, please click "like" before you go, welcome comments from the big guys passing by, correct the mistakes, and welcome comments from the small partners with problems, private letters. Everyone's attention is my motivation to update my blog!!!

Topics: JSON network encoding