[crawler notes] R language simple dynamic web crawler (rvest package) example

Posted by michealholding on Tue, 08 Feb 2022 16:32:55 +0100

1. Reptile target

It is generally believed that Python crawlers are powerful, but Python has some difficulties in solving dynamic loading or landing websites. Compared with some ordinary crawlers, using R language will be more convenient.

with https://www.thepaper.cn/ For example, the httr package commonly used in dynamic web pages is mainly used to crawl the news (title, content and time) on the home page.

During the initial learning, reference is made to station B< 20 minute introduction to web crawler based on R language_ Beeping beeping beeping beeping beeping beeping beeping beeping beeping beeping beeping beeping beeping beeping beeping beeping beeping beeping beeping beeping beeping beeping beeping beeping beeping beeping beeping beeping beeping beeping beeping beeping beeping beeping beeping beeping beeping beeping beeping beeping beeping >But there was an error in the code operation, so it was modified in the original code.

2. Code analysis and error correction

library(rvest)
library(stringr)

url <-"https://www.thepaper.cn/"
web <- read_html(url)    #Functions for reading html web pages

news <- web %>% html_nodes('h2 a')
title <- news %>% html_text()  #Get the text part
link <- news %>% html_attrs()  #Get the URL corresponding to each title

link1 <- c(1:length(link))
for(i in 1:length(link1))
{
  link1[i]<- link[[i]][1]
}
link2 <- paste("https://www.thepaper.cn/",link1,sep="")

##Get the full text of each news
news_content<-c(1:length(link2))
for(i in 1:length(link2))
{
  web <- read_html(link2[i])
  {
    if(length(html_nodes(web,'div.news_txt'))==1)
      news_content[i]<- html_text(html_nodes(web,'div.news_txt'))
    else
      news_content[i]<- trimws(str_replace_all((html_text(html_nodes(web,'div.video_txt_l p'))), "[\r\n]" , ""))
  }
} 


##Time to get each news
news_date <- c(1:length(link2))
for(i in 1:length(link2))
{
  web <- read_html(link2[i])
  {
    if(length(html_nodes(web,'div.news_txt'))==1)
      news_date[i]<- trimws(str_replace_all((html_text((html_nodes(web, "div p"))[2])), "[\r\n]" , ""))
    else
      news_date[i]<- trimws(str_replace_all((html_text(html_nodes(web,'div.video_txt_l span'))), "[\r\n]" , ""))
  }
} 


date <- c(1:length(link2))
time <- c(1:length(link2))
for(i in 1:length(link2))
{
  date[i] <- substring(news_date[i],1,10)
  time[i] <- substring(news_date[i],12,16)   # is.character(news_date[i])
}

news_01 <- data.frame(title,date,time,url=link2,news_content)

save(news_01,file="news_information.Rdata")
write.csv(news_01,file=" news_information.csv")  

Note1:

Press F12 to enter the developer page, click a title, and the corresponding node of the title will be a under h2, as shown in the following figure:

Note2:

The original video uses the following code to obtain the text content of daily news, but there is an error: change the parameter length to zero. As follows:

> for(i in 1:length(link2))
+ {
+     news_content[i] <- read_html(link2[i]) %>% html_nodes('div.news_txt') %>% html_text()
+ }
Error in news_content[i] <- read_html(link2[i]) %>% html_nodes("div.news_txt") %>%  : 
  The length of the replacement parameter is zero

The reason for the error is that there is pure video content in the web page, so the parameter obtained through the "div.news_txt" node is 0. Therefore, in the code of this paper, judge the web page content first, and extract the text content from the text news and video news respectively.

Note3:

Use the following original code to extract the news content every day, but (Ben Xiaobai speculates that the network may be upgraded due to anti climbing!!!)

Therefore, Ben Xiaobai found the corresponding node again and still found it on the developer tool page. The time of the text is at the "div p" node, and the extracted string is a string containing newline characters and spaces; The time of the video page is located in "div.video_txt_l span", and there is a small tail (news source) extracted.

Note4

Most importantly, although there is Warning, it's not important, because I can't solve it. I'm very happy when the result comes out. Life's first reptile, success!

Then I thank the UP Master of station B very much. I hope it doesn't infringe on the rights and interests of others. The above!

 

3. Strange knowledge has increased

pacman::p_load(XML,rvest,jiebaR,dplyr,stringr) loads multiple packages at one time, including install and library

read_html(): the function of reading html documents (web pages);

html_nodes(): select and extract the part of the specified element and node in the document;

html_text(): extract the text in the tag;

html_attrs(): extract the attribute name and its content;

trimws(): remove the spaces before and after the string;

str_replace_all(): replace. In this article, spaces and line breaks are replaced \ r \ n;

 is.character(): judge whether it is in string form

 

 

Topics: R Language github crawler Data Mining