Recently, in the process of writing a crawler with node js, we encountered a strange problem. The crawler code executes normally, but it always stops in the middle of the process, no longer continues to execute, and no error is reported to exit. Moreover, the time of occurrence is also very random and headache,
At the beginning of the analysis, it is burst stack. However, when checking the code, it is found that node js based on event loop is full of callbacks, which is difficult to burst stack. Analysis is too much memory stuck? It's useless to execute gc on a regular basis. Finally, it is found that the callback cannot execute because the timeout time is not set during the request, and the next step cannot be carried out.
However, the node js http module does not support setting the timeout, so it can only pseudo implement one, not to mention the code, using the HTTP module, the cheerio module (used to change the crawled web page into a document tree like jq access), the iconv Lite (used to change the web page data code, there is a pit here, http.get will set the web page code to utf-8 by default, if the crawled web page is gbk, There will be confusion),
const http = require("http"); const fs = require("fs"); const cheerio = require("cheerio"); const iconv = require('iconv-lite'); var req = null request_timer = setTimeout(function () { //Here, timeout setting is realized by timer pseudo. There is no response within 20 seconds. End the request, execute the callback, and focus on req.abort(); console.log('Request Timeout.'); }, 20000); req = http.get(url, res => { if (res) { clearTimeout(request_timer); //If the request responds, clear the timer whether it succeeds or fails console.log("Prepare to get data") var data = []; res.on("data", data1 => {// Because the data is transmitted in blocks and monitored by data block splicing, console.log("Data acquisition in progress") data.push(data1) }) res.on("end", data2 => { var res = iconv.decode(Buffer.concat(data), 'gb2312');//Change the master code through iconv, which is set according to the code set in the meta tag of the web page. If it is not set, the default value is utf-8,, var $ = cheerio.load(res);//This is to convert the obtained web data into dom like tree console.log("Finished processing") var text = $('').eq(1).text(); //Here is the way to get the desired data in the web page through jq like }) } }) req.on("error", function () { //Remember to listen for error events here, otherwise the request error will end the process console.log("Error halt") })