The first time I wrote a blog, my original intention was to share my knowledge with you. Also pour down their own, uncomfortable and happy in the process. If you want to see the process, just look at it. What you don't want to see can also be directly at the bottom? Take the code away.
Cause: it's because my girlfriend and another friend who i ...
Posted by Venkatesh on Tue, 08 Mar 2022 20:00:11 +0100
In the process of writing crawler, we often need to parse the list page of the website. For example, the following example:
<title>Test relative path</title>
<h1>List of books</h1> ...
Posted by adamh91 on Tue, 08 Mar 2022 17:57:47 +0100
When we write a crawler, we may need to generate a new url based on the current url in the crawler. For example, the following pseudo code:
current_url = 'https://www.kingname.info/archives/page/2/'
current_page = re.search('/(\d+)', current_url).group(1)
next_page = int(current_page) + 1
next_url = re.sub('\d+', str(next_page), curr ...
Posted by simanta on Tue, 08 Mar 2022 17:43:46 +0100
Encodes Python objects into JSON strings and decodes JSON strings into Python objects.
Posted by bam2550 on Tue, 08 Mar 2022 13:50:15 +0100
With the development of economy and the progress of science and technology, the car has become a necessary means of transportation for every family. In addition, the prerequisite for marriage is to have a car and a house, which virtually aggravates the pressure of male compatriots. At this time, we need a car urgently. The second-hand car marke ...
When we are reptiles, we often these situations:
The website is complex and will encounter many repeated requests. Sometimes the crawler is interrupted unexpectedly, but we don't save the crawling state. If we run again, we need to crawl again.
There are such problems.
So how to solve these repeated crawling problems? You probably think of ...
Posted by gavinandresen on Mon, 07 Mar 2022 09:42:37 +0100
As a self-study note, this article is for reference only Learning course: IT of Feixue City, station B
Reptiles: Use the program to obtain resources on the Internet.
robots.txt protocol: specifies which data in the website cannot be crawled. It is only a protocol, but it does not prevent malicious crawling
General steps of crawler:
Crawler is a good thing. I want to use it recently, so I'll send out the previous little things and write a few blogs~
First of all, it is clear that the thread is not multi-threaded, and the thread is still a single thread in essence. However, the characteristic of this thread is that when the current thread e ...
Posted by adamh91 on Sat, 05 Mar 2022 01:05:18 +0100
The use of selenium in reptiles
I selenium overview
Selenium is a Web automated testing tool, which was originally developed for website automated testing. Selenium can directly call the browser. It supports all mainstream browsers (including PhantomJS, which has no interface). It can receive instructions and let the browser ...
Posted by Eskimo887 on Thu, 03 Mar 2022 09:21:58 +0100