python crawls east of Beijing on a large scale
Main Tools
scrapy
BeautifulSoup
requests
Analysis steps
Open the first page of Jingdong, enter your pants and you will see the page Jump to Here This is the starting point for our analysis
We can see that this page is not complete, and when we pull down we will see pictures loading constantly, which is ajax, but when we pull down to the end we will see that the whole page has loaded 60 pants. We open the chrome debugging tool and look for page elements to see each oneThe pants information is in the label <li class='gl-item'></li>, as shown below:
Then when we open the source code of the web page, we will find that the source code of the web page is only the first 30 pieces of data, and the last 30 pieces of data can't be found. So we think of ajax, an asynchronous loading method, so we're going to start grabbing the package. We open chrome and press F12, Click NetWork above, and then click XHR, which is easier.Find out. Start grabbing the bag below, as shown below:
The requested URL can be found from above and there is a long and large segment. Let's try to remove some and see if it can be opened. The simplified url= https://search.jd.com/s_new.p...{0}&s=26&scrolling=y&pos=30&show_items={1}
Here showitems is the id of trousers and pages are paged. It can be seen that we can open different pages only by changing two places. The pages here are very easy to find. You will find a very interesting thing: the pages of the main page are odd, but the pages of asynchronously loaded pages are even, so just fill in hereEven numbers are fine, but odd numbers are accessible.Here show_items is the id. We can find it in the source code of the page. Looking for it, we can see that the id is in the data-pid of the li tag. See the following figure for details
Now that we know how to find parameters, we can code
Code Explanation
First we need to get the source of the web page. Here I use the requests library, which is installed by pip install requests, with the following code:
def get_html(self): res = requests.get(self.url, headers=self.headers) html = res.text return html #Return Source Code
From the above analysis, you can see that the second step is to get the parameter show_items in the asynchronously loaded url, which is data-pid in the li tag, coded as follows:
def get_pids(self): html = self.get_html() soup = BeautifulSoup(html, 'lxml') #Create BeautifulSoup object lis = soup.find_all("li", class_='gl-item') #Find the li tag for li in lis: data_pid = li.get("data-pid") #Get the data-pid under the li tag if (data_pid): self.pids.add(data_pid) #Here self.pids is a collection used to filter duplicates
Here is the url to get the first 30 pictures, which is also the pictures on the main page. One of the problems is that the IMG tags do not have the same attributes, that is, the IMG in the source code are not all src attributes. The pictures that were loaded at first are src attributes, but the pictures that were not loaded are data-lazy-img, soAdd a discussion when parsing the page.The code is as follows:
def get_src_imgs_data(self): html = self.get_html() soup = BeautifulSoup(html, 'lxml') divs = soup.find_all("div", class_='p-img') # picture # divs_prices = soup.find_all("div", class_='p-price') #Price for div in divs: img_1 = div.find("img").get('data-lazy-img') # Get URLs that are not loaded img_2 = div.find("img").get("src") # Get the url that's already loaded if img_1: print img_1 self.sql.save_img(img_1) self.img_urls.add(img_1) if img_2: print img_2 self.sql.save_img(img_2) self.img_urls.add(img_2)
The first 30 pictures have been found, now we start to look for the last 30 pictures. Of course, we want to request that url loaded asynchronously. We have found the required parameters before. Now we can do it. Paste the code directly:
def get_extend_imgs_data(self): # self.search_urls=self.search_urls+','.join(self.pids) self.search_urls = self.search_urls.format(str(self.search_page), ','.join(self.pids)) #Split urls, which spell the singular numbers into urls, where the IDS in show_items are separated by',', so to split each id in the collection, pages are even. Add one to the pages of the main page directly here print self.search_urls html = requests.get(self.search_urls, headers=self.headers).text #request soup = BeautifulSoup(html, 'lxml') div_search = soup.find_all("div", class_='p-img') #analysis for div in div_search: img_3 = div.find("img").get('data-lazy-img') #Here you can see the separate lookup of img properties img_4 = div.find("img").get("src") if img_3: #If it is data-lazy-img print img_3 self.sql.save_img(img_3) #Store in database self.img_urls.add(img_3) #Weighting with sets if img_4: #If it is an src attribute print img_4 self.sql.save_img(img_4) self.img_urls.add(img_4)
You can crawl through the above, but still consider the speed. Here I use multi-threading to start a thread directly on each page. The speed is still okay. I feel the speed is okay. After a few minutes to solve the problem, I crawled a total of 100 pages, where the storage method is mysql database storage.To use the library of Fah Oh MySQLdb, you can use Baidu for details. Of course you can also use mogodb, but you haven't learned it yet. See your friends who want the source code GitHub Source
Expand
Write here you can see that keyword and wq are the words you typed in the web address of the first search page. If you want to crawl more information, you can change these two words to the words you want to search for. Write the Chinese characters directly. They will automatically code for you when you request them. I've also tried. You can grab the source code if you wantIf you want to keep grabbing, you can write the words you want to search into a file and read them from it.Above is just a normal crawler, there is no frame to use, next will write scrapy frame crawl, please continue to pay attention to my blog!!!