python crawls east of Beijing on a large scale

Posted by Vince889 on Sat, 06 Jul 2019 18:30:29 +0200

python crawls east of Beijing on a large scale

Main Tools

  • scrapy

  • BeautifulSoup

  • requests

Analysis steps

  • Open the first page of Jingdong, enter your pants and you will see the page Jump to Here This is the starting point for our analysis

  • We can see that this page is not complete, and when we pull down we will see pictures loading constantly, which is ajax, but when we pull down to the end we will see that the whole page has loaded 60 pants. We open the chrome debugging tool and look for page elements to see each oneThe pants information is in the label <li class='gl-item'></li>, as shown below:

  • Then when we open the source code of the web page, we will find that the source code of the web page is only the first 30 pieces of data, and the last 30 pieces of data can't be found. So we think of ajax, an asynchronous loading method, so we're going to start grabbing the package. We open chrome and press F12, Click NetWork above, and then click XHR, which is easier.Find out. Start grabbing the bag below, as shown below:

  • The requested URL can be found from above and there is a long and large segment. Let's try to remove some and see if it can be opened. The simplified url= https://search.jd.com/s_new.p...{0}&s=26&scrolling=y&pos=30&show_items={1}
    Here showitems is the id of trousers and pages are paged. It can be seen that we can open different pages only by changing two places. The pages here are very easy to find. You will find a very interesting thing: the pages of the main page are odd, but the pages of asynchronously loaded pages are even, so just fill in hereEven numbers are fine, but odd numbers are accessible.Here show_items is the id. We can find it in the source code of the page. Looking for it, we can see that the id is in the data-pid of the li tag. See the following figure for details

  • Now that we know how to find parameters, we can code

Code Explanation

  • First we need to get the source of the web page. Here I use the requests library, which is installed by pip install requests, with the following code:

    def get_html(self):
        res = requests.get(self.url, headers=self.headers)
        html = res.text     
        return html    #Return Source Code

  • From the above analysis, you can see that the second step is to get the parameter show_items in the asynchronously loaded url, which is data-pid in the li tag, coded as follows:

    def get_pids(self):
        html = self.get_html()
        soup = BeautifulSoup(html, 'lxml')    #Create BeautifulSoup object
        lis = soup.find_all("li", class_='gl-item')   #Find the li tag
        for li in lis:
            data_pid = li.get("data-pid")      #Get the data-pid under the li tag
            if (data_pid):
                self.pids.add(data_pid)    #Here self.pids is a collection used to filter duplicates

  • Here is the url to get the first 30 pictures, which is also the pictures on the main page. One of the problems is that the IMG tags do not have the same attributes, that is, the IMG in the source code are not all src attributes. The pictures that were loaded at first are src attributes, but the pictures that were not loaded are data-lazy-img, soAdd a discussion when parsing the page.The code is as follows:

    def get_src_imgs_data(self):
        html = self.get_html()
        soup = BeautifulSoup(html, 'lxml')
        divs = soup.find_all("div", class_='p-img')  # picture
        # divs_prices = soup.find_all("div", class_='p-price')   #Price
        for div in divs:
            img_1 = div.find("img").get('data-lazy-img')  # Get URLs that are not loaded
            img_2 = div.find("img").get("src")  # Get the url that's already loaded
            if img_1:
                print img_1
                self.sql.save_img(img_1)
                self.img_urls.add(img_1)
            if img_2:
                print img_2
                self.sql.save_img(img_2)
                self.img_urls.add(img_2)

The first 30 pictures have been found, now we start to look for the last 30 pictures. Of course, we want to request that url loaded asynchronously. We have found the required parameters before. Now we can do it. Paste the code directly:

    def get_extend_imgs_data(self):
        # self.search_urls=self.search_urls+','.join(self.pids)
        self.search_urls = self.search_urls.format(str(self.search_page), ','.join(self.pids))  #Split urls, which spell the singular numbers into urls, where the IDS in show_items are separated by',', so to split each id in the collection, pages are even. Add one to the pages of the main page directly here
        print self.search_urls
        html = requests.get(self.search_urls, headers=self.headers).text   #request
        soup = BeautifulSoup(html, 'lxml')   
        div_search = soup.find_all("div", class_='p-img')   #analysis
        for div in div_search:  
            img_3 = div.find("img").get('data-lazy-img')    #Here you can see the separate lookup of img properties
            img_4 = div.find("img").get("src")

            if img_3:    #If it is data-lazy-img
                print img_3
                self.sql.save_img(img_3)    #Store in database
                self.img_urls.add(img_3)      #Weighting with sets
            if img_4:    #If it is an src attribute
                print img_4
                self.sql.save_img(img_4)     
                self.img_urls.add(img_4)

  • You can crawl through the above, but still consider the speed. Here I use multi-threading to start a thread directly on each page. The speed is still okay. I feel the speed is okay. After a few minutes to solve the problem, I crawled a total of 100 pages, where the storage method is mysql database storage.To use the library of Fah Oh MySQLdb, you can use Baidu for details. Of course you can also use mogodb, but you haven't learned it yet. See your friends who want the source code GitHub Source

Expand

Write here you can see that keyword and wq are the words you typed in the web address of the first search page. If you want to crawl more information, you can change these two words to the words you want to search for. Write the Chinese characters directly. They will automatically code for you when you request them. I've also tried. You can grab the source code if you wantIf you want to keep grabbing, you can write the words you want to search into a file and read them from it.Above is just a normal crawler, there is no frame to use, next will write scrapy frame crawl, please continue to pay attention to my blog!!!

My Blog

Topics: Python SQL Database network