Graphics Worm Network - Write in front
After a crackling operation, I finally wrote my blog to the 10th article. Later, it will gradually involve more crawler modules. Someone asked scrapy when to start using it. I expect it to be 30 articles later. The following routine is still slow, so don't worry. 100 articles are expected to be 4-5 months. After writing, the common anti-crawling will also be written, as well as the content of the fuck login class.
Graphic insect net-crawling graph insect net
Why to crawl this website? I don't know why it's amazing to get it. It feels that the picture quality is good. It's not comparable to those gorgeous and cheap goods. So I started to crawl. After searching the website, some people are also crawling, but basically py2 and py3 have not been written yet. So I'd like to write a handwritten article.
Start Page
https://tuchong.com/explore/
There are many labels on this page. There are many pictures under each label. For harmony, I chose a very good label flower. You can choose other label flowers. You can even climb down all of them.
https://tuchong.com/tags/%E8%8A%B1%E5%8D%89/# Flowers are coded as%E8%8A%B1%E5%8D%89. It doesn't matter.
We're also playing with queue s in python, which we haven't written about before.
Here are some of the explanations I've come by from other people. So many basic reptiles were used in the early days.
1. Initialization: class Queue.Queue(maxsize) FIFO FIFO 2\. Common methods in packages: - queue.qsize() Return queue size - queue.empty() If the queue is empty, return True,Conversely False - queue.full() If the queue is full, return True,Conversely False - queue.full and maxsize Size correspondence - queue.get([block[, timeout]])Get the queue, timeout waiting time 3. Create a Queue Object import queue myqueue = queue.Queue(maxsize = 10) 4. Put a value in the queue myqueue.put(10) 5. Remove a value from the queue myqueue.get()
Start coding
First of all, let's implement the framework of the main methods. I still put some core points on the annotations.
def main(): # Declare a queue in which 100 pages are stored using a loop page_queue = Queue(100) for i in range(1,101): page_queue.put(i) # Acquisition results (image address waiting to be downloaded) data_queue = Queue() # Record a list of threads thread_crawl = [] # Open 4 threads at a time craw_list = ['Acquisition Thread 1','Acquisition Thread 2','Acquisition Thread 3','Acquisition Thread 4'] for thread_name in craw_list: c_thread = ThreadCrawl(thread_name, page_queue, data_queue) c_thread.start() thread_crawl.append(c_thread) # Waiting for the page_queue queue to be empty, that is, waiting for the previous operation to complete while not page_queue.empty(): pass if __name__ == '__main__': main() Python Resource sharing qun 784758214 ,Installation packages are included. PDF,Learning videos, here is Python The gathering place of learners, zero foundation and advanced level are all welcomed.
After the code runs, four threads are successfully started and then waiting for the thread to finish. Note that you need to complete the ThreadCrawl class.
class ThreadCrawl(threading.Thread): def __init__(self, thread_name, page_queue, data_queue): # threading.Thread.__init__(self) # Invoke parent class initialization method super(ThreadCrawl, self).__init__() self.threadName = thread_name self.page_queue = page_queue self.data_queue = data_queue def run(self): print(self.threadName + ' start-up************')
Operation results
Threads have been started. In the run method, just add the code to crawl the data. This place introduces a global variable to identify the crawl state.
CRAWL_EXIT = False
First add the following code to the main method
CRAWL_EXIT = False # This variable is declared at this location class ThreadCrawl(threading.Thread): def __init__(self, thread_name, page_queue, data_queue): # threading.Thread.__init__(self) # Invoke parent class initialization method super(ThreadCrawl, self).__init__() self.threadName = thread_name self.page_queue = page_queue self.data_queue = data_queue def run(self): print(self.threadName + ' start-up************') while not CRAWL_EXIT: try: global tag, url, headers,img_format # Bring in the global value. # Queue empty produces an exception page = self.page_queue.get(block=False) # Get values from it spider_url = url_format.format(tag,page,100) # The URL to be crawled by splicing print(spider_url) except: break timeout = 4 # The qualified place is to try to get three times, fail three times, and jump out. while timeout > 0: timeout -= 1 try: with requests.Session() as s: response = s.get(spider_url, headers=headers, timeout=3) json_data = response.json() if json_data is not None: imgs = json_data["postList"] for i in imgs: imgs = i["images"] for img in imgs: img = img_format.format(img["user_id"],img["img_id"]) self.data_queue.put(img) # After capturing the link to the picture, store it in a new queue and wait for the next operation. break except Exception as e: print(e) if timeout <= 0: print('time out!') def main(): # Code above # Waiting for the page_queue queue to be empty, that is, waiting for the previous operation to complete while not page_queue.empty(): pass # If page_queue is empty, the collection thread exits the loop global CRAWL_EXIT CRAWL_EXIT = True # Test if there is value in the queue print(data_queue)
After testing, data_queue contains data!! Ha-ha, the following is using the same operation, just download the picture.
Improving main method
def main(): # Code above for thread in thread_crawl: thread.join() print("Grab thread termination") thread_image = [] image_list = ['Download Thread 1', 'Download Thread 2', 'Download Thread 3', 'Download Thread 4'] for thread_name in image_list: Ithread = ThreadDown(thread_name, data_queue) Ithread.start() thread_image.append(Ithread) while not data_queue.empty(): pass global DOWN_EXIT DOWN_EXIT = True for thread in thread_image: thread.join() print("End of download thread")
Or add a ThreadDown class, which is used to download pictures.
class ThreadDown(threading.Thread): def __init__(self, thread_name, data_queue): super(ThreadDown, self).__init__() self.thread_name = thread_name self.data_queue = data_queue def run(self): print(self.thread_name + ' start-up************') while not DOWN_EXIT: try: img_link = self.data_queue.get(block=False) self.write_image(img_link) except Exception as e: pass def write_image(self, url): with requests.Session() as s: response = s.get(url, timeout=3) img = response.content # Getting Binary Stream try: file = open('image/' + str(time.time())+'.jpg', 'wb') file.write(img) file.close() print('image/' + str(time.time())+'.jpg The picture is downloaded.') except Exception as e: print(e) return Python Resource sharing qun 784758214 ,Installation packages are included. PDF,Learning videos, here is Python The gathering place of learners, zero foundation and advanced level are all welcomed.
After running, just wait for the picture to download.~~
Key annotations have been added to the code. Take a picture (_). This time the code will be uploaded back to github because it is relatively simple.
When you modify the above flowers to something like xx, it's a fairy beyond the sky.