Crawling the official account with python to push the picture and save it as PPT

Posted by ozPATT on Mon, 31 Jan 2022 09:21:35 +0100

1, Foreword

The first two days in WeChat saw Dong Fuguo's official account, "python cabin", sent a push. "Learning Python+numpy array operation and matrix operation, just look at this 254 page PPT" . Mr. Dong put all the 254 pages of PPT pictures in his courseware in this push.

I am learning machine learning recently and try to re implement the code implemented in Matlab in teacher Wu Enda's Deep Learning course in Python. I want to master the usage of matrix operation in Python as soon as possible, so I want to use this PPT library very efficiently. However, there are requirements to obtain this PPT:

But, 36 likes are a little difficult for me. There are few friends on wechat... But I really want this PPT. What should I do? It seems that we can only try to write a script to climb down these pictures and automatically import them into PPT. I apologize to Mr. Dong here. White whoring is helpless. I hope you can forgive me~

Final results:

2, Start

(1) Get push URL link

If you want to crawl a web page, you naturally have to get its URL link first. Since opening a push WeChat directly will generate a random link, I still get the URL of the target pushing in the background of WeChat official account. See my other article for details Crawling the official account with python . Finally, get the target push link:

https://mp.weixin.qq.com/s?__biz=MzI4MzM2MDgyMQ==&mid=2247493780&idx=1&sn=efde5d4ec9a5c04383b709ace1222518&chksm=eb8943cedcfecad84ffb4794d19cce4c7ae3344ffa0c361f646d10689717af33dab5da93332f#rd

(2) Crawl the web page and extract the picture to save

1. Define the PictureSpider class used to crawl and push pictures

This time, I choose to implement the crawler in an object-oriented way.

First, define a class that is specifically used to crawl the pictures in the official account, named PictureSpider, and then introduce the push link start_ to be crawled when initialized. URL, and set the headers parameter of the crawler (here I save the Cookie and user agent in the "wechat.yaml" file, which will be read automatically during initialization):

class PictureSpider(object):
    def __init__(self,start_url):
        with open("wechat.yaml", "r") as file:
            file_data = file.read()
        self.config = yaml.safe_load(file_data) 

        self.headers = {
            "Cookie": self.config['cookie'],
            "User-Agent": self.config['user_agent'] 
        }
        self.start_url = start_url

2. Define get_url_text() method

Define get_ url_ The text () method crawls the push and returns the pushed html content:

def get_url_text(self):
    """
    Return to web page html content
    parameter: 
    url->str:website

    return:
    r.text->str:Web content
    """
    r = requests.get(self.start_url,headers=self.headers)
    if r.status_code!=200:
        print("Page load failed")
    else:
        return r.text

3. Define search_pictures() method

Define search_ The pictures () method extracts the pictures in the push and returns a list that stores the URL links of all the pictures in the push:

def search_pictures(self,text):
    """ 
    Search for pictures in web pages url

    parameter:
    text->str: Web content

    return:
    picture_url->list: picture URL list
    """
    pictures_url = []

    re_image = re.compile('<img.*?src="(.*?)".*?>',re.S)
    pictures = re_image.finditer(text)
    for img in pictures:
        if img.group(1):
            pictures_url.append(img.group(1))
    return pictures_url

4. Define Download_ Pictures() method

Define Download_ The pictures () method is used for pictures_ Each URL in the URL list is traversed and crawled, that is, download the image content and call save_ The picture() method saves the picture in the specified folder:

def download_pictues(self,pictues_url,save_address):
    """
    Download pictures

    parameter:
    pictures_url->list: Save picture url List of
    save_address->str: Picture saving address
    """
    i = 1
    n = len(pictues_url)
    for url in pictues_url:
        r = requests.get(url,headers=self.headers)
        if r.status_code!=200:
            print(f"Image crawling failed, page{i}/{n}Zhang")
            i += 1
            continue
        else:
            self.save_picture(r,str(i),save_address)
            print(f"success,{i}/{n}")
            i += 1
        if i%5==0:
            time.sleep(1)

5. Define save_picture() method

Define save_ The picture () method writes the picture into a file in binary form and saves it:

def save_picture(self,r,name,save_address):
    """ 
    Save picture

    parameter:
    r->requests.Response: picture url Request return result
    name->str: Picture save name
    save_adress->str: Picture saving address
    """
    with open(save_address+'\\'+name+'.jpg',"wb") as f:
        f.write(r.content)

6. Start crawling and results

Finally, instantiate a PictureSpider class and start crawling:

if __name__ == "__main__":
    url = "https://mp.weixin.qq.com/s?__biz=MzI4MzM2MDgyMQ==&mid=2247493780&idx=1&sn=efde5d4ec9a5c04383b709ace1222518&chksm=eb8943cedcfecad84ffb4794d19cce4c7ae3344ffa0c361f646d10689717af33dab5da93332f#rd"
    spider = PictureSpider(url)
    text = spider.get_url_text()
    pictures_address = spider.search_pictures(text)
    spider.download_pictues(pictures_address,"Numpy_PPT")

After crawling, you can see that all pictures in the target push have been saved to the "Numpy_PPT" folder and named in order:

7. Complete code of crawler part

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
'''
@File    :   picture_spider.py
@Time    :   2021/06/06 19:35:55
@Author  :   YuFanWenShu 
@Contact :   1365240381@qq.com
'''

# here put the import lib

import requests
import yaml
import re 
import os
import time

class PictureSpider(object):
    def __init__(self,start_url):
        with open("wechat.yaml", "r") as file:
            file_data = file.read()
        self.config = yaml.safe_load(file_data) 

        self.headers = {
            "Cookie": self.config['cookie'],
            "User-Agent": self.config['user_agent'] 
        }
        self.start_url = start_url

    def get_url_text(self):
        """
        Return to web page html content
        parameter: 
        url->str:website

        return:
        r.text->str:Web content
        """
        r = requests.get(self.start_url,headers=self.headers)
        if r.status_code!=200:
            print("Page load failed")
        else:
            return r.text

    def search_pictures(self,text):
        """ 
        Search for pictures in web pages url

        parameter:
        text->str: Web content

        return:
        picture_url->list: picture URL list
        """
        pictures_url = []

        re_image = re.compile('<img.*?src="(.*?)".*?>',re.S)
        pictures = re_image.finditer(text)
        for img in pictures:
            if img.group(1):
                pictures_url.append(img.group(1))
        return pictures_url

    def download_pictues(self,pictues_url,save_address):
        """
        Download pictures

        parameter:
        pictures_url->list: Save picture url List of
        save_address->str: Picture saving address
        """
        i = 1
        n = len(pictues_url)
        for url in pictues_url:
            r = requests.get(url,headers=self.headers)
            if r.status_code!=200:
                print(f"Image crawling failed, page{i}/{n}Zhang")
                i += 1
                continue
            else:
                self.save_picture(r,str(i),save_address)
                print(f"success,{i}/{n}")
                i += 1
            if i%5==0:
                time.sleep(1)

    def save_picture(self,r,name,save_address):
        """ 
        Save picture

        parameter:
        r->requests.Response: picture url Request return result
        name->str: Picture save name
        save_adress->str: Picture saving address
        """
        with open(save_address+'\\'+name+'.jpg',"wb") as f:
            f.write(r.content)




if __name__ == "__main__":
    url = "https://mp.weixin.qq.com/s?__biz=MzI4MzM2MDgyMQ==&mid=2247493780&idx=1&sn=efde5d4ec9a5c04383b709ace1222518&chksm=eb8943cedcfecad84ffb4794d19cce4c7ae3344ffa0c361f646d10689717af33dab5da93332f#rd"
    spider = PictureSpider(url)
    text = spider.get_url_text()
    pictures_address = spider.search_pictures(text)
    spider.download_pictues(pictures_address,"Numpy_PPT")

(3) Batch import pictures into PPT

1. Code

This part uses the "Python pptx" library. First, set the path of the folder where the pictures are stored, enter the name of the new PPT, and then automatically write the JPG pictures under the folder into the PPT file in numbered order. Finally, save:

import os
import pptx
from pptx.util import Inches

path = os.getcwd()
os.chdir(path+'\\Numpy_PPT')
ppt_filename = input('Enter target ppt file name(No suffix required): ')
full_ppt_filename = '{}.{}'.format(ppt_filename,'pptx')
ppt_file = pptx.Presentation()

pic_files = [fn for fn in os.listdir() if fn.endswith('.jpg')]

# Import by picture number sequence
for fn in sorted(pic_files, key=lambda item:int(item[:item.rindex('.')])):
    slide = ppt_file.slides.add_slide(ppt_file.slide_layouts[1])

    # Set the text for the first text box in the current slide of PPTX file, which can be ignored in this code
    slide.shapes.placeholders[0].text = fn[:fn.rindex('.')]

    # Import and add pictures for the current slide. The starting position and size can be modified
    slide.shapes.add_picture(fn, Inches(0), Inches(0), Inches(10), Inches(7.5))
    
ppt_file.save(full_ppt_filename)

2. Results

After running the script, a PPT file is generated in the folder:

You can see that all 254 pictures have been imported in order, and the effect is very ideal.

3, Conclusion

With the experience of climbing the official account of the "post office", this code was written very smoothly, and it took more than an hour to complete all the modules. The result is very satisfactory. Say nothing, Python, yyds!

In addition, I would like to thank Mr. Dong Fuguo again. I have read several books written by Mr. Dong, and completed all the exercises in the "cabin problem brushing" system last year. It can be said that Mr. Dong and Mr. Songtian of Beihang are my guides.

Topics: Python crawler ppt