Teach you how to use Python web crawler to download a novel (with source code)

Posted by metalblend on Thu, 30 Dec 2021 08:01:25 +0100

Hello, I'm an advanced Python.

preface

A few days ago, the [Pan Xi bird] boss shared a code for capturing novels in the group. I feel it's quite good. I'll share it here for you to learn.

1, Novel download

If you want to download any novel on the website, click the link directly, as shown in the figure below.

Just get the number in the URL. For example, here is 951. Then this number represents the book number of the book, which can be used in the following code.

2, Concrete implementation

The code of the boss is directly lost here, as shown below:

# coding: utf-8
'''
Biqu novel download
 Research code only
 Do not use for commercial purposes
 Please delete within 24 hours
'''
import requests
import os
from bs4 import BeautifulSoup
import time


def book_page_list(book_id):
    '''
    Book number passed in bookid,Get the table of contents of all chapters of this book
    :param book_id:
    :return: Chapter directory and chapter address
    '''
    url = 'http://www.biquw.com/book/{}/'.format(book_id)
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36'}
    response = requests.get(url, headers)
    response.encoding = response.apparent_encoding
    response = BeautifulSoup(response.text, 'lxml')
    booklist = response.find('div', class_='book_list').find_all('a')
    return booklist


def book_page_text(bookid, booklist):
    '''
    Capture and archive the contents of each chapter through the book number and chapter directory
    :param bookid:str
    :param booklist:
    :return:None
    '''
    try:
        for book_page in booklist:
            page_name = book_page.text.replace('*', '')
            page_id = book_page['href']
            time.sleep(3)
            url = 'http://www.biquw.com/book/{}/{}'.format(bookid,page_id)
            headers = {
                'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36'}
            response_book = requests.get(url, headers)
            response_book.encoding = response_book.apparent_encoding
            response_book = BeautifulSoup(response_book.text, 'lxml')
            book_content = response_book.find('div', id="htmlContent")
            with open("./{}/{}.txt".format(bookid,page_name), 'a') as f:
                f.write(book_content.text.replace('\xa0', ''))
                print("Currently downloaded chapters:{}".format(page_name))
    except Exception as e:
        print(e)
        print("Chapter content acquisition failed. Please ensure that the book number is correct and the book has normal content.")


if __name__ == '__main__':
    bookid = input("Please enter the book number(number): ")
    # If the directory corresponding to the book number does not exist, a new directory is created to store the chapter contents
    if not os.path.isdir('./{}'.format(bookid)):
        os.mkdir('./{}'.format(bookid))
    try:
        booklist = book_page_list(bookid)
        print("Get directory successfully!")
        time.sleep(5)
        book_page_text(bookid, booklist)
    except Exception as e:
        print(e)
        print("Failed to get the directory. Please make sure the book number is entered correctly!")

After the program runs, enter the book number on the console to start crawling.

A folder named by book number will be automatically created locally, and the chapters of the novel will be stored in this folder, as shown in the figure below.

3, Frequently asked questions

Small and medium-sized partners should often encounter this problem during operation, as shown in the figure below.

This is because the visit is too fast and the website is crawling back to you. It can be solved by setting random user agent or upper agent.

4, Summary

I'm an advanced Python. This article mainly introduces the acquisition method of novel content, which is based on web crawler, implemented through requests crawler library and bs4 selector, and gives you examples of common problems.

This article is only for code learning, communication and sharing. Don't get sick of crawlers. When crawling, choose to do it at night as much as possible. Set up more sleep. Crawlers can stop. Don't overpressure each other's servers. Remember! Remember! Remember!