Document analysis notes of Baidu Library

Posted by anupam_28 on Fri, 21 Jan 2022 02:45:01 +0100

cause

Last time I wanted to download a document, I tried Baidu Library Downloader, but it didn't work.

Including all kinds of software, browser plug-ins and oil monkey plug-ins, all of them are dead.

However, you can only get the content temporarily by copying (select the content and click "translate").

If you're free afterwards, just see if you can analyze and download it.

process

First, I searched the Internet. There are open source content and analytical articles. I tried both and couldn't use them, so I didn't take a closer look.

Get started directly and find that if you directly visit the article link, the web content directly includes all the things you need, including the links of text and pictures.

The picture link doesn't know how to parse. It looks like the server combines multiple pictures into one and returns, and the front end is divided again.

The text content is returned in json format. After searching online, I found a legendary bdjson, that is, the json format customized by Baidu, that is, the format of xreader (reading software on psp). After a simple check, I found no way to convert xreader to doc. I simply gave up the direct conversion and decided to take the text content directly, and then download all the pictures, Manually adjust the format.

Most of the time, you may just want to copy a little content, but you find that you can't preview the full text. The test found that you can empty the cookie of the library first, and then jump to the document from Baidu search interface, so you don't limit the full text preview.

Later, another interface was found, which can directly preview the full text and copy, but can not print (only 6 pages are displayed, and other parts are blank)

https://wenku.baidu.com/share / document ID?share_api=1

Download Text and picture code

import requests
import re, time, os, json


class Doc(object):
    """
    Download Baidu Library as TXT
    """
    def __init__(self, url):
        """
        Incoming address
        """
        self.doc_id = re.findall('view/(.*).html', url)[0]
        self.s = requests.session()
        self.get_info()
        self.get_token()
        if not os.path.exists('download'):
            os.mkdir('download')

    @staticmethod
    def get_timestamp(long: int = 13):
        """
        Take the timestamp, 13 bits by default
        """
        return str(time.time_ns())[:long]

    def get_info(self):
        """
        Get document name/Page number and other basic information
        """
        url = f'https://wenku.baidu.com/share/{self.doc_id}?share_api=1'
        try:
            html = self.s.get(url).content.decode('GBK')
            self.title = re.search(r"title'   : '([\s\S]*?)',", html).group(1)
            pages = re.search(r"totalPageNum' : '([\s\S]*?)',", html).group(1)
            self.pages = int(pages)
            self.type = re.search(r"docType' : '(.*?)',", html).group(1)
            htmlURLs = re.search(r"htmlUrls: '([\s\S]*?)',", html).group(1).replace('\\\\', '')
            htmlURLs = htmlURLs.encode('latin1').decode('unicode_escape')
            self.urls = json.loads(htmlURLs)
        except:
            print('Failed to get the basic information of the document. The operation is over.')
            os._exit()

    def get_token(self):
        """
        take Token,Short term (hour level) effective
        """
        url = 'https://wenku.baidu.com/api/interface/gettoken?host=wenku.baidu.com&secretKey=6f614cb00c6b6821e3cdc85ab1f8f907'
        try:
            res = self.s.get(url).json()
            self.token = res['data']['token']
        except:
            print('obtain token Failed, run ended.')
            os._exit()

    def download_pic(self):
        """
        Download all pictures
        """
        pics = self.urls['png']
        for pic in pics:
            content = self.s.get(pic['pageLoadUrl']).content
            with open(f'download/{self.title}_{pic["pageIndex"]}.png', 'wb') as f:
                f.write(content)

    def download(self):
        """
        Download Text
        """
        self.download_pic()
        print(self.token, self.title, self.pages, self.type)
        result = ''
        url = 'https://wenku.baidu.com/api/interface/getcontent'
        for page in range(1, self.pages + 1):
            params = {
                "doc_id": self.doc_id,
                "pn": page,
                "t": "json",
                "v": "6",
                "host": "wenku.baidu.com",
                "token": self.token,
                "type": "xreader"
            }
            res = self.s.get(url, params=params).json()
            lst = res['data']['1']['body']
            for index in range(len(lst)):
                if lst[index]['t'] == 'word':
                    text = lst[index]['c']
                    if text == ' ' and index < len(lst) - 1:
                        if lst[index]['p']['y'] < lst[index + 1]['p']['y']:
                            text = '\n'
                        else:
                            text = ' '
                    result += text
                elif lst[index]['t'] == 'pic':
                    pass
        with open(f'download/{self.title}.txt', 'w', encoding='utf-8') as f:
            f.write(result)
        print('complete')

    def download2(self):
        """
        Download Text
        """
        self.download_pic()
        print(self.token, self.title, self.pages, self.type)
        result = ''
        pages = self.urls['json']
        for page in pages:
            res = self.s.get(page['pageLoadUrl']).text
            res = res[8:-1]
            res = json.loads(res)
            lst = res['body']
            for index in range(len(lst)):
                if lst[index]['t'] == 'word':
                    text = lst[index]['c']
                    if text == ' ' and index < len(lst) - 1:
                        if lst[index]['p']['y'] < lst[index + 1]['p']['y']:
                            text = '\n'
                        else:
                            text = ' '
                    result += text
                elif lst[index]['t'] == 'pic':
                    pass
        with open(f'download/{self.title}.txt', 'w', encoding='utf-8') as f:
            f.write(result)
        print('complete')


if __name__ == "__main__":

    # url = 'https://wenku.baidu.com/view/81bbd69a541810a6f524ccbff121dd36a22dc477.html'
    url = 'https://wenku.baidu.com/view/c145899e6294dd88d1d26b0e.html'
    doc = Doc(url)
    doc.download()
    # doc.download2()

Topics: crawler