Use python to grab photos of Baidu's floating bottle girl paper

Posted by monkey72 on Thu, 11 Jul 2019 19:53:11 +0200

I accidentally found a sticker that also came out of a drifting bottle. I flipped it over and found that there were actually many sisters'pictures. When I was idle, I wanted to write a crawler program to grab all the pictures.

Here is the address of the sticker bottle
http://tieba.baidu.com/bottle...

1. Analysis

First, open Fiddler, then open the first page of the bottle, load a few pages to try. After filtering out the picture data and the interfering data of non-http 200 status code in Fiddler, we find that every page has regular data acquisition, which makes it easy to capture.The url to get one page is as follows:

http://tieba.baidu.com/bottle...

It is easy to see that page_number is the current page number and page_size is the number of bottles contained in the current page.

The data accessed is in json format and is roughly structured as follows:

{
    "error_code": 0,
    "error_msg": "success",
    "data": {
        "has_more": 1,
        "bottles": [
            {
                "thread_id": "5057974188",
                "title": "Extremely beautiful",
                "img_url": "http://imgsrc.baidu.com/forum/pic/item/a8c87dd062d9f2d3f0113c2ea0ec8a136227cca9.jpg"
            },
            {
                "thread_id": "5057974188",
                "title": "Extremely beautiful",
                "img_url": "http://imgsrc.baidu.com/forum/pic/item/a8c87dd062d9f2d3f0113c2ea0ec8a136227cca9.jpg"
            },
            ...
   }
}

The content is straightforward enough to see that the data in the bottles is what we want (thread_id bottle specific id, title sisters'paper slot content, img_url photo real address). Traversing through the bottles, you can get all the drifting bottles on the current page.(Actually all I get now is a cover image. It's a surprise to open a specific bottle because I'm lazy and lazy to write, but I also analyze the internal data. The URL is: http://tieba.baidu.com/bottle... Bottle thread_id>)

Another parameter, has_more, guesses whether the meaning of the next page exists.
The way you collect here should be certain.That is, from the first page, you don't stop looping back until the parameter has_more doesn't end with 1.

2. Encoding

This is done using python 2.7 + urllib2 + demjson.Urllib2 is the library that comes with python 2.7. demjson needs to be installed by itself (in general, you can use the json library that comes with python to complete the json parsing task, but now many websites provide json that is not standard, which makes the json library with itself helpless.)

demjson installation (windows does not require sudo)

sudo pip install demjson

perhaps

sudo esay_install demjson

2.1 Get a page

def bottlegen():
    page_number = 1
    while True:
        try:
            data = urllib2.urlopen(
                "http://tieba.baidu.com/bottle/bottles?page_number=%d&page_size=30" % page_number).read()
            json = demjson.decode(data)
            if json["error_code"] == 0:
                data = json["data"]
                has_more = data["has_more"]
                bottles = data["bottles"]
                for bottle in bottles:
                    thread_id = bottle["thread_id"]
                    title = bottle["title"]
                    img_url = bottle["img_url"]
                    yield (thread_id, title, img_url)
                if has_more != 1:
                    break
                page_number += 1
        except:
            raise
            print("bottlegen exception")
            time.sleep(5)

This uses python's generator source to continuously output what is analyzed.

2.2 Save picture data based on url

for thread_id, title, img_url in bottlegen():
    filename = os.path.basename(img_url)
    pathname = "tieba/bottles/%s_%s" % (thread_id, filename)
        print filename
        with open(pathname, "wb") as f:
            f.write(urllib2.urlopen(img_url).read())
            f.close()

2.3 The full code is as follows

# -*- encoding: utf-8 -*-
import urllib2
import demjson
import time
import re
import os

def bottlegen():
    page_number = 1
    while True:
        try:
            data = urllib2.urlopen(
                "http://tieba.baidu.com/bottle/bottles?page_number=%d&page_size=30" % page_number).read()
            json = demjson.decode(data)
            if json["error_code"] == 0:
                data = json["data"]
                has_more = data["has_more"]
                bottles = data["bottles"]
                for bottle in bottles:
                    thread_id = bottle["thread_id"]
                    title = bottle["title"]
                    img_url = bottle["img_url"]
                    yield (thread_id, title, img_url)
                if has_more != 1:
                    break
                page_number += 1
        except:
            raise
            print("bottlegen exception")
            time.sleep(5)

def imggen(thread_id):
    try:
        data = urllib2.urlopen(
            "http://tieba.baidu.com/bottle/photopbPage?thread_id=%s" % thread_id).read()
        match = re.search(r"\_\.Module\.use\(\'encourage\/widget\/bottle\',(.*?),function\(\)\{\}\);", data)
        data = match.group(1)
        json = demjson.decode(data)
        json = demjson.decode(json[1].replace("\r\n", ""))
        for i in json:
            thread_id = i["thread_id"]
            text = i["text"]
            img_url = i["img_url"]
            yield (thread_id, text, img_url)
    except:
        raise
        print("imggen exception")

try:
    os.makedirs("tieba/bottles")
except:
    pass

for thread_id, _, _ in bottlegen():
    for _, title, img_url in imggen(thread_id):
        filename = os.path.basename(img_url)
        pathname = "tieba/bottles/%s_%s" % (thread_id, filename)
        print filename
        with open(pathname, "wb") as f:
            f.write(urllib2.urlopen(img_url).read())
            f.close()

After running, you get all the bottles on each page, then all the pictures in the specific bottle, and output them to tieba/bottles/xxxx.jpg.(forgive ^^,,, for not making mistakes because you are lazy.)

conclusion

The conclusion is,'It's all deceptive, but there are a few nice pages on the front page -,'he mews, '

Finally, post the results of the collection

Topics: Python JSON sudo encoding

Programmer Think