I accidentally found a sticker that also came out of a drifting bottle. I flipped it over and found that there were actually many sisters'pictures. When I was idle, I wanted to write a crawler program to grab all the pictures.
Here is the address of the sticker bottle
http://tieba.baidu.com/bottle...
1. Analysis
First, open Fiddler, then open the first page of the bottle, load a few pages to try. After filtering out the picture data and the interfering data of non-http 200 status code in Fiddler, we find that every page has regular data acquisition, which makes it easy to capture.The url to get one page is as follows:
http://tieba.baidu.com/bottle...
It is easy to see that page_number is the current page number and page_size is the number of bottles contained in the current page.
The data accessed is in json format and is roughly structured as follows:
{ "error_code": 0, "error_msg": "success", "data": { "has_more": 1, "bottles": [ { "thread_id": "5057974188", "title": "Extremely beautiful", "img_url": "http://imgsrc.baidu.com/forum/pic/item/a8c87dd062d9f2d3f0113c2ea0ec8a136227cca9.jpg" }, { "thread_id": "5057974188", "title": "Extremely beautiful", "img_url": "http://imgsrc.baidu.com/forum/pic/item/a8c87dd062d9f2d3f0113c2ea0ec8a136227cca9.jpg" }, ... } }
The content is straightforward enough to see that the data in the bottles is what we want (thread_id bottle specific id, title sisters'paper slot content, img_url photo real address). Traversing through the bottles, you can get all the drifting bottles on the current page.(Actually all I get now is a cover image. It's a surprise to open a specific bottle because I'm lazy and lazy to write, but I also analyze the internal data. The URL is: http://tieba.baidu.com/bottle... Bottle thread_id>)
Another parameter, has_more, guesses whether the meaning of the next page exists.
The way you collect here should be certain.That is, from the first page, you don't stop looping back until the parameter has_more doesn't end with 1.
2. Encoding
This is done using python 2.7 + urllib2 + demjson.Urllib2 is the library that comes with python 2.7. demjson needs to be installed by itself (in general, you can use the json library that comes with python to complete the json parsing task, but now many websites provide json that is not standard, which makes the json library with itself helpless.)
demjson installation (windows does not require sudo)
sudo pip install demjson
perhaps
sudo esay_install demjson
2.1 Get a page
def bottlegen(): page_number = 1 while True: try: data = urllib2.urlopen( "http://tieba.baidu.com/bottle/bottles?page_number=%d&page_size=30" % page_number).read() json = demjson.decode(data) if json["error_code"] == 0: data = json["data"] has_more = data["has_more"] bottles = data["bottles"] for bottle in bottles: thread_id = bottle["thread_id"] title = bottle["title"] img_url = bottle["img_url"] yield (thread_id, title, img_url) if has_more != 1: break page_number += 1 except: raise print("bottlegen exception") time.sleep(5)
This uses python's generator source to continuously output what is analyzed.
2.2 Save picture data based on url
for thread_id, title, img_url in bottlegen(): filename = os.path.basename(img_url) pathname = "tieba/bottles/%s_%s" % (thread_id, filename) print filename with open(pathname, "wb") as f: f.write(urllib2.urlopen(img_url).read()) f.close()
2.3 The full code is as follows
# -*- encoding: utf-8 -*- import urllib2 import demjson import time import re import os def bottlegen(): page_number = 1 while True: try: data = urllib2.urlopen( "http://tieba.baidu.com/bottle/bottles?page_number=%d&page_size=30" % page_number).read() json = demjson.decode(data) if json["error_code"] == 0: data = json["data"] has_more = data["has_more"] bottles = data["bottles"] for bottle in bottles: thread_id = bottle["thread_id"] title = bottle["title"] img_url = bottle["img_url"] yield (thread_id, title, img_url) if has_more != 1: break page_number += 1 except: raise print("bottlegen exception") time.sleep(5) def imggen(thread_id): try: data = urllib2.urlopen( "http://tieba.baidu.com/bottle/photopbPage?thread_id=%s" % thread_id).read() match = re.search(r"\_\.Module\.use\(\'encourage\/widget\/bottle\',(.*?),function\(\)\{\}\);", data) data = match.group(1) json = demjson.decode(data) json = demjson.decode(json[1].replace("\r\n", "")) for i in json: thread_id = i["thread_id"] text = i["text"] img_url = i["img_url"] yield (thread_id, text, img_url) except: raise print("imggen exception") try: os.makedirs("tieba/bottles") except: pass for thread_id, _, _ in bottlegen(): for _, title, img_url in imggen(thread_id): filename = os.path.basename(img_url) pathname = "tieba/bottles/%s_%s" % (thread_id, filename) print filename with open(pathname, "wb") as f: f.write(urllib2.urlopen(img_url).read()) f.close()
After running, you get all the bottles on each page, then all the pictures in the specific bottle, and output them to tieba/bottles/xxxx.jpg.(forgive ^^,,, for not making mistakes because you are lazy.)
conclusion
The conclusion is,'It's all deceptive, but there are a few nice pages on the front page -,'he mews, '