Python's most frequently used scenarios are web rapid development, crawler, automated operation and maintenance: writing simple websites, writing automatic posting scripts, writing email scripts, writing simple verification code identification scripts.
Reptiles also have a lot of reuse in the development process. Here's a summary, we can save some things in the future.
1. Basic crawling of web pages
get Method import urllib2 url "http://www.baidu.com" respons = urllib2.urlopen(url) print response.read()
post method
import urllib import urllib2 url = "http://abcde.com" form = {'name':'abc','password':'1234'} form_data = urllib.urlencode(form) request = urllib2.Request(url,form_data) response = urllib2.urlopen(request) print response.read()
2. Using proxy IP
In the process of crawler development, IP is often blocked, so proxy IP is needed.
There is a ProxyHandler class in the urllib2 package, through which you can set up a proxy to access the Web page, as follows:
import urllib2 proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8087'}) opener = urllib2.build_opener(proxy) urllib2.install_opener(opener) response = urllib2.urlopen('http://www.baidu.com') print response.read()
3. Cookies Processing
Cookies are the data stored on the user's local terminal (usually encrypted) by some websites in order to identify users and track session s. python provides cookielib module to process cookies. The main function of cookielib module is to provide objects that can store cookies in order to access Internet resources in cooperation with urllib2 module.
Code snippet:
import urllib2, cookielib cookie_support= urllib2.HTTPCookieProcessor(cookielib.CookieJar()) opener = urllib2.build_opener(cookie_support) urllib2.install_opener(opener) content = urllib2.urlopen('http://XXXX').read()
The key is CookieJar(), which manages HTTP cookie values, stores cookies generated by HTTP requests, and adds cookie objects to outgoing HTTP requests. The entire cookie is stored in memory, and the cookie will be lost after garbage collection of CookieJar instances, and all processes need not be operated independently.
Manual addition of cookie s
cookie = "PHPSESSID=91rurfqm2329bopnosfu4fvmu7; kmsign=55d2c12c9b1e3; KMUID=b6Ejc1XSwPq9o756AxnBAg=" request.add_header("Cookie", cookie)
4. Disguise as a browser
Some websites are disgusted with the crawler's visits, so they refuse all requests to the crawler. So direct access to websites with urllib2 often results in HTTP Error 403: Forbidden
Pay special attention to some headers. Server will check these headers.
User-Agent Some Server s or Proxy check this value to determine whether it is a browser-initiated Request?
When Content-Type uses the REST interface, the Server checks the value to determine how the content in the HTTP Body should be parsed.
This can be achieved by modifying the header in the http package. The code snippet is as follows:
import urllib2 headers = { 'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6' } request = urllib2.Request( url = 'http://my.oschina.net/jhao104/blog?catalog=3463517', headers = headers ) print urllib2.urlopen(request).read()
5. Page parsing
For page parsing, the most powerful of course is regular expressions. This is different for different users of different websites. There is no need to explain too much. Attach two better websites:
Introduction to Regular Expressions: http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html
Regular expression online testing: http://tool.oschina.net/regex/
Next is the parsing library. There are two commonly used websites, lxml and Beautiful Soup.
lxml: http://my.oschina.net/jhao104/blog/639448
BeautifulSoup: http://cuiqingcai.com/1319.html
For these two libraries, my evaluation is that they are both HTML/XML processing libraries. Beautiful soup is implemented in python, which is inefficient, but has practical functions, such as obtaining the source code of an HTML node by searching the results; lxmlC language coding is efficient and supports Xpath.
6. Processing of Verification Code
For some simple verification codes, simple identification can be carried out. I have only done some simple verification code recognition. However, some anti-human authentication codes, such as 12306, can be manually coded through the code platform, of course, this is a fee.
7. gzip compression
Have you ever encountered some web pages, no matter how transcoding is a mess of code. Haha, that means you don't know that many web services have the ability to send compressed data, which can reduce the amount of data transmitted over the network line by more than 60%. This is especially true for XML web services, because the compression rate of XML data can be very high.
But the server will not send you compressed data unless you tell the server that you can process compressed data.
If you want to learn python programmers, you can come to my python to learn qun: 835017344, free video tutorials for python oh! I will also broadcast python knowledge in the group at 8 p.m. and welcome you to come to learn and communicate.
So you need to modify the code like this:
import urllib2, httplib request = urllib2.Request('http://xxxx.com') request.add_header('Accept-encoding', 'gzip') 1 opener = urllib2.build_opener() f = opener.open(request)
This is the key: create the Request object, add an Accept-encoding header to tell the server that you can accept gzip compressed data
Then it decompresses the data:
import StringIO import gzip compresseddata = f.read() compressedstream = StringIO.StringIO(compresseddata) gzipper = gzip.GzipFile(fileobj=compressedstream) print gzipper.read()
8. Multi-threaded concurrent crawling
If a single thread is too slow, it needs multi-threading. Here we give a simple thread pool template. This program simply prints 1-10, but it can be seen that it is concurrent.
Although python's multi-threading is a chicken's rib, it can improve the efficiency to some extent for the crawler, which is a frequent type of network.
from threading import Thread from Queue import Queue from time import sleep # q is the task queue #NUM is the total number of concurrent threads #How many tasks does JOBS have? q = Queue() NUM = 2 JOBS = 10 #Specific processing functions that handle individual tasks def do_somthing_using(arguments): print arguments #This is the work process, which is responsible for continuously retrieving data from the queue and processing it. def working(): while True: arguments = q.get() do_somthing_using(arguments) sleep(1) q.task_done() #fork NUM threads waiting queue for i in range(NUM): t = Thread(target=working) t.setDaemon(True) t.start() #Queue JOBS for i in range(JOBS): q.put(i) #Waiting for all JOBS to complete q.join()