From the case of crawling Taobao and saving it as a csv file. Forced by the video upload of station B, station C failed to pass the review, so it can only be directly linked to the text version.
Basic write crawler: https://www.bilibili.com/video/BV1CW411C7ZM?spm_id_from=333.999.0.0
Advanced object oriented write crawler: https://www.bilibili.com/video/BV1VW411y7Cd?from=search&seid=2911900904516132152&spm_id_from=333.337.0.0
1. IDE (integrated development environment) is a development tool to write code and speed up development efficiency. pycharm is used here
To create a project in pychar is to create a folder containing The idea folder and the venv virtual environment are two folders that are automatically generated after creating files and cannot be deleted.
2. Write code with ideas and business processes
3. Reptile principle
(1) All the information on the browser is downloaded from the server (using http request)
(2) A page is composed of multiple network resources
(3) Network resources: page, picture, video, audio, file, interface
To open a network resource, you must have a url. This url is the url, and its full name is the unified resource locator
(4) One http request will only download one network resource
(5) The browser sends http requests in order
4. The crawler is to simulate pretending to be a browser to download the data on the server, learn from the browser to download network resources, that is, learn from the browser to send requests
5. Tools: Google browser, use f12 to open the developer debugging tool, click network, and then press f5 to refresh the page
6. When sending a request, you must have an address, which is the url, to send the request
7. The first time the web page sends a request, the html source code is downloaded. Generally speaking, it is only a blank room, and we still need fine decoration
8. Requests, a third-party module, is a python program written by Daniel. For example, after everyone buys a home appliance, install it according to the instructions, and then see how to use it. Python only needs to download and install it for free. Download and install syntax: pip install requests
The browser sends an http request to download, and the crawler uses requests to send a get request in order to simulate
9. See the following code for some crawler programs, with a sentence data = JSON Loads (data) there is a problem with my and it has not been solved yet. Interested partners can help or complete it.
10. csv file: you can create a new text document. If you change txt to csv, you can open it directly in excel by default or in Notepad. At this time, start writing the header. The comma in the middle should be in English, and then open it in Excel to become a table.
import requests import re # Internal library, also known as standard library, does not need to be installed import json # The results of crawling Taobao Search are saved as csv files # 1. Download the html source code of the search page # url URL index_url='https://s.taobao.com/search?q=python&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.jianhua.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306' # Send http request for download response=requests.get(index_url) # Add Text is because we downloaded the source code in text format html=response.text # 2. Regular expressions are used to extract information from html. Therefore, re, an internal library of python, can be imported, and xpath can also be used to extract information # The basic framework of extracting information: re Findall (where does R 'start (. *?) Where to end ', HTML, re s)[0] . Represents matching any character, * represents matching multiple characters,? Represents non other matching, () represents reverse capture, data=re.findall(r'g_page_config = (.*?);',html,re.S) # The data obtained now is nonstandard and belongs to a json data. There are no semicolons and many spaces in the json data, so it should be removed # Remove illegal characters at the beginning and end, such as spaces, semicolons, and carriage returns data=data.strip(' ;\n') # Convert to a dictionary (key value pair), so import the library in json data=json.loads(data)#This sentence is wrong print(data) # Retrieve item list data data=data['mods']['itemlist']['data']['auctions'] # 3. Save data, that is, data persistence # Data for each commodity goods_info=[] for item in data: temp={ 'title':item['title'] } fb=open('csv file name','w',encoding='utf-8') # Write header fb.write('title,price,Number of buyers,Package mail,Is tmall,Regional store name,url\n') print(data)