How does Python store the crawled data in txt, excel and mysql respectively
1. Page analysis
The page I crawled is Tencent sports, and the link is as follows:
https://nba.stats.qq.com/player/list.htm
**Observe the above figure: * * the left shows 30 NBA teams respectively, and the right shows the details of the corresponding players of each team.
At this time, the idea is very clear. Every time we click on a player, the player information of the team will appear on the right.
The whole idea of crawler is simplified as follows:
- ① Get the url of each player page;
- ② Using Python code to obtain the data in each web page;
- ③ Store the acquired data in different databases;
So, what we need to do now is to find the url of each player page and find their association.
Every time we click on a team, we copy its url. Below, I copy the page url of three teams, as shown below:
# 76 people https://nba.stats.qq.com/player/list.htm#teamId=20 # rocket https://nba.stats.qq.com/player/list.htm#teamId=10 # Miami Heat https://nba.stats.qq.com/player/list.htm#teamId=14
* * as like as two peas of the above URL, we can see that **url is basically the same. Except for the corresponding numbers of teamId, it can be guessed that this is the corresponding number of each team, and the 30 numbers of the 30 teams.
As long as it involves the word "Tencent", it is basically a dynamic web page. I have encountered it many times before. The basic method can't get data at all. If you don't believe it, you can check the web source code: right click - > Click to check the web source code.
Click "Copy + paste" in the page you want to search, and then click "Copy + paste" in the page you want to search. If there are 0 records like the above figure, you can basically judge that the web page belongs to a dynamic web page * *. If you directly obtain the source code, you will not find the data you want.
Therefore, if you want to get the data in the page, using selenium automatic crawler is one of the ways.
2. Data crawling
Go straight to the code!
from selenium import webdriver # Create a browser object, which will automatically open the Google browser window for us browser = webdriver.Chrome() # Call the browser object and send a request to the server. This operation will open Google browser and jump to the "Baidu" home page browser.get("https://nba.stats.qq.com/player/list.htm#teamId=20") # maximize window browser.maximize_window() # Get player Chinese name chinese_names = browser.find_elements_by_xpath('//div[@class="players"]//tr[@class="show"]/td[2]/a') chinese_names_list = [i.text for i in chinese_names] # Get player English name english_names = browser.find_elements_by_xpath('//div[@class="players"]//tr[@class="show"]/td[3]/a') english_names_list = [i.get_attribute('title') for i in english_names] # get attribute # Get player number numbers = browser.find_elements_by_xpath('//div[@class="players"]//tr[@class="show"]/td[4]') numbers_list = [i.text for i in numbers] # Get player location locations = browser.find_elements_by_xpath('//div[@class="players"]//tr[@class="show"]/td[5]') locations_list = [i.text for i in locations] # Get player height heights = browser.find_elements_by_xpath('//div[@class="players"]//tr[@class="show"]/td[6]') heights_list = [i.text for i in heights] # Get player weight weights = browser.find_elements_by_xpath('//div[@class="players"]//tr[@class="show"]/td[7]') weights_list = [i.text for i in weights] # Get player age ages = browser.find_elements_by_xpath('//div[@class="players"]//tr[@class="show"]/td[8]') ages_list = [i.text for i in ages_list] # Get player age qiu_lings = browser.find_elements_by_xpath('//div[@class="players"]//tr[@class="show"]/td[9]') qiu_lings_list = [i.text for i in qiu_lings_list]
There is only one team crawling here, and the task of crawling player data of the remaining 29 teams is up to you. The whole part of the code is basically the same. I wrote one. You draw the gourd and ladle according to the gourd. [just one cycle, it's not simple!]
3. Store to txt
The operation of saving data to txt text is very simple. txt is compatible with almost all platforms. The only disadvantage is that it is inconvenient to retrieve. If you do not have high requirements for retrieval and data structure and pursue convenience first, please use txt text storage.
**Note: str string is written in * * txt.
**The rules for writing data to txt documents are as follows: * * start from scratch and fill from left to right. When filled to the far right, it will be squeezed to the next line. Therefore, if you want to store more regular data, you can automatically fill in the tab "\ t" and line feed "\ n".
Taking this article as an example, the obtained data is stored in txt text.
for i in zip(chinese_names_list,english_names_list,numbers_list,locations_list,heights_list,weights_list,ages_list,qiu_lings_list): with open("NBA.txt","a+",encoding="utf-8") as f: # zip function to get a tuple. We need to convert it into a string f.write(str(i)[1:-1]) # Auto wrap to write line 2 data f.write("\n") f.write("\n")
4. Store in excel
Excel has two formats of files, one is csv format and the other is xlsx format. Save the data to excel, of course, it is more convenient to use pandas library.
for i in zip(chinese_names_list,english_names_list,numbers_list,locations_list,heights_list,weights_list,ages_list,qiu_lings_list): with open("NBA.txt","a+",encoding="utf-8") as f: # zip function to get a tuple. We need to convert it into a string f.write(str(i)[1:-1]) # Auto wrap to write line 2 data f.write("\n") f.write("\n")
5. Store to mysql
MySQL is a relational database. The data is saved by two-dimensional tables similar to excel, that is, a table composed of rows and columns. Each row represents a record and each column represents a field.
In order to make you better understand this process, I will explain to you:
① Create a table nba
To insert data into the database, we first need to create a table named nba.
import pymysql # 1. Connect to the database db = pymysql.connect(host='localhost',user='root', password='123456',port=3306, db='demo', charset='utf8') # 2. Create a table # Create a cursor object; cursor = db.cursor() # Create table statement; sql = """ create table NBA( chinese_names_list varchar(20), english_names_list varchar(20), numbers_list varchar(20), locations_list varchar(20), heights_list varchar(20), weights_list varchar(20), ages_list varchar(20), qiu_lings_list varchar(20) )charset=utf8 """ # Execute sql statements; cursor.execute(sql) # Disconnect the database; db.close()
② Insert data into table nba
import pymysql # 1. Organizational data data_list = [] for i in zip(chinese_names_list,english_names_list,numbers_list,locations_list,heights_list,weights_list,ages_list,qiu_lings_list): data_list.append(i) # 2. Connect to the database db = pymysql.connect(host='localhost',user='root', password='123456',port=3306, db='demo', charset='utf8') # Create a cursor object; cursor = db.cursor() # 3. Insert data sql = 'insert into nba(chinese_names_list,english_names_list,numbers_list,locations_list,heights_list,weights_list,ages_list,qiu_lings_list) values(%s,%s,%s,%s,%s,%s,%s,%s)' try: cursor.executemany(sql,data_list) db.commit() print("Insert successful") except: print("Insert failed") db.rollback() db.close()