The historical data of Lotto and lottery are public data, and there are no legal obstacles.
We select the data of the public website and save it as csv, excel and other format data files, which is convenient for secondary data sorting.
The whole process is mainly divided into obtaining the web page (finding the effective login of the web site), parsing the web page (separating the effective data) and retaining the data (writing the data into the data file)
1, Preparatory work import module
import requests #Reptile Library import xlwt #Write excel table Library import time #Time acquisition conversion from bs4 import BeautifulSoup #Reptile Library
These are commonly used modules for network crawling data
2, Get web page
def get_html(url): headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' } response = requests.get(url, headers = headers) if response.status_code == 200: #If the object status code is equal to 200, it indicates that the web page content is obtained successfully print('Read web page successfully!') return response.text#Return the obtained web page content else: print('Failed to read web page,No data!') return None
This is a simple user agent. It is not difficult for web pages to crawl back. The content of headers can be obtained through Google Chorme browser.
3, Parse data
def parse_html(html,ccc): if ccc=='dlt': ddd1='b1' ddd2='b2' mm=14 else: ddd1='r6' ddd2='b1' mm=15 soup = BeautifulSoup(html, 'lxml')#Create a web page parser object #print(soup) i = 0 #Find the TR tag in the web page and read from the fourth tr to the penultimate TR, because through the analysis of the web page, the first three and the last tr are useless for item in soup.select('tr')[3:-1]:#Form the checked tr into a list. Item is the list pointer. for each cycle, item selects the next tr. after reading the list, the cycle ends and the function ends, try: #Without try and except, some values are & nbsp, which is a blank key in the web page and will make errors. In addition, the debugging command ignores the errors and will be handled uniformly later yield{ 'issue':item.select('td')[i].text,#The 0th td found in item is the lottery issue number, which is written to the time column 'r1':item.select('td')[i+1].text,#0 + 1 td is the winning number 'r2':item.select('td')[i+2].text,'r3':item.select('td')[i+3].text,'r4':item.select('td')[i+4].text,'r5':item.select('td')[i+5].text,ddd1:item.select('td')[i+6].text,ddd2:item.select('td')[i+7].text, 'time':item.select('td')[i+mm].text#Award date } except IndexError: pass
The first part is to cooperate with the choice of Lotto or two-color ball. The big Lotto (dlt) is five red balls (front area) and two basketball (back area), and the two-color ball is six red balls and one basketball.
The second part is to analyze the data.
There are many tr in the original web page. Depending on the content of the web page, analyze where the data you need starts. Besides, I just choose
<tr class="t_tr1"> <td class="t_tr1">1</td> <td class="t_tr1">2</td> <td class="t_tr1">3</td> <td class="t_tr1">4</td> <td class="t_tr1">5</td> <td class="t_tr1">×¢Êý</td> <td class="t_tr1">½±½ð(Ôª)</td> <td class="t_tr1">×¢Êý</td> <td class="t_tr1">½±½ð(Ôª)</td> </tr> <tbody id="tdata"> <tr class="t_tr1"><!--<td>2</td>--><td class="t_tr1">21100</td><td class="cfont2">07</td><td class="cfont2">11</td><td class="cfont2">23</td><td class="cfont2">26</td><td class="cfont2">28</td><td class="cfont4">02</td><td class="cfont4">07</td><td class="t_tr1">1,059,494,718</td><td class="t_tr1">3</td><td class="t_tr1">10,000,000</td><td class="t_tr1">96</td><td class="t_tr1">110,807</td><td class="t_tr1">280,471,793</td><td class="t_tr1">2021-08-30</td></tr><tr class="t_tr1"><!--<td>2</td>--><td class="t_tr1">21099</td><td class="cfont2">15</td><td class="cfont2">19</td><td class="cfont2">27</td><td class="cfont2">28</td><td class="cfont2">30</td><td class="cfont4">03</td><td class="cfont4">04</td><td class="t_tr1">1,062,893,865</td><td class="t_tr1">1</td><td class="t_tr1">10,000,000</td><td class="t_tr1">130</td><td class="t_tr1">128,865</td><td class="t_tr1">303,058,009</td><td class="t_tr1">2021-08-28</td></tr><tr class="t_tr1"><!--<td>2</td>--><td class="t_tr1">21098</td><td class="cfont2">02</td><td class="cfont2">07</td><td class="cfont2">21</td><td class="cfont2">27</td><td class="cfont2">33</td><td class="cfont4">07</td><td class="cfont4">09</td><td class="t_tr1">1,007,650,453</td><td class="t_tr1">2</td><td class="t_tr1">10,000,000</td><td class="t_tr1">62</td><td class="t_tr1">202,870</td><td class="t_tr1">276,951,852</td><td class="t_tr1">2021-08-25</td></tr><tr class="t_tr1"><!--<td>2</td>--><td class="t_tr1">21097</td><td class="cfont2">11</td><td class="cfont2">26</td><td class="cfont2">30</td><td class="cfont2">31</td><td class="cfont2">33</td><td class="cfont4">03</td><td class="cfont4">10</td><td class="t_tr1">975,519,760</td><td class="t_tr1">1</td><td class="t_tr1">10,000,000</td><td class="t_tr1">45</td><td class="t_tr1">293,323</td><td class="t_tr1">272,897,259</td><td class="t_tr1">2021-08-23</td></tr><tr class="t_tr1"><!--<td>2</td>--><td class="t_tr1">21096</td><td class="cfont2">07</td><td class="cfont2">08</td><td class="cfont2">10</td><td class="cfont2">20</td><td class="cfont2">21</td><td class="cfont4">01</td><td class="cfont4">05</td><td class="t_tr1">927,073,812</td><td class="t_tr1">10</td><td class="t_tr1">8,465,507</td><td class="t_tr1">204</td><td class="t_tr1">69,310</td><td class="t_tr1">297,348,639</td><td class="t_tr1">2021-08-21</td></tr><tr class="t_tr1"><!--<td>2</td>--><td class="t_tr1">21095</td><td class="cfont2">10</td><td class="cfont2">15</td><td class="cfont2">19</td><td class="cfont2">20</td><td class="cfont2">30</td><td class="cfont4">11</td><td class="cfont4">12</td><td class="t_tr1">962,561,876</td><td class="t_tr1">2</td><td class="t_tr1">10,000,000</td><td class="t_tr1">116</td><td class="t_tr1">114,636</td><td class="t_tr1">279,549,610</td><td class="t_tr1">2021-08-18</td></tr><tr class="t_tr1"><!--<td>2</td>--><td class="t_tr1">21094</td><td class="cfont2">05</td><td class="cfont2">06</td><td class="cfont2">24</td><td class="cfont2">27</td><td class="cfont2">33</td><td class="cfont4">05</td><td class="cfont4">12</td><td class="t_tr1">931,384,310</td><td class="t_tr1">6</td><td class="t_tr1">9,900,472</td><td class="t_tr1">77</td><td class="t_tr1">173,225</td><td class="t_tr1">275,141,551</td><td class="t_tr1">2021-08-16</td></tr><tr class="t_tr1"><!--<td>2</td>--><td class="t_tr1">21093</td><td class="cfont2">20</td><td class="cfont2">25</td><td class="cfont2">27</td><td class="cfont2">28</td><td class="cfont2">31</td><td class="cfont4">08</td><td class="cfont4">11</td><td class="t_tr1">948,527,900</td><td class="t_tr1">2</td><td class="t_tr1">10,000,000</td><td class="t_tr1">66</td><td class="t_tr1">239,598</td><td class="t_tr1">302,609,762</td><td class="t_tr1">2021-08-14</td></tr><tr class="t_tr1"><!--<td>2</td>--><td class="t_tr1">21092</td><td class="cfont2">12</td><td class="cfont2">14</td><td class="cfont2">27</td><td class="cfont2">28</td><td class="cfont2">34</td><td class="cfont4">03</td><td class="cfont4">07</td><td class="t_tr1">907,549,564</td><td class="t_tr1">2</td><td class="t_tr1">10,000,000</td><td class="t_tr1">108</td><td class="t_tr1">111,939</td><td class="t_tr1">281,386,147</td><td class="t_tr1">2021-08-11</td></tr><tr class="t_tr1"><!--<td>2</td>--><td class="t_tr1">21091</td><td class="cfont2">01</td><td class="cfont2">04</td><td class="cfont2">10</td><td class="cfont2">21</td><td class="cfont2">29</td><td class="cfont4">03</td><td class="cfont4">05</td><td class="t_tr1">877,764,456</td><td class="t_tr1">5</td><td class="t_tr1">10,000,000</td><td class="t_tr1">155</td><td class="t_tr1">61,811</td><td class="t_tr1">272,675,404</td><td class="t_tr1">2021-08-09</td></tr><tr class="t_tr1"><!--<td>2</td>--><td class="t_tr1">21090</td><td class="cfont2">06</td><td class="cfont2">07</td><td class="cfont2">19</td><td class="cfont2">26</td><td class="cfont2">32</td><td class="cfont4">06</td><td class="cfont4">12</td><td class="t_tr1">880,472,278</td><td class="t_tr1">2</td><td class="t_tr1">10,000,000</td><td class="t_tr1">236</td><td class="t_tr1">46,324</td><td class="t_tr1">289,448,960</td><td class="t_tr1">2021-08-07</td></tr><tr class="t_tr1"><!--<td>2</td>--><td class="t_tr1">21089</td><td class="cfont2">01</td><td class="cfont2">17</td><td class="cfont2">24</td><td class="cfont2">28</td><td class="cfont2">35</td><td class="cfont4">10</td><td class="cfont4">12</td><td class="t_tr1">851,200,235</td><td class="t_tr1">3</td><td class="t_tr1">10,000,000</td><td class="t_tr1">48</td><td class="t_tr1">309,133</td><td class="t_tr1">269,906,690</td><td class="t_tr1">2021-08-04</td></tr><tr class="t_tr1"><!--<td>2</td>--><td class="t_tr1">21088</td><td class="cfont2">02</td><td class="cfont2">14</td><td class="cfont2">31</td><td class="cfont2">34</td><td class="cfont2">35</td><td class="cfont4">01</td><td class="cfont4">07</td><td class="t_tr1">822,562,305</td><td class="t_tr1">2</td><td class="t_tr1">9,148,546</td><td class="t_tr1">63</td><td class="t_tr1">197,930</td><td class="t_tr1">264,720,425</td><td class="t_tr1">2021-08-02</td></tr><tr class="t_tr1"><!--<td>2</td>--><td class="t_tr1">21087</td><td class="cfont2">10</td><td class="cfont2">14</td><td class="cfont2">15</td><td class="cfont2">25</td><td class="cfont2">28</td><td class="cfont4">03</td><td class="cfont4">10</td><td class="t_tr1">797,251,485</td><td class="t_tr1">6</td><td class="t_tr1">9,400,653</td><td class="t_tr1">177</td><td class="t_tr1">68,131</td><td class="t_tr1">290,446,990</td><td class="t_tr1">2021-07-31</td></tr><tr class="t_tr1"><!--<td>2</td>--><td class="t_tr1">21086</td><td class="cfont2">02</td><td class="cfont2">03</td><td class="cfont2">07</td><td class="cfont2">16</td><td class="cfont2">17</td><td class="cfont4">06</td><td class="cfont4">10</td><td class="t_tr1">822,607,828</td><td class="t_tr1">1</td><td class="t_tr1">10,000,000</td><td class="t_tr1">71</td><td class="t_tr1">176,586</td><td class="t_tr1">272,160,233</td><td class="t_tr1">2021-07-28</td></tr><tr class="t_tr1"><!--<td>2</td>--><td class="t_tr1">21085</td><td class="cfont2">01</td><td class="cfont2">12</td><td class="cfont2">15</td><td class="cfont2">26</td><td class="cfont2">35</td><td class="cfont4">10</td><td class="cfont4">11</td><td class="t_tr1">775,634,441</td><td class="t_tr1">6</td><td class="t_tr1">7,421,868</td><td class="t_tr1">99</td><td class="t_tr1">137,086</td><td class="t_tr1">270,848,072</td><td class="t_tr1">2021-07-26</td></tr><tr class="t_tr1"><!--<td>2</td>--><td class="t_tr1">21084</td><td class="cfont2">02</td><td class="cfont2">10</td><td class="cfont2">14</td><td class="cfont2">30</td><td class="cfont2">33</td><td class="cfont4">06</td><td class="cfont4">09</td><td class="t_tr1">763,493,932</td><td class="t_tr1">6</td><td class="t_tr1">6,943,989</td><td class="t_tr1">130</td><td class="t_tr1">92,500</td><td class="t_tr1">290,591,822</td><td class="t_tr1">2021-07-24</td></tr><tr class="t_tr1"><!--<td>2</td>--><td class="t_tr1">21083</td><td class="cfont2">07</td><td class="cfont2">09</td><td class="cfont2">11</td><td class="cfont2">26</td><td class="cfont2">35</td><td class="cfont4">01</td><td class="cfont4">08</td><td class="t_tr1">759,158,453</td><td class="t_tr1">6</td><td class="t_tr1">7,329,020</td><td class="t_tr1">8
The lottery ball was, and the bonus amount was not selected. These are selected according to their own needs. This is part of the page.
4, Data recording, data can be written to data files in different formats. CSV and excel are commonly used.
# Write data to excel def write_to_excel(url,ccc): if ccc=='dlt': ddd1='b1' ddd2='b2' else: ddd1='r6' ddd2='b1' f = xlwt.Workbook() #Create excel table object sheet1 = f.add_sheet(ccc, cell_overwrite_ok=True)#Create a table called 3D row0 = ['period','r1','r2','r3','r4','r5',ddd1,ddd2,'time']#Make all column names into a list table # Write first line for j in range(0, len(row0)):#Write the names of each column in order with a loop sheet1.write(0, j, row0[j])#Write the contents of row 0 and column # Crawl the web page and write the results into excel objects i = 0 html = get_html(url)#Call the user-defined function to read the web page and obtain the web page content now1=time.localtime() bbb=str(now1[0])+str(now1[1])+str(now1[2])+str(now1[3])+str(now1[4])+str(now1[5]) filename='d:\\yy'+ccc+bbb+'.xls' print(filename) file=filename print('Extracting saved data......') if html != None: #If there is no error in reading the web page and the reading is successful, proceed to the next step, for item in parse_html(html,ccc): sheet1.write(i+1, 0, item['issue'])#Write the i+1 row and column 0 in the excel table, and write the time key data of item sheet1.write(i+1, 1, item['r1']) sheet1.write(i+1, 2, item['r2']) sheet1.write(i+1, 3, item['r3']) sheet1.write(i+1, 4, item['r4']) sheet1.write(i+1, 5, item['r5']) sheet1.write(i+1, 6, item[ddd1]) sheet1.write(i+1, 7, item[ddd2]) sheet1.write(i+1, 8, item['time']) i += 1#After writing a line feed, prepare to write the next line in the next cycle try: f.save(file) print('write in EXCEL surface',file,'success!') except: print('write in EXCEL Table failed')
The first part is still used with the lottery or two-color ball.
The second part is to create excel table
The third part writes the data into the Excel table, in which a new file is saved every time the file is collected, or the file can be modified to be fixed.
five
#The main function is to call other functions to write data into excel def main(): ccc='dlt' url = 'http://datachart.500.com/{0}/history/inc/history.php?limit=12000'.format(ccc) print(ccc) write_to_excel(url,ccc)#Custom function to write data into excel ccc='ssq' url = 'https://datachart.500.com/{0}/history/newinc/history.php?start=03001&end=21500'.format(ccc) print(ccc) write_to_excel(url,ccc)#Custom function to write data into excel
5, Main function
Here is the choice of lottery or two-color ball, and the website to log in. This paper analyzes the structure of the website address and finds out the historical data at one time without turning the page.
6, Main function
if __name__ == '__main__': main()
No explanation of the main function VII. Operation
dlt Read web page successfully! d:\yydlt202191115147.xls Extracting saved data...... write in EXCEL surface d:\yydlt202191115147.xls success! ssq Read web page successfully! d:\yyssq202191115157.xls Extracting saved data...... write in EXCEL surface d:\yyssq202191115157.xls success! >>>
Operation results
Lottoperiod | r1 | r2 | r3 | r4 | r5 | b1 | b2 | time |
21100 | 07 | 11 | 23 | 26 | 28 | 02 | 07 | 2021-08-30 |
21099 | 15 | 19 | 27 | 28 | 30 | 03 | 04 | 2021-08-28 |
21098 | 02 | 07 | 21 | 27 | 33 | 07 | 09 | 2021-08-25 |
21097 | 11 | 26 | 30 | 31 | 33 | 03 | 10 | 2021-08-23 |
21096 | 07 | 08 | 10 | 20 | 21 | 01 | 05 | 2021-08-21 |
21095 | 10 | 15 | 19 | 20 | 30 | 11 | 12 | 2021-08-18 |
21094 | 05 | 06 | 24 | 27 | 33 | 05 | 12 | 2021-08-16 |
21093 | 20 | 25 | 27 | 28 | 31 | 08 | 11 | 2021-08-14 |
21092 | 12 | 14 | 27 | 28 | 34 | 03 | 07 | 2021-08-11 |
21091 | 01 | 04 | 10 | 21 | 29 | 03 | 05 | 2021-08-09 |
21090 | 06 | 07 | 19 | 26 | 32 | 06 | 12 | 2021-08-07 |
21089 | 01 | 17 | 24 | 28 | 35 | 10 | 12 | 2021-08-04 |
21088 | 02 | 14 | 31 | 34 | 35 | 01 | 07 | 2021-08-02 |
period | r1 | r2 | r3 | r4 | r5 | r6 | b1 | time |
21099 | 09 | 11 | 17 | 18 | 20 | 27 | 15 | 2021-08-31 |
21098 | 01 | 10 | 13 | 18 | 26 | 32 | 05 | 2021-08-29 |
21097 | 03 | 11 | 12 | 13 | 25 | 28 | 12 | 2021-08-26 |
21096 | 01 | 07 | 11 | 14 | 15 | 26 | 11 | 2021-08-24 |
21095 | 08 | 12 | 17 | 24 | 27 | 28 | 13 | 2021-08-22 |
21094 | 09 | 11 | 24 | 25 | 28 | 33 | 15 | 2021-08-19 |
21093 | 05 | 11 | 15 | 23 | 28 | 33 | 03 | 2021-08-17 |
21092 | 02 | 07 | 08 | 10 | 12 | 31 | 03 | 2021-08-15 |