It should be noted that the following contents are obtained from teacher Li Hongyi's course.
Homework 1: Linear Regression
This goal: PM2.5 in 10 hours predicted by 18 features (including PM2.5) in the first 9 hours.
Load 'train.csv'
The data of train.csv is 12 months, 20 days a month and 24 hours a day (18 features per hour).
import sys import pandas as pd import numpy as np from google.colab import drive !gdown --id '1wNKAxQ29G15kgpBy_asjTcZRRgmsCZRm' --output data.zip !unzip data.zip # data = pd.read_csv('gdrive/My Drive/hw1-regression/train.csv', header = None, encoding = 'big5') data = pd.read_csv('./train.csv', encoding = 'big5')
Preprocessing
Take the required value part and fill all the "RAINFALL" fields with 0.
In addition, if you want to repeat the execution of this code in collab, please run it from scratch (run all the above again) to avoid running out of the results you don't want (if you write your own program, you will not encounter it, but colab will continue to get the data after running this paragraph repeatedly, which means that the data after the third column of the original data is taken for the first time, the data after the first column is taken for the second time, and the data after the third column is dropped,...).
data = data.iloc[:, 3:] data[data == 'NR'] = 0 raw_data = data.to_numpy()
Extract Features (1)
The original 4320 * 18 data are divided into 12 18 (features) * 480 (hours) data according to each month.
month_data = {} for month in range(12): sample = np.empty([18, 480]) for day in range(20): sample[:, day * 24 : (day + 1) * 24] = raw_data[18 * (20 * month + day) : 18 * (20 * month + day + 1), :] month_data[month] = sample
Extract Features (2)
There will be 480hrs per month, one data every 9 hours, and 471 data per month, so the total number of data is 471 * 12, and each data has 9 * 18 features (18 features per hour * 9 hours).
There are 471 * 12 corresponding target s (PM2.5 in the 10th hour)
x = np.empty([12 * 471, 18 * 9], dtype = float) y = np.empty([12 * 471, 1], dtype = float) for month in range(12): for day in range(20): for hour in range(24): if day == 19 and hour > 14: continue x[month * 471 + day * 24 + hour, :] = month_data[month][:,day * 24 + hour : day * 24 + hour + 9].reshape(1, -1) #vector dim:18*9 (9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9) y[month * 471 + day * 24 + hour, 0] = month_data[month][9, day * 24 + hour + 9] #value print(x) print(y)
Normalize (1)
mean_x = np.mean(x, axis = 0) #18 * 9 std_x = np.std(x, axis = 0) #18 * 9 for i in range(len(x)): #12 * 471 for j in range(len(x[0])): #18 * 9 if std_x[j] != 0: x[i][j] = (x[i][j] - mean_x[j]) / std_x[j] x
#Split Training Data Into "train_set" and "validation_set"
This part is a simple demonstration for the second and third questions of the report in the assignment to generate the train_set used for training in the comparison and the validation_set that will not be put into training but only used for verification.
import math x_train_set = x[: math.floor(len(x) * 0.8), :] y_train_set = y[: math.floor(len(y) * 0.8), :] x_validation = x[math.floor(len(x) * 0.8): , :] y_validation = y[math.floor(len(y) * 0.8): , :] print(x_train_set) print(y_train_set) print(x_validation) print(y_validation) print(len(x_train_set)) print(len(y_train_set)) print(len(x_validation)) print(len(y_validation))
Training
(different from the above figure: the following code adopts Root Mean Square Error)
Because of the existence of constant term, the dimension (dim) needs to add an additional column; eps term is the minimum value added to avoid the denominator of adagrad being 0.
Each dimension (dim) corresponds to its own gradient, weight (w), and is learned through iteration (iter_time) again and again.
dim = 18 * 9 + 1 w = np.zeros([dim, 1]) x = np.concatenate((np.ones([12 * 471, 1]), x), axis = 1).astype(float) learning_rate = 100 iter_time = 1000 adagrad = np.zeros([dim, 1]) eps = 0.0000000001 for t in range(iter_time): loss = np.sqrt(np.sum(np.power(np.dot(x, w) - y, 2))/471/12)#rmse if(t%100==0): print(str(t) + ":" + str(loss)) gradient = 2 * np.dot(x.transpose(), np.dot(x, w) - y) #dim*1 adagrad += gradient ** 2 w = w - learning_rate * gradient / np.sqrt(adagrad + eps) np.save('weight.npy', w) w
Testing
# testdata = pd.read_csv('gdrive/My Drive/hw1-regression/test.csv', header = None, encoding = 'big5') testdata = pd.read_csv('./test.csv', header = None, encoding = 'big5') test_data = testdata.iloc[:, 2:] test_data[test_data == 'NR'] = 0 test_data = test_data.to_numpy() test_x = np.empty([240, 18*9], dtype = float) for i in range(240): test_x[i, :] = test_data[18 * i: 18* (i + 1), :].reshape(1, -1) for i in range(len(test_x)): for j in range(len(test_x[0])): if std_x[j] != 0: test_x[i][j] = (test_x[i][j] - mean_x[j]) / std_x[j] test_x = np.concatenate((np.ones([240, 1]), test_x), axis = 1).astype(float) test_x
Prediction
The illustration is the same as above
With weight and test data, target can be predicted.
w = np.load('weight.npy') ans_y = np.dot(test_x, w) ans_y
Save Prediction to CSV File
import csv with open('submit.csv', mode='w', newline='') as submit_file: csv_writer = csv.writer(submit_file) header = ['id', 'value'] print(header) csv_writer.writerow(header) for i in range(240): row = ['id_' + str(i), ans_y[i][0]] csv_writer.writerow(row) print(row)
For relevant reference s, please refer to:
Adagrad :
https://youtu.be/yKKNr-QKz2Q?list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49&t=705
RMSprop :
https://www.youtube.com/watch?v=5Yt-obwvMHI
Adam
https://www.youtube.com/watch?v=JXQT_vxqwIs
In addition, in your own linux system, you can replace the dead part of the file with sys.argv (you can enter the file and file location in terminal).
Finally, you can surpass the baseline by adjusting the learning rate, iter_time (number of iterations), the number of features (a few hours, which feature fields to take), and even different model s.
For the problem template of Report, please refer to: https://docs.google.com/document/d/1s84RXs2AEgZr54WCK9IgZrfTF-6B1td-AlKR9oqYa4g/edit