Li Hongyi's in-depth learning_ homework01

Posted by tweek on Tue, 05 Oct 2021 23:20:24 +0200

It should be noted that the following contents are obtained from teacher Li Hongyi's course.

Homework 1: Linear Regression

This goal: PM2.5 in 10 hours predicted by 18 features (including PM2.5) in the first 9 hours.

Load 'train.csv'

The data of train.csv is 12 months, 20 days a month and 24 hours a day (18 features per hour).

import sys
import pandas as pd
import numpy as np
from google.colab import drive 
!gdown --id '1wNKAxQ29G15kgpBy_asjTcZRRgmsCZRm' --output data.zip
!unzip data.zip
# data = pd.read_csv('gdrive/My Drive/hw1-regression/train.csv', header = None, encoding = 'big5')
data = pd.read_csv('./train.csv', encoding = 'big5')

Preprocessing

Take the required value part and fill all the "RAINFALL" fields with 0.
In addition, if you want to repeat the execution of this code in collab, please run it from scratch (run all the above again) to avoid running out of the results you don't want (if you write your own program, you will not encounter it, but colab will continue to get the data after running this paragraph repeatedly, which means that the data after the third column of the original data is taken for the first time, the data after the first column is taken for the second time, and the data after the third column is dropped,...).

data = data.iloc[:, 3:]
data[data == 'NR'] = 0
raw_data = data.to_numpy()

Extract Features (1)

The original 4320 * 18 data are divided into 12 18 (features) * 480 (hours) data according to each month.

month_data = {}
for month in range(12):
    sample = np.empty([18, 480])
    for day in range(20):
        sample[:, day * 24 : (day + 1) * 24] = raw_data[18 * (20 * month + day) : 18 * (20 * month + day + 1), :]
    month_data[month] = sample

Extract Features (2)

There will be 480hrs per month, one data every 9 hours, and 471 data per month, so the total number of data is 471 * 12, and each data has 9 * 18 features (18 features per hour * 9 hours).

There are 471 * 12 corresponding target s (PM2.5 in the 10th hour)

x = np.empty([12 * 471, 18 * 9], dtype = float)
y = np.empty([12 * 471, 1], dtype = float)
for month in range(12):
    for day in range(20):
        for hour in range(24):
            if day == 19 and hour > 14:
                continue
            x[month * 471 + day * 24 + hour, :] = month_data[month][:,day * 24 + hour : day * 24 + hour + 9].reshape(1, -1) #vector dim:18*9 (9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9)
            y[month * 471 + day * 24 + hour, 0] = month_data[month][9, day * 24 + hour + 9] #value
print(x)
print(y)

Normalize (1)

mean_x = np.mean(x, axis = 0) #18 * 9 
std_x = np.std(x, axis = 0) #18 * 9 
for i in range(len(x)): #12 * 471
    for j in range(len(x[0])): #18 * 9 
        if std_x[j] != 0:
            x[i][j] = (x[i][j] - mean_x[j]) / std_x[j]
x

#Split Training Data Into "train_set" and "validation_set"
This part is a simple demonstration for the second and third questions of the report in the assignment to generate the train_set used for training in the comparison and the validation_set that will not be put into training but only used for verification.

import math
x_train_set = x[: math.floor(len(x) * 0.8), :]
y_train_set = y[: math.floor(len(y) * 0.8), :]
x_validation = x[math.floor(len(x) * 0.8): , :]
y_validation = y[math.floor(len(y) * 0.8): , :]
print(x_train_set)
print(y_train_set)
print(x_validation)
print(y_validation)
print(len(x_train_set))
print(len(y_train_set))
print(len(x_validation))
print(len(y_validation))

Training

(different from the above figure: the following code adopts Root Mean Square Error)

Because of the existence of constant term, the dimension (dim) needs to add an additional column; eps term is the minimum value added to avoid the denominator of adagrad being 0.

Each dimension (dim) corresponds to its own gradient, weight (w), and is learned through iteration (iter_time) again and again.

dim = 18 * 9 + 1
w = np.zeros([dim, 1])
x = np.concatenate((np.ones([12 * 471, 1]), x), axis = 1).astype(float)
learning_rate = 100
iter_time = 1000
adagrad = np.zeros([dim, 1])
eps = 0.0000000001
for t in range(iter_time):
    loss = np.sqrt(np.sum(np.power(np.dot(x, w) - y, 2))/471/12)#rmse
    if(t%100==0):
        print(str(t) + ":" + str(loss))
    gradient = 2 * np.dot(x.transpose(), np.dot(x, w) - y) #dim*1
    adagrad += gradient ** 2
    w = w - learning_rate * gradient / np.sqrt(adagrad + eps)
np.save('weight.npy', w)
w

Testing

# testdata = pd.read_csv('gdrive/My Drive/hw1-regression/test.csv', header = None, encoding = 'big5')
testdata = pd.read_csv('./test.csv', header = None, encoding = 'big5')
test_data = testdata.iloc[:, 2:]
test_data[test_data == 'NR'] = 0
test_data = test_data.to_numpy()
test_x = np.empty([240, 18*9], dtype = float)
for i in range(240):
    test_x[i, :] = test_data[18 * i: 18* (i + 1), :].reshape(1, -1)
for i in range(len(test_x)):
    for j in range(len(test_x[0])):
        if std_x[j] != 0:
            test_x[i][j] = (test_x[i][j] - mean_x[j]) / std_x[j]
test_x = np.concatenate((np.ones([240, 1]), test_x), axis = 1).astype(float)
test_x

Prediction

The illustration is the same as above

With weight and test data, target can be predicted.

w = np.load('weight.npy')
ans_y = np.dot(test_x, w)
ans_y

Save Prediction to CSV File

import csv
with open('submit.csv', mode='w', newline='') as submit_file:
    csv_writer = csv.writer(submit_file)
    header = ['id', 'value']
    print(header)
    csv_writer.writerow(header)
    for i in range(240):
        row = ['id_' + str(i), ans_y[i][0]]
        csv_writer.writerow(row)
        print(row)

For relevant reference s, please refer to:

Adagrad :
https://youtu.be/yKKNr-QKz2Q?list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49&t=705

RMSprop :
https://www.youtube.com/watch?v=5Yt-obwvMHI

Adam
https://www.youtube.com/watch?v=JXQT_vxqwIs

In addition, in your own linux system, you can replace the dead part of the file with sys.argv (you can enter the file and file location in terminal).

Finally, you can surpass the baseline by adjusting the learning rate, iter_time (number of iterations), the number of features (a few hours, which feature fields to take), and even different model s.

For the problem template of Report, please refer to: https://docs.google.com/document/d/1s84RXs2AEgZr54WCK9IgZrfTF-6B1td-AlKR9oqYa4g/edit

Topics: Deep Learning