# Li Hongyi's in-depth learning_ homework01

Posted by tweek on Tue, 05 Oct 2021 23:20:24 +0200

It should be noted that the following contents are obtained from teacher Li Hongyi's course.

# Homework 1: Linear Regression

This goal: PM2.5 in 10 hours predicted by 18 features (including PM2.5) in the first 9 hours.

The data of train.csv is 12 months, 20 days a month and 24 hours a day (18 features per hour).

```import sys
import pandas as pd
import numpy as np
!gdown --id '1wNKAxQ29G15kgpBy_asjTcZRRgmsCZRm' --output data.zip
!unzip data.zip
data = pd.read_csv('./train.csv', encoding = 'big5')
```

# Preprocessing

Take the required value part and fill all the "RAINFALL" fields with 0.
In addition, if you want to repeat the execution of this code in collab, please run it from scratch (run all the above again) to avoid running out of the results you don't want (if you write your own program, you will not encounter it, but colab will continue to get the data after running this paragraph repeatedly, which means that the data after the third column of the original data is taken for the first time, the data after the first column is taken for the second time, and the data after the third column is dropped,...).

```data = data.iloc[:, 3:]
data[data == 'NR'] = 0
raw_data = data.to_numpy()
```

# Extract Features (1)

The original 4320 * 18 data are divided into 12 18 (features) * 480 (hours) data according to each month.

```month_data = {}
for month in range(12):
sample = np.empty([18, 480])
for day in range(20):
sample[:, day * 24 : (day + 1) * 24] = raw_data[18 * (20 * month + day) : 18 * (20 * month + day + 1), :]
month_data[month] = sample
```

# Extract Features (2)

There will be 480hrs per month, one data every 9 hours, and 471 data per month, so the total number of data is 471 * 12, and each data has 9 * 18 features (18 features per hour * 9 hours).

There are 471 * 12 corresponding target s (PM2.5 in the 10th hour)

```x = np.empty([12 * 471, 18 * 9], dtype = float)
y = np.empty([12 * 471, 1], dtype = float)
for month in range(12):
for day in range(20):
for hour in range(24):
if day == 19 and hour > 14:
continue
x[month * 471 + day * 24 + hour, :] = month_data[month][:,day * 24 + hour : day * 24 + hour + 9].reshape(1, -1) #vector dim:18*9 (9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9)
y[month * 471 + day * 24 + hour, 0] = month_data[month][9, day * 24 + hour + 9] #value
print(x)
print(y)
```

# Normalize (1)

```mean_x = np.mean(x, axis = 0) #18 * 9
std_x = np.std(x, axis = 0) #18 * 9
for i in range(len(x)): #12 * 471
for j in range(len(x)): #18 * 9
if std_x[j] != 0:
x[i][j] = (x[i][j] - mean_x[j]) / std_x[j]
x
```

#Split Training Data Into "train_set" and "validation_set"
This part is a simple demonstration for the second and third questions of the report in the assignment to generate the train_set used for training in the comparison and the validation_set that will not be put into training but only used for verification.

```import math
x_train_set = x[: math.floor(len(x) * 0.8), :]
y_train_set = y[: math.floor(len(y) * 0.8), :]
x_validation = x[math.floor(len(x) * 0.8): , :]
y_validation = y[math.floor(len(y) * 0.8): , :]
print(x_train_set)
print(y_train_set)
print(x_validation)
print(y_validation)
print(len(x_train_set))
print(len(y_train_set))
print(len(x_validation))
print(len(y_validation))
```

# Training

(different from the above figure: the following code adopts Root Mean Square Error)

Because of the existence of constant term, the dimension (dim) needs to add an additional column; eps term is the minimum value added to avoid the denominator of adagrad being 0.

Each dimension (dim) corresponds to its own gradient, weight (w), and is learned through iteration (iter_time) again and again.

```dim = 18 * 9 + 1
w = np.zeros([dim, 1])
x = np.concatenate((np.ones([12 * 471, 1]), x), axis = 1).astype(float)
learning_rate = 100
iter_time = 1000
eps = 0.0000000001
for t in range(iter_time):
loss = np.sqrt(np.sum(np.power(np.dot(x, w) - y, 2))/471/12)#rmse
if(t%100==0):
print(str(t) + ":" + str(loss))
gradient = 2 * np.dot(x.transpose(), np.dot(x, w) - y) #dim*1
np.save('weight.npy', w)
w
```

# Testing

```# testdata = pd.read_csv('gdrive/My Drive/hw1-regression/test.csv', header = None, encoding = 'big5')
test_data = testdata.iloc[:, 2:]
test_data[test_data == 'NR'] = 0
test_data = test_data.to_numpy()
test_x = np.empty([240, 18*9], dtype = float)
for i in range(240):
test_x[i, :] = test_data[18 * i: 18* (i + 1), :].reshape(1, -1)
for i in range(len(test_x)):
for j in range(len(test_x)):
if std_x[j] != 0:
test_x[i][j] = (test_x[i][j] - mean_x[j]) / std_x[j]
test_x = np.concatenate((np.ones([240, 1]), test_x), axis = 1).astype(float)
test_x
```

# Prediction

The illustration is the same as above

With weight and test data, target can be predicted.

```w = np.load('weight.npy')
ans_y = np.dot(test_x, w)
ans_y
```

# Save Prediction to CSV File

```import csv
with open('submit.csv', mode='w', newline='') as submit_file:
csv_writer = csv.writer(submit_file)
for i in range(240):
row = ['id_' + str(i), ans_y[i]]
csv_writer.writerow(row)
print(row)
```

For relevant reference s, please refer to:

https://youtu.be/yKKNr-QKz2Q?list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49&t=705

RMSprop :