Machine learning notes 1-0: Regression

Posted by ram4nd on Mon, 17 Jan 2022 17:37:05 +0100

*Note: this blog refers to Mr. Li Hongyi's 2020 machine learning course Video link

Regression Model

Regression model is used to obtain a mapping relationship between input data and output data
y = w x + b (1) y=wx+b\tag{1} y=wx+b(1)
To make it simple.
Where w represents the weight of the network and b represents the bias. x is the network input and y is the network output.

Loss Function

The loss function is used to evaluate the ability of the model to predict (FIT) data. The smaller the value of the loss function, the more accurate the model prediction is.
Obviously, the loss function should have the following basic requirements:

  • It is related to the parameters of the model, i.e. w and b.
  • The smaller the loss function value, the more accurate the model prediction is.
  • The derivative can be obtained, and the derivative function is related to w and b.
    Common loss functions are as follows:
  1. Cross entropy
  2. Mean square deviation

Gradient Descent

Assuming input x, in the case of zero error, the network should output y0, but now the network outputs y1, and then the network needs to update w and b to make the output closer to y0
According to the definition of loss function, as long as the changed w and b make the loss function smaller, the output will be closer to y0
Therefore, the partial derivatives of the loss function with respect to w and b are obtained as long as:


w t + 1 = w t − ∂ l ∂ w w^{t+1}=w^t-\frac{\partial{l}}{\partial{w}} wt+1=wt−∂w∂l​
b t + 1 = b t − ∂ l ∂ b b^{t+1}=b^t-\frac{\partial{l}}{\partial{b}} bt+1=bt−∂b∂l​

Just. Among them_ w is the partial derivative of the loss function with respect to w.

example

If you have a function, y = − 26 ∗ x + 40 y=-26*x+40 y=−26∗x+40.

def true_function(x):
    return -26*x+40

Design a model to simulate this function, assuming that the model is set to
y = w 1 x 2 + w 2 x + b y=w_1x^2+w_2x+b y=w1​x2+w2​x+b
The model can fit a univariate quadratic equation nearly perfectly at most.


The model is defined as follows:

class Model:
    def __init__(self):
        self.w=1
        self.b=1
    def cal(self,x):
        return self.w*x+self.b
model=Model()

Select the mean square deviation as the loss function, i.e
l o s s = ( y 1 − y 0 ) 2 loss=(y_1-y_0)^2 loss=(y1​−y0​)2
among y 0 y_0 y0 , is the exact value, y 1 y_1 y1 # is the model fit value.


According to the chain derivation rule, let y1_w is the partial derivative of the model for w, then the partial derivative of the loss function for w can also be expressed as follows:

def loss(y0,y1):
    v=(y1-y0)**2
    return v
def loss_w(y0,y1,y1_w1):
    v=2*(y1-y0)*y1_w1
    return v
def loss_b(y0,y1,y1_b):
    v=2*(y1-y0)*y1_b
    return v

Assuming that the input is x, the process of updating parameters once according to the input can be expressed as follows:

from matplotlib import pyplot as plt
ls=[]
dws=[]
dbs=[]
def update(x,lr=0.0001,epoch=1000):
    for i in range(epoch):
        y1=model.cal(x)
        y0=true_function(x)
        l=loss(y0,y1)
        ls.append(l)
        dw=loss_w(y0,y1,x)
        db=loss_b(y0,y1,1)
        dws.append(dw)
        dbs.append(db)
        model.w-=dw*lr
        model.b-=db*lr
    print("loss:{} w1:{:.2f} b:{:.2f}".format(l,model.w,model.b))

Prepare a series of data, perform 1000 parameter updates, and draw a graph of loss changing with the number of updates. As can be seen from the figure, the loss is indeed getting smaller and smaller with the parameter update, but the final result is far from the correct function.

update(5)
plt.subplot(221)
plt.plot(ls)
plt.subplot(222)
plt.plot(dws)
plt.subplot(223)
plt.plot(dbs)
plt.show()
loss:0.27581895497589537 w1:-17.36 b:-2.67

Question:

  • The reason for this result is that there are too few data. Using only one data, it is easy to get another function exactly the same as the result of the correct function under a certain chance. Therefore, it is necessary to add data for training the model, and the update function should also be modified according to multiple input data.

  • At the same time, it is noted in the experiment that once the learning rate lr is set too large, it is easy to lead to parameter expansion, that is, a very large parameter is obtained, which makes the gradient of subsequent calculation very large, and finally leads to data overflow and unable to train.

terms of settlement:

  • Modify the learning rate several times during training
  • Modify the logic of parameter update so that it will not change too much at one time

The first method is called the optimizer, which will be described later.
For simplicity, the second method is used to add a limit for parameter update, that is, the value of each update must not be greater than bound_value.

The modified correlation function is as follows:

def update2(xs,ys,lr=1e-7,epoch=1000):
    bound_value=1e7
    ws=[]
    bs=[]
    ls=[]
    for i in range(epoch):
        l=d_w=d_b=0
        for x,y0 in zip(xs,ys):
            y1=model.cal(x)
            l+=loss(y0,y1)
            d_w+=loss_w(y0,y1,x)
            d_b+=loss_b(y0,y1,5)
        d_w=min(bound_value,max(-bound_value,d_w))*lr
        d_b=min(bound_value,max(-bound_value,d_b))*lr
        ls.append(l)
        ws.append(d_w)
        bs.append(d_b)
        model.w-=d_w
        model.b-=d_b
    print("loss:{} w:{:.2f} b:{:.2f}".format(l,model.w,model.b))
    plt.subplot(121)
    plt.plot(ls)
    plt.subplot(122)
    plt.scatter(xs,ys,color="b",label="true")
    plt.plot(xs,[model.cal(x) for x in xs],color="r",label="ours")
    plt.legend()
    plt.show()
update=update2
datas_x=[x for x in range(-10,30)]
datas_y=[true_function(x) for x in datas_x]
update(datas_x,datas_y,epoch=10000,lr=1e-7)
loss:28946.69946353684 w:-24.50 b:5.16

From the images of the obtained function and the correct function, the fitting effect has been better. However, if only from the value of each parameter, the fitting effect is still not the result we want.


There are two solutions:

  1. Starting from the data, add more training data.

  2. Starting from the training times, since the two curves do not completely coincide, more training times can be added.

    loss:3309.083946276638 w:-26.00 b:27.09

Another example

Next, the function y=36*x+1000 is used for the experiment, and the results are as follows. From the experimental results, the change of b in the fitting process is very slow.


It can be seen from the process of parameter updating that the value of calculating the partial derivative of b is smaller than that of w, so it is unreasonable to set the same learning rate.

loss:4050028.206281854 w:35.32 b:548.32

Topics: Python Machine Learning AI