Chapter 2 exercise of Dive Into Deep Learning

Posted by harsha on Thu, 27 Jan 2022 03:21:33 +0100

(I've just started to study in depth and try to record all the exercises of the class. I'll give you a dish and chicken. If there is any mistake or lack of understanding, please give me advice.)

2. Preparatory knowledge

2.1 data operation

First question

Run the code in this section. Change the conditional statement X == Y in this section to x < y or x > y, and then see what kind of tensor you can get

X = torch.arange(12, dtype=torch.float32).reshape((3,4))
Y = torch.tensor([[2.0, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
X < Y, X > Y

Operation results

	(tensor([[ True, False,  True, False],
         [False, False, False, False],
         [False, False, False, False]]),
    tensor([[False, False, False, False],
         [ True,  True,  True,  True],
         [ True,  True,  True,  True]]))

Second question

Replace the two tensors operated by elements in the broadcast mechanism with other shapes, such as three-dimensional tensors. Is the result the same as expected?

If it is (2x1x3) + (1x3x2), an error is reported

a = torch.tensor([[[1,2,3]],[[5,3,5]]])  # 2x1x3
b = torch.tensor([[[1,2],[3,5],[6,7]]])  # 1x3x2
a + b
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Input In [103], in <module>
      1 a = torch.tensor([[[1,2,3]],[[5,3,5]]])
      2 b = torch.tensor([[[1,2],[3,5],[6,7]]])
----> 3 a + b

RuntimeError: The size of tensor a (3) must match the size of tensor b (2) at non-singleton dimension 2

If (1x2x3) + (2x1x3)

a = torch.tensor([[[1, 2, 3], [4, 5, 6]]])  # 1x2x3
b = torch.tensor([[[7, 8, 9]], [[4, 5, 6]]])  # 2x1x3
a + b

The result is (2x2x3)

tensor([[[ 8, 10, 12],
         [11, 13, 15]],

        [[ 5,  7,  9],
         [ 8, 10, 12]]])

about Broadcasting mechanism , refer to the explanation given by pytorch official:
1. Each tensor has at least one dimension.
2. When iterating dimension size, starting from the trailing dimension, the dimension size must be equal, one of which is 1, or one of which does not exist.
An article on "Dog King" was found in the station, which explained in great detail:

Premise 2: look at each dimension of the two tensors in order, and the corresponding two dimensions of x and y need to be able to match. Under what circumstances is it a match? If the following conditions are met:
if these two dimensions are equal in size
elif a dimension has a tensor and a tensor does not
elif has a tensor and a tensor in a certain dimension, but the size is 1

2.2 data preprocessing

os.makedirs(dir_name2, exist_ok=True) can create folders recursively
exist_ When the OK parameter is True, it is judged that if the folder exists, it will not be created

os.path.join() splice file path

First question

Create a raw dataset with more rows and columns. (1) Delete the column with the most missing values. (2) The preprocessed data set is converted to tensor format.

import os
import pandas as pd
import torch

os.makedirs(os.path.join('..', '2.1', 'data2'), exist_ok=True)
data_file = os.path.join('..', '2.1', 'data2', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('NumRooms,Alley,Floor,Price\n')
    f.write('NA,Pave,2,127500\n')
    f.write('2,NA,1,106000\n')
    f.write('4,NA,NA,178100\n')
    f.write('NA,NA,2,140000\n')
    f.write('NA,NA,2,152000\n')
data = pd.read_csv(data_file)
inputs, outputs = data.iloc[:, 0:3], data.iloc[:, 3]

num = inputs.isnull().sum() # Get the maximum number of missing values
Max_NaN = inputs.isnull().sum().idxmax()  # Gets the index of the maximum number of missing values
inputs = inputs.drop(Max_NaN, axis=1) # Delete the item with the most missing value in inputs
inputs = inputs.fillna(inputs.mean()) # Replace the missing items of the same column with the mean of the same column
inputs = pd.get_dummies(inputs, dummy_na=True)

x, y = torch.tensor(inputs.values), torch.tensor(outputs.values)  # Into tensor form

Operation results

(tensor([[3.0000, 2.0000],
         [2.0000, 1.0000],
         [4.0000, 1.7500],
         [3.0000, 2.0000],
         [3.0000, 2.0000]], dtype=torch.float64),
 tensor([127500, 106000, 178100, 140000, 152000]))

df.isnull().sum()          count the null value of the number of rows in each column and return the number of rows
. idxmax()            get the index corresponding to the maximum value of series in pandas.
The drop function deletes rows by default, and axis = 1 should be added to columns

2.3 linear algebra

First question

It is proved that the transpose of a matrix A is a, that is, (AT)T=A

A = torch.randn(4, 3)
A == A.T.T

Operation results

tensor([[True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True]])

Second question

Two matrices A and B are given, and it is proved that "the sum of their transposes" is equal to "the transpose of their sums", that is, AT+BT=(A+B)T

A = torch.arange(12).reshape(3,4)
B = torch.randn(3,4)
A.T + B.T == (A + B).T

Operation results

tensor([[True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True]])

Question 3

Given an arbitrary square matrix A, is A+AT always symmetric? Why?

A = torch.randn(4, 4)
(A + A.T).T == (A + A.T)

Operation results

tensor([[True, True, True, True],
        [True, True, True, True],
        [True, True, True, True],
        [True, True, True, True]])

Question 4

In this section, we define the tensor X of shape (2,3,4). What is the output of len(X)?

X = torch.arange(24).reshape(2, 3, 4)
len(X)

Operation results

2

Question 5

For tensor X of any shape, does len (x) always correspond to the length of a specific axis of X? What is this axis?

X = torch.arange(24).reshape(2, 3, 4)  # 2x3x4 tensor
Y = torch.arange(24).reshape(4, 6)  # 4X6 tensor
Z = torch.ones(1)   # One dimensional tensor
len(X), len(Y), len(Z)

Operation results

2  4  6

len()     returns the length of the zeroth dimension of the tensor

Question 6

Run A / A.sum(axis=1) to see what happens. Can you analyze the reason?

  1. If A is A square matrix
A = torch.arange(16).reshape(4, 4)  # 4X4
A / A.sum(axis=1)

Operation results

tensor([[0.0000, 0.0455, 0.0526, 0.0556],
        [0.6667, 0.2273, 0.1579, 0.1296],
        [1.3333, 0.4091, 0.2632, 0.2037],
        [2.0000, 0.5909, 0.3684, 0.2778]])
  1. If A is not A square matrix
A = torch.arange(12).reshape(3, 4)  # 3X4
B = torch.arange(12).reshape(4, 3)  # 4x3
A / A.sum(axis=1)  # Or B / B.sum(axis=1)

Operation results

RuntimeError                              Traceback (most recent call last)
Input In [78], in <module>
----> 1 A / A.sum(axis=1)

RuntimeError: The size of tensor a (4) must match the size of tensor b (3) at non-singleton dimension 1

If a is a square matrix and A.sum(axis=1) is dimension reduction along the 1 axis, the result is equal to the length of a matrix, and the matrix can be divided by elements
If a is not a square matrix, A.sum(axis=1) reduces the dimension along the 1 axis, and the result is not equal to the length of a matrix, so it cannot be calculated according to elements

Question 7

Considering a tensor with shape (2,3,4), what is the shape of the summation output on axes 0, 1 and 2?

A = torch.arange(24).reshape(2,3,4)  # 2x3x4
A.sum(axis=0).shape, A.sum(axis=1).shape, A.sum(axis=2).shape

Operation results

torch.Size([3, 4])
torch.Size([2, 4])
torch.Size([2, 3])

                                              

Question 8

Is linalg The norm function provides tensors of three or more axes and observes their output. What does this function calculate for tensors of any shape?

A, B = torch.randn(2,3,4), torch.randn(3, 4)
outputs1 = torch.linalg.norm(A)  # 2x3x4 tensor
outputs2 = torch.linalg.norm(B)  # 3x4 tensor
A, B, outputs1, outputs2

Operation results

(tensor([[[ 2.1417, -1.2939, -0.0506,  0.0582],
          [ 0.9437,  0.3785, -0.0736, -0.1000],
          [-0.2323,  1.3399,  0.6603,  0.8154]],
 
         [[-0.1303, -0.4355, -0.2770,  1.8112],
          [ 0.7443, -0.1177,  0.8033,  0.0264],
          [ 0.5158, -0.1448, -0.7694, -0.5072]]]),
 tensor([[-1.1264,  0.0546,  0.4413, -0.1869],
         [-1.7601, -0.4381, -0.2288, -1.7541],
         [-0.1453,  1.0307, -0.8918,  0.7459]]),
 tensor(4.0225),
 tensor(3.2180))

Equivalent to finding L2 norm, which can represent the size of tensor

2.4 calculus

First question

Draw function y = f ( x ) = x 3 − 1 x y=f(x)=x^{3}-\frac{1}{x} y=f(x)=x3 − x1 and its x = 1 x=1 The image of the tangent at x=1.

plot(x, [x ** 3 - 1 / x, 4 * x - 4], 'x', 'f(x)', legend=['f(x)', 'Tangent line (x=1)'])

Operation results

Second question

Find function f ( x ) = 3 x 1 2 + 5 e x 2 f(x)=3x_{1}^{2}+5e^{x_{2}} F (x) = gradient of 3x12 + 5ex2.

[ 6 x 1 6x_{1} 6x1​, 5 e x 2 5e^{x_{2}} 5ex2]     derive x1 of the first term and x2 of the second term respectively    the gradient is in vector form

Question 3

function f ( x ) = ∥ x ∥ 2 f(x)=\left \| x \right \|_{2} What is the gradient of f(x) = ‖ x ‖ 2?

∂ f ( x ) ∂ x = ∂ ∥ x ∥ 2 ∂ x = x ∥ x ∥ 2 \frac{\partial f(x)}{\partial x}=\frac{\partial \left \| x \right \|_{2}}{\partial x}=\frac{x}{\left \| x \right \|_{2}} ∂x∂f(x)​=∂x∂∥x∥2​​=∥x∥2​x​

L2 norm can be regarded as x 2 \sqrt{x^{2}} x2 , then f ( x ) f(x) f(x) gradient is x 2 \sqrt{x^{2}} x2 Right x x x derivative: f ′ ( x ) = 2 x 2 x 2 = x x 2 = x ∥ x ∥ 2 f^{'}(x)=\frac{2x}{2\sqrt{ x^{2}}}=\frac{x}{\sqrt{x^{2}}}=\frac{x}{\left \| x \right \|_{2}} f′(x)=2x2 ​2x​=x2 ​x​=∥x∥2​x​

Question 4

You can write functions u = f ( x , y , z ) u=f(x,y,z) u=f(x,y,z), where x = x ( a , b ) x=x(a,b) x=x(a,b) , y = y ( a , b ) y=y(a,b) y=y(a,b), z = z ( a , b ) z=z(a,b) Z = the chain rule of Z (a, b)?

d u d a = d u d x d x d a + d u d y d y d a + d u d z d z d a \frac{du}{da}=\frac{du}{dx}\frac{dx}{da}+\frac{du}{dy}\frac{dy}{da}+\frac{du}{dz}\frac{dz}{da} dadu​=dxdu​dadx​+dydu​dady​+dzdu​dadz​
d u d b = d u d x d x d b + d u d y d y d b + d u d z d z d b \frac{du}{db}=\frac{du}{dx}\frac{dx}{db}+\frac{du}{dy}\frac{dy}{db}+\frac{du}{dz}\frac{dz}{db} dbdu​=dxdu​dbdx​+dydu​dbdy​+dzdu​dbdz​

2.5 automatic differentiation

requires_grad is an attribute of Tensor, a general data structure in pytoch, which is used to indicate whether the current quantity needs to retain the corresponding gradient information in the calculation

The summation of vector x is equivalent to multiplying the vector by a unit vector E, so the gradient is 1

First question

Why is it more expensive to calculate the second derivative than the first derivative?

Because to calculate the second-order reciprocal, you need to calculate the first derivative first

Second question

Immediately after running the back propagation function, run it again to see what happens.

import torch

x = torch.arange(4.0, requires_grad=True)
y = 2 * torch.dot(x, x)
y.backward()
y.backward()  # Perform another backpropagation now
x.grad

Operation results

RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling .backward() or autograd.grad() the first time.  # When trying the second back propagation, the result of the first back propagation has been released

For the first time When backward, specify retain graph=True

import torch

x = torch.arange(4.0, requires_grad=True)
y = 2 * torch.dot(x, x)
y.backward(retain_graph=True)  # The reserved calculation chart is not released
y.backward()
x.grad

Operation results

tensor([ 0.,  8., 16., 24.])

By default, pytorch cannot continuously perform back propagation. If line sequence execution is required, x.grad needs to be updated

Question 3

In the control flow example, we calculate the derivative of d with respect to A. what happens if we change the variable a to a random vector or matrix?

import torch

def f(a):
    b = a * 2
    while b.norm() < 1000:
        b = b * 2
    if b.sum() > 0:
        c = b
    else:
        c = 100 * b
    return c

a = torch.randn(size=(2,2), requires_grad=True)  # a is 2x2
d = f(a)
d.backward()

Operation results

RuntimeError: grad can be implicitly created only for scalar outputs  # No back propagation of vectors or matrices

For non scalars (vectors or matrices), you need to specify that the length of the gradient matches its length

Question 4

Redesign an example of finding the gradient of control flow, run and analyze the results.

import torch

def f(a):
    b = a / 2
    while b > 1:
        b = pow(a, 2)
    if b < 3:
        c = b * 2
    else:
        c = b * 3
    return c

a = torch.randn(size=(), requires_grad=True)
d = f(a)
d.backward()
a.grad == d / a

Operation results

tensor(True)

Consistent with the example ideas in 2.5.4

Question 5

send f ( x ) = s i n ( x ) f(x)=sin(x) f(x)=sin(x), plot f ( x ) f(x) f(x) and d f ( x ) d x \frac{df(x)}{dx} dxdf(x), where the latter is not used f ′ ( x ) = c o s ( x ) f^{'}(x)=cos(x) f′(x)=cos(x)

error code

import torch
import matplotlib.pyplot as plt
import numpy as np
x = torch.arange(-3*np.pi, 3*np.pi, 0.1,requires_grad=True)
y = torch.sin(x)

y.sum().backward()

plt.plot(x, y, label='y=sin(x)') 
plt.plot(x, x.grad, label='dsin(x)=cos(x)') 
plt.legend(loc='upper center')
plt.show()

report errors:

RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead.

You cannot call numpy() on a tensor to grad and apply tensor detach(). Numpy() instead

error code

y.backward()

report errors:

grad can be implicitly created only for scalar outputs

When y is a scalar, back propagation can be carried out, and each input x corresponds to a scalar output y *, so a scalar can be obtained by using y.sum

import torch
import matplotlib.pyplot as plt
import numpy as np
x = torch.arange(-3*np.pi, 3*np.pi, 0.1,requires_grad=True)
y = torch.sin(x)

y.sum().backward()

plt.plot(x.detach(), y.detach(), label='y=sin(x)') 
plt.plot(x.detach(), x.grad, label='dsin(x)=cos(x)') 
plt.legend(loc='upper center')
plt.show()

Operation results

2.6 probability

%matplotlib inline is magic functions: IPython has a set of pre-defined so-called magic functions that can be accessed through the syntax of the command line. "%" is followed by the parameters of the magic function
The function of this function is embedded drawing, and PLT is omitted show()

torch.cumsum(input, dim, *, dtype=None, out=None)   returns the cumulative sum of input elements in dimension dim

First question

m=500 groups of experiments were carried out, and n=10 samples were taken from each group. Change m and n, observe and analyze the experimental results.





Increasing m or n will increase the sample size, which will make the result tend to the real probability

Second question

Given two events with probability P(A) and P(B), calculate the upper and lower limits of P(A ∪ B) and P(A ∩ B).


Question 3

Suppose we have A series of random variables, such as A, B and C, where B only depends on A and C only depends on B. can you simplify the joint probability P(A,B,C)?

Markov chain is a set of discrete random variables with Markov properties. Specifically, for probability space ( ℧ , F , P ) (\mho ,F,\mathbb{P}) A set of random variables in (℧, F,P) with one-dimensional countable set as index set X = { X n : n > 0 } X=\left \{ X_{n}:n>0 \right \} X={Xn: n > 0}, if the values of random variables are in the countable set: X = s i , s i ∈ s X=s_{i},s_{i}\in s X=si, si ∈ s, and the conditional probability of random variables satisfies the following relationship: P ( X t + 1 ∣ X t , . . . , X 1 ) = P ( X t + 1 ∣ X t ) P(X_{t+1}|X_{t},...,X_{1})=P(X_{t+1}|X_{t}) P(Xt+1​∣Xt​,...,X1​)=P(Xt+1​∣Xt​)

                  P ( A B C ) = P ( A ) P ( B ∣ A ) P ( C ∣ B A ) = P ( A ) P ( B ∣ A ) P ( C ∣ B ) P(ABC)=P(A)P(B|A)P(C|BA)=P(A)P(B|A)P(C|B) P(ABC)=P(A)P(B∣A)P(C∣BA)=P(A)P(B∣A)P(C∣B)


See P7, 4 of Zhang Yu's 9 lectures on probability theory and mathematical statistics, 2022 edition notes

Question 4

In section 2.6.2.6, the first test is more accurate. Why not run the first test twice, but run the first and second tests at the same time?

Because the characteristics of each test are different, each test will be affected differently. If you run the first test twice, this impact will be superimposed. Running the first and second tests at the same time can effectively offset this impact.

Topics: Python Pytorch Deep Learning