RNN process details

Posted by sebnewyork on Fri, 12 Nov 2021 22:02:07 +0100

RNN and its code flow

This paper focuses on the whole process of RNN rather than the derivation process of BP

What is RNN

  • Recurrent Neural Network

  • Cyclic neural network

Why do you need RNN?

Ordinary neural networks can only deal with one input separately, and the former input has nothing to do with the latter input. However, some tasks need to be able to better process sequence information, that is, the previous input is related to the subsequent input

**For example, when we understand the meaning of a sentence, it is not enough to understand each word of the sentence in isolation. We need to deal with the whole sequence of these words** When we process video, we can not only analyze each frame separately, but also analyze the whole sequence connected by these frames.

For example: take simple part of speech tagging as an example

  • Enter "I eat apples"
  • Output as "I / n eat / v apple / n"
  • Obviously, the probability that "apple" after "eat" is a noun is greater than a verb
  • Conclusion: the information of a location will be affected by its previous information
  • RNN is a neural network that can store the previous time step information

Basic framework of RNN (cyclic neural network)

It should be noted in the figure below that the self circulation model on the left is the real RNN structure. The model on the right is just that we expand the structure of RNN in time steps for easy understanding

  • The information of the upper layer is retained through the hidden layer, which will be transmitted to the next time step
  • Look at the right half of the picture above
    • x t x_t xt , yes t t Input of t time steps, s t s_t st , yes t t The hidden information retained in t time steps, that is, the information of the previous time step, which will be transferred to the second time step t + 1 t+1 t+1 time steps are passed on.
    • O t O_t Ot , is the second t t Output of t time steps
    • W , U , V W , U , V W. U and V are weight matrices
    • O t = g ( V ⋅ S t ) O_{t}=g\left(V \cdot S_{t}\right) Ot​=g(V⋅St​)
    • S t = f ( U ⋅ X t + W ⋅ S t − 1 ) S_{t}=f\left(U \cdot X_{t}+W \cdot S_{t-1}\right) St​=f(U⋅Xt​+W⋅St−1​)
  • Now go back to the left
    • The time step is actually the first, second and second time of the model work t t t times.
    • There is an input for each model work x x x. And the information left by the previous work, that is, the hidden layer s s s. Then the model passes the formula S = f ( U ⋅ X + W ⋅ S ) S=f\left(U \cdot X+W \cdot S\right) S=f(U ⋅ X+W ⋅ S) updates the hidden layer information, that is, add the information entered this time to the past information, and then pass it to the next layer
    • The hidden layer information of the current layer combines the previous information and the information input this time, so the output can be determined. We use O = g ( V ⋅ S ) O=g\left(V \cdot S\right) O=g(V ⋅ S) get the output of the current model

Test the effect of the model (use the model to generate the sampling function of the sentence)

  • sample function

    def sample(h, seed_ix, n): # Create an index sequence
      # h is the state of the hidden layer, that is, the information left by the previous time step
      #vocab_size is the number of letters that are not repeated. Here we vectorize the letters. Each letter corresponds to a position in x. if the position is 1, it means that the letter appears. If it is 0, it does not appear
      x = np.zeros((vocab_size, 1))
      x[seed_ix] = 1 # seed_ix is a number
      print("seed_ix:%s" % seed_ix)
      ixes = []
      for t in range(n): # A total of n index es are taken out
        h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h) + bh)
        y = np.dot(Why, h) + by
        p = np.exp(y) / np.sum(np.exp(y))
        ix = np.random.choice(range(vocab_size), p=p.ravel())
        x = np.zeros((vocab_size, 1))
        x[ix] = 1
        ixes.append(ix)
      return ixes
    
    • Function: input the index of a letter, use the current RNN model to create the whole sentence according to the letter, and then return to the index list corresponding to the letter in the sentence

    • input

      • h is the state of the hidden layer, that is, the information left by the previous time step
      • seed_ix is an index, that is, the index corresponding to the letter we want to enter
      • n: Sentence length (number of alphabetic indexes to be generated)
    • Analysis 1

        x = np.zeros((vocab_size, 1))
        x[seed_ix] = 1 # Get the coding vector of the corresponding letter of the access index
        print("seed_ix:%s" % seed_ix)
        ixes = []
      
      • vocab_size is the number of letters that are not repeated. Here we vectorize the letters. Each letter corresponds to a position in x. if the position is 1, it means that the letter appears. If it is 0, it does not appear
      • It's one hot coding
    • Analysis 2

       for t in range(n): # A total of n index es are taken out
          h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h) + bh)
          y = np.dot(Why, h) + by
          p = np.exp(y) / np.sum(np.exp(y))
          ix = np.random.choice(range(vocab_size), p=p.ravel())
          x = np.zeros((vocab_size, 1))
          x[ix] = 1
          ixes.append(ix) #Put in the ixes list
      
      • h is the hidden layer state

      • Note: the h corresponding to each t is different, and the h here is updated

      • y: Score vector. Each score is the score of the letter corresponding to the index of the score

      • p: Using softmax, the score is transformed into probability

      • ix: take out an index according to the probability in p

      • Reset the encoding vector x for the next time step

      • Is to achieve the following process

lossFun (loss function)

  • lossFun

    • The process is introduced in detail, and the gradient calculation in BP process is not introduced in detail
    def lossFun(inputs, targets, hprev):
      """
      inputs,targets are both list of integers.
      hprev is Hx1 array of initial hidden state
      returns the loss, gradients on model parameters, and last hidden state
      """
      xs, hs, ys, ps = {}, {}, {}, {}
      hs[-1] = np.copy(hprev)
      loss = 0
      # forward pass
      for t in range(len(inputs)): # inputs are all array < int > types
        # Encode elements as vectors
        xs[t] = np.zeros((vocab_size,1)) # Initialize to 0 vector
        # print("input: %s" % inputs[t])
        # print("type: %s" % type(inputs[t]))
        xs[t][inputs[t]] = 1
        hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state
        ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
        ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
        loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)
           
        
      # BP process, calculating gradient
      dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
      dbh, dby = np.zeros_like(bh), np.zeros_like(by)
      dhnext = np.zeros_like(hs[0])
      for t in reversed(range(len(inputs))):
        dy = np.copy(ps[t])
        dy[targets[t]] -= 1 # backprop into y. see http://cs231n.github.io/neural-networks-case-study/#grad if confused here
        dWhy += np.dot(dy, hs[t].T)
        dby += dy
        dh = np.dot(Why.T, dy) + dhnext # backprop into h
        dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
        dbh += dhraw
        dWxh += np.dot(dhraw, xs[t].T)
        dWhh += np.dot(dhraw, hs[t-1].T)
        dhnext = np.dot(Whh.T, dhraw)
      for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
        np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients
      return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs)-1]
    
    • Input a string of characters to train the model and return the gradient of loss and various parameters

    • Input:

      • inputs: input of model training data, i.e. [a, B, C, D] shown in the figure above
      • targets: the correct result that the model should produce for inputs, that is, [a ', b', c ', d'] as shown in the figure above
      • hprev: vector of H-by-1. Represents the initial state of the hidden layer of each neuron in the current time step
    • Analysis 1

        xs, hs, ys, ps = {}, {}, {}, {}
        hs[-1] = np.copy(hprev)
        loss = 0
      
      • For each character input, it trains models with different time steps. eg:
        • inputs = [a , b , c , d]
        • a trains the model at t = 1
        • b trains the model at t = 2
        • c trains the model at t = 3
        • d trains the model at t = 4
      • This code is just initialization
    • Analysis 2

      for t in range(len(inputs)): #inputs are all array < int > types
          # Encode elements as vectors
          xs[t] = np.zeros((vocab_size,1)) # Initialize to 0 vector
          xs[t][inputs[t]] = 1
          hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state
          ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
          ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
          loss += -np.log(ps[t][targets[t],0]) # The score of the correct result, the greater the score, the smaller the loss
      
      • Note: when traversing the inputs, the index is used as the time step. hs[t-1] is always changing, and the information of the previous time step is stored
      • Traverse each input character
      • Establish one hot encoding for a character
      • Calculates the hidden state of the current time step
      • Note: the parameter matrix of each time step is the same, which is essentially the training of a model, but the input characters are different from the hidden state
      • Calculate the output of different time steps (fractional vector), that is, the output corresponding to different characters
      • Transform fractional vector into probability vector using softmax
      • Calculate the loss generated by each character
  • main function

    • In the process of one training, the parameter matrix of the model in different time steps is the same, because in essence, one model is trained in time
    • One SEQ at a time_ Long data training model
    • The detailed process is marked in the notes
    n, p = 0, 0
    mWxh, mWhh, mWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
    mbh, mby = np.zeros_like(bh), np.zeros_like(by) # memory variables for Adagrad
    smooth_loss = -np.log(1.0/vocab_size)*seq_length # loss at iteration 0
    
    while True:
      if p + seq_length + 1 >= len(data) or n == 0: # It is initialized at the beginning or at the end
        hprev = np.zeros((hidden_size,1)) # Reset hidden layer state
        p = 0 # Point to the first of the input data
    
      # targets is a word pushed back by inputs
      inputs = [char_to_ix[ch] for ch in data[p:p+seq_length]] # A list storing index
      targets = [char_to_ix[ch] for ch in data[p+1:p+seq_length+1]] # A list storing index
    
      # This sampling is just to try the effect of the model, sampling every 100 times
      if n % 100 == 0:
        sample_ix = sample(hprev, inputs[0], 200) # Sampling to obtain the index sequence of the sampled words
        txt = ''.join(ix_to_char[ix] for ix in sample_ix) # Output these words through the obtained index sequence
        print ('----\n %s \n----' % (txt, ))
    
      # Using seq_ For the data of length, the loss and gradient of the model are calculated once
      loss, dWxh, dWhh, dWhy, dbh, dby, hprev = lossFun(inputs, targets, hprev) # Get Loss and gradient
      smooth_loss = smooth_loss * 0.999 + loss * 0.001
      if n % 100 == 0: print ('iter %d, loss: %f' % (n, smooth_loss)) # 100 times one output
      
      # Parameter update
      for param, dparam, mem in zip([Wxh, Whh, Why, bh, by], 
                                    [dWxh, dWhh, dWhy, dbh, dby], 
                                    [mWxh, mWhh, mWhy, mbh, mby]):
        mem += dparam * dparam # The corresponding position point of dparam multiplied by mem is the sum of squares of the front gradient
        
        #The more you go to the back, the less you learn
        param += -learning_rate * dparam / np.sqrt(mem + 1e-8) # adagrad update
    
      p += seq_length # p points to the next SEQ_ Start of length
      n += 1 # iteration counter 
    

Topics: AI neural networks rnn