Implementation of NLP statistical word segmentation hidden Markov model

1, HMM word segmentation idea

HMM implements word segmentation as a sequence annotation task of words in a string.
The basic idea is that each word occupies a certain word formation position (i.e. word position) when constructing a specific word. It is stipulated that each word can only have four word formation positions at most, i.e B B B (initial word) M M M (in words) E E E (suffix) and S S S (separate word).

2, HMM model construction

1. Model state set

Q Q Q = { B B B, M M M, E E E, S S S}, N N N = 4

2. Observation state set

V V V = { I I I, love love Love,...}, a collection of sentences.

3. Observe the status and status sequence

Observation sequence: Xiao Ming is Chinese
Status sequence: B , E , S , B , M , E B, E, S, B, M, E B,E,S,B,M,E

4. State transition probability distribution matrix

In Chinese word segmentation, it is the sequence of states Q Q Q = { B B B, M M M, E E E, S S S} The state probability matrix is obtained in the parameter estimation in the training stage.

5. Observation state probability matrix (launch probability)

In Chinese word segmentation, the emission probability refers to the state sequence corresponding to each character Q Q Q = { B B B, M M M, E E E, S S S} The probability of each state in the training set is obtained by counting the frequency of the corresponding state of each character in the training set.

6. Initial probability

In Chinese, the initial state probability of word segmentation refers to the corresponding state probability of the first character of each sentence.
{ B B B: xxx, M M M: xxx, E E E: xxx, S S S: xxx}

7. Objectives

max = m a x P ( i 1 , i 2 , i 3 . . . , i T ∣ o 1 , o 2 , o 3 . . . , o T ) maxP(i_1, i_2, i_3...,i_T | o_1,o_2,o_3... ,o_T) maxP(i1​,i2​,i3​...,iT​∣o1​,o2​,o3​...,oT​)
Of which: T T T is the length of the sentence, o i o_i oi is every word of the sentence, i i i_i ii) is the mark of each word.
According to Bayesian formula:

P ( i ∣ o ) P(i | o) P(i∣o) = P ( o ∣ i ) P ( o ) P(o | i) P(o) P(o∣i)P(o) / P ( i ) P(i) P(i)
According to homogeneity HMM:
P ( o ) = p ( o 1 ) p ( o 2 ∣ o 1 ) . . . p ( o t ∣ o t − 1 ) P(o) = p(o1)p(o_2| o_1)...p(o_{t}| o_{t-1}) P(o)=p(o1)p(o2​∣o1​)... p(ot ∣ ot − 1), state transition probability.
P ( o ∣ i ) = p ( o 1 ∣ i 1 ) . . . p ( o t ∣ i t ) P(o | i) = p(o_1| i_1)...p(o_{t}| i_{t}) P(o∣i)=p(o1​∣i1​)... p(ot ∣ it), that is, the probability of generating the observation state (transmission probability).

send P ( o ) P ( o ∣ i ) P(o) P(o | i) P(o)P(o ∣ i) has the highest probability.

3, Corpus

In the people's daily corpus, each line is a sentence, and each word is separated by a space.

IV python code implementation

1. Initialization class

class HMM(object):

    def __init__(self):

        # It is mainly used to access the intermediate results of the algorithm without training the model every time
        self.model_file = './data/hmm_model.pkl'
        # Status value set
        self.state_list = ['B', 'M', 'E', 'S']
        # Parameter loading is used to determine whether the model needs to be reloaded_ file
        self.load_para = False
        # Count the occurrence times of the status, and find p(o)
        self.Count_dic = {}
        # Count the expected total number of rows
        self.line_num = 0

2. Decide whether to retrain

def try_load_model(self, trained):
    It is used to load the calculated intermediate results. When it is necessary to retrain, it is necessary to initialize the emptying results
    :param trained: Training or not
    if trained:
        with open(self.model_file, 'rb') as f:
            self.A_dic = pickle.load(f)
            self.B_dic = pickle.load(f)
            self.Pi_dic = pickle.load(f)
            self.load_para = True

        # State transition probability (State - > conditional probability of state)
        self.A_dic = {}
        # Launch probability (status - > conditional probability of words)
        self.B_dic = {}
        # Initial probability of state
        self.Pi_dic = {}
        self.load_para = False

3. Initialization parameters

def init_parameters(self):
    Initialization parameters
    for state in self.state_list:
        self.A_dic[state] = {s: 0.0 for s in self.state_list}
        self.Pi_dic[state] = 0.0
        self.B_dic[state] = {}
        self.Count_dic[state] = 0

4. Mark the input sentences

def make_label(text):
    Put words according to B,M,E,S tagging
    B: prefix
    M: In words
    E: Suffix
    S: Separate word formation
    :param text:
    out_text = []
    if len(text) == 1:
        out_text += ['B'] + ['M'] * (len(text) - 2) + ['E']

    return out_text

5. Training

def train(hmm, path):
    # Set of observers, mainly words and punctuation
    words = set()
    line_num = -1
    with open(path, encoding='utf8') as f:
        for line in f:
            line_num += 1

            line = line.strip()
            if not line:

            # Gets the word for each line and updates the set of words
            word_list = [i for i in line if i != ' ']
            words |= set(word_list)

            # Each line is segmented according to the space and the result of word segmentation
            line_list = line.split()
            line_state = []

            for w in line_list:
            assert len(word_list) == len(line_state)

            # ['B', 'M', 'M', 'M', 'E', 'S']
            for k, v in enumerate(line_state):
                hmm.Count_dic[v] += 1  # Count the number of occurrences of the status
                if k == 0:
                    hmm.Pi_dic[v] += 1  # The state of the first word of each sentence is used to calculate the initial state probability
                    # {'B': {'B': 0.0, 'M': 0.0, 'E': 0.0, 'S': 0.0}, ...}
                    # A matrix update: the second state "M", get the previous state "B", B - > M: add one
                    # {'B': {'B': 0.0, 'M': 1.0, 'E': 0.0, 'S': 0.0}, ...}
                    hmm.A_dic[line_state[k - 1]][v] += 1  # Calculate transition probability

                    # {'B': {}, 'M': {}, 'E': {}, 'S': {}}
                    # ['1', '9', '8', '6', 'year', 'year', ']
                    # {'B': {}, 'M': {'9': 1.0}, 'E': {}, 'S': {}}
                    hmm.B_dic[line_state[k]][word_list[k]] = hmm.B_dic[line_state[k]].get(word_list[k], 0) + 1.0  # Calculate launch probability

    hmm.line_num = line_num
    # A_dic
    # {'B': {'B': 0.0,      'M': 162066.0, 'E': 1226466.0, 'S': 0.0},
    #  'M': {'B': 0.0,      'M': 62332.0,  'E': 162066.0,  'S': 0.0},
    #  'E': {'B': 651128.0, 'M': 0.0,      'E': 0.0,       'S': 737404.0},
    #  'S': {'B': 563988.0, 'M': 0.0,      'E': 0.0,       'S': 747969.0}
    #  }

    # B_dic
    # {'B': {'medium': 12812.0, 'son': 464.0, 'step': 62.0},
    #  'M ': {' medium ': 12812.0,' son ': 464.0,' step ': 62.0},
    #  'E': {'medium': 12812.0, 'son': 464.0, 'step': 62.0},
    #  'S': {'medium': 12812.0, 'son': 464.0, 'step': 62.0},
    # }

    # Count_dic: {'B': 1388532, 'M': 224398, 'E': 1388532, 'S': 1609916}

# Calculate probability
def calculate_probability(hmm):

    # Finding probability
    hmm.Pi_dic = {k: v * 1.0 / hmm.line_num for k, v in hmm.Pi_dic.items()}
    # Probability of transition state
    hmm.A_dic = {k: {k1: v1 / hmm.Count_dic[k] for k1, v1 in v.items()} for k, v in hmm.A_dic.items()}
    # 1 plus smoothing
    hmm.B_dic = {k: {k1: (v1 + 1) / hmm.Count_dic[k] for k1, v1 in v.items()} for k, v in hmm.B_dic.items()}

    with open(hmm.model_file, 'wb') as f:
        pickle.dump(hmm.A_dic, f)
        pickle.dump(hmm.B_dic, f)
        pickle.dump(hmm.Pi_dic, f)

6. Viterbi algorithm annotation, word segmentation according to annotation

def viterbi(self, text, states, start_p, trans_p, emit_p):
    V = [{}]
    path = {}
    for y in states:
        V[0][y] = start_p[y] * emit_p[y].get(text[0], 0)
        path[y] = [y]
    for t in range(1, len(text)):
        newpath = {}

        # Check whether there is this word in the transmission probability matrix of training
        neverSeen = text[t] not in emit_p['S'].keys() and \
                    text[t] not in emit_p['M'].keys() and \
                    text[t] not in emit_p['E'].keys() and \
                    text[t] not in emit_p['B'].keys()
        for y in states:
            emitP = emit_p[y].get(text[t], 0) if not neverSeen else 1.0  # Set unknown words to separate words
            (prob, state) = max(
                [(V[t - 1][y0] * trans_p[y0].get(y, 0) *
                  emitP, y0)
                 for y0 in states if V[t - 1][y0] > 0])
            V[t][y] = prob
            newpath[y] = path[state] + [y]
        path = newpath

    if emit_p['M'].get(text[-1], 0) > emit_p['S'].get(text[-1], 0):
        (prob, state) = max([(V[len(text) - 1][y], y) for y in ('E', 'M')])
        (prob, state) = max([(V[len(text) - 1][y], y) for y in states])

    return prob, path[state]

def cut(self, text):
    import os
    if not self.load_para:
    prob, pos_list = self.viterbi(text, self.state_list, self.Pi_dic, self.A_dic, self.B_dic)
    begin, next = 0, 0
    for i, char in enumerate(text):
        pos = pos_list[i]
        if pos == 'B':
            begin = i
        elif pos == 'E':
            yield text[begin: i + 1]
            next = i + 1
        elif pos == 'S':
            yield char
            next = i + 1
    if next < len(text):
        yield text[next:]

7. Test

if __name__ == '__main__':
    hmm = HMM()

    # Initialize state transition matrix

    # print(hmm.A_dic)
    # print(hmm.B_dic)
    # print(hmm.Pi_dic)

    train(hmm, './data/trainCorpus.txt_utf8')

    text = 'This is a great plan!'
    res = hmm.cut(text)

8. Results

