jieba Word Breaker Group for Software Engineering Application and Practice Course Code Analysis

Posted by iRock on Tue, 28 Dec 2021 04:02:35 +0100

2021SC@SDUSC

Continue with the analysis of part-of-speech labeling code in posseg.

The Viterbi function is in jieba/posseg/viterbi. Implemented in py file. Parameter cases accepted by the Viterbi function:

(1) obs are observation sequences, that is, sentences to be labeled.

States is the possible state of each word, in jieba/posseg/char_ State_ Tab. The py file defines the format as shown in the code snippet. Possible states of the representation word'\u4e00'include'B' indicating where the word begins,'m'indicating that the part of speech is a number, and so on.

P={'\u4e00': (('B', 'm'),
             ('S', 'm'),
             ('B', 'd'),
             ('B', 'a'),
             ('M', 'm'),
             ('B', 'n'),
            ······
}

(3) start_p, is the initial state, at jieba/posseg/prob_ Start. Defined in the PY file as shown in the code snippet below, meaning "B" indicates the beginning of the word, and "a" represents an adjective with a logarithmic probability of -4.762305214596967, and so on.

P={('B', 'a'): -4.762305214596967,
 ('B', 'ad'): -6.680066036784177,
 ('B', 'ag'): -3.14e+100,
 ('B', 'an'): -8.697083223018778,
 ('B', 'b'): -5.018374362109218,
 ······
}

(4) trans_p, is the probability of state transition, in jieba/posseg/prob_ Trans. Defined in the PY file in the format shown in the code snippet, Indicates that the state of the previous moment is ("B" and "a"), that is, the first word is the starting position of the word, the part of speech is the adjective, the state of the current moment is ("E" and "a"), that is, the current word is at the end of the word, the part of speech is the adjective, and its state transition probability is -0.0050648453069648755, and so on.

P={('B', 'a'): {('E', 'a'): -0.0050648453069648755,
              ('M', 'a'): -5.287963037107507},
 ('B', 'ad'): {('E', 'ad'): -0.0007479013978476627,
               ('M', 'ad'): -7.198613337130562},
 ······
}

emit_p, is the state emission probability, at jieba/posseg/prob_ Emit. The format defined in the PY file, as shown in the code snippet, means that the current state is ("B" and "a"), that is, the current word is at the beginning of the word, the part of speech is an adjective, and the logarithm value of the probability of transmission to the Chinese character'u 4e00'is -3.618715666782108, and so on.

P={('B', 'a'): {'\u4e00': -3.618715666782108,
              '\u4e07': -10.500566885381515,
              '\u4e0a': -8.541143017159477,
              '\u4e0b': -8.445222895280738,
······
}

The viterbi function first calculates the logarithmic probability values for each initial state, then calculates them recursively to get all the state sets at the previous moment. Based on the state and state transition matrix of the previous moment, the state set of the current moment is calculated in advance, then the possible state set of the current moment is obtained from the current observation value, and then intersects with the state set calculated in the previous step. According to the current state of each moment, its logarithmic probability value depends on the logarithmic probability value of the previous moment, the transition probability from the previous moment to the current moment, and the transmission probability of the current word at this moment. Finally, the optimal path is obtained by backtracking the maximum probability path in turn, that is, the state at each time required.

def viterbi(obs, states, start_p, trans_p, emit_p):
    V = [{}]  # tabular
    mem_path = [{}]
    # Get all possible states from the state transition matrix
    all_states = trans_p.keys()
    # Moment t=0, initial state
    for y in states.get(obs[0], all_states):  # init
        V[0][y] = start_p[y] + emit_p[y].get(obs[0], MIN_FLOAT)
        mem_path[0][y] = ''
    # Moment t=1,...,len(obs) - 1
    for t in xrange(1, len(obs)):
        V.append({})
        mem_path.append({})
        #prev_states = get_top_states(V[t-1])
        # Get all the state sets from the previous moment
        prev_states = [
            x for x in mem_path[t - 1].keys() if len(trans_p[x]) > 0]

       # Pre-compute the state set of the current moment based on the state and state transition matrix of the previous moment
        prev_states_expect_next = set(
            (y for x in prev_states for y in trans_p[x].keys()))

        # Obtain the set of possible states at the current moment from the current observations and intersect it with the set of states calculated in the previous step
        obs_states = set(
            states.get(obs[t], all_states)) & prev_states_expect_next

        # If the intersection set of the current state is empty
        if not obs_states:
            # If the set of states for the advance calculation of the current moment is not empty, the set of states for the current moment is the set of states for the advance calculation of the current moment, otherwise it is the set of all possible states
            obs_states = prev_states_expect_next if prev_states_expect_next else all_states

        # A collection of possible states at the current moment
        for y in obs_states:
            # Obtain the probability logarithm of the state at the previous moment, the transition probability logarithm of the state to the current moment, and the transmission probability logarithm of the state at this moment.
            # prev_states is the set of possible states corresponding to the state of the current moment
            prob, state = max((V[t - 1][y0] + trans_p[y0].get(y, MIN_INF) +
                               emit_p[y].get(obs[t], MIN_FLOAT), y0) for y0 in prev_states)
            V[t][y] = prob
            mem_path[t][y] = state

    # Last Moment
    last = [(V[-1][y], y) for y in mem_path[-1].keys()]
    # if len(last)==0:
    #     print obs
    prob, state = max(last)

    # From time t = len(obs) - 1,...,0, save the state corresponding to the maximum probability in the list in turn
    route = [None] * len(obs)
    i = len(obs) - 1
    while i >= 0:
        route[i] = state
        state = mem_path[i][state]
        i -= 1
    # Returns the maximum probability and the state of each moment
    return (prob, route)

Topics: Python

Programmer Think

jieba Word Breaker Group for Software Engineering Application and Practice Course Code Analysis

Hot Topics