Visualization and graph analysis of dependency parsing in natural language processing (NLP)

Posted by Quicksilver_0 on Thu, 10 Feb 2022 16:27:47 +0100

Visualization and graph analysis of dependency parsing in natural language processing (NLP)

Although the effect of dependency parsing is not as good as that of word segmentation and NER, it also has its use value. In our daily work, we have to deal with it. How to analyze the results of dependency parsing, an important aspect is its visualization and its graph analysis.

The NLP tools we use are jieba and LTP. jieba is used for word segmentation and LTP is used for part of speech tagging and syntactic analysis. You need to download pos.model and parser Model file.

The example sentences used in this article are:

On July 26, 2018, Ren Zhengfei, founder of Huawei, held an award ceremony to Professor Eldar, the father of 5G Polar code, in recognition of his contribution to the communication field.

First, let's take a look at the parsing results before there is no visualization effect. Python code is as follows:

import os
import jieba
from pyltp import Postagger, Parser

sent = '2018 On July 26, 2006, Ren Zhengfei, founder of Huawei, told 5 G Polarization code( Polar Professor Eldar, the father of code), held an award ceremony to recognize his contribution to the field of communication.'

jieba.add_word('Polar code')
jieba.add_word('5G Polarization code')
jieba.add_word('Eldar')
jieba.add_word('Father of')
words = list(jieba.cut(sent))

print(words)

# Part of speech tagging
pos_model_path = os.path.join(os.path.dirname(__file__), 'pos.model')
postagger = Postagger()
postagger.load(pos_model_path)
postags = postagger.postag(words)

# dependency parsing 
par_model_path = os.path.join(os.path.dirname(__file__), 'parser.model')
parser = Parser()
parser.load(par_model_path)
arcs = parser.parse(words, postags)

rely_id = [arc.head for arc in arcs]  # Extract dependent parent node id
relation = [arc.relation for arc in arcs]  # Extract dependencies
heads = ['Root' if id == 0 else words[id - 1] for id in rely_id]  # Match dependent parent node words

for i in range(len(words)):
    print(relation[i] + '(' + words[i] + ', ' + heads[i] + ')')

Operation results:

['2018', 'year', '7', 'month', '26', 'day', ',', 'Huawei', 'founder', 'Ren Zhengfei', 'towards', '5G Polarization code', '(', 'Polar code', ')', 'Father of', 'Eldar', 'professor', 'hold', 'awarding ceremony', ',', 'Commend', 'his', 'about', 'signal communication', 'field', 'do', 'of', 'contribution', '. ']
ATT(2018, year)
ATT(year, day)
ATT(7, month)
ATT(month, day)
ATT(26, day)
ADV(day, hold)
WP(,, day)
ATT(Huawei, founder)
ATT(founder, Ren Zhengfei)
SBV(Ren Zhengfei, hold)
ADV(towards, hold)
ATT(5G Polarization code, Father of)
WP((, Polar code)
COO(Polar code, 5G Polarization code)
WP(), Polar code)
ATT(Father of, Eldar)
ATT(Eldar, professor)
POB(professor, towards)
HED(hold, Root)
VOB(awarding ceremony, hold)
WP(,, hold)
COO(Commend, hold)
ATT(his, contribution)
ADV(about, do)
ATT(signal communication, field)
POB(field, about)
ATT(do, contribution)
RAD(of, do)
VOB(contribution, Commend)
WP(. , hold)

We get the result of dependency parsing of the sentence, but its visualization effect is not good.
We use the above syntax analysis tool (graph8195z) to get the following results:

from graphviz import Digraph

g = Digraph('Test picture')

g.node(name='Root')
for word in words:
    g.node(name=word)

for i in range(len(words)):
    if relation[i] not in ['HED']:
        g.edge(words[i], heads[i], label=relation[i])
    else:
        if heads[i] == 'Root':
            g.edge(words[i], 'Root', label=relation[i])
        else:
            g.edge(heads[i], 'Root', label=relation[i])

g.view()

The visual syntactic analysis of dependency is as follows:

At this time, there is a problem of Chinese garbled code, mostly because fontname is not set to support Chinese display. Just put fontname="Microsoft YaHei" in node or edge to display normally

from graphviz import Digraph

g = Digraph('Test picture')

g.node(name='Root')
for word in words:
    g.node(name=word, fontname="Microsoft YaHei")

for i in range(len(words)):
    if relation[i] not in ['HED']:
        g.edge(words[i], heads[i], label=relation[i])
    else:
        if heads[i] == 'Root':
            g.edge(words[i], 'Root', label=relation[i])
        else:
            g.edge(heads[i], 'Root', label=relation[i])

g.view()

In this picture, we have an intuitive feeling of the dependency parsing results, and the effect is also very good. Unfortunately, we can't analyze the Graph formed by the above visualization results, because Graphviz is just a visualization tool. So, what kind of tool should we use for Graph analysis?

The answer is NetworkX. The following is a demonstration of the visualization and graph analysis of NetworkX applied to dependency parsing, in which graph analysis shows the shortest path between two nodes. The Python code of the example is as follows:

# Drawing parsing results using networkx
import networkx as nx
import matplotlib.pyplot as plt
from pylab import mpl

mpl.rcParams['font.sans-serif'] = ['Arial Unicode MS']  # Specifies the default font

G = nx.Graph()  # Establish undirected graph G

# Add node
for word in words:
    G.add_node(word)

G.add_node('Root')

# Add edge
for i in range(len(words)):
    G.add_edge(words[i], heads[i])

source = '5G Polarization code'
target1 = 'Ren Zhengfei'
distance1 = nx.shortest_path_length(G, source=source, target=target1)
print("'%s'And'%s'The shortest distance in the dependency parsing graph is:  %s" % (source, target1, distance1))

target2 = 'Eldar'
distance2 = nx.shortest_path_length(G, source=source, target=target2)
print("'%s'And'%s'The shortest distance in the dependency parsing graph is:  %s" % (source, target2, distance2))

nx.draw(G, with_labels=True)
plt.savefig("undirected_graph.png")

Operation results:

'5G Polarization code'And'Ren Zhengfei'The shortest distance in the dependency parsing graph is:  6
'5G Polarization code'And'Eldar'The shortest distance in the dependency parsing graph is:  2

The obtained visual images are as follows:

graphviz Chinese garbled code, color and other attribute settings:

from graphviz import Digraph

digraph = Digraph("Chinese pictures")

digraph.node(name="a", label="wood", color="#00CD66", style="filled", fontcolor="white", fontname="Microsoft YaHei")
digraph.node(name="b", label="fire", color="#FF4500", style="filled", fontcolor="white", fontname="Microsoft YaHei")
digraph.node(name="c", label="Soil", color="#CD950C", style="filled", fontcolor="white", fontname="Microsoft YaHei")
digraph.node(name="d", label="gold", color="#FAFAD2", style="filled", fontcolor="#999999", fontname="Microsoft YaHei")
digraph.node(name="e", label="water", color="#00BFFF", style="filled", fontcolor="white", fontname="Microsoft YaHei")

digraph.edge("a", "b", label="Wood makes fire", color="#FF6666", fontcolor="#FF6666", fontname="Microsoft YaHei")
digraph.edge("b", "c", label="Pyrogenic soil", color="#FF6666", fontcolor="#FF6666", fontname="Microsoft YaHei")
digraph.edge("c", "d", label="Native gold", color="#FF6666", fontcolor="#FF6666", fontname="Microsoft YaHei")
digraph.edge("d", "e", label="Jin Shengshui", color="#FF6666", fontcolor="#FF6666", fontname="Microsoft YaHei")
digraph.edge("e", "a", label="Aquatic wood", color="#FF6666", fontcolor="#FF6666", fontname="Microsoft YaHei")

digraph.edge("a", "c", label="Muketu", color="#333333", fontcolor="#333333", fontname="Microsoft YaHei")
digraph.edge("c", "e", label="Tuke water", color="#333333", fontcolor="#333333", fontname="Microsoft YaHei")
digraph.edge("e", "b", label="Water conquers fire", color="#333333", fontcolor="#333333", fontname="Microsoft YaHei")
digraph.edge("b", "d", label="Huokejin", color="#333333", fontcolor="#333333", fontname="Microsoft YaHei")
digraph.edge("d", "a", label="Jin Kemu", color="#333333", fontcolor="#333333", fontname="Microsoft YaHei")

digraph.view()

Operation results:

 

Topics: Python Machine Learning Deep Learning NLP data visualization