Train a NER marker from scratch

Posted by dniry on Thu, 13 Jan 2022 22:17:12 +0100

Train a NER marker from scratch

The Chinese name of NER tagging is named entity recognition. Like part of speech tagging, it is one of the technical foundations of natural language processing. Ner annotation refers to the identification of the name of an object in the real world, such as France, Donald Trump or wechat. Among these words, France is a country, identified as GPE (GEO remediation entity), Donald Trump as PER (person's name), and wechat is a company, so it is identified as ORG (Organization).

Common entity types in the spaCy module are:

Function of NER label:
1) Obviously, the most important thing is that the entities needed in the text can be recognized through the model.
2) The relationship between entities can be deduced; For example, Rome is the capital of Italy. According to entity recognition, it can be judged that Rome is an Italian city rather than an R & b artist. This work is called entity disambiguation (NED);

The use scenario of NED can eliminate word ambiguity, identify genes and gene products, and analyze writing style in medical research.

Next, start sparcy training NER marker.

Note: This article is implemented using spaCy 3.0 code.

1, Custom model

1. Import the required packages and modules

from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
import spacy
from import Example
from spacy.tokens import Doc

2. The basic database and training samples used are imported

The index of entity annotation starts from 0, and 17 is the index of the last character + 1. The index refers to the python index method

# training data
    ('Who is Shaka Khan?', {
        'entities': [(7, 17, 'PERSON')] 
    ('I like London and Berlin.', {
        'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]

Although the number of training samples in the example is small, it is representative.

2, Training model

1. Optimize the existing model

if model is not None:
        nlp = spacy.load(model)  # Load existing models  
        print("Loaded model '%s'" % model)
        nlp = spacy.blank('en')  # Create a blank model
        print("Created blank 'en' model")

2. Create built-in pipe components

Using add_pipeline function creates pipeline

if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe('ner', last=True)
        ner = nlp.get_pipe('ner')

3. Add the label of train data

for _, annotations in TRAIN_DATA:
        for ent in annotations.get('entities'):

4. Build model

The training process itself is very simple, NLP The update () method abstracts everything for us, and spaCy handles the actual machine learning and training process.

# Disable all other components in the pipeline so that only the NER annotator is trained / updated
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # Only the labels we marked will be trained. If not, all labels will be trained,
        for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)     # Each iteration of training data disrupts the sequence
            losses = {}                    # Define loss function
            for text, annotations in TRAIN_DATA:
                # Organize the data into the data required by the new model
                example = Example.from_dict(nlp.make_doc(text), annotations)    
                    [example],  # Annotation
                    sgd=optimizer,  # Update weight

5. Model saving

if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
        print("Saved model to", output_dir)

Three. Model test

def load_model_test(path,text):
    nlp = spacy.load(path)
    print("Loading from", path)
    doc = nlp(text)
    for i in doc.ents:

if __name__ == "__main__":
    path = "./model/"
    text = "Who is Shaka Khan"

The effect of the model is as follows

Loading from ./model/
Shaka Khan PERSON

You can go to Shaka Khan and label it as PERSON, that is, PERSON name.

reference resources

Bagoff Srinivasa desikan Natural language processing and computational linguistics People's Posts and Telecommunications Publishing House