es installing the IK word breaker

Posted by icez on Sun, 21 Nov 2021 05:27:53 +0100

For installation es, refer to: Installing elasticsearch

Install ik plug-in (slow online)

# Enter the inside of the container
docker exec -it elasticsearch /bin/bash

# Download and install online
./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.2/elasticsearch-analysis-ik-7.4.2.zip

#sign out
exit
#Restart container
docker restart elasticsearch

Install ik plug-ins offline (recommended)

View data volume directory

To install the plug-in, you need to know the location of the plugins directory of elasticsearch, which is mounted with a data volume. Therefore, you need to view the data volume directory of elasticsearch through the following command:

docker inspect es

Display results:

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-b1Osfj5m-1631424878178)(/images/es/02.png)]

Note the plugins directory is mounted in: / root / docker / ES / es plugins.

Unzip the installation package of word splitter

Decompress ik word splitter and rename it to ik (download address: Click download directly)

Upload to the plug-in data volume of the es container

That is / root / docker / ES / es plugins

Restart container

# Restart container
docker restart es

# View es log
docker logs -f es

Test:

GET /_analyze
{
  "analyzer": "ik_max_word",
  "text": "I am a famous farmer in a new era"
}

result:

{
  "tokens" : [
    {
      "token" : "I",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "yes",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "One",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "one",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "TYPE_CNUM",
      "position" : 3
    },
    {
      "token" : "individual",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "COUNT",
      "position" : 4
    },
    {
      "token" : "New era",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "times",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "agriculture",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "CN_CHAR",
      "position" : 7
    },
    {
      "token" : "name",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "CN_CHAR",
      "position" : 8
    },
    {
      "token" : "work",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "CN_CHAR",
      "position" : 9
    }
  ]
}

Summary:

ik_smart: minimum segmentation
ik_max_word: thinnest segmentation

Extended word dictionary

With the development of the Internet, "word making movement" is becoming more and more frequent. There are many new words that do not exist in the original vocabulary list, such as "aoligai", "jujuezi", etc.

Therefore, vocabulary also needs to be constantly updated. IK word splitter provides the function of expanding vocabulary.

1) Open the IK word breaker config directory

2) In the IKAnalyzer.cfg.xml configuration file, add:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
        <comment>IK Analyzer Extended configuration</comment>
        <!--Users can configure their own extended dictionary here-->
        <entry key="ext_dict">ext.dic</entry>
         <!--Users can configure their own extended stop word dictionary here  *** Add stop word dictionary-->
        <entry key="ext_stopwords">stopword.dic</entry>
</properties>

3) Create a new ext.dic. You can copy a configuration file under the config directory for modification

Jujuezi
 awesome

4) Restart elasticsearch

docker restart es

# View log
docker logs -f elasticsearch

The ext.dic configuration file has been successfully loaded in the log

5) Test effect:

GET /_analyze
{
  "analyzer": "ik_max_word",
  "text": "It's a unique son to eliminate the underworld and evil"
}

Note that the encoding of the current file must be in UTF-8 format. It is strictly prohibited to edit it with Windows Notepad

Stop word dictionary

In Internet projects, the transmission speed between networks is very fast, so many languages are not allowed to be transmitted on the network, such as sensitive words such as religion and politics, so we should also ignore the current words when searching.

The IK word splitter also provides a powerful stop word function, allowing us to directly ignore the contents of the current stop vocabulary when indexing.

1) Add the contents of IKAnalyzer.cfg.xml configuration file:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
        <comment>IK Analyzer Extended configuration</comment>
        <!--Users can configure their own extended dictionary here-->
        <entry key="ext_dict">ext.dic</entry>
         <!--Users can configure their own extended stop word dictionary here  *** Add stop word dictionary-->
        <entry key="ext_stopwords">stopword.dic</entry>
</properties>

3) Add a stop word in stopword.dic

heroin
 narcotics