Elasticsearch introduction to mastery - Elasticsearch IK automatic hot update principle and Implementation

Posted by byenary on Wed, 19 Jan 2022 07:30:00 +0100

1, Thermal renewal principle

elasticsearch will refresh the dictionary every 60s after the function of loading external dictionary is enabled. The specific principle code is as follows:

public void loadDic(HttpServletRequest req,HttpServletResponse response){
    String eTag =req.getParameter("If-None-Match");
    try {
        OutputStream out= response.getOutputStream();
        List<String> list=new ArrayList<String>();
        list.add("The People's Republic of China");
        list.add("I love you love me");
        String oldEtag = list.size() + "";
        StringBuffer sb=new StringBuffer();
        if (oldEtag != eTag) {
            for (String str : list) {
                if(StringUtils.isNotBlank(str)){
                    sb.append("\r\n");
                }
                sb.append(str+"\r\n");
            }
        }
        String data=sb.toString();

        response.setHeader("Last-Modified", String.valueOf(list.size()));
        response.setHeader("ETag",String.valueOf(list.size()));
        response.setContentType("text/plain; charset=gbk");
        out.write(data.getBytes());
        out.flush();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

2, Configuration description

The word splitter commonly used by our company and users is IK word splitter, which has a corresponding core configuration file named ikanalyzer cfg. XML, specific content:

<?xml version="1.0" encoding="UTF-8"?>  
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">  
<properties>  
    <comment>IK Analyzer Extended configuration</comment>  
    <!--Users can configure their own extended dictionary here -->  
    <entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry>  
     <!--Users can configure their own extended stop word dictionary here-->  
    <entry key="ext_stopwords">custom/ext_stopword.dic</entry>  
    <!--Users can configure the remote extension dictionary here -->  
    <entry key="remote_ext_dict">location</entry>  
    <!--Users can configure the remote extended stop word dictionary here-->  
    <entry key="remote_ext_stopwords">http://xxx.com/xxx.dic</entry>  
</properties>

Usage of hot update IK word segmentation. At present, the plug-in supports hot update IK word segmentation through the following configuration mentioned in the IK configuration file above

<!--Users can configure the remote extension dictionary here -->  
<entry key="remote_ext_dict">location</entry>  
<!--Users can configure the remote extended stop word dictionary here-->  
<entry key="remote_ext_stopwords">location</entry> 

Where location refers to a url, such as http://yoursite.com/getCustomDict , the request only needs to meet the following two points to complete the word segmentation hot update.
The http request needs to return two headers

  1. One is last modified,

  2. One is ETag

Both of them are string types. As long as one of them changes, the plug-in will grab the new word segmentation and update the thesaurus.
The content format returned by the http request is one word per line, and the newline character can be used \ n.

The hot update word segmentation can be realized by meeting the above two requirements, and there is no need to restart the ES instance. It should also be noted that to obtain the url of the dictionary, head access must be supported

The following is a web api service interface for accessing the remote extension Dictionary (this step can be ignored directly. Just look at the third one. Here is just to explain that the following method can also be used)

public async Task<HttpResponseMessage> GetDictionary(string path)  {  
    var response = this.Request.CreateResponse(HttpStatusCode.OK);  
    var content = File.ReadAllText(path);  
    response.Content = new StringContent(content, Encoding.UTF8);  
    response.Headers.Age = TimeSpan.FromHours(1);  
    response.Headers.ETag = EntityTagHeaderValue.Parse($"\"{content.ToMD5()}\"");  
    return response;  
}

3, Tomcat server automatic update

1. Deploy http service

Here, use tomcat8 as the web container. First download a tomcat8 5.16, and then upload it to a server (192.168.80.100).

Execute the following command again

1

2

3

tar -zxvf apache-tomcat-8.5.16.tar.gz

cd apache-tomcat-8.5.16/webapp/ROOT

vim hot_ Ext.dic < br > intelligent mobile robot < br > VIM hot_ stopwords. DIC < br > item

Then verify whether the file can be accessed normally http://192.168.80.100:8080/hot.dic

2. Modify the configuration file of ik plug-in

Modify the IK core configuration file ikanalyzer cfg. Configuration in XML

<?xml version="1.0" encoding="UTF-8"?>  
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">  
<properties>  
    <comment>IK Analyzer Extended configuration</comment>  
    <!--Users can configure their own extended dictionary here -->  
    <entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry>  
     <!--Users can configure their own extended stop word dictionary here-->  
    <entry key="ext_stopwords">custom/ext_stopword.dic</entry>  
    <!--Users can configure the remote extension dictionary here -->  
    <entry key="remote_ext_dict">http://192.168.80.100:8080/hot_ext.dic</entry>  
    <!--Users can configure the remote extended stop word dictionary here-->  
    <entry key="remote_ext_stopwords">http://192.168.80.100:8080/hot_stopwords.dic</entry>  
</properties>
 

3. Verification

After restarting es, you will see the following log information, indicating that the remote dictionary has been loaded successfully.

  

Enter the following command in the browser: http://localhost:9200/patent/_analyze?analyzer=ik_smart&text= Intelligent mobile robot

Under normal circumstances, it will be divided many times. This time, it is found that only this word will not be divided, indicating that the remote thesaurus configured just now has taken effect. Then let's change another one: Beijing haze. Under normal circumstances, it may be divided into two words: Beijing and haze. If I directly modify the hot under tomcat_ Add a keyword to the ext.dic thesaurus file: Beijing haze. After saving the file, check the es log and you will see the following log information:

  

At this time, execute the following command to view the word segmentation effect: http://localhost:9200/patent/_analyze?analyzer=ik_smart&text= Beijing haze

As a result, there is only one word, which will not be segmented.

So far, we can dynamically add custom thesaurus and realize hot update of thesaurus~~

Topics: Big Data ElasticSearch