Document of word segmentation

Posted by eco on Tue, 25 Jan 2022 04:17:34 +0100

The git of word segmentation can't be opened. Turn the content over for easy viewing

How to use word segmentation:

1. Quick experience

Run the script under the project root directory demo-word.bat You can quickly experience the word segmentation effect
 usage: command [text] [input] [output]
command command The optional values are: demo,text,file
demo
text Yang Shangchuan is APDPlat Author of application level product development platform
file d:/text.txt d:/word.txt
exit

2. Word segmentation of text

Remove stop words: List<Word> words = WordSegmenter.seg("Yang Shangchuan is APDPlat Author of application level product development platform");
Reserved stop words: List<Word> words = WordSegmenter.segWithStopWords("Yang Shangchuan is APDPlat Author of application level product development platform");
			System.out.println(words);

Output:
Remove stop words:[Yang Shangchuan, apdplat, Application level, product, Development platform, author]
Reserved stop words:[Yang Shangchuan, yes, apdplat, Application level, product, Development platform, of, author]

3. Word segmentation of documents

String input = "d:/text.txt";
String output = "d:/word.txt";
Remove stop words: WordSegmenter.seg(new File(input), new File(output));
Reserved stop words: WordSegmenter.segWithStopWords(new File(input), new File(output));

4. Custom profile

The default configuration file is under the classpath word.conf,Packed in word-x.x.jar in
 The custom configuration file is under the classpath word.local.conf,It needs to be provided by the user
 If the custom configuration is the same as the default configuration, the custom configuration overrides the default configuration
 The configuration file code is UTF-8

5. Custom user Thesaurus

User defined thesaurus is one or more folders or files, and absolute path or relative path can be used
 The user thesaurus is composed of multiple dictionary files, which are encoded as UTF-8
 The format of dictionary file is text file, and one line represents a word
 Paths can be specified by system properties or configuration files, and multiple paths are separated by commas
 The dictionary file under the class path needs to be prefixed before the relative path classpath:
	
There are three ways to specify:
	Designation method 1: programming designation (high priority):
		WordConfTools.set("dic.path", "classpath:dic.txt,d:/custom_dic");
		DictionaryFactory.reload();//After changing the dictionary path, reload the dictionary
	Designation method 2, Java Virtual machine startup parameters (medium priority):
		java -Ddic.path=classpath:dic.txt,d:/custom_dic
	Specify method 3: configuration file assignment (low priority):
		Use files under classpath word.local.conf To specify configuration information
		dic.path=classpath:dic.txt,d:/custom_dic

If not specified, the class path is used by default dic.txt Dictionary file

besides, You can also maintain the thesaurus with code in the program, The method is as follows:

// Single operation
// Add a custom word
DictionaryFactory.getDictionary().add("Yang Shangchuan");
// Delete a custom word
DictionaryFactory.getDictionary().remove("Liu Shishi");
// Batch operation
List<String> words = new ArrayList<>();
words.add("Lau Andy");
words.add("Jing Tian");
words.add("Zhao Liying");
// Add a batch of custom words
DictionaryFactory.getDictionary().addAll(words);
// Delete a batch of custom words
DictionaryFactory.getDictionary().removeAll(words);

6. Custom stop word thesaurus

The usage method is similar to the custom user thesaurus. The configuration items are:
stopwords.path=classpath:stopwords.txt,d:/custom_stopwords_dic

7. Automatically detect thesaurus changes

It can automatically detect the changes of user-defined thesaurus and user-defined stop word thesaurus
 It includes files and folders under classpath, absolute path and relative path under non classpath
 For example:
classpath:dic.txt,classpath:custom_dic_dir,
d:/dic_more.txt,d:/DIC_DIR,D:/DIC2_DIR,my_dic_dir,my_dic_file.txt

classpath:stopwords.txt,classpath:custom_stopwords_dic_dir,
d:/stopwords_more.txt,d:/STOPWORDS_DIR,d:/STOPWORDS2_DIR,stopwords_dir,remove.txt

8. Explicitly specify word segmentation algorithm

When segmenting text, you can explicitly specify a specific word segmentation algorithm, such as:
WordSegmenter.seg("APDPlat Application level product development platform", SegmentationAlgorithm.BidirectionalMaximumMatching);

SegmentationAlgorithm The optional types of are:	 
Forward maximum matching algorithm: MaximumMatching
 Reverse maximum matching algorithm: ReverseMaximumMatching
 Forward minimum matching algorithm: MinimumMatching
 Inverse minimum matching algorithm: ReverseMinimumMatching
 Bidirectional maximum matching algorithm: BidirectionalMaximumMatching
 Bidirectional minimum matching algorithm: BidirectionalMinimumMatching
 Bidirectional maximum minimum matching algorithm: BidirectionalMaximumMinimumMatching
 Total segmentation algorithm: FullSegmentation
 Least words algorithm: MinimalWordCount
 maximum Ngram Scoring algorithm: MaxNgramScore

9. Word segmentation effect evaluation

Run the script under the project root directory evaluation.bat The effect of word segmentation can be evaluated
 The test text used in the evaluation is 2533709 lines, with a total of 2837 4490 characters
 The assessment results are located in target/evaluation Directory:
corpus-text.txt It is the manual marked text of divided words, and the words are separated by spaces
test-text.txt To test the text, put corpus-text.txt Results separated by punctuation into multiple lines
standard-text.txt To test the manual marking text corresponding to the text as the standard for whether the word segmentation is correct
result-text-***.txt,***Names for various word segmentation algorithms, which are word Word segmentation result
perfect-result-***.txt,***It is the name of various word segmentation algorithms, which is the text whose word segmentation results are completely consistent with the manual annotation standard
wrong-result-***.txt,***It is the name of various word segmentation algorithms, which is the text whose word segmentation results are inconsistent with the manual annotation standard

10. Distributed Chinese word segmentation

1,In a custom profile word.conf or word.local.conf Specify all configuration items in*.path use HTTP Resources, and specify configuration items redis.*

    #Dictionary
    dic.path=http://localhost:8080/word_web/resources/dic.txt
    #Part of speech tagging data
    part.of.speech.dic.path=http://localhost:8080/word_web/resources/part_of_speech_dic.txt
    #Part of speech description data
    part.of.speech.des.path=http://localhost:8080/word_web/resources/part_of_speech_des.txt
    #Binary model
    bigram.path=http://localhost:8080/word_web/resources/bigram.txt
    #Ternary model
    trigram.path=http://localhost:8080/word_web/resources/trigram.txt
    #Stop word dictionary
    stopwords.path=http://localhost:8080/word_web/resources/stopwords.txt
    #Punctuation marks used to segment words
    punctuation.path=http://localhost:8080/word_web/resources/punctuation.txt
    #Hundred family names
    surname.path=http://localhost:8080/word_web/resources/surname.txt
    #Quantifier
    quantifier.path=http://localhost:8080/word_web/resources/quantifier.txt

    #Whether to use the publish subscribe service of redis to detect HTTP resource changes in real time
    redis.enable=false
    #redis service is used to detect HTTP resource changes in real time
    #redis host
    redis.host=localhost
    #redis port
    redis.port=6379

2,Configure and start redis The server

    All word splitters subscribe redis The server, When redis After the server receives the user's instruction to add or delete resources, All word splitters will be notified to perform corresponding operations

3,Configure and start provisioning HTTP Resource web Server, about to project: https://github. com/ysc/word_ Deploy web to port 8080 of tomcat

    // Inform all word splitters to add the word "Yang Shangchuan"
    http://localhost:8080/word_ web/admin/dic. jsp? Action = add & DIC = Yang Shangchuan
    // Tell all word splitters to delete the word "notebook"
    http://localhost:8080/word_ web/admin/dic. jsp? Action = Remove & DIC = notebook

    dic.jsp After receiving the user's request, the message will be delivered to redis The server, redis The server is publishing messages to all subscribed word splitters

11. Part of speech tagging

Taking the word segmentation result as an input parameter, call PartOfSpeechTagging Class process Method, part of speech saved in Word Class partOfSpeech Field
 As follows:
List<Word> words = WordSegmenter.segWithStopWords("I love China");
System.out.println("Unmarked part of speech:"+words);
//Part of speech tagging
PartOfSpeechTagging.process(words);
System.out.println("Part of speech:"+words);
Output content:
Unmarked part of speech:[I, love, China]
Part of speech:[I/r, love/v, China/ns]

12,refine

Let's take a segmentation example:
List<Word> words = WordSegmenter.segWithStopWords("China's working class and the broad working masses should unite more closely around the Party Central Committee");
System.out.println(words);
The results are as follows:
[our country, working class, and, vast, working masses, want, More, close together, land, unite, stay, the central committee of the communist of China(CPC), around]
If the segmentation result we want is:
[our country, worker, class, and, vast, labour, Masses, want, More, close together, land, unite, stay, the central committee of the communist of China(CPC), around]
That is, we should subdivide the "working class" into the "working class" and the "working masses" into the "working masses". Then what should we do?
We can word.refine.path File specified by configuration item classpath:word_refine.txt Add the following content to the list:
working class=working class
 working masses=Working masses
 Then, we analyze the segmentation results refine: 
words = WordRefiner.refine(words);
System.out.println(words);
In this way, we can achieve the desired effect:
[our country, worker, class, and, vast, labour, Masses, want, More, close together, land, unite, stay, the central committee of the communist of China(CPC), around]

Let's take another segmentation example:
List<Word> words = WordSegmenter.segWithStopWords("New achievements on the great journey of realizing the goal of "two centenaries"");
System.out.println(words);
The results are as follows:
[stay, realization, Two, A hundred years, Goal, of, great, journey, upper, Recreate, new, achievement]
If the segmentation result we want is:
[stay, realization, two centenary goals, Goal, of, great journey , upper, Recreate, new, achievement]
That is, we should merge the "two centenaries" into "two centenaries" and, "Journey" is merged into "great journey", so what should we do?
We can word.refine.path File specified by configuration item classpath:word_refine.txt Add the following content to the list:
two centenary goals=two centenary goals
 Great journey=great journey 
Then, we analyze the result of word segmentation refine: 
words = WordRefiner.refine(words);
System.out.println(words);
In this way, we can achieve the desired effect:
[stay, realization, two centenary goals, Goal, of, great journey , upper, Recreate, new, achievement]

13. Synonymous annotation

List<Word> words = WordSegmenter.segWithStopWords("Chu limo tried every means to retrieve the memory for ruthlessness");
System.out.println(words);
The results are as follows:
[Chu limo, make every attempt, by, ruthless, Retrieve, memory]
Make synonymous annotation:
SynonymTagging.process(words);
System.out.println(words);
The results are as follows:
[Chu limo, make every attempt[Long intentional, Make every effort, try various devices to, tax one 's ingenuity], by, ruthless, Retrieve, memory[Image]]
If indirect synonyms are enabled:
SynonymTagging.process(words, false);
System.out.println(words);
The results are as follows:
[Chu limo, make every attempt[Long intentional, Make every effort, try various devices to, tax one 's ingenuity], by, ruthless, Retrieve, memory[image, Image]]

List<Word> words = WordSegmenter.segWithStopWords("Older people with strong hands tend to live longer");
System.out.println(words);
The results are as follows:
[Hand strength, large, of, the elderly, often, more, longevity]
Make synonymous annotation:
SynonymTagging.process(words);
System.out.println(words);
The results are as follows:
[Hand strength, large, of, the elderly[Old man], often[often, Often, often], more, longevity[Long life, longevity]]
If indirect synonyms are enabled:
SynonymTagging.process(words, false);
System.out.println(words);
The results are as follows:
[Hand strength, large, of, the elderly[Old man], often[As usual, commonly, Generally, ordinary, often, Chang ri, ordinary, usually, usual, Weekdays, peacetime, Usual, daily, Everyday ordinary, often, ordinary, Often, General, Su ri, often, popular, usually], more, longevity[Long life, longevity]]

Take the word "do everything possible" as an example:
Can pass Word of getSynonym()Method to obtain synonyms, such as:
System.out.println(word.getSynonym());
The results are as follows:
[Long intentional, Make every effort, try various devices to, tax one 's ingenuity]
Note: if there is no synonym, then getSynonym()Return empty collection: Collections.emptyList()

The differences between indirect synonyms and direct synonyms are as follows:
Assumptions:
A and B Is a synonym, A and C Is a synonym, B and D Is a synonym, C and E Is a synonym
 Then:
about A For example, A B C Is a direct synonym
 about B For example, A B D Is a direct synonym
 about C For example, A C E Is a direct synonym
 about A B C For example, A B C D E Is an indirect synonym

14. Antisense annotation

List<Word> words = WordSegmenter.segWithStopWords("5 What movies are worth watching at the beginning of this month");
System.out.println(words);
The results are as follows:
[5, At the beginning of the month, have, Which?, film, worth, watch]
Antisense annotation:
AntonymTagging.process(words);
System.out.println(words);
The results are as follows:
[5, At the beginning of the month[end of month, end of the month, The end of the month], have, Which?, film, worth, watch]

List<Word> words = WordSegmenter.segWithStopWords("Because the work is not in place and the service is not perfect, unpleasant things happen to customers during dinner,The restaurant should make a sincere apology to the customers,Not perfunctory.");
System.out.println(words);
The results are as follows:
[because, work, Not in place, service, imperfect, cause, customer, stay, have meals, Time, happen, Unpleasant, of, thing, restaurant, aspect, should, towards, customer, make, sincere, of, apologize, instead of, do things carelessly]
Antisense annotation:
AntonymTagging.process(words);
System.out.println(words);
The results are as follows:
[because, work, Not in place, service, imperfect, cause, customer, stay, have meals, Time, happen, Unpleasant, of, thing, restaurant, aspect, should, towards, customer, make, sincere[Fool, FALSE, false, sinister and crafty], of, apologize, instead of, do things carelessly[be strict in one 's demands, be conscientious and do one's best, strain every nerve, strain every nerve, refine on, a matter of conscience]]

Take the word "beginning of the month" as an example:
Can pass Word of getAntonym()Method to obtain antonyms, such as:
System.out.println(word.getAntonym());
The results are as follows:
[end of month, end of the month, The end of the month]
Note: if there is no antonym, getAntonym()Return empty collection: Collections.emptyList()

15. Pinyin annotation

List<Word> words = WordSegmenter.segWithStopWords("<Since its release on April 12, the box office of speed and passion 7 in mainland China has exceeded RMB 2 billion in just two weeks");
System.out.println(words);
The results are as follows:
[speed, And, passion, 7, of, China, inland, box office, since, 4 month, 12 day, Show, since, stay, Short, Two weeks, within, Breach, 20 Hundred million, RMB]
Execute pinyin annotation:
PinyinTagging.process(words);
System.out.println(words);
The results are as follows:
[speed sd sudu, And y yu, passion jq jiqing, 7, of d de, China zg zhongguo, inland nd neidi, box office pf piaofang, since z zi, 4 month, 12 day, Show sy shangying, since yl yilai, stay z zai, Short dd duanduan, Two weeks lz liangzhou, within n nei, Breach tp tupo, 20 Hundred million, RMB rmb renminbi]

Take the word "speed" as an example:
Can pass Word of getFullPinYin()Method to obtain complete Pinyin, such as: sudu
 Can pass Word of getAcronymPinYin()Method to obtain the acronym Pinyin, such as: sd

16. Lucene plug-in:

1,Construct a word analyzer ChineseWordAnalyzer
Analyzer analyzer = new ChineseWordAnalyzer();
If you need to use a specific word segmentation algorithm, you can specify it through the constructor:
Analyzer analyzer = new ChineseWordAnalyzer(SegmentationAlgorithm.FullSegmentation);
If not specified, the two-way maximum matching algorithm is used by default: SegmentationAlgorithm.BidirectionalMaximumMatching
 See enumeration class for available word segmentation algorithms: SegmentationAlgorithm

2,utilize word Parser segmentation text
TokenStream tokenStream = analyzer.tokenStream("text", "Yang Shangchuan is APDPlat Author of application level product development platform");
//Prepare for consumption
tokenStream.reset();
//Start consumption
while(tokenStream.incrementToken()){
	//Words
	CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
	//The starting position of a word in a text
	OffsetAttribute offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class);
	//The first few words
	PositionIncrementAttribute positionIncrementAttribute = tokenStream.getAttribute(PositionIncrementAttribute.class);
	
	LOGGER.info(charTermAttribute.toString()+" ("+offsetAttribute.startOffset()+" - "+offsetAttribute.endOffset()+") "+positionIncrementAttribute.getPositionIncrement());
}
//Consumption completed
tokenStream.close();

3,utilize word Analyzer establishment Lucene Indexes
Directory directory = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter indexWriter = new IndexWriter(directory, config);

4,utilize word Analyzer query Lucene Indexes
QueryParser queryParser = new QueryParser("text", analyzer);
Query query = queryParser.parse("text:Yang Shangchuan");
TopDocs docs = indexSearcher.search(query, Integer.MAX_VALUE);

17. Solr plug-in:

1,download word-1.3.jar
 Download address: http://search.maven.org/remotecontent?filepath=org/apdplat/word/1.3/word-1.3.jar

2,Create directory solr-5.2.0/example/solr/lib,take word-1.3.jar Copy to lib catalogue

3,to configure schema Specify word breaker
 take solr-5.2.0/example/solr/collection1/conf/schema.xml All in the file
<tokenizer class="solr.WhitespaceTokenizerFactory"/>and
<tokenizer class="solr.StandardTokenizerFactory"/>Replace all with
<tokenizer class="org.apdplat.word.solr.ChineseWordTokenizerFactory"/>
And remove all filter label

4,If you need to use a specific word segmentation algorithm:
<tokenizer class="org.apdplat.word.solr.ChineseWordTokenizerFactory" segAlgorithm="ReverseMinimumMatching"/>
segAlgorithm The optional values are:	 
Forward maximum matching algorithm: MaximumMatching
 Reverse maximum matching algorithm: ReverseMaximumMatching
 Forward minimum matching algorithm: MinimumMatching
 Inverse minimum matching algorithm: ReverseMinimumMatching
 Bidirectional maximum matching algorithm: BidirectionalMaximumMatching
 Bidirectional minimum matching algorithm: BidirectionalMinimumMatching
 Bidirectional maximum minimum matching algorithm: BidirectionalMaximumMinimumMatching
 Total segmentation algorithm: FullSegmentation
 Least words algorithm: MinimalWordCount
 maximum Ngram Scoring algorithm: MaxNgramScore
 If not specified, the two-way maximum matching algorithm is used by default: BidirectionalMaximumMatching

5,If you need to specify a specific profile:
<tokenizer class="org.apdplat.word.solr.ChineseWordTokenizerFactory" segAlgorithm="ReverseMinimumMatching"
		conf="solr-5.2.0/example/solr/nutch/conf/word.local.conf"/>
word.local.conf For the configurable contents in the file, see word-1.3.jar Medium word.conf file
 If not specified, the default configuration file, located at word-1.3.jar Medium word.conf file

18. ElasticSearch plug-in:

1,Open the command line and switch to elasticsearch Root directory of
cd elasticsearch-5.4.3

2,install word Word segmentation plug-in:
wget http://apdplat.org/word/archive/v1.4.1.zip
mkdir plugins/word
unzip -d plugins/word v1.4.1.zip
 Note: if elasticsearch Version greater than 5.4.3,For example, 5.6.4,File plugins/word/plugin-descriptor.properties Change the configuration in to: elasticsearch.version=5.6.4
	
3,start-up ElasticSearch	
bin/elasticsearch

4,Test effect, in Chrome Access in browser:
http://localhost:9200/_ analyze? Analyzer = word & text = Yang Shangchuan is the author of APDPlat application level product development platform

19. Luke plugin:

1,download http://luke.googlecode.com/files/lukeall-4.0.0-ALPHA.jar (not accessible in China)

2,Download and unzip Java Chinese word segmentation component word-1.0-bin.zip: http://pan.baidu.com/s/1dDziDFz

3,Unzip the Java Chinese word segmentation component word-1.0-bin/word-1.0 4 in the folder jar Unzip the package to the current folder
 Use compression and decompression tools such as winrar open lukeall-4.0.0-ALPHA.jar,Remove the current folder META-INF Folder.jar,
.bat,.html,word.local.conf Drag all files other than files to lukeall-4.0.0-ALPHA.jar inside

4,Execute command java -jar lukeall-4.0.0-ALPHA.jar start-up luke,stay Search Tabbed Analysis inside
 You can choose org.apdplat.word.lucene.ChineseWordAnalyzer The word splitter is broken

5,stay Plugins Tabbed Available analyzers found on the current classpath You can also choose from it 
org.apdplat.word.lucene.ChineseWordAnalyzer Tokenizer 

Note: if you want to integrate yourself word Other versions of the word breaker, run under the root directory of the project mvn install Compile the project and run the command
mvn dependency:copy-dependencies Replication dependent jar Bag, then target/dependency/There will be all in the directory
 Dependence of jar Bag. among target/dependency/slf4j-api-1.6.4.jar yes word The log framework used by the word splitter,
target/dependency/logback-classic-0.9.28.jar and
target/dependency/logback-core-0.9.28.jar yes word The log implementation recommended by the word splitter and the configuration file of the log implementation
 Path at target/classes/logback.xml,target/word-1.3.jar yes word Main of word splitter jar Package, if required
 To customize the dictionary, you need to modify the word splitter configuration file target/classes/word.conf

Download the integrated Luke plug-in (applicable to Lucene 4.0.0): lukeall-4.0.0-ALPHA-with-word-1.0.jar

Download the integrated Luke plug-in (applicable to Lucene 4.10.3): lukeall-4.10.3-with-word-1.2.jar

20. The relevant words are obtained by calculating the context of the words:

How can we get related words by calculating the context of words?

Context is defined as: in a text, the context of any word is determined by its antecedent N Words and after N It consists of two words.
The definition of Related words is: if the context of two words is more similar, the more similar and related the two words are.

The algorithm consists of two steps:

1,The context of each word is calculated from a large corpus, and the word vector is used to represent the context.
2,The problem of finding the similarity of two words is transformed into the problem of finding the similarity of the context of the two words.
By calculating the similarity of context, we can get the similarity of words. The more similar words are, the more relevant they are.

The method of use is as follows:

1,use word Word segmentation built-in corpus: running word Script under the root directory of word segmentation project 
demo-word-vector-corpus.bat or demo-word-vector-corpus.sh
2,Use your own text content: run word Script under the root directory of word segmentation project 
demo-word-vector-file.bat or demo-word-vector-file.sh

Due to the large corpus, it will take a long time to start. Please wait patiently. An example is given below:
For example, we want to analyze the Related words of Lanzhou, and we run the script 
demo-word-vector-corpus.sh ,Command line prompt after successful startup:

Start initializing model
 Model initialization complete
 You can enter commands sa=cos To specify the similarity algorithm. The available algorithms are:
   1,sa=cos,cosine similarity 
   2,sa=edi,Edit distance
   3,sa=euc,Euclidean distance
   4,sa=sim,Simple common words
   5,sa=jac,Jaccard Similarity coefficient
   6,sa=man,Manhattan distance
   7,sa=shh,SimHash + Hamming distance
   8,sa=ja,Jaro distance
   9,sa=jaw,Jaro–Winkler distance
   10,sa=sd,Sørensen–Dice coefficient
 You can enter commands limit=15 To specify the number of results to display
 You can enter commands exit Exit program
 Enter the word or command to query:

Enter Lanzhou and press enter. The result shows:

Related words of Lanzhou( EditDistanceTextSimilarity): 
----------------------------------------------------------
	1,Lanzhou 1.0
	2,Beijing 0.21
	3,Fuzhou 0.2
	4,Taiyuan 0.19
	5,Chengdu 0.17
	6,Xi'an 0.17
	7,Harbin 0.17
	8,Nanning 0.17
	9,Guiyang 0.16
	10,Qingyang 0.15
	11,Shenyang 0.14
	12,Hefei 0.14
	13,Datong 0.14
	14,Lhasa 0.13
	15,Xining 0.13
----------------------------------------------------------
The result shown here is the related word of Lanzhou, followed by the correlation score,
Lanzhou and Lanzhou are the same word, and the correlation is 100%. Naturally, it is 1 point.

From this result, we can analyze why these words are related? Where's the clue?

First of all, the parts of speech of these words are nouns;
Secondly, these words are place names and big city names;
From here, we can also see an interesting phenomenon that the usage of the same part of speech, such as place names, is often consistent.

Related words are derived from context. The number followed by words in context is weight, and the weight is 1/N Cumulative value of
 Let's look at the context of these words:

Lanzhou : [Military region 1.0, Gansu 0.78205127, New area 0.7692308, University 0.42307693, Lanzhou, Gansu 0.41025642, Truck 0.3846154, Xi'an 0.32051283, Our newspaper 0.2948718, Xinhua News Agency 0.2820513, Lanzhou New Area 0.26923078, Hold 0.23076923, Send to 0.21794872, China 0.20512821, Lanzhou 0.20512821, Railway station 0.20512821, Railway 0.17948718, Attend 0.15384616, Xining 0.15384616, Direction 0.15384616, Chengdu 0.14102565, Police 0.14102565, Construction 0.12820514, Municipal Party committee 0.12820514, Come to 0.12820514, One 0.12820514, Center 0.115384616, Refinery 0.102564104, Enter 0.102564104, From 0.102564104, Hold 0.102564104]	
Beijing : [Xinhua News Agency 1.0, Our newspaper 0.7119143, Hold 0.19384204, Shanghai 0.17831326, Time 0.16385542, Railway Bureau 0.1394913, West Station 0.13226238, Youth Daily 0.12717536, Morning news 0.11700134, Municipal Party committee 0.1145917, Region 0.11218206, Hold 0.10200803, City 0.08299866, Current 0.07951807, Come to 0.06961178, Military region 0.06827309, International 0.066398926, Center 0.063453816, Beijing time 0.06184739, People 0.059973225, Work 0.05863454, Metro 0.057563588, Beijing Railway Bureau 0.056492638, Hospital 0.055421688, Fly to 0.05381526, Capital 0.053547524, China 0.053547524, Where 0.05274431, Today 0.052208837, Satellite TV 0.05167336]
Fuzhou : [Railway station 1.0, New area 0.46666667, Fuzhou railway station 0.45555556, Evening news 0.2962963, Reporter 0.2777778, Work 0.27407408, Come to 0.24814814, Citizen 0.23333333, Our newspaper 0.22222222, University 0.21851853, Urban 0.2074074, Municipal Party committee 0.19259259, Hold 0.19259259, Gulou District 0.18518518, Netizen 0.18148148, Reach 0.17037037, To 0.16296296, Current 0.14074074, Branch 0.14074074, One 0.12962963, City 0.12962963, East Street 0.12222222, Fuzhou evening news 0.12222222, Xinhua News Agency 0.11851852, Railway 0.11851852, Hold 0.11481482, Go to 0.11481482, Development 0.11481482, Push 0.11111111, Fuzhou 0.11111111]	 
Taiyuan : [Shanxi 1.0, Taiyuan, Shanxi 0.6136364, Our newspaper 0.39772728, Xinhua News Agency 0.3409091, Railway station 0.26136363, Jinan 0.25, Railway 0.23863636, Beijing 0.22727273, Launch 0.1590909, International 0.1590909, Return 0.14772727, Corundum 0.13636364, From 0.13636364, Publish 0.13636364, Work 0.125, Center 0.125, Municipal Party committee 0.11363637, Bank 0.11363637, Railway Bureau 0.10227273, Xi'an 0.09090909, Group 0.09090909, Public security 0.09090909, To 0.09090909, For example, 0.07954545, Finance 0.07954545, Train ticket 0.07954545, Datong 0.06818182, Shanxi Province 0.06818182, Military division 0.06818182, Leave 0.06818182]
Chengdu : [Business daily 1.0, Chengdu Business Daily 0.4117647, Military region 0.1875, Railway Bureau 0.17830883, Beijing 0.17463236, Our newspaper 0.17095588, Chongqing 0.15441176, Tell 0.15441176, Traffic police 0.14338236, Direction 0.1360294, Reporter 0.13419117, Plain 0.121323526, Sichuan 0.1194853, Changsha 0.11764706, Polytechnic University 0.0992647, From 0.09375, Xinhua News Agency 0.09191176, To 0.090073526, Chengdu Railway Bureau 0.08455882, Railway 0.080882356, Hold 0.07904412, Citizen 0.075367644, Municipal Party committee 0.073529415, Company 0.07169118, Guangzhou 0.07169118, Xi'an 0.0680147, Pixian 0.060661763, Work 0.060661763, Urban 0.05882353, Evening news 0.05882353]
Xi'an : [Railway station 1.0, Incident 0.75, Traffic 0.7058824, Construction 0.5882353, Metro 0.5882353, >Xianyang 0.5588235, Come to 0.5294118, Citizen 0.50735295, University 0.5, Railway 0.5, Delegation 0.5, Railway Bureau 0.49264705, Company 0.4852941, Wuhan 0.4632353, Qujiang 0.44117647, Power supply 0.42647058, Xinhua News Agency 0.4117647, Xi'an railway station 0.4117647, Beijing 0.3602941, Jiaotong University 0.3602941, Our newspaper 0.34558824, Xi'an Incident 0.3382353, City 0.31617647, City 0.31617647, Settled in 0.30882353, Municipal Party committee 0.29411766, International 0.2867647, Chengdong 0.2867647, Chengdu 0.2720588, Hold 0.25]	
Harbin : [Polytechnic University 1.0, Railway station 0.41584158, Harbin University of Technology 0.36138615, Industry 0.25742576, Direction 0.23762377, Xinhua News Agency 0.20792079, To 0.18811882, Harbin railway station 0.18316832, At 0.17821783, University 0.17326732, Railway Bureau 0.15841584, From 0.15346535, Minimum 0.14356436, Beijing 0.12871288, Our newspaper 0.12376238, Heilongjiang Province 0.12376238, Publish 0.11386139, China 0.10891089, Fly to 0.0990099, Black Dragon>Jiang 0.08415841, Shenyang 0.07920792, Project 0.07920792, Near 0.074257426, Municipal Party committee 0.06930693, Aircraft 0.06930693, Shanghai 0.06930693, Candidate 0.06930693, Enter 0.06930693, Stop 0.06930693, Economy 0.06435644]
Nanning : [Guangxi 1.0, Railway Bureau 0.8, Nanning, Guangxi 0.62222224, Our newspaper 0.54444444, Xinhua News Agency 0.36666667, Nanning Railway Bureau 0.31111112, Municipal Party committee 0.26666668, Liuzhou 0.18888889, Guilin 0.17777778, Railway 0.15555556, Prosper>Ning District 0.14444445, Come to 0.11111111, To 0.11111111, Go to 0.11111111, Public security 0.11111111, Work 0.11111111, To 0.11111111, City 0.08888889, Beautiful 0.08888889, Hold 0.08888889, Engaged in 0.08888889, Guantang 0.08888889, Property market 0.08888889, Branch office 0.07777778, Nanning Municipal Party committee 0.07777778, Motor car 0.07777778, Occurrence 0.07777778, Hold 0.07777778, Xixiang 0.06666667, Mayor 0.06666667]
Guiyang : [Newspaper 1.0, Chongqing 0.73333335, Xinhua News Agency 0.46666667, Direction 0.43333334, Go to 0.4, Brothers 0.4, City 0.4, Home 0.33333334, Xi'an 0.26666668, Chengdu 0.26666668, Street 0.26666668, Evening news 0.26666668, Irrelevant 0.26666668, Hangzhou 0.23333333, Involving 0.2, And 0.2, City 0.2, Netizen 0.2, Zhengzhou 0.16666667, Nanning 0.16666667, Changsha 0.16666667, Wuhan 0.16666667, Stall 0.16666667, Municipal Party committee 0.13333334, Kunming 0.13333334, Anshun 0.13333334, Come to 0.13333334, Hegemony 0.13333334, Top four 0.13333334, Railway 0.13333334]
Qingyang : [Gansu 1.0, Qingyang, Gansu 0.8, Gansu Province 0.4, Region 0.4, Old area 0.3, Forest 0.2, Pingliang 0.2, Zhenyuan County 0.1, Revolution 0.1, Han Fengting 0.1, Traffic 0.1, Lanzhou forest brigade 0.1, Brigade 0.1, Lanzhou 0.1, Xifeng 0.1, hair>Send 0.1, One 0.1, License plate 0.1, From 0.1]
Shenyang : [Military region 1.0, Evening news 0.5123967, Direction 0.3181818, Our newspaper 0.27272728, Shenyang Evening News 0.23553719, Xinhua News Agency 0.20661157, Shenyang Military Region 0.18595041, Military region team 0.15289256, Sea lions 0.14876033, Automation office 0.14049587, This time 0.14049587, Economic Zone 0.1322314, China 0.12809917, >Dalian 0.12809917, Uncle 0.12809917, Municipal Party committee 0.12396694, One family 0.11570248, High speed 0.11570248, International 0.11157025, Train ticket 0.11157025, Faku 0.10743801, University 0.10330579, Changchun 0.10330579, Direct to 0.09917355, Shenzhen 0.09090909, Shanghai 0.08677686, Reporter 0.08677686, Sea lion 0.08264463, Aunt 0.08264463, Two digit 0.08264463]	
Hefei : [Railway station 1.0, Citizen 0.8181818, Urban 0.53333336, Property market 0.4848485, Hefei railway station 0.4121212, Railway 0.38787878, Anhui 0.36969697, Reach 0.36363637, Market 0.34545454, Last week 0.3030303, Wuhu 0.2969697, Hold 0.28484848, Reporter 0.27272728, Become 0.27272728, Come to 0.26666668, Hefei, Anhui 0.24242425, City 0.24242425, Economic circle 0.24242425, Bus 0.24242425, Current 0.23636363, Our newspaper 0.21818182, This year 0.21818182, Takeoff 0.21818182, Car 0.21212122, Substance 0.2060606, Hefei property market 0.2060606, Airport 0.2060606, Industry 0.19393939, Title 0.18181819, Wild 0.16969697]
da tong : [University 1.0, Railway 0.52380955, Shanxi 0.5, Securities 0.33333334, Datong University 0.33333334, Shanxi Province 0.23809524, This time 0.23809524, Datong, Shanxi 0.1904762, World 0.1904762, World Datong 0.1904762, Street 0.16666667, Taiyuan 0.14285715, Municipal Party committee 0.14285715, Shanghai 0.14285715, Police station 0.14285715, Public security department 0.14285715, Japan 0.14285715, Forward 0.14285715, Yuncheng 0.11904762, Military division 0.0952381, Mining Bureau 0.0952381, Primary school 0.0952381, Attend 0.0952381, Item 0.0952381, Secondary school 0.0952381, Water plant 0.0952381, Depot 0.0952381, To 0.0952381, Datong securities 0.0952381, Campaign 0.071428575]
Lhasa : [Railway station 1.0, Xinhua News Agency 0.91935486, Tibet 0.7580645, Urban 0.61290324, Our newspaper 0.58064514, Hold 0.5645161, Customs 0.5483871, City 0.48387095, Lhasa railway station 0.4032258, Municipal Party committee 0.38709676, Chengdu 0.37096775, Gongga 0.3548387, Opening 0.32258064, Publish 0.30645162, Lhasa, Tibet 0.2580645, Meeting 0.2580645, Airport 0.22580644, Closing 0.22580644, Grand 0.22580644, Nyingchi 0.20967741, Hold 0.19354838, Open 0.19354838, Business department 0.19354838, Citizen 0.17741935, Market 0.17741935, Economy 0.17741935, Center 0.17741935, Air 0.17741935, Become 0.17741935, People 0.16129032]
Xining : [Xinhua News Agency 1.0, Shanghai 0.8235294, Lanzhou 0.3529412, Rolling 0.3529412, Our newspaper 0.29411766, Qinghai 0.29411766, Investigation 0.23529412, Dangjie 0.23529412, Special steel 0.1764706, Direction 0.1764706, Branch 0.1764706, Asking for bribes 0.1764706, Beijing 0.14705883, But 0.14705883, Lhasa 0.11764706, We 0.11764706, Title 0.11764706, Traffic police 0.11764706, Delegation 0.11764706, Process 0.0882353, Yinchuan 0.0882353, Ticket 0.0882353, Preparation 0.0882353, Transfer 0.0882353, Attend 0.0882353, January 0.05882353, Test Bureau 0.05882353, February 0.05882353, Region 0.05882353, Serious 0.05882353]	

Finally, let's take a look at the relevant words of Lanzhou calculated by using seven similarity algorithms:

----------------------------------------------------------
Related words of Lanzhou( CosineTextSimilarity): 
	1,Lanzhou 1.0
	2,Shenyang 0.5
	3,Beijing Military Region 0.47
	4,Logistics Department 0.46
	5,Shenyang Military Region 0.46
	6,General hospital 0.46
	7,Xinjiang Military Region 0.46
	8,Commander 0.42
	9,Lanzhou, Gansu 0.42
	10,Lanzhou New Area 0.42
	11,A division 0.39
	12,Zhengpu port 0.38
	13,Xixian 0.38
	14,Tianshui 0.37
	15,Zheng Dong 0.37
 Time: 25 seconds,572 millisecond
----------------------------------------------------------
Related words of Lanzhou( EditDistanceTextSimilarity): 
	1,Lanzhou 1.0
	2,Beijing 0.21
	3,Fuzhou 0.2
	4,Taiyuan 0.19
	5,Chengdu 0.17
	6,Nanning 0.17
	7,Xi'an 0.17
	8,Harbin 0.17
	9,Guiyang 0.16
	10,Qingyang 0.15
	11,Hefei 0.14
	12,Datong 0.14
	13,Shenyang 0.14
	14,Perth 0.13
	15,Lhasa 0.13
 Time: 44 seconds,253 millisecond
----------------------------------------------------------
Related words of Lanzhou( EuclideanDistanceTextSimilarity): 
	1,Lanzhou 1.0
	2,Logistics Department 0.37
	3,Beijing Military Region 0.37
	4,Xinjiang Military Region 0.37
	5,Shenyang 0.37
	6,Shenyang Military Region 0.37
	7,General hospital 0.37
	8,Shanghai Pudong New Area 0.36
	9,Zhengpu port 0.36
	10,Pudong New Area 0.36
	11,Lanzhou, Gansu 0.36
	12,Xixian 0.36
	13,Xixian new area 0.36
	14,Zhengding new area 0.36
	15,Commander 0.36
 Time: 24 seconds,710 millisecond
----------------------------------------------------------
Related words of Lanzhou( SimpleTextSimilarity): 
	1,Lanzhou 1.0
	2,Fuzhou 0.36
	3,Xi'an 0.33
	4,Li Hongqi 0.33
	5,China Financial Information Center 0.33
	6,Nantes 0.32
	7,Cartagena 0.32
	8,Harbin 0.3
	9,Wuhan 0.3
	10,Dacry 0.3
	11,Chuxiong 0.29
	12,Zhu Mengkui 0.29
	13,Yue Feifei 0.29
	14,Changsha 0.28
	15,LV Guoqing 0.28
 Time: 21 seconds,918 millisecond
----------------------------------------------------------
Related words of Lanzhou( JaccardTextSimilarity): 
	1,Lanzhou 1.0
	2,Fuzhou 0.22
	3,Xi'an 0.2
	4,Harbin 0.18
	5,Beijing 0.18
	6,Wuhan 0.18
	7,Chengdu 0.18
	8,Changsha 0.15
	9,Taiyuan 0.15
	10,Guiyang 0.15
	11,Shenyang 0.15
	12,Guangzhou 0.15
	13,Lhasa 0.15
	14,Nanchang 0.15
	15,Changchun 0.13
 Time: 19 seconds,717 millisecond
----------------------------------------------------------
Related words of Lanzhou( ManhattanDistanceTextSimilarity): 
	1,Lanzhou 1.0
	2,Shanghai Pudong New Area 0.11
	3,Shaanxi Xixian new area 0.11
	4,Lanzhou, Gansu 0.11
	5,Beijing Military Region 0.11
	6,Xinjiang Military Region 0.11
	7,Xixian 0.11
	8,Zhengding new area 0.11
	9,Tianfu new area 0.11
	10,Shenyang Military Region 0.11
	11,National new area 0.11
	12,Lanzhou New Area 0.11
	13,Xiake 0.1
	14,Threat theory 0.1
	15,One or two months 0.1
 Time: 23 seconds,857 millisecond
----------------------------------------------------------
Related words of Lanzhou( SimHashPlusHammingDistanceTextSimilarity): 
	1,Lanzhou 1.0
	2,Fish water 0.96
	3,Feng dao0.95
	4,Press release 0.95
	5,Science 0.95
	6,Property company 0.95
	7,Active serviceman 0.95
	8,Who 0.95
	9,Zhang fu0.94
	10,Announcement 0.94
	11,Information publishing 0.94
	12,Initiative 0.94
	13,Liquid medicine 0.94
	14,Archaeological excavation 0.94
	15,Public release 0.94
 Time: 5 minutes,57 second,339 millisecond
----------------------------------------------------------
Related words of Lanzhou( JaroDistanceTextSimilarity): 
	1,Lanzhou 1.0
	2,Changsha 0.49
	3,Harbin 0.49
	4,Fuzhou 0.48
	5,Taiyuan 0.47
	6,Qingyang 0.46
	7,Jinan 0.46
	8,Beijing 0.45
	9,Chengdu 0.45
	10,Zhangjiaming 0.45
	11,Xi'an 0.45
	12,Sun Yong 0.45
	13,Chuxiong 0.44
	14,Fuzhou railway station 0.44
	15,Nanning 0.44
 Time: 12 seconds,718 millisecond
----------------------------------------------------------
Related words of Lanzhou( JaroWinklerDistanceTextSimilarity): 
	1,Lanzhou 1.0
	2,Lhasa 0.56
	3,Nanning 0.55
	4,Imperial court 0.55
	5,Public judgment 0.54
	6,Samond 0.53
	7,World class 0.53
	8,Lakeside 0.53
	9,Large and small 0.52
	10,General election 0.52
	11,Seventh session 0.52
	12,Bake 0.51
	13,Wuping County 0.51
	14,Moscow 0.51
	15,Retraining 0.51
 Time: 16 seconds,723 millisecond
----------------------------------------------------------
Related words of Lanzhou( SørensenDiceCoefficientTextSimilarity): 
	1,Lanzhou 1.0
	2,Fuzhou 0.37
	3,Xi'an 0.33
	4,Harbin 0.3
	5,Beijing 0.3
	6,Wuhan 0.3
	7,Chengdu 0.3
	8,Changsha 0.27
	9,Taiyuan 0.27
	10,Guiyang 0.27
	11,Shenyang 0.27
	12,Guangzhou 0.27
	13,Lhasa 0.27
	14,Nanchang 0.27
	15,Changchun 0.23
 Time: 19 seconds,852 millisecond
----------------------------------------------------------

21. Word frequency statistics:

org. apdplat. word. Word frequency statistics provides the function of word frequency statistics

The calling method of command line script is as follows:

Write the text requiring word frequency statistics to the file: text.txt
chmod +x wfs.sh & wfs.sh -textFile=text.txt -statisticsResultFile=statistics-result.txt
 Open the file after the program runs statistics-result.txt View word frequency statistics

The calling methods in the program are as follows:

//Word frequency statistics settings
WordFrequencyStatistics wordFrequencyStatistics = new WordFrequencyStatistics();
wordFrequencyStatistics.setRemoveStopWord(false);
wordFrequencyStatistics.setResultPath("word-frequency-statistics.txt");
wordFrequencyStatistics.setSegmentationAlgorithm(SegmentationAlgorithm.MaxNgramScore);
//Start participle
wordFrequencyStatistics.seg("It's raining tomorrow to combine molecules. Tomorrow there's a course on molecules and atoms. If it rains, I have to go to class");
//Output word frequency statistics
wordFrequencyStatistics.dump();
//prepare documents
Files.write(Paths.get("text-to-seg.txt"), Arrays.asList("word Participle is a Java The distributed Chinese word segmentation component provides a variety of dictionary based word segmentation algorithms ngram Model to eliminate ambiguity."));
//Clear previous statistics
wordFrequencyStatistics.reset();
//Word segmentation of documents
wordFrequencyStatistics.seg(new File("text-to-seg.txt"), new File("text-seg-result.txt"));
//Output word frequency statistics
wordFrequencyStatistics.dump("file-seg-statistics-result.txt");

Word frequency statistics of the first sentence:

1,Rain 2
2,Tomorrow 2
3,Molecule 2
4,Course 1
5,Lecture 1
6,Combination 1
7,Atom 1
8,Go 1
9,Cheng 1
10,About 1
11,And 1
12,Also 1
13,Have 1
14,Of 1
15,1

Word frequency statistics of the second sentence:

1,Participle 2
2,Of 2
3,Based on 1
4,word 1
5,Component 1
6,Dictionary 1
7,ngram 1
8,Multiple 1
9,Implementation 1
10,And 1
11,Utilization 1
12,Disambiguation 1
13,Chinese word segmentation 1
14,Algorithm 1
15,Yes 1
16,Distributed 1
17,1
18,Provide 1
19,Model 1
20,Come 1
21,One 1
22,Java 1	

22. Text similarity:

word segmentation provides a variety of text similarity calculation methods:

Method 1: cosine similarity: evaluate the similarity of two vectors by calculating the cosine value of the angle between them

Implementation class: org apdplat. word. analysis. CosineTextSimilarity

The usage is as follows:

String text1 = "I love shopping";
String text2 = "I love reading";
String text3 = "He's a hacker";
TextSimilarity textSimilarity = new CosineTextSimilarity();
double score1pk1 = textSimilarity.similarScore(text1, text1);
double score1pk2 = textSimilarity.similarScore(text1, text2);
double score1pk3 = textSimilarity.similarScore(text1, text3);
double score2pk2 = textSimilarity.similarScore(text2, text2);
double score2pk3 = textSimilarity.similarScore(text2, text3);
double score3pk3 = textSimilarity.similarScore(text3, text3);
System.out.println(text1+" and "+text1+" Similarity score:"+score1pk1);
System.out.println(text1+" and "+text2+" Similarity score:"+score1pk2);
System.out.println(text1+" and "+text3+" Similarity score:"+score1pk3);
System.out.println(text2+" and "+text2+" Similarity score:"+score2pk2);
System.out.println(text2+" and "+text3+" Similarity score:"+score2pk3);
System.out.println(text3+" and "+text3+" Similarity score:"+score3pk3);

The operation results are as follows:

Similarity score of I love shopping and I love shopping: 1.0
 I love shopping and I love reading similarity score: 0.67
 I love shopping and he is a hacker similarity score: 0.0
 Similarity score of I love reading and I love reading: 1.0
 I love reading and he is a hacker similarity score: 0.0
 Similarity score between he is a hacker and he is a hacker: 1.0

Method 2: simple common words. Their similarity is evaluated by calculating the total number of characters of words shared by two documents divided by the number of characters in the longest document

Implementation class: org apdplat. word. analysis. SimpleTextSimilarity

The usage is as follows:

String text1 = "I love shopping";
String text2 = "I love reading";
String text3 = "He's a hacker";
TextSimilarity textSimilarity = new SimpleTextSimilarity();
double score1pk1 = textSimilarity.similarScore(text1, text1);
double score1pk2 = textSimilarity.similarScore(text1, text2);
double score1pk3 = textSimilarity.similarScore(text1, text3);
double score2pk2 = textSimilarity.similarScore(text2, text2);
double score2pk3 = textSimilarity.similarScore(text2, text3);
double score3pk3 = textSimilarity.similarScore(text3, text3);
System.out.println(text1+" and "+text1+" Similarity score:"+score1pk1);
System.out.println(text1+" and "+text2+" Similarity score:"+score1pk2);
System.out.println(text1+" and "+text3+" Similarity score:"+score1pk3);
System.out.println(text2+" and "+text2+" Similarity score:"+score2pk2);
System.out.println(text2+" and "+text3+" Similarity score:"+score2pk3);
System.out.println(text3+" and "+text3+" Similarity score:"+score3pk3);

The operation results are as follows:

Similarity score between I love shopping and I love shopping: 1.0
 I love shopping and I love reading similarity score: 0.5
 I love shopping and he is a hacker similarity score: 0.0
 Similarity score of I love reading and I love reading: 1.0
 I love reading and he is a hacker similarity score: 0.0
 Similarity score between he is a hacker and he is a hacker: 1.0

Method 3: edit distance, which evaluates the similarity between two strings by calculating the minimum number of editing operations required to convert one string to another

Implementation class: org apdplat. word. analysis. EditDistanceTextSimilarity

The usage is as follows:

String text1 = "I love shopping";
String text2 = "I love reading";
String text3 = "He's a hacker";
TextSimilarity textSimilarity = new EditDistanceTextSimilarity();
double score1pk1 = textSimilarity.similarScore(text1, text1);
double score1pk2 = textSimilarity.similarScore(text1, text2);
double score1pk3 = textSimilarity.similarScore(text1, text3);
double score2pk2 = textSimilarity.similarScore(text2, text2);
double score2pk3 = textSimilarity.similarScore(text2, text3);
double score3pk3 = textSimilarity.similarScore(text3, text3);
System.out.println(text1+" and "+text1+" Similarity score:"+score1pk1);
System.out.println(text1+" and "+text2+" Similarity score:"+score1pk2);
System.out.println(text1+" and "+text3+" Similarity score:"+score1pk3);
System.out.println(text2+" and "+text2+" Similarity score:"+score2pk2);
System.out.println(text2+" and "+text3+" Similarity score:"+score2pk3);
System.out.println(text3+" and "+text3+" Similarity score:"+score3pk3);

The operation results are as follows:

Similarity score of I love shopping and I love shopping: 1.0
 I love shopping and I love reading similarity score: 0.5
 I love shopping and he is a hacker similarity score: 0.0
 Similarity score of I love reading and I love reading: 1.0
 I love reading and he is a hacker similarity score: 0.0
 Similarity score between he is a hacker and he is a hacker: 1.0

Method 4: SimHash + Hamming distance. First use SimHash to map texts of different lengths into equal length texts, and then calculate the Hamming distance of equal length texts

Implementation class: org apdplat. word. analysis. SimHashPlusHammingDistanceTextSimilarity

The usage is as follows:

String text1 = "I love shopping";
String text2 = "I love reading";
String text3 = "He's a hacker";
TextSimilarity textSimilarity = new SimHashPlusHammingDistanceTextSimilarity();
double score1pk1 = textSimilarity.similarScore(text1, text1);
double score1pk2 = textSimilarity.similarScore(text1, text2);
double score1pk3 = textSimilarity.similarScore(text1, text3);
double score2pk2 = textSimilarity.similarScore(text2, text2);
double score2pk3 = textSimilarity.similarScore(text2, text3);
double score3pk3 = textSimilarity.similarScore(text3, text3);
System.out.println(text1+" and "+text1+" Similarity score:"+score1pk1);
System.out.println(text1+" and "+text2+" Similarity score:"+score1pk2);
System.out.println(text1+" and "+text3+" Similarity score:"+score1pk3);
System.out.println(text2+" and "+text2+" Similarity score:"+score2pk2);
System.out.println(text2+" and "+text3+" Similarity score:"+score2pk3);
System.out.println(text3+" and "+text3+" Similarity score:"+score3pk3);

The operation results are as follows:

Similarity score of I love shopping and I love shopping: 1.0
 I love shopping and I love reading similarity score: 0.95
 I love shopping and he is a hacker similarity score: 0.83
 Similarity score of I love reading and I love reading: 1.0
 I love reading and he is a hacker similarity score: 0.86
 Similarity score between he is a hacker and he is a hacker: 1.0

Method 5: Jaccard similarity coefficient: calculate the size of the intersection of two sets divided by the size of the union to evaluate their similarity

Implementation class: org apdplat. word. analysis. JaccardTextSimilarity

The usage is as follows:

String text1 = "I love shopping";
String text2 = "I love reading";
String text3 = "He's a hacker";
TextSimilarity textSimilarity = new JaccardTextSimilarity();
double score1pk1 = textSimilarity.similarScore(text1, text1);
double score1pk2 = textSimilarity.similarScore(text1, text2);
double score1pk3 = textSimilarity.similarScore(text1, text3);
double score2pk2 = textSimilarity.similarScore(text2, text2);
double score2pk3 = textSimilarity.similarScore(text2, text3);
double score3pk3 = textSimilarity.similarScore(text3, text3);
System.out.println(text1+" and "+text1+" Similarity score:"+score1pk1);
System.out.println(text1+" and "+text2+" Similarity score:"+score1pk2);
System.out.println(text1+" and "+text3+" Similarity score:"+score1pk3);
System.out.println(text2+" and "+text2+" Similarity score:"+score2pk2);
System.out.println(text2+" and "+text3+" Similarity score:"+score2pk3);
System.out.println(text3+" and "+text3+" Similarity score:"+score3pk3);

The operation results are as follows:

Similarity score of I love shopping and I love shopping: 1.0
 I love shopping and I love reading similarity score: 0.5
 I love shopping and he is a hacker similarity score: 0.0
 Similarity score of I love reading and I love reading: 1.0
 I love reading and he is a hacker similarity score: 0.0
 Similarity score between he is a hacker and he is a hacker: 1.0

Method 6: Euclidean Distance, which evaluates their similarity by calculating the distance between two points

Implementation class: org apdplat. word. analysis. EuclideanDistanceTextSimilarity

The usage is as follows:

String text1 = "I love shopping";
String text2 = "I love reading";
String text3 = "He's a hacker";
TextSimilarity textSimilarity = new EuclideanDistanceTextSimilarity();
double score1pk1 = textSimilarity.similarScore(text1, text1);
double score1pk2 = textSimilarity.similarScore(text1, text2);
double score1pk3 = textSimilarity.similarScore(text1, text3);
double score2pk2 = textSimilarity.similarScore(text2, text2);
double score2pk3 = textSimilarity.similarScore(text2, text3);
double score3pk3 = textSimilarity.similarScore(text3, text3);
System.out.println(text1+" and "+text1+" Similarity score:"+score1pk1);
System.out.println(text1+" and "+text2+" Similarity score:"+score1pk2);
System.out.println(text1+" and "+text3+" Similarity score:"+score1pk3);
System.out.println(text2+" and "+text2+" Similarity score:"+score2pk2);
System.out.println(text2+" and "+text3+" Similarity score:"+score2pk3);
System.out.println(text3+" and "+text3+" Similarity score:"+score3pk3);

The operation results are as follows:

Similarity score of I love shopping and I love shopping: 1.0
 I love shopping and I love reading similarity score: 0.41
 I love shopping and he is a hacker similarity score: 0.29
 Similarity score of I love reading and I love reading: 1.0
 I love reading and he is a hacker similarity score: 0.29
 Similarity score between he is a hacker and he is a hacker: 1.0

Method 7: Manhattan Distance, which evaluates the similarity of two points by calculating the sum of their absolute wheelbase in the standard coordinate system

Implementation class: org apdplat. word. analysis. ManhattanDistanceTextSimilarity

The usage is as follows:

String text1 = "I love shopping";
String text2 = "I love reading";
String text3 = "He's a hacker";
TextSimilarity textSimilarity = new ManhattanDistanceTextSimilarity();
double score1pk1 = textSimilarity.similarScore(text1, text1);
double score1pk2 = textSimilarity.similarScore(text1, text2);
double score1pk3 = textSimilarity.similarScore(text1, text3);
double score2pk2 = textSimilarity.similarScore(text2, text2);
double score2pk3 = textSimilarity.similarScore(text2, text3);
double score3pk3 = textSimilarity.similarScore(text3, text3);
System.out.println(text1+" and "+text1+" Similarity score:"+score1pk1);
System.out.println(text1+" and "+text2+" Similarity score:"+score1pk2);
System.out.println(text1+" and "+text3+" Similarity score:"+score1pk3);
System.out.println(text2+" and "+text2+" Similarity score:"+score2pk2);
System.out.println(text2+" and "+text3+" Similarity score:"+score2pk3);
System.out.println(text3+" and "+text3+" Similarity score:"+score3pk3);

The operation results are as follows:

Similarity score of I love shopping and I love shopping: 1.0
 I love shopping and I love reading similarity score: 0.33
 I love shopping and he is a hacker similarity score: 0.14
 Similarity score of I love reading and I love reading: 1.0
 I love reading and he is a hacker similarity score: 0.14
 Similarity score between he is a hacker and he is a hacker: 1.0

Method 8: Jaro Distance, a type of editing distance

Implementation class: org apdplat. word. analysis. JaroDistanceTextSimilarity

The usage is as follows:

String text1 = "I love shopping";
String text2 = "I love reading";
String text3 = "He's a hacker";
TextSimilarity textSimilarity = new JaroDistanceTextSimilarity();
double score1pk1 = textSimilarity.similarScore(text1, text1);
double score1pk2 = textSimilarity.similarScore(text1, text2);
double score1pk3 = textSimilarity.similarScore(text1, text3);
double score2pk2 = textSimilarity.similarScore(text2, text2);
double score2pk3 = textSimilarity.similarScore(text2, text3);
double score3pk3 = textSimilarity.similarScore(text3, text3);
System.out.println(text1+" and "+text1+" Similarity score:"+score1pk1);
System.out.println(text1+" and "+text2+" Similarity score:"+score1pk2);
System.out.println(text1+" and "+text3+" Similarity score:"+score1pk3);
System.out.println(text2+" and "+text2+" Similarity score:"+score2pk2);
System.out.println(text2+" and "+text3+" Similarity score:"+score2pk3);
System.out.println(text3+" and "+text3+" Similarity score:"+score3pk3);

The operation results are as follows:

Similarity score of I love shopping and I love shopping: 1.0
 I love shopping and I love reading similarity score: 0.67
 I love shopping and he is a hacker similarity score: 0.0
 Similarity score of I love reading and I love reading: 1.0
 I love reading and he is a hacker similarity score: 0.0
 Similarity score between he is a hacker and he is a hacker: 1.0

Method 9: Jaro – Winkler Distance, expansion of Jaro

Implementation class: org apdplat. word. analysis. JaroWinklerDistanceTextSimilarity

The usage is as follows:

String text1 = "I love shopping";
String text2 = "I love reading";
String text3 = "He's a hacker";
TextSimilarity textSimilarity = new JaroWinklerDistanceTextSimilarity();
double score1pk1 = textSimilarity.similarScore(text1, text1);
double score1pk2 = textSimilarity.similarScore(text1, text2);
double score1pk3 = textSimilarity.similarScore(text1, text3);
double score2pk2 = textSimilarity.similarScore(text2, text2);
double score2pk3 = textSimilarity.similarScore(text2, text3);
double score3pk3 = textSimilarity.similarScore(text3, text3);
System.out.println(text1+" and "+text1+" Similarity score:"+score1pk1);
System.out.println(text1+" and "+text2+" Similarity score:"+score1pk2);
System.out.println(text1+" and "+text3+" Similarity score:"+score1pk3);
System.out.println(text2+" and "+text2+" Similarity score:"+score2pk2);
System.out.println(text2+" and "+text3+" Similarity score:"+score2pk3);
System.out.println(text3+" and "+text3+" Similarity score:"+score3pk3);

The operation results are as follows:

Similarity score of I love shopping and I love shopping: 1.0
 I love shopping and I love reading similarity score: 0.73
 I love shopping and he is a hacker similarity score: 0.0
 Similarity score of I love reading and I love reading: 1.0
 I love reading and he is a hacker similarity score: 0.0
 Similarity score between he is a hacker and he is a hacker: 1.0

Method 10: S ø rensen – Dice coefficient, which evaluates their similarity by calculating twice the size of the intersection of two sets divided by the sum of the sizes of the two sets

Implementation class: org apdplat. word. analysis. SørensenDiceCoefficientTextSimilarity

The usage is as follows:

String text1 = "I love shopping";
String text2 = "I love reading";
String text3 = "He's a hacker";
TextSimilarity textSimilarity = new SørensenDiceCoefficientTextSimilarity();
double score1pk1 = textSimilarity.similarScore(text1, text1);
double score1pk2 = textSimilarity.similarScore(text1, text2);
double score1pk3 = textSimilarity.similarScore(text1, text3);
double score2pk2 = textSimilarity.similarScore(text2, text2);
double score2pk3 = textSimilarity.similarScore(text2, text3);
double score3pk3 = textSimilarity.similarScore(text3, text3);
System.out.println(text1+" and "+text1+" Similarity score:"+score1pk1);
System.out.println(text1+" and "+text2+" Similarity score:"+score1pk2);
System.out.println(text1+" and "+text3+" Similarity score:"+score1pk3);
System.out.println(text2+" and "+text2+" Similarity score:"+score2pk2);
System.out.println(text2+" and "+text3+" Similarity score:"+score2pk3);
System.out.println(text3+" and "+text3+" Similarity score:"+score3pk3);

The operation results are as follows:

Similarity score of I love shopping and I love shopping: 1.0
 I love shopping and I love reading similarity score: 0.67
 I love shopping and he is a hacker similarity score: 0.0
 Similarity score of I love reading and I love reading: 1.0
 I love reading and he is a hacker similarity score: 0.0
 Similarity score between he is a hacker and he is a hacker: 1.0

23. The possibility of judging that the sentence is meaningful:

By the following command:
unix-like:
	chmod +x sentence-identify.sh & ./sentence-identify.sh
windows:
	./sentence-identify.bat
 function org.apdplat.word.analysis.SentenceIdentify The results of the class are as follows:

1. sentence: I am a man and you are a woman, probability: 0.71428573
2. sentence: I am a person, probability: 0.6666667
3. sentence: I love reading, probability: 0.5
4. sentence: I love learning., probability: 0.5
5. sentence: Fati's room, martial arts master of your generation, change to the entry point, probability: 0.2857143
6. sentence: General supervision office of high voltage line tower with apparent porosity, probability: 0.2857143
7. sentence: Wang Jiejun reports that it's not impossible to perform a ridge wall with hay and Vera, probability: 0.25
8. sentence: At eight or nine o'clock, the scenery of the mountain has experienced changes in the world, and pulushenko's Huaihe town is happy to play a simulated flight, probability: 0.22222222
9. sentence: Level mission area book of the dead barnar has no brain to pull people's hearts and lungs to review lessons Lin Youli typhoon shelter, probability: 0.2
10. sentence: Participating party: Journal of Botany Bai Shanye runs wildly in the shadow and rides the white horse wuzishan castle. He hesitates at Yueyang airport, probability: 0.2

Then you can enter the sentence according to the command line prompt and enter to get the score of the sentence

For example, enter a sentence and enter:Strive for China's rise
 The program returns the following results:
Random words: [by, China, rise, and, strive, struggle]
Generate sentences: Strive for China's rise
 Sentence probability: 1.0

For example, enter a sentence and enter:Is the memory of the human brain stored in bioelectricity or in cells?
The program returns the following results:
Random words: [human brain, of, memory, yes, preservation, stay, biology, electric, upper, still, stay, Cells, in]
Generate sentences: Is the memory of the human brain stored in bioelectricity or in cells?
Sentence probability: 0.8333333

Effect evaluation of word segmentation algorithm:

1,word Maximum participle Ngram Scoring algorithm:
Word segmentation speed: 370.9714 character/millisecond
 Perfect rate of lines: 66.55%  Number of rows error rate: 33.44%  Total lines: 2533709 perfect lines: 1686210 error lines: 847499
 Word perfect rate: 60.94% Word count error rate: 39.05% Total words: 28374490 perfect words: 17293964 wrong words: 11080526

2,word Minimum word segmentation algorithm:
Word segmentation speed: 330.1586 character/millisecond
 Perfect rate of lines: 65.67%  Number of rows error rate: 34.32%  Total lines: 2533709 perfect lines: 1663958 wrong lines: 869751
 Word perfect rate: 60.12% Word count error rate: 39.87% Total words: 28374490 perfect words: 17059641 wrong words: 11314849

3,word Word segmentation algorithm:
Word segmentation speed: 62.960262 character/millisecond
 Perfect rate of lines: 57.2%  Number of rows error rate: 42.79%  Total lines: 2533709 perfect lines: 1449288 error lines: 1084421
 Word perfect rate: 47.95% Word error rate: 52.04% Total words: 28374490 perfect words: 13605742 wrong words: 14768748

4,word Word segmentation bidirectional maximum minimum matching algorithm:
Word segmentation speed: 462.87158 character/millisecond
 Perfect rate of lines: 53.06%  Number of rows error rate: 46.93%  Total lines: 2533709 perfect lines: 1344624 error lines: 1189085
 Word perfect rate: 43.07% Word count error rate: 56.92% Total words: 28374490 perfect words: 12221610 wrong words: 16152880

5,word Word segmentation bidirectional minimum matching algorithm:
Word segmentation speed: 967.68604 character/millisecond
 Perfect rate of lines: 46.34%  Number of rows error rate: 53.65%  Total lines: 2533709 perfect lines: 1174276 error lines: 1359433
 Word perfect rate: 36.07% Word error rate: 63.92% Total words: 28374490 perfect words: 10236574 wrong words: 18137916

6,word Word segmentation bidirectional maximum matching algorithm:
Word segmentation speed: 661.148 character/millisecond
 Perfect rate of lines: 46.18%  Number of rows error rate: 53.81%  Total lines: 2533709 perfect lines: 1170075 error lines: 1363634
 Word perfect rate: 35.65% Word count error rate: 64.34% Total words: 28374490 perfect words: 10117122 wrong words: 18257368

7,word Word segmentation forward maximum matching algorithm:
Word segmentation speed: 1567.1318 character/millisecond
 Perfect rate of lines: 41.88%  Number of rows error rate: 58.11%  Total lines: 2533709 perfect lines: 1061189 error lines: 1472520
 Word perfect rate: 31.35% Word error rate: 68.64% Total words: 28374490 perfect words: 8896173 wrong words: 19478317

8,word Word segmentation reverse maximum matching algorithm:
Word segmentation speed: 1232.6017 character/millisecond
 Perfect rate of lines: 41.69%  Number of rows error rate: 58.3%  Total lines: 2533709 perfect lines: 1056515 error lines: 1477194
 Word perfect rate: 30.98% Word error rate: 69.01% Total words: 28374490 perfect words: 8792532 wrong words: 19581958

9,word Word segmentation inverse minimum matching algorithm:
Word segmentation speed: 1936.9575 character/millisecond
 Perfect rate of lines: 41.42%  Number of rows error rate: 58.57%  Total lines: 2533709 perfect lines: 1049673 wrong lines: 1484036
 Word perfect rate: 31.34% Word error rate: 68.65% Total words: 28374490 perfect words: 8893622 wrong words: 1948 0868

10,word Word segmentation forward minimum matching algorithm:
Word segmentation speed: 2228.9465 character/millisecond
 Perfect rate of lines: 36.7%  Number of rows error rate: 63.29%  Total lines: 2533709 perfect lines: 930069 error lines: 1603640
 Word perfect rate: 26.72% Word error rate: 73.27% Total words: 28374490 perfect words: 7583741 wrong words: 20790749

 

Topics: NLP