The git of word segmentation can't be opened. Turn the content over for easy viewing
How to use word segmentation:
1. Quick experience
Run the script under the project root directory demo-word.bat You can quickly experience the word segmentation effect usage: command [text] [input] [output] command command The optional values are: demo,text,file demo text Yang Shangchuan is APDPlat Author of application level product development platform file d:/text.txt d:/word.txt exit
2. Word segmentation of text
Remove stop words: List<Word> words = WordSegmenter.seg("Yang Shangchuan is APDPlat Author of application level product development platform"); Reserved stop words: List<Word> words = WordSegmenter.segWithStopWords("Yang Shangchuan is APDPlat Author of application level product development platform"); System.out.println(words); Output: Remove stop words:[Yang Shangchuan, apdplat, Application level, product, Development platform, author] Reserved stop words:[Yang Shangchuan, yes, apdplat, Application level, product, Development platform, of, author]
3. Word segmentation of documents
String input = "d:/text.txt"; String output = "d:/word.txt"; Remove stop words: WordSegmenter.seg(new File(input), new File(output)); Reserved stop words: WordSegmenter.segWithStopWords(new File(input), new File(output));
4. Custom profile
The default configuration file is under the classpath word.conf,Packed in word-x.x.jar in The custom configuration file is under the classpath word.local.conf,It needs to be provided by the user If the custom configuration is the same as the default configuration, the custom configuration overrides the default configuration The configuration file code is UTF-8
5. Custom user Thesaurus
User defined thesaurus is one or more folders or files, and absolute path or relative path can be used The user thesaurus is composed of multiple dictionary files, which are encoded as UTF-8 The format of dictionary file is text file, and one line represents a word Paths can be specified by system properties or configuration files, and multiple paths are separated by commas The dictionary file under the class path needs to be prefixed before the relative path classpath: There are three ways to specify: Designation method 1: programming designation (high priority): WordConfTools.set("dic.path", "classpath:dic.txt,d:/custom_dic"); DictionaryFactory.reload();//After changing the dictionary path, reload the dictionary Designation method 2, Java Virtual machine startup parameters (medium priority): java -Ddic.path=classpath:dic.txt,d:/custom_dic Specify method 3: configuration file assignment (low priority): Use files under classpath word.local.conf To specify configuration information dic.path=classpath:dic.txt,d:/custom_dic If not specified, the class path is used by default dic.txt Dictionary file besides, You can also maintain the thesaurus with code in the program, The method is as follows: // Single operation // Add a custom word DictionaryFactory.getDictionary().add("Yang Shangchuan"); // Delete a custom word DictionaryFactory.getDictionary().remove("Liu Shishi"); // Batch operation List<String> words = new ArrayList<>(); words.add("Lau Andy"); words.add("Jing Tian"); words.add("Zhao Liying"); // Add a batch of custom words DictionaryFactory.getDictionary().addAll(words); // Delete a batch of custom words DictionaryFactory.getDictionary().removeAll(words);
6. Custom stop word thesaurus
The usage method is similar to the custom user thesaurus. The configuration items are: stopwords.path=classpath:stopwords.txt,d:/custom_stopwords_dic
7. Automatically detect thesaurus changes
It can automatically detect the changes of user-defined thesaurus and user-defined stop word thesaurus It includes files and folders under classpath, absolute path and relative path under non classpath For example: classpath:dic.txt,classpath:custom_dic_dir, d:/dic_more.txt,d:/DIC_DIR,D:/DIC2_DIR,my_dic_dir,my_dic_file.txt classpath:stopwords.txt,classpath:custom_stopwords_dic_dir, d:/stopwords_more.txt,d:/STOPWORDS_DIR,d:/STOPWORDS2_DIR,stopwords_dir,remove.txt
8. Explicitly specify word segmentation algorithm
When segmenting text, you can explicitly specify a specific word segmentation algorithm, such as: WordSegmenter.seg("APDPlat Application level product development platform", SegmentationAlgorithm.BidirectionalMaximumMatching); SegmentationAlgorithm The optional types of are: Forward maximum matching algorithm: MaximumMatching Reverse maximum matching algorithm: ReverseMaximumMatching Forward minimum matching algorithm: MinimumMatching Inverse minimum matching algorithm: ReverseMinimumMatching Bidirectional maximum matching algorithm: BidirectionalMaximumMatching Bidirectional minimum matching algorithm: BidirectionalMinimumMatching Bidirectional maximum minimum matching algorithm: BidirectionalMaximumMinimumMatching Total segmentation algorithm: FullSegmentation Least words algorithm: MinimalWordCount maximum Ngram Scoring algorithm: MaxNgramScore
9. Word segmentation effect evaluation
Run the script under the project root directory evaluation.bat The effect of word segmentation can be evaluated The test text used in the evaluation is 2533709 lines, with a total of 2837 4490 characters The assessment results are located in target/evaluation Directory: corpus-text.txt It is the manual marked text of divided words, and the words are separated by spaces test-text.txt To test the text, put corpus-text.txt Results separated by punctuation into multiple lines standard-text.txt To test the manual marking text corresponding to the text as the standard for whether the word segmentation is correct result-text-***.txt,***Names for various word segmentation algorithms, which are word Word segmentation result perfect-result-***.txt,***It is the name of various word segmentation algorithms, which is the text whose word segmentation results are completely consistent with the manual annotation standard wrong-result-***.txt,***It is the name of various word segmentation algorithms, which is the text whose word segmentation results are inconsistent with the manual annotation standard
10. Distributed Chinese word segmentation
1,In a custom profile word.conf or word.local.conf Specify all configuration items in*.path use HTTP Resources, and specify configuration items redis.* #Dictionary dic.path=http://localhost:8080/word_web/resources/dic.txt #Part of speech tagging data part.of.speech.dic.path=http://localhost:8080/word_web/resources/part_of_speech_dic.txt #Part of speech description data part.of.speech.des.path=http://localhost:8080/word_web/resources/part_of_speech_des.txt #Binary model bigram.path=http://localhost:8080/word_web/resources/bigram.txt #Ternary model trigram.path=http://localhost:8080/word_web/resources/trigram.txt #Stop word dictionary stopwords.path=http://localhost:8080/word_web/resources/stopwords.txt #Punctuation marks used to segment words punctuation.path=http://localhost:8080/word_web/resources/punctuation.txt #Hundred family names surname.path=http://localhost:8080/word_web/resources/surname.txt #Quantifier quantifier.path=http://localhost:8080/word_web/resources/quantifier.txt #Whether to use the publish subscribe service of redis to detect HTTP resource changes in real time redis.enable=false #redis service is used to detect HTTP resource changes in real time #redis host redis.host=localhost #redis port redis.port=6379 2,Configure and start redis The server All word splitters subscribe redis The server, When redis After the server receives the user's instruction to add or delete resources, All word splitters will be notified to perform corresponding operations 3,Configure and start provisioning HTTP Resource web Server, about to project: https://github. com/ysc/word_ Deploy web to port 8080 of tomcat // Inform all word splitters to add the word "Yang Shangchuan" http://localhost:8080/word_ web/admin/dic. jsp? Action = add & DIC = Yang Shangchuan // Tell all word splitters to delete the word "notebook" http://localhost:8080/word_ web/admin/dic. jsp? Action = Remove & DIC = notebook dic.jsp After receiving the user's request, the message will be delivered to redis The server, redis The server is publishing messages to all subscribed word splitters
11. Part of speech tagging
Taking the word segmentation result as an input parameter, call PartOfSpeechTagging Class process Method, part of speech saved in Word Class partOfSpeech Field As follows: List<Word> words = WordSegmenter.segWithStopWords("I love China"); System.out.println("Unmarked part of speech:"+words); //Part of speech tagging PartOfSpeechTagging.process(words); System.out.println("Part of speech:"+words); Output content: Unmarked part of speech:[I, love, China] Part of speech:[I/r, love/v, China/ns]
12,refine
Let's take a segmentation example: List<Word> words = WordSegmenter.segWithStopWords("China's working class and the broad working masses should unite more closely around the Party Central Committee"); System.out.println(words); The results are as follows: [our country, working class, and, vast, working masses, want, More, close together, land, unite, stay, the central committee of the communist of China(CPC), around] If the segmentation result we want is: [our country, worker, class, and, vast, labour, Masses, want, More, close together, land, unite, stay, the central committee of the communist of China(CPC), around] That is, we should subdivide the "working class" into the "working class" and the "working masses" into the "working masses". Then what should we do? We can word.refine.path File specified by configuration item classpath:word_refine.txt Add the following content to the list: working class=working class working masses=Working masses Then, we analyze the segmentation results refine: words = WordRefiner.refine(words); System.out.println(words); In this way, we can achieve the desired effect: [our country, worker, class, and, vast, labour, Masses, want, More, close together, land, unite, stay, the central committee of the communist of China(CPC), around] Let's take another segmentation example: List<Word> words = WordSegmenter.segWithStopWords("New achievements on the great journey of realizing the goal of "two centenaries""); System.out.println(words); The results are as follows: [stay, realization, Two, A hundred years, Goal, of, great, journey, upper, Recreate, new, achievement] If the segmentation result we want is: [stay, realization, two centenary goals, Goal, of, great journey , upper, Recreate, new, achievement] That is, we should merge the "two centenaries" into "two centenaries" and, "Journey" is merged into "great journey", so what should we do? We can word.refine.path File specified by configuration item classpath:word_refine.txt Add the following content to the list: two centenary goals=two centenary goals Great journey=great journey Then, we analyze the result of word segmentation refine: words = WordRefiner.refine(words); System.out.println(words); In this way, we can achieve the desired effect: [stay, realization, two centenary goals, Goal, of, great journey , upper, Recreate, new, achievement]
13. Synonymous annotation
List<Word> words = WordSegmenter.segWithStopWords("Chu limo tried every means to retrieve the memory for ruthlessness"); System.out.println(words); The results are as follows: [Chu limo, make every attempt, by, ruthless, Retrieve, memory] Make synonymous annotation: SynonymTagging.process(words); System.out.println(words); The results are as follows: [Chu limo, make every attempt[Long intentional, Make every effort, try various devices to, tax one 's ingenuity], by, ruthless, Retrieve, memory[Image]] If indirect synonyms are enabled: SynonymTagging.process(words, false); System.out.println(words); The results are as follows: [Chu limo, make every attempt[Long intentional, Make every effort, try various devices to, tax one 's ingenuity], by, ruthless, Retrieve, memory[image, Image]] List<Word> words = WordSegmenter.segWithStopWords("Older people with strong hands tend to live longer"); System.out.println(words); The results are as follows: [Hand strength, large, of, the elderly, often, more, longevity] Make synonymous annotation: SynonymTagging.process(words); System.out.println(words); The results are as follows: [Hand strength, large, of, the elderly[Old man], often[often, Often, often], more, longevity[Long life, longevity]] If indirect synonyms are enabled: SynonymTagging.process(words, false); System.out.println(words); The results are as follows: [Hand strength, large, of, the elderly[Old man], often[As usual, commonly, Generally, ordinary, often, Chang ri, ordinary, usually, usual, Weekdays, peacetime, Usual, daily, Everyday ordinary, often, ordinary, Often, General, Su ri, often, popular, usually], more, longevity[Long life, longevity]] Take the word "do everything possible" as an example: Can pass Word of getSynonym()Method to obtain synonyms, such as: System.out.println(word.getSynonym()); The results are as follows: [Long intentional, Make every effort, try various devices to, tax one 's ingenuity] Note: if there is no synonym, then getSynonym()Return empty collection: Collections.emptyList() The differences between indirect synonyms and direct synonyms are as follows: Assumptions: A and B Is a synonym, A and C Is a synonym, B and D Is a synonym, C and E Is a synonym Then: about A For example, A B C Is a direct synonym about B For example, A B D Is a direct synonym about C For example, A C E Is a direct synonym about A B C For example, A B C D E Is an indirect synonym
14. Antisense annotation
List<Word> words = WordSegmenter.segWithStopWords("5 What movies are worth watching at the beginning of this month"); System.out.println(words); The results are as follows: [5, At the beginning of the month, have, Which?, film, worth, watch] Antisense annotation: AntonymTagging.process(words); System.out.println(words); The results are as follows: [5, At the beginning of the month[end of month, end of the month, The end of the month], have, Which?, film, worth, watch] List<Word> words = WordSegmenter.segWithStopWords("Because the work is not in place and the service is not perfect, unpleasant things happen to customers during dinner,The restaurant should make a sincere apology to the customers,Not perfunctory."); System.out.println(words); The results are as follows: [because, work, Not in place, service, imperfect, cause, customer, stay, have meals, Time, happen, Unpleasant, of, thing, restaurant, aspect, should, towards, customer, make, sincere, of, apologize, instead of, do things carelessly] Antisense annotation: AntonymTagging.process(words); System.out.println(words); The results are as follows: [because, work, Not in place, service, imperfect, cause, customer, stay, have meals, Time, happen, Unpleasant, of, thing, restaurant, aspect, should, towards, customer, make, sincere[Fool, FALSE, false, sinister and crafty], of, apologize, instead of, do things carelessly[be strict in one 's demands, be conscientious and do one's best, strain every nerve, strain every nerve, refine on, a matter of conscience]] Take the word "beginning of the month" as an example: Can pass Word of getAntonym()Method to obtain antonyms, such as: System.out.println(word.getAntonym()); The results are as follows: [end of month, end of the month, The end of the month] Note: if there is no antonym, getAntonym()Return empty collection: Collections.emptyList()
15. Pinyin annotation
List<Word> words = WordSegmenter.segWithStopWords("<Since its release on April 12, the box office of speed and passion 7 in mainland China has exceeded RMB 2 billion in just two weeks"); System.out.println(words); The results are as follows: [speed, And, passion, 7, of, China, inland, box office, since, 4 month, 12 day, Show, since, stay, Short, Two weeks, within, Breach, 20 Hundred million, RMB] Execute pinyin annotation: PinyinTagging.process(words); System.out.println(words); The results are as follows: [speed sd sudu, And y yu, passion jq jiqing, 7, of d de, China zg zhongguo, inland nd neidi, box office pf piaofang, since z zi, 4 month, 12 day, Show sy shangying, since yl yilai, stay z zai, Short dd duanduan, Two weeks lz liangzhou, within n nei, Breach tp tupo, 20 Hundred million, RMB rmb renminbi] Take the word "speed" as an example: Can pass Word of getFullPinYin()Method to obtain complete Pinyin, such as: sudu Can pass Word of getAcronymPinYin()Method to obtain the acronym Pinyin, such as: sd
16. Lucene plug-in:
1,Construct a word analyzer ChineseWordAnalyzer Analyzer analyzer = new ChineseWordAnalyzer(); If you need to use a specific word segmentation algorithm, you can specify it through the constructor: Analyzer analyzer = new ChineseWordAnalyzer(SegmentationAlgorithm.FullSegmentation); If not specified, the two-way maximum matching algorithm is used by default: SegmentationAlgorithm.BidirectionalMaximumMatching See enumeration class for available word segmentation algorithms: SegmentationAlgorithm 2,utilize word Parser segmentation text TokenStream tokenStream = analyzer.tokenStream("text", "Yang Shangchuan is APDPlat Author of application level product development platform"); //Prepare for consumption tokenStream.reset(); //Start consumption while(tokenStream.incrementToken()){ //Words CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class); //The starting position of a word in a text OffsetAttribute offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class); //The first few words PositionIncrementAttribute positionIncrementAttribute = tokenStream.getAttribute(PositionIncrementAttribute.class); LOGGER.info(charTermAttribute.toString()+" ("+offsetAttribute.startOffset()+" - "+offsetAttribute.endOffset()+") "+positionIncrementAttribute.getPositionIncrement()); } //Consumption completed tokenStream.close(); 3,utilize word Analyzer establishment Lucene Indexes Directory directory = new RAMDirectory(); IndexWriterConfig config = new IndexWriterConfig(analyzer); IndexWriter indexWriter = new IndexWriter(directory, config); 4,utilize word Analyzer query Lucene Indexes QueryParser queryParser = new QueryParser("text", analyzer); Query query = queryParser.parse("text:Yang Shangchuan"); TopDocs docs = indexSearcher.search(query, Integer.MAX_VALUE);
17. Solr plug-in:
1,download word-1.3.jar Download address: http://search.maven.org/remotecontent?filepath=org/apdplat/word/1.3/word-1.3.jar 2,Create directory solr-5.2.0/example/solr/lib,take word-1.3.jar Copy to lib catalogue 3,to configure schema Specify word breaker take solr-5.2.0/example/solr/collection1/conf/schema.xml All in the file <tokenizer class="solr.WhitespaceTokenizerFactory"/>and <tokenizer class="solr.StandardTokenizerFactory"/>Replace all with <tokenizer class="org.apdplat.word.solr.ChineseWordTokenizerFactory"/> And remove all filter label 4,If you need to use a specific word segmentation algorithm: <tokenizer class="org.apdplat.word.solr.ChineseWordTokenizerFactory" segAlgorithm="ReverseMinimumMatching"/> segAlgorithm The optional values are: Forward maximum matching algorithm: MaximumMatching Reverse maximum matching algorithm: ReverseMaximumMatching Forward minimum matching algorithm: MinimumMatching Inverse minimum matching algorithm: ReverseMinimumMatching Bidirectional maximum matching algorithm: BidirectionalMaximumMatching Bidirectional minimum matching algorithm: BidirectionalMinimumMatching Bidirectional maximum minimum matching algorithm: BidirectionalMaximumMinimumMatching Total segmentation algorithm: FullSegmentation Least words algorithm: MinimalWordCount maximum Ngram Scoring algorithm: MaxNgramScore If not specified, the two-way maximum matching algorithm is used by default: BidirectionalMaximumMatching 5,If you need to specify a specific profile: <tokenizer class="org.apdplat.word.solr.ChineseWordTokenizerFactory" segAlgorithm="ReverseMinimumMatching" conf="solr-5.2.0/example/solr/nutch/conf/word.local.conf"/> word.local.conf For the configurable contents in the file, see word-1.3.jar Medium word.conf file If not specified, the default configuration file, located at word-1.3.jar Medium word.conf file
18. ElasticSearch plug-in:
1,Open the command line and switch to elasticsearch Root directory of cd elasticsearch-5.4.3 2,install word Word segmentation plug-in: wget http://apdplat.org/word/archive/v1.4.1.zip mkdir plugins/word unzip -d plugins/word v1.4.1.zip Note: if elasticsearch Version greater than 5.4.3,For example, 5.6.4,File plugins/word/plugin-descriptor.properties Change the configuration in to: elasticsearch.version=5.6.4 3,start-up ElasticSearch bin/elasticsearch 4,Test effect, in Chrome Access in browser: http://localhost:9200/_ analyze? Analyzer = word & text = Yang Shangchuan is the author of APDPlat application level product development platform
19. Luke plugin:
1,download http://luke.googlecode.com/files/lukeall-4.0.0-ALPHA.jar (not accessible in China) 2,Download and unzip Java Chinese word segmentation component word-1.0-bin.zip: http://pan.baidu.com/s/1dDziDFz 3,Unzip the Java Chinese word segmentation component word-1.0-bin/word-1.0 4 in the folder jar Unzip the package to the current folder Use compression and decompression tools such as winrar open lukeall-4.0.0-ALPHA.jar,Remove the current folder META-INF Folder.jar, .bat,.html,word.local.conf Drag all files other than files to lukeall-4.0.0-ALPHA.jar inside 4,Execute command java -jar lukeall-4.0.0-ALPHA.jar start-up luke,stay Search Tabbed Analysis inside You can choose org.apdplat.word.lucene.ChineseWordAnalyzer The word splitter is broken 5,stay Plugins Tabbed Available analyzers found on the current classpath You can also choose from it org.apdplat.word.lucene.ChineseWordAnalyzer Tokenizer Note: if you want to integrate yourself word Other versions of the word breaker, run under the root directory of the project mvn install Compile the project and run the command mvn dependency:copy-dependencies Replication dependent jar Bag, then target/dependency/There will be all in the directory Dependence of jar Bag. among target/dependency/slf4j-api-1.6.4.jar yes word The log framework used by the word splitter, target/dependency/logback-classic-0.9.28.jar and target/dependency/logback-core-0.9.28.jar yes word The log implementation recommended by the word splitter and the configuration file of the log implementation Path at target/classes/logback.xml,target/word-1.3.jar yes word Main of word splitter jar Package, if required To customize the dictionary, you need to modify the word splitter configuration file target/classes/word.conf
Download the integrated Luke plug-in (applicable to Lucene 4.0.0): lukeall-4.0.0-ALPHA-with-word-1.0.jar
Download the integrated Luke plug-in (applicable to Lucene 4.10.3): lukeall-4.10.3-with-word-1.2.jar
20. The relevant words are obtained by calculating the context of the words:
How can we get related words by calculating the context of words?
Context is defined as: in a text, the context of any word is determined by its antecedent N Words and after N It consists of two words. The definition of Related words is: if the context of two words is more similar, the more similar and related the two words are.
The algorithm consists of two steps:
1,The context of each word is calculated from a large corpus, and the word vector is used to represent the context. 2,The problem of finding the similarity of two words is transformed into the problem of finding the similarity of the context of the two words. By calculating the similarity of context, we can get the similarity of words. The more similar words are, the more relevant they are.
The method of use is as follows:
1,use word Word segmentation built-in corpus: running word Script under the root directory of word segmentation project demo-word-vector-corpus.bat or demo-word-vector-corpus.sh 2,Use your own text content: run word Script under the root directory of word segmentation project demo-word-vector-file.bat or demo-word-vector-file.sh Due to the large corpus, it will take a long time to start. Please wait patiently. An example is given below: For example, we want to analyze the Related words of Lanzhou, and we run the script demo-word-vector-corpus.sh ,Command line prompt after successful startup: Start initializing model Model initialization complete You can enter commands sa=cos To specify the similarity algorithm. The available algorithms are: 1,sa=cos,cosine similarity 2,sa=edi,Edit distance 3,sa=euc,Euclidean distance 4,sa=sim,Simple common words 5,sa=jac,Jaccard Similarity coefficient 6,sa=man,Manhattan distance 7,sa=shh,SimHash + Hamming distance 8,sa=ja,Jaro distance 9,sa=jaw,Jaro–Winkler distance 10,sa=sd,Sørensen–Dice coefficient You can enter commands limit=15 To specify the number of results to display You can enter commands exit Exit program Enter the word or command to query: Enter Lanzhou and press enter. The result shows: Related words of Lanzhou( EditDistanceTextSimilarity): ---------------------------------------------------------- 1,Lanzhou 1.0 2,Beijing 0.21 3,Fuzhou 0.2 4,Taiyuan 0.19 5,Chengdu 0.17 6,Xi'an 0.17 7,Harbin 0.17 8,Nanning 0.17 9,Guiyang 0.16 10,Qingyang 0.15 11,Shenyang 0.14 12,Hefei 0.14 13,Datong 0.14 14,Lhasa 0.13 15,Xining 0.13 ---------------------------------------------------------- The result shown here is the related word of Lanzhou, followed by the correlation score, Lanzhou and Lanzhou are the same word, and the correlation is 100%. Naturally, it is 1 point. From this result, we can analyze why these words are related? Where's the clue? First of all, the parts of speech of these words are nouns; Secondly, these words are place names and big city names; From here, we can also see an interesting phenomenon that the usage of the same part of speech, such as place names, is often consistent. Related words are derived from context. The number followed by words in context is weight, and the weight is 1/N Cumulative value of Let's look at the context of these words: Lanzhou : [Military region 1.0, Gansu 0.78205127, New area 0.7692308, University 0.42307693, Lanzhou, Gansu 0.41025642, Truck 0.3846154, Xi'an 0.32051283, Our newspaper 0.2948718, Xinhua News Agency 0.2820513, Lanzhou New Area 0.26923078, Hold 0.23076923, Send to 0.21794872, China 0.20512821, Lanzhou 0.20512821, Railway station 0.20512821, Railway 0.17948718, Attend 0.15384616, Xining 0.15384616, Direction 0.15384616, Chengdu 0.14102565, Police 0.14102565, Construction 0.12820514, Municipal Party committee 0.12820514, Come to 0.12820514, One 0.12820514, Center 0.115384616, Refinery 0.102564104, Enter 0.102564104, From 0.102564104, Hold 0.102564104] Beijing : [Xinhua News Agency 1.0, Our newspaper 0.7119143, Hold 0.19384204, Shanghai 0.17831326, Time 0.16385542, Railway Bureau 0.1394913, West Station 0.13226238, Youth Daily 0.12717536, Morning news 0.11700134, Municipal Party committee 0.1145917, Region 0.11218206, Hold 0.10200803, City 0.08299866, Current 0.07951807, Come to 0.06961178, Military region 0.06827309, International 0.066398926, Center 0.063453816, Beijing time 0.06184739, People 0.059973225, Work 0.05863454, Metro 0.057563588, Beijing Railway Bureau 0.056492638, Hospital 0.055421688, Fly to 0.05381526, Capital 0.053547524, China 0.053547524, Where 0.05274431, Today 0.052208837, Satellite TV 0.05167336] Fuzhou : [Railway station 1.0, New area 0.46666667, Fuzhou railway station 0.45555556, Evening news 0.2962963, Reporter 0.2777778, Work 0.27407408, Come to 0.24814814, Citizen 0.23333333, Our newspaper 0.22222222, University 0.21851853, Urban 0.2074074, Municipal Party committee 0.19259259, Hold 0.19259259, Gulou District 0.18518518, Netizen 0.18148148, Reach 0.17037037, To 0.16296296, Current 0.14074074, Branch 0.14074074, One 0.12962963, City 0.12962963, East Street 0.12222222, Fuzhou evening news 0.12222222, Xinhua News Agency 0.11851852, Railway 0.11851852, Hold 0.11481482, Go to 0.11481482, Development 0.11481482, Push 0.11111111, Fuzhou 0.11111111] Taiyuan : [Shanxi 1.0, Taiyuan, Shanxi 0.6136364, Our newspaper 0.39772728, Xinhua News Agency 0.3409091, Railway station 0.26136363, Jinan 0.25, Railway 0.23863636, Beijing 0.22727273, Launch 0.1590909, International 0.1590909, Return 0.14772727, Corundum 0.13636364, From 0.13636364, Publish 0.13636364, Work 0.125, Center 0.125, Municipal Party committee 0.11363637, Bank 0.11363637, Railway Bureau 0.10227273, Xi'an 0.09090909, Group 0.09090909, Public security 0.09090909, To 0.09090909, For example, 0.07954545, Finance 0.07954545, Train ticket 0.07954545, Datong 0.06818182, Shanxi Province 0.06818182, Military division 0.06818182, Leave 0.06818182] Chengdu : [Business daily 1.0, Chengdu Business Daily 0.4117647, Military region 0.1875, Railway Bureau 0.17830883, Beijing 0.17463236, Our newspaper 0.17095588, Chongqing 0.15441176, Tell 0.15441176, Traffic police 0.14338236, Direction 0.1360294, Reporter 0.13419117, Plain 0.121323526, Sichuan 0.1194853, Changsha 0.11764706, Polytechnic University 0.0992647, From 0.09375, Xinhua News Agency 0.09191176, To 0.090073526, Chengdu Railway Bureau 0.08455882, Railway 0.080882356, Hold 0.07904412, Citizen 0.075367644, Municipal Party committee 0.073529415, Company 0.07169118, Guangzhou 0.07169118, Xi'an 0.0680147, Pixian 0.060661763, Work 0.060661763, Urban 0.05882353, Evening news 0.05882353] Xi'an : [Railway station 1.0, Incident 0.75, Traffic 0.7058824, Construction 0.5882353, Metro 0.5882353, >Xianyang 0.5588235, Come to 0.5294118, Citizen 0.50735295, University 0.5, Railway 0.5, Delegation 0.5, Railway Bureau 0.49264705, Company 0.4852941, Wuhan 0.4632353, Qujiang 0.44117647, Power supply 0.42647058, Xinhua News Agency 0.4117647, Xi'an railway station 0.4117647, Beijing 0.3602941, Jiaotong University 0.3602941, Our newspaper 0.34558824, Xi'an Incident 0.3382353, City 0.31617647, City 0.31617647, Settled in 0.30882353, Municipal Party committee 0.29411766, International 0.2867647, Chengdong 0.2867647, Chengdu 0.2720588, Hold 0.25] Harbin : [Polytechnic University 1.0, Railway station 0.41584158, Harbin University of Technology 0.36138615, Industry 0.25742576, Direction 0.23762377, Xinhua News Agency 0.20792079, To 0.18811882, Harbin railway station 0.18316832, At 0.17821783, University 0.17326732, Railway Bureau 0.15841584, From 0.15346535, Minimum 0.14356436, Beijing 0.12871288, Our newspaper 0.12376238, Heilongjiang Province 0.12376238, Publish 0.11386139, China 0.10891089, Fly to 0.0990099, Black Dragon>Jiang 0.08415841, Shenyang 0.07920792, Project 0.07920792, Near 0.074257426, Municipal Party committee 0.06930693, Aircraft 0.06930693, Shanghai 0.06930693, Candidate 0.06930693, Enter 0.06930693, Stop 0.06930693, Economy 0.06435644] Nanning : [Guangxi 1.0, Railway Bureau 0.8, Nanning, Guangxi 0.62222224, Our newspaper 0.54444444, Xinhua News Agency 0.36666667, Nanning Railway Bureau 0.31111112, Municipal Party committee 0.26666668, Liuzhou 0.18888889, Guilin 0.17777778, Railway 0.15555556, Prosper>Ning District 0.14444445, Come to 0.11111111, To 0.11111111, Go to 0.11111111, Public security 0.11111111, Work 0.11111111, To 0.11111111, City 0.08888889, Beautiful 0.08888889, Hold 0.08888889, Engaged in 0.08888889, Guantang 0.08888889, Property market 0.08888889, Branch office 0.07777778, Nanning Municipal Party committee 0.07777778, Motor car 0.07777778, Occurrence 0.07777778, Hold 0.07777778, Xixiang 0.06666667, Mayor 0.06666667] Guiyang : [Newspaper 1.0, Chongqing 0.73333335, Xinhua News Agency 0.46666667, Direction 0.43333334, Go to 0.4, Brothers 0.4, City 0.4, Home 0.33333334, Xi'an 0.26666668, Chengdu 0.26666668, Street 0.26666668, Evening news 0.26666668, Irrelevant 0.26666668, Hangzhou 0.23333333, Involving 0.2, And 0.2, City 0.2, Netizen 0.2, Zhengzhou 0.16666667, Nanning 0.16666667, Changsha 0.16666667, Wuhan 0.16666667, Stall 0.16666667, Municipal Party committee 0.13333334, Kunming 0.13333334, Anshun 0.13333334, Come to 0.13333334, Hegemony 0.13333334, Top four 0.13333334, Railway 0.13333334] Qingyang : [Gansu 1.0, Qingyang, Gansu 0.8, Gansu Province 0.4, Region 0.4, Old area 0.3, Forest 0.2, Pingliang 0.2, Zhenyuan County 0.1, Revolution 0.1, Han Fengting 0.1, Traffic 0.1, Lanzhou forest brigade 0.1, Brigade 0.1, Lanzhou 0.1, Xifeng 0.1, hair>Send 0.1, One 0.1, License plate 0.1, From 0.1] Shenyang : [Military region 1.0, Evening news 0.5123967, Direction 0.3181818, Our newspaper 0.27272728, Shenyang Evening News 0.23553719, Xinhua News Agency 0.20661157, Shenyang Military Region 0.18595041, Military region team 0.15289256, Sea lions 0.14876033, Automation office 0.14049587, This time 0.14049587, Economic Zone 0.1322314, China 0.12809917, >Dalian 0.12809917, Uncle 0.12809917, Municipal Party committee 0.12396694, One family 0.11570248, High speed 0.11570248, International 0.11157025, Train ticket 0.11157025, Faku 0.10743801, University 0.10330579, Changchun 0.10330579, Direct to 0.09917355, Shenzhen 0.09090909, Shanghai 0.08677686, Reporter 0.08677686, Sea lion 0.08264463, Aunt 0.08264463, Two digit 0.08264463] Hefei : [Railway station 1.0, Citizen 0.8181818, Urban 0.53333336, Property market 0.4848485, Hefei railway station 0.4121212, Railway 0.38787878, Anhui 0.36969697, Reach 0.36363637, Market 0.34545454, Last week 0.3030303, Wuhu 0.2969697, Hold 0.28484848, Reporter 0.27272728, Become 0.27272728, Come to 0.26666668, Hefei, Anhui 0.24242425, City 0.24242425, Economic circle 0.24242425, Bus 0.24242425, Current 0.23636363, Our newspaper 0.21818182, This year 0.21818182, Takeoff 0.21818182, Car 0.21212122, Substance 0.2060606, Hefei property market 0.2060606, Airport 0.2060606, Industry 0.19393939, Title 0.18181819, Wild 0.16969697] da tong : [University 1.0, Railway 0.52380955, Shanxi 0.5, Securities 0.33333334, Datong University 0.33333334, Shanxi Province 0.23809524, This time 0.23809524, Datong, Shanxi 0.1904762, World 0.1904762, World Datong 0.1904762, Street 0.16666667, Taiyuan 0.14285715, Municipal Party committee 0.14285715, Shanghai 0.14285715, Police station 0.14285715, Public security department 0.14285715, Japan 0.14285715, Forward 0.14285715, Yuncheng 0.11904762, Military division 0.0952381, Mining Bureau 0.0952381, Primary school 0.0952381, Attend 0.0952381, Item 0.0952381, Secondary school 0.0952381, Water plant 0.0952381, Depot 0.0952381, To 0.0952381, Datong securities 0.0952381, Campaign 0.071428575] Lhasa : [Railway station 1.0, Xinhua News Agency 0.91935486, Tibet 0.7580645, Urban 0.61290324, Our newspaper 0.58064514, Hold 0.5645161, Customs 0.5483871, City 0.48387095, Lhasa railway station 0.4032258, Municipal Party committee 0.38709676, Chengdu 0.37096775, Gongga 0.3548387, Opening 0.32258064, Publish 0.30645162, Lhasa, Tibet 0.2580645, Meeting 0.2580645, Airport 0.22580644, Closing 0.22580644, Grand 0.22580644, Nyingchi 0.20967741, Hold 0.19354838, Open 0.19354838, Business department 0.19354838, Citizen 0.17741935, Market 0.17741935, Economy 0.17741935, Center 0.17741935, Air 0.17741935, Become 0.17741935, People 0.16129032] Xining : [Xinhua News Agency 1.0, Shanghai 0.8235294, Lanzhou 0.3529412, Rolling 0.3529412, Our newspaper 0.29411766, Qinghai 0.29411766, Investigation 0.23529412, Dangjie 0.23529412, Special steel 0.1764706, Direction 0.1764706, Branch 0.1764706, Asking for bribes 0.1764706, Beijing 0.14705883, But 0.14705883, Lhasa 0.11764706, We 0.11764706, Title 0.11764706, Traffic police 0.11764706, Delegation 0.11764706, Process 0.0882353, Yinchuan 0.0882353, Ticket 0.0882353, Preparation 0.0882353, Transfer 0.0882353, Attend 0.0882353, January 0.05882353, Test Bureau 0.05882353, February 0.05882353, Region 0.05882353, Serious 0.05882353] Finally, let's take a look at the relevant words of Lanzhou calculated by using seven similarity algorithms: ---------------------------------------------------------- Related words of Lanzhou( CosineTextSimilarity): 1,Lanzhou 1.0 2,Shenyang 0.5 3,Beijing Military Region 0.47 4,Logistics Department 0.46 5,Shenyang Military Region 0.46 6,General hospital 0.46 7,Xinjiang Military Region 0.46 8,Commander 0.42 9,Lanzhou, Gansu 0.42 10,Lanzhou New Area 0.42 11,A division 0.39 12,Zhengpu port 0.38 13,Xixian 0.38 14,Tianshui 0.37 15,Zheng Dong 0.37 Time: 25 seconds,572 millisecond ---------------------------------------------------------- Related words of Lanzhou( EditDistanceTextSimilarity): 1,Lanzhou 1.0 2,Beijing 0.21 3,Fuzhou 0.2 4,Taiyuan 0.19 5,Chengdu 0.17 6,Nanning 0.17 7,Xi'an 0.17 8,Harbin 0.17 9,Guiyang 0.16 10,Qingyang 0.15 11,Hefei 0.14 12,Datong 0.14 13,Shenyang 0.14 14,Perth 0.13 15,Lhasa 0.13 Time: 44 seconds,253 millisecond ---------------------------------------------------------- Related words of Lanzhou( EuclideanDistanceTextSimilarity): 1,Lanzhou 1.0 2,Logistics Department 0.37 3,Beijing Military Region 0.37 4,Xinjiang Military Region 0.37 5,Shenyang 0.37 6,Shenyang Military Region 0.37 7,General hospital 0.37 8,Shanghai Pudong New Area 0.36 9,Zhengpu port 0.36 10,Pudong New Area 0.36 11,Lanzhou, Gansu 0.36 12,Xixian 0.36 13,Xixian new area 0.36 14,Zhengding new area 0.36 15,Commander 0.36 Time: 24 seconds,710 millisecond ---------------------------------------------------------- Related words of Lanzhou( SimpleTextSimilarity): 1,Lanzhou 1.0 2,Fuzhou 0.36 3,Xi'an 0.33 4,Li Hongqi 0.33 5,China Financial Information Center 0.33 6,Nantes 0.32 7,Cartagena 0.32 8,Harbin 0.3 9,Wuhan 0.3 10,Dacry 0.3 11,Chuxiong 0.29 12,Zhu Mengkui 0.29 13,Yue Feifei 0.29 14,Changsha 0.28 15,LV Guoqing 0.28 Time: 21 seconds,918 millisecond ---------------------------------------------------------- Related words of Lanzhou( JaccardTextSimilarity): 1,Lanzhou 1.0 2,Fuzhou 0.22 3,Xi'an 0.2 4,Harbin 0.18 5,Beijing 0.18 6,Wuhan 0.18 7,Chengdu 0.18 8,Changsha 0.15 9,Taiyuan 0.15 10,Guiyang 0.15 11,Shenyang 0.15 12,Guangzhou 0.15 13,Lhasa 0.15 14,Nanchang 0.15 15,Changchun 0.13 Time: 19 seconds,717 millisecond ---------------------------------------------------------- Related words of Lanzhou( ManhattanDistanceTextSimilarity): 1,Lanzhou 1.0 2,Shanghai Pudong New Area 0.11 3,Shaanxi Xixian new area 0.11 4,Lanzhou, Gansu 0.11 5,Beijing Military Region 0.11 6,Xinjiang Military Region 0.11 7,Xixian 0.11 8,Zhengding new area 0.11 9,Tianfu new area 0.11 10,Shenyang Military Region 0.11 11,National new area 0.11 12,Lanzhou New Area 0.11 13,Xiake 0.1 14,Threat theory 0.1 15,One or two months 0.1 Time: 23 seconds,857 millisecond ---------------------------------------------------------- Related words of Lanzhou( SimHashPlusHammingDistanceTextSimilarity): 1,Lanzhou 1.0 2,Fish water 0.96 3,Feng dao0.95 4,Press release 0.95 5,Science 0.95 6,Property company 0.95 7,Active serviceman 0.95 8,Who 0.95 9,Zhang fu0.94 10,Announcement 0.94 11,Information publishing 0.94 12,Initiative 0.94 13,Liquid medicine 0.94 14,Archaeological excavation 0.94 15,Public release 0.94 Time: 5 minutes,57 second,339 millisecond ---------------------------------------------------------- Related words of Lanzhou( JaroDistanceTextSimilarity): 1,Lanzhou 1.0 2,Changsha 0.49 3,Harbin 0.49 4,Fuzhou 0.48 5,Taiyuan 0.47 6,Qingyang 0.46 7,Jinan 0.46 8,Beijing 0.45 9,Chengdu 0.45 10,Zhangjiaming 0.45 11,Xi'an 0.45 12,Sun Yong 0.45 13,Chuxiong 0.44 14,Fuzhou railway station 0.44 15,Nanning 0.44 Time: 12 seconds,718 millisecond ---------------------------------------------------------- Related words of Lanzhou( JaroWinklerDistanceTextSimilarity): 1,Lanzhou 1.0 2,Lhasa 0.56 3,Nanning 0.55 4,Imperial court 0.55 5,Public judgment 0.54 6,Samond 0.53 7,World class 0.53 8,Lakeside 0.53 9,Large and small 0.52 10,General election 0.52 11,Seventh session 0.52 12,Bake 0.51 13,Wuping County 0.51 14,Moscow 0.51 15,Retraining 0.51 Time: 16 seconds,723 millisecond ---------------------------------------------------------- Related words of Lanzhou( SørensenDiceCoefficientTextSimilarity): 1,Lanzhou 1.0 2,Fuzhou 0.37 3,Xi'an 0.33 4,Harbin 0.3 5,Beijing 0.3 6,Wuhan 0.3 7,Chengdu 0.3 8,Changsha 0.27 9,Taiyuan 0.27 10,Guiyang 0.27 11,Shenyang 0.27 12,Guangzhou 0.27 13,Lhasa 0.27 14,Nanchang 0.27 15,Changchun 0.23 Time: 19 seconds,852 millisecond ----------------------------------------------------------
21. Word frequency statistics:
org. apdplat. word. Word frequency statistics provides the function of word frequency statistics
The calling method of command line script is as follows:
Write the text requiring word frequency statistics to the file: text.txt chmod +x wfs.sh & wfs.sh -textFile=text.txt -statisticsResultFile=statistics-result.txt Open the file after the program runs statistics-result.txt View word frequency statistics
The calling methods in the program are as follows:
//Word frequency statistics settings WordFrequencyStatistics wordFrequencyStatistics = new WordFrequencyStatistics(); wordFrequencyStatistics.setRemoveStopWord(false); wordFrequencyStatistics.setResultPath("word-frequency-statistics.txt"); wordFrequencyStatistics.setSegmentationAlgorithm(SegmentationAlgorithm.MaxNgramScore); //Start participle wordFrequencyStatistics.seg("It's raining tomorrow to combine molecules. Tomorrow there's a course on molecules and atoms. If it rains, I have to go to class"); //Output word frequency statistics wordFrequencyStatistics.dump(); //prepare documents Files.write(Paths.get("text-to-seg.txt"), Arrays.asList("word Participle is a Java The distributed Chinese word segmentation component provides a variety of dictionary based word segmentation algorithms ngram Model to eliminate ambiguity.")); //Clear previous statistics wordFrequencyStatistics.reset(); //Word segmentation of documents wordFrequencyStatistics.seg(new File("text-to-seg.txt"), new File("text-seg-result.txt")); //Output word frequency statistics wordFrequencyStatistics.dump("file-seg-statistics-result.txt");
Word frequency statistics of the first sentence:
1,Rain 2 2,Tomorrow 2 3,Molecule 2 4,Course 1 5,Lecture 1 6,Combination 1 7,Atom 1 8,Go 1 9,Cheng 1 10,About 1 11,And 1 12,Also 1 13,Have 1 14,Of 1 15,1
Word frequency statistics of the second sentence:
1,Participle 2 2,Of 2 3,Based on 1 4,word 1 5,Component 1 6,Dictionary 1 7,ngram 1 8,Multiple 1 9,Implementation 1 10,And 1 11,Utilization 1 12,Disambiguation 1 13,Chinese word segmentation 1 14,Algorithm 1 15,Yes 1 16,Distributed 1 17,1 18,Provide 1 19,Model 1 20,Come 1 21,One 1 22,Java 1
22. Text similarity:
word segmentation provides a variety of text similarity calculation methods:
Method 1: cosine similarity: evaluate the similarity of two vectors by calculating the cosine value of the angle between them
Implementation class: org apdplat. word. analysis. CosineTextSimilarity
The usage is as follows:
String text1 = "I love shopping"; String text2 = "I love reading"; String text3 = "He's a hacker"; TextSimilarity textSimilarity = new CosineTextSimilarity(); double score1pk1 = textSimilarity.similarScore(text1, text1); double score1pk2 = textSimilarity.similarScore(text1, text2); double score1pk3 = textSimilarity.similarScore(text1, text3); double score2pk2 = textSimilarity.similarScore(text2, text2); double score2pk3 = textSimilarity.similarScore(text2, text3); double score3pk3 = textSimilarity.similarScore(text3, text3); System.out.println(text1+" and "+text1+" Similarity score:"+score1pk1); System.out.println(text1+" and "+text2+" Similarity score:"+score1pk2); System.out.println(text1+" and "+text3+" Similarity score:"+score1pk3); System.out.println(text2+" and "+text2+" Similarity score:"+score2pk2); System.out.println(text2+" and "+text3+" Similarity score:"+score2pk3); System.out.println(text3+" and "+text3+" Similarity score:"+score3pk3);
The operation results are as follows:
Similarity score of I love shopping and I love shopping: 1.0 I love shopping and I love reading similarity score: 0.67 I love shopping and he is a hacker similarity score: 0.0 Similarity score of I love reading and I love reading: 1.0 I love reading and he is a hacker similarity score: 0.0 Similarity score between he is a hacker and he is a hacker: 1.0
Method 2: simple common words. Their similarity is evaluated by calculating the total number of characters of words shared by two documents divided by the number of characters in the longest document
Implementation class: org apdplat. word. analysis. SimpleTextSimilarity
The usage is as follows:
String text1 = "I love shopping"; String text2 = "I love reading"; String text3 = "He's a hacker"; TextSimilarity textSimilarity = new SimpleTextSimilarity(); double score1pk1 = textSimilarity.similarScore(text1, text1); double score1pk2 = textSimilarity.similarScore(text1, text2); double score1pk3 = textSimilarity.similarScore(text1, text3); double score2pk2 = textSimilarity.similarScore(text2, text2); double score2pk3 = textSimilarity.similarScore(text2, text3); double score3pk3 = textSimilarity.similarScore(text3, text3); System.out.println(text1+" and "+text1+" Similarity score:"+score1pk1); System.out.println(text1+" and "+text2+" Similarity score:"+score1pk2); System.out.println(text1+" and "+text3+" Similarity score:"+score1pk3); System.out.println(text2+" and "+text2+" Similarity score:"+score2pk2); System.out.println(text2+" and "+text3+" Similarity score:"+score2pk3); System.out.println(text3+" and "+text3+" Similarity score:"+score3pk3);
The operation results are as follows:
Similarity score between I love shopping and I love shopping: 1.0 I love shopping and I love reading similarity score: 0.5 I love shopping and he is a hacker similarity score: 0.0 Similarity score of I love reading and I love reading: 1.0 I love reading and he is a hacker similarity score: 0.0 Similarity score between he is a hacker and he is a hacker: 1.0
Method 3: edit distance, which evaluates the similarity between two strings by calculating the minimum number of editing operations required to convert one string to another
Implementation class: org apdplat. word. analysis. EditDistanceTextSimilarity
The usage is as follows:
String text1 = "I love shopping"; String text2 = "I love reading"; String text3 = "He's a hacker"; TextSimilarity textSimilarity = new EditDistanceTextSimilarity(); double score1pk1 = textSimilarity.similarScore(text1, text1); double score1pk2 = textSimilarity.similarScore(text1, text2); double score1pk3 = textSimilarity.similarScore(text1, text3); double score2pk2 = textSimilarity.similarScore(text2, text2); double score2pk3 = textSimilarity.similarScore(text2, text3); double score3pk3 = textSimilarity.similarScore(text3, text3); System.out.println(text1+" and "+text1+" Similarity score:"+score1pk1); System.out.println(text1+" and "+text2+" Similarity score:"+score1pk2); System.out.println(text1+" and "+text3+" Similarity score:"+score1pk3); System.out.println(text2+" and "+text2+" Similarity score:"+score2pk2); System.out.println(text2+" and "+text3+" Similarity score:"+score2pk3); System.out.println(text3+" and "+text3+" Similarity score:"+score3pk3);
The operation results are as follows:
Similarity score of I love shopping and I love shopping: 1.0 I love shopping and I love reading similarity score: 0.5 I love shopping and he is a hacker similarity score: 0.0 Similarity score of I love reading and I love reading: 1.0 I love reading and he is a hacker similarity score: 0.0 Similarity score between he is a hacker and he is a hacker: 1.0
Method 4: SimHash + Hamming distance. First use SimHash to map texts of different lengths into equal length texts, and then calculate the Hamming distance of equal length texts
Implementation class: org apdplat. word. analysis. SimHashPlusHammingDistanceTextSimilarity
The usage is as follows:
String text1 = "I love shopping"; String text2 = "I love reading"; String text3 = "He's a hacker"; TextSimilarity textSimilarity = new SimHashPlusHammingDistanceTextSimilarity(); double score1pk1 = textSimilarity.similarScore(text1, text1); double score1pk2 = textSimilarity.similarScore(text1, text2); double score1pk3 = textSimilarity.similarScore(text1, text3); double score2pk2 = textSimilarity.similarScore(text2, text2); double score2pk3 = textSimilarity.similarScore(text2, text3); double score3pk3 = textSimilarity.similarScore(text3, text3); System.out.println(text1+" and "+text1+" Similarity score:"+score1pk1); System.out.println(text1+" and "+text2+" Similarity score:"+score1pk2); System.out.println(text1+" and "+text3+" Similarity score:"+score1pk3); System.out.println(text2+" and "+text2+" Similarity score:"+score2pk2); System.out.println(text2+" and "+text3+" Similarity score:"+score2pk3); System.out.println(text3+" and "+text3+" Similarity score:"+score3pk3);
The operation results are as follows:
Similarity score of I love shopping and I love shopping: 1.0 I love shopping and I love reading similarity score: 0.95 I love shopping and he is a hacker similarity score: 0.83 Similarity score of I love reading and I love reading: 1.0 I love reading and he is a hacker similarity score: 0.86 Similarity score between he is a hacker and he is a hacker: 1.0
Method 5: Jaccard similarity coefficient: calculate the size of the intersection of two sets divided by the size of the union to evaluate their similarity
Implementation class: org apdplat. word. analysis. JaccardTextSimilarity
The usage is as follows:
String text1 = "I love shopping"; String text2 = "I love reading"; String text3 = "He's a hacker"; TextSimilarity textSimilarity = new JaccardTextSimilarity(); double score1pk1 = textSimilarity.similarScore(text1, text1); double score1pk2 = textSimilarity.similarScore(text1, text2); double score1pk3 = textSimilarity.similarScore(text1, text3); double score2pk2 = textSimilarity.similarScore(text2, text2); double score2pk3 = textSimilarity.similarScore(text2, text3); double score3pk3 = textSimilarity.similarScore(text3, text3); System.out.println(text1+" and "+text1+" Similarity score:"+score1pk1); System.out.println(text1+" and "+text2+" Similarity score:"+score1pk2); System.out.println(text1+" and "+text3+" Similarity score:"+score1pk3); System.out.println(text2+" and "+text2+" Similarity score:"+score2pk2); System.out.println(text2+" and "+text3+" Similarity score:"+score2pk3); System.out.println(text3+" and "+text3+" Similarity score:"+score3pk3);
The operation results are as follows:
Similarity score of I love shopping and I love shopping: 1.0 I love shopping and I love reading similarity score: 0.5 I love shopping and he is a hacker similarity score: 0.0 Similarity score of I love reading and I love reading: 1.0 I love reading and he is a hacker similarity score: 0.0 Similarity score between he is a hacker and he is a hacker: 1.0
Method 6: Euclidean Distance, which evaluates their similarity by calculating the distance between two points
Implementation class: org apdplat. word. analysis. EuclideanDistanceTextSimilarity
The usage is as follows:
String text1 = "I love shopping"; String text2 = "I love reading"; String text3 = "He's a hacker"; TextSimilarity textSimilarity = new EuclideanDistanceTextSimilarity(); double score1pk1 = textSimilarity.similarScore(text1, text1); double score1pk2 = textSimilarity.similarScore(text1, text2); double score1pk3 = textSimilarity.similarScore(text1, text3); double score2pk2 = textSimilarity.similarScore(text2, text2); double score2pk3 = textSimilarity.similarScore(text2, text3); double score3pk3 = textSimilarity.similarScore(text3, text3); System.out.println(text1+" and "+text1+" Similarity score:"+score1pk1); System.out.println(text1+" and "+text2+" Similarity score:"+score1pk2); System.out.println(text1+" and "+text3+" Similarity score:"+score1pk3); System.out.println(text2+" and "+text2+" Similarity score:"+score2pk2); System.out.println(text2+" and "+text3+" Similarity score:"+score2pk3); System.out.println(text3+" and "+text3+" Similarity score:"+score3pk3);
The operation results are as follows:
Similarity score of I love shopping and I love shopping: 1.0 I love shopping and I love reading similarity score: 0.41 I love shopping and he is a hacker similarity score: 0.29 Similarity score of I love reading and I love reading: 1.0 I love reading and he is a hacker similarity score: 0.29 Similarity score between he is a hacker and he is a hacker: 1.0
Method 7: Manhattan Distance, which evaluates the similarity of two points by calculating the sum of their absolute wheelbase in the standard coordinate system
Implementation class: org apdplat. word. analysis. ManhattanDistanceTextSimilarity
The usage is as follows:
String text1 = "I love shopping"; String text2 = "I love reading"; String text3 = "He's a hacker"; TextSimilarity textSimilarity = new ManhattanDistanceTextSimilarity(); double score1pk1 = textSimilarity.similarScore(text1, text1); double score1pk2 = textSimilarity.similarScore(text1, text2); double score1pk3 = textSimilarity.similarScore(text1, text3); double score2pk2 = textSimilarity.similarScore(text2, text2); double score2pk3 = textSimilarity.similarScore(text2, text3); double score3pk3 = textSimilarity.similarScore(text3, text3); System.out.println(text1+" and "+text1+" Similarity score:"+score1pk1); System.out.println(text1+" and "+text2+" Similarity score:"+score1pk2); System.out.println(text1+" and "+text3+" Similarity score:"+score1pk3); System.out.println(text2+" and "+text2+" Similarity score:"+score2pk2); System.out.println(text2+" and "+text3+" Similarity score:"+score2pk3); System.out.println(text3+" and "+text3+" Similarity score:"+score3pk3);
The operation results are as follows:
Similarity score of I love shopping and I love shopping: 1.0 I love shopping and I love reading similarity score: 0.33 I love shopping and he is a hacker similarity score: 0.14 Similarity score of I love reading and I love reading: 1.0 I love reading and he is a hacker similarity score: 0.14 Similarity score between he is a hacker and he is a hacker: 1.0
Method 8: Jaro Distance, a type of editing distance
Implementation class: org apdplat. word. analysis. JaroDistanceTextSimilarity
The usage is as follows:
String text1 = "I love shopping"; String text2 = "I love reading"; String text3 = "He's a hacker"; TextSimilarity textSimilarity = new JaroDistanceTextSimilarity(); double score1pk1 = textSimilarity.similarScore(text1, text1); double score1pk2 = textSimilarity.similarScore(text1, text2); double score1pk3 = textSimilarity.similarScore(text1, text3); double score2pk2 = textSimilarity.similarScore(text2, text2); double score2pk3 = textSimilarity.similarScore(text2, text3); double score3pk3 = textSimilarity.similarScore(text3, text3); System.out.println(text1+" and "+text1+" Similarity score:"+score1pk1); System.out.println(text1+" and "+text2+" Similarity score:"+score1pk2); System.out.println(text1+" and "+text3+" Similarity score:"+score1pk3); System.out.println(text2+" and "+text2+" Similarity score:"+score2pk2); System.out.println(text2+" and "+text3+" Similarity score:"+score2pk3); System.out.println(text3+" and "+text3+" Similarity score:"+score3pk3);
The operation results are as follows:
Similarity score of I love shopping and I love shopping: 1.0 I love shopping and I love reading similarity score: 0.67 I love shopping and he is a hacker similarity score: 0.0 Similarity score of I love reading and I love reading: 1.0 I love reading and he is a hacker similarity score: 0.0 Similarity score between he is a hacker and he is a hacker: 1.0
Method 9: Jaro – Winkler Distance, expansion of Jaro
Implementation class: org apdplat. word. analysis. JaroWinklerDistanceTextSimilarity
The usage is as follows:
String text1 = "I love shopping"; String text2 = "I love reading"; String text3 = "He's a hacker"; TextSimilarity textSimilarity = new JaroWinklerDistanceTextSimilarity(); double score1pk1 = textSimilarity.similarScore(text1, text1); double score1pk2 = textSimilarity.similarScore(text1, text2); double score1pk3 = textSimilarity.similarScore(text1, text3); double score2pk2 = textSimilarity.similarScore(text2, text2); double score2pk3 = textSimilarity.similarScore(text2, text3); double score3pk3 = textSimilarity.similarScore(text3, text3); System.out.println(text1+" and "+text1+" Similarity score:"+score1pk1); System.out.println(text1+" and "+text2+" Similarity score:"+score1pk2); System.out.println(text1+" and "+text3+" Similarity score:"+score1pk3); System.out.println(text2+" and "+text2+" Similarity score:"+score2pk2); System.out.println(text2+" and "+text3+" Similarity score:"+score2pk3); System.out.println(text3+" and "+text3+" Similarity score:"+score3pk3);
The operation results are as follows:
Similarity score of I love shopping and I love shopping: 1.0 I love shopping and I love reading similarity score: 0.73 I love shopping and he is a hacker similarity score: 0.0 Similarity score of I love reading and I love reading: 1.0 I love reading and he is a hacker similarity score: 0.0 Similarity score between he is a hacker and he is a hacker: 1.0
Method 10: S ø rensen – Dice coefficient, which evaluates their similarity by calculating twice the size of the intersection of two sets divided by the sum of the sizes of the two sets
Implementation class: org apdplat. word. analysis. SørensenDiceCoefficientTextSimilarity
The usage is as follows:
String text1 = "I love shopping"; String text2 = "I love reading"; String text3 = "He's a hacker"; TextSimilarity textSimilarity = new SørensenDiceCoefficientTextSimilarity(); double score1pk1 = textSimilarity.similarScore(text1, text1); double score1pk2 = textSimilarity.similarScore(text1, text2); double score1pk3 = textSimilarity.similarScore(text1, text3); double score2pk2 = textSimilarity.similarScore(text2, text2); double score2pk3 = textSimilarity.similarScore(text2, text3); double score3pk3 = textSimilarity.similarScore(text3, text3); System.out.println(text1+" and "+text1+" Similarity score:"+score1pk1); System.out.println(text1+" and "+text2+" Similarity score:"+score1pk2); System.out.println(text1+" and "+text3+" Similarity score:"+score1pk3); System.out.println(text2+" and "+text2+" Similarity score:"+score2pk2); System.out.println(text2+" and "+text3+" Similarity score:"+score2pk3); System.out.println(text3+" and "+text3+" Similarity score:"+score3pk3);
The operation results are as follows:
Similarity score of I love shopping and I love shopping: 1.0 I love shopping and I love reading similarity score: 0.67 I love shopping and he is a hacker similarity score: 0.0 Similarity score of I love reading and I love reading: 1.0 I love reading and he is a hacker similarity score: 0.0 Similarity score between he is a hacker and he is a hacker: 1.0
23. The possibility of judging that the sentence is meaningful:
By the following command: unix-like: chmod +x sentence-identify.sh & ./sentence-identify.sh windows: ./sentence-identify.bat function org.apdplat.word.analysis.SentenceIdentify The results of the class are as follows: 1. sentence: I am a man and you are a woman, probability: 0.71428573 2. sentence: I am a person, probability: 0.6666667 3. sentence: I love reading, probability: 0.5 4. sentence: I love learning., probability: 0.5 5. sentence: Fati's room, martial arts master of your generation, change to the entry point, probability: 0.2857143 6. sentence: General supervision office of high voltage line tower with apparent porosity, probability: 0.2857143 7. sentence: Wang Jiejun reports that it's not impossible to perform a ridge wall with hay and Vera, probability: 0.25 8. sentence: At eight or nine o'clock, the scenery of the mountain has experienced changes in the world, and pulushenko's Huaihe town is happy to play a simulated flight, probability: 0.22222222 9. sentence: Level mission area book of the dead barnar has no brain to pull people's hearts and lungs to review lessons Lin Youli typhoon shelter, probability: 0.2 10. sentence: Participating party: Journal of Botany Bai Shanye runs wildly in the shadow and rides the white horse wuzishan castle. He hesitates at Yueyang airport, probability: 0.2 Then you can enter the sentence according to the command line prompt and enter to get the score of the sentence For example, enter a sentence and enter:Strive for China's rise The program returns the following results: Random words: [by, China, rise, and, strive, struggle] Generate sentences: Strive for China's rise Sentence probability: 1.0 For example, enter a sentence and enter:Is the memory of the human brain stored in bioelectricity or in cells? The program returns the following results: Random words: [human brain, of, memory, yes, preservation, stay, biology, electric, upper, still, stay, Cells, in] Generate sentences: Is the memory of the human brain stored in bioelectricity or in cells? Sentence probability: 0.8333333
Effect evaluation of word segmentation algorithm:
1,word Maximum participle Ngram Scoring algorithm: Word segmentation speed: 370.9714 character/millisecond Perfect rate of lines: 66.55% Number of rows error rate: 33.44% Total lines: 2533709 perfect lines: 1686210 error lines: 847499 Word perfect rate: 60.94% Word count error rate: 39.05% Total words: 28374490 perfect words: 17293964 wrong words: 11080526 2,word Minimum word segmentation algorithm: Word segmentation speed: 330.1586 character/millisecond Perfect rate of lines: 65.67% Number of rows error rate: 34.32% Total lines: 2533709 perfect lines: 1663958 wrong lines: 869751 Word perfect rate: 60.12% Word count error rate: 39.87% Total words: 28374490 perfect words: 17059641 wrong words: 11314849 3,word Word segmentation algorithm: Word segmentation speed: 62.960262 character/millisecond Perfect rate of lines: 57.2% Number of rows error rate: 42.79% Total lines: 2533709 perfect lines: 1449288 error lines: 1084421 Word perfect rate: 47.95% Word error rate: 52.04% Total words: 28374490 perfect words: 13605742 wrong words: 14768748 4,word Word segmentation bidirectional maximum minimum matching algorithm: Word segmentation speed: 462.87158 character/millisecond Perfect rate of lines: 53.06% Number of rows error rate: 46.93% Total lines: 2533709 perfect lines: 1344624 error lines: 1189085 Word perfect rate: 43.07% Word count error rate: 56.92% Total words: 28374490 perfect words: 12221610 wrong words: 16152880 5,word Word segmentation bidirectional minimum matching algorithm: Word segmentation speed: 967.68604 character/millisecond Perfect rate of lines: 46.34% Number of rows error rate: 53.65% Total lines: 2533709 perfect lines: 1174276 error lines: 1359433 Word perfect rate: 36.07% Word error rate: 63.92% Total words: 28374490 perfect words: 10236574 wrong words: 18137916 6,word Word segmentation bidirectional maximum matching algorithm: Word segmentation speed: 661.148 character/millisecond Perfect rate of lines: 46.18% Number of rows error rate: 53.81% Total lines: 2533709 perfect lines: 1170075 error lines: 1363634 Word perfect rate: 35.65% Word count error rate: 64.34% Total words: 28374490 perfect words: 10117122 wrong words: 18257368 7,word Word segmentation forward maximum matching algorithm: Word segmentation speed: 1567.1318 character/millisecond Perfect rate of lines: 41.88% Number of rows error rate: 58.11% Total lines: 2533709 perfect lines: 1061189 error lines: 1472520 Word perfect rate: 31.35% Word error rate: 68.64% Total words: 28374490 perfect words: 8896173 wrong words: 19478317 8,word Word segmentation reverse maximum matching algorithm: Word segmentation speed: 1232.6017 character/millisecond Perfect rate of lines: 41.69% Number of rows error rate: 58.3% Total lines: 2533709 perfect lines: 1056515 error lines: 1477194 Word perfect rate: 30.98% Word error rate: 69.01% Total words: 28374490 perfect words: 8792532 wrong words: 19581958 9,word Word segmentation inverse minimum matching algorithm: Word segmentation speed: 1936.9575 character/millisecond Perfect rate of lines: 41.42% Number of rows error rate: 58.57% Total lines: 2533709 perfect lines: 1049673 wrong lines: 1484036 Word perfect rate: 31.34% Word error rate: 68.65% Total words: 28374490 perfect words: 8893622 wrong words: 1948 0868 10,word Word segmentation forward minimum matching algorithm: Word segmentation speed: 2228.9465 character/millisecond Perfect rate of lines: 36.7% Number of rows error rate: 63.29% Total lines: 2533709 perfect lines: 930069 error lines: 1603640 Word perfect rate: 26.72% Word error rate: 73.27% Total words: 28374490 perfect words: 7583741 wrong words: 20790749