Java implementation of Chinese word frequency statistics

Posted by markjia on Mon, 30 Dec 2019 16:59:07 +0100

Yesterday, there was a demand for Chinese word frequency statistics. After Baidu's visit, it found a lot of party articles with titles, which were seriously inconsistent with the content. Here is a simple record of its own implementation process!

 

Different from the word frequency statistics of English words, the difficulty of Chinese lies in how to segment words, but fortunately, there are many excellent ready-made libraries for calling, which are used here ansj_seg Plugins.

 

Add dependency first:

Download jar
  • Visit http://maven.nlpcn.org/org/ansj/ It is better to download the latest version of ansj/
    • Download at the same time nlp-lang.jar You need to match with ansj ﹣ SEG.. you can see the maven dependency in the jar package. Generally, the latest ansj matches the latest NLP Lang correctly.
  • Import to eclipse and start your program
maven
        <dependency>
            <groupId>org.ansj</groupId>
            <artifactId>ansj_seg</artifactId>
            <version>5.1.1</version>
        </dependency>

The basic usage is:

 String str = "welcome to ansj_seg. If you have any problem here, you can contact me. I will do my best to help you. Ansj_seg is faster, more accurate and more free!";
 System.out.println(ToAnalysis.parse(str));
 
 / d, f r ee / a,!

 

Here's the code:

  public static void wordFrequency() throws IOException {
        Map<String, Integer> map = new HashMap<>();

        String article = getString();
        String result = ToAnalysis.parse(article).toStringWithOutNature();
        String[] words = result.split(",");


        for(String word: words){
            String str = word.trim();
            // Filter white space
            if (str.equals(""))
                continue;
            // Filter some high frequency symbols
            else if(str.matches("[)|(|.|,|. |+|-|"|"|: |?|\\s]"))
                continue;
            // The filter length here is 1 str
            else if (str.length() < 2)
                continue;

            if (!map.containsKey(word)){
                map.put(word, 1);
            } else {
                int n = map.get(word);
                map.put(word, ++n);
            }
        }

        Iterator<Map.Entry<String, Integer>> iterator = map.entrySet().iterator();
        while (iterator.hasNext()){
            Map.Entry<String, Integer> entry = iterator.next();
            System.out.println(entry.getKey() + ": " + entry.getValue());
        }
  List
<Map.Entry<String, Integer>> list = new ArrayList<>(); Map.Entry<String, Integer> entry;
    
while ((entry = getMax(map)) != null){ list.add(entry); } System.out.println(Arrays.toString(list.toArray())); } /** * Find the entry with the largest value in the map, return the entry, and delete the entry in the map * @param map * @return */ public static Map.Entry<String, Integer> getMax(Map<String, Integer> map){ if (map.size() == 0){ return null; } Map.Entry<String, Integer> maxEntry = null; boolean flag = false; Iterator<Map.Entry<String, Integer>> iterator = map.entrySet().iterator(); while (iterator.hasNext()){ Map.Entry<String, Integer> entry = iterator.next(); if (!flag){ maxEntry = entry; flag = true; } if (entry.getValue() > maxEntry.getValue()){ maxEntry = entry; } } map.remove(maxEntry.getKey()); return maxEntry; } /** * Read the article material to be split from the file
   * The content of the document comes from the popular article of Jianshu: https://www.jianshu.com/p/5b37403f6ba6 *
@return * @throws IOException */ public static String getString() throws IOException { FileInputStream inputStream = new FileInputStream(new File("/home/as_/IdeaProjects/SpringMaven/article-txt")); BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream)); StringBuilder strBuilder = new StringBuilder(); String line; while((line = reader.readLine()) != null){ strBuilder.append(line); } reader.close(); inputStream.close(); return strBuilder.toString(); }

 

Finally, the picture is still attached:

Topics: Java Maven Eclipse