The following is the practical code of using httpclient crawler to crawl Chinese characters related information of a website. Some character format problems are encountered in the process. Previously, my colleagues have seen using html parsing classes to grab page information, instead of using regular, often try, the effect is not good, after all, the page crawler is very good to do. In this practice, we have encountered some difficulties, so we still go to the way of regular extraction. Share the code for your reference. Key information has not disappeared.
public static void main(String[] args) throws SQLException { DEFAULT_CHARSET = GB2312; List<String> list = WriteRead.readTxtFileByLine(LONG_Path + "word.log"); list.forEach(py -> { getPYAndWord(py); }); testOver(); } public static void getPYAndWord(String py) { output(py); String url = "http://zd.diyifanwen.com/zidian/py/" + py + ".htm"; HttpGet httpGet = getHttpGet(url); JSONObject response = getHttpResponse(httpGet); // output(response); String content = response.getString("content"); String all = new String(content.getBytes(UTF_8), UTF_8); List<String> regexAll = new ArrayList<>(); List<String> alllist = regexAll(all, "http://zd.d.*?>[\\u4e00-\\u9FFF]<"); output(alllist.size()); alllist.forEach(line -> { String murl = regexAll(line, "http://zd.diyifanwen.com/zidian/\\w/\\d+.htm").get(0); String mword = regexAll(line, ">[\\u4e00-\\u9fa5]<").get(0); regexAll.add(mword); output(murl, mword); String sql = "INSERT INTO chinese_dictionary_word (word,url) VALUES (\"%s\",\"%s\");"; sql = String.format(sql, mword.replaceAll("<|>", EMPTY), murl); output(sql); MySqlTest.sendWork(sql); }); String str = regexAll.toString().replaceAll("<|>|\\[|\\]", EMPTY); String sql = "INSERT INTO chinese_dictionary_py_word (py,words) VALUES (\"%s\",\"%s\");"; sql = String.format(sql, py, str); output(sql); MySqlTest.sendWork(sql); sleep(2); } /**Get the Pinyin List * @return */ public static String getPY() { String url = "http://zd.diyifanwen.com/zidian/py/"; HttpGet httpGet = getHttpGet(url); JSONObject response = getHttpResponse(httpGet); // output(response); String content = response.getString("content"); byte[] bytes = content.getBytes(UTF_8); String all = new String(bytes, UTF_8); Log.log("content", all); return all; } /**Get all the initials and Pinyin * @param all */ public static void getAllPY(String all) { List<String> list = regexAll(all, "<dt class=\"pyTitle\">Pinyin initials\\w+</dt>" + LINE + ".+/dd>"); list.forEach(s -> { int num = s.indexOf("Pinyin initials"); String first = s.substring(num + 5, num + 6); List<String> list1 = regexAll(s, "http://zd.diyifanwen.com/zidian/py/\\w+.htm"); list1.forEach(str -> { int one = str.indexOf("/py/"); int two = str.lastIndexOf("."); String second = str.substring(one + 4, two); String sql = "INSERT INTO chinese_dictionary_py (first_word,all_word) VALUES (\"%s\",\"%s\");"; String sqlEnd = String.format(sql, first, second); MySqlTest.sendWork(sqlEnd); }); }); } /**Check if all pinyin is available * @param all */ public static void checkPY(String all) { List<String> list = regexAll(all, "zidian/py/\\w+.htm"); list.forEach(str -> { int one = str.indexOf("/py/"); int two = str.lastIndexOf("."); String second = str.substring(one + 4, two); output(second); String sql = "SELECT * FROM chinese_dictionary_py WHERE all_word = \"%s\";"; String sq = String.format(sql, second); ResultSet resultSet = MySqlTest.excuteQuerySql(sq); try { if (!resultSet.next()) output(sq); } catch (SQLException e) { e.printStackTrace(); } }); } /**Find the currently acquired Pinyin from the database and store it in a file * @throws SQLException */ public static void getAllPY() throws SQLException { List<String> word = new ArrayList<>(); ResultSet resultSet = MySqlTest.excuteQuerySql("SELECT all_word FROM chinese_dictionary_py;"); while (resultSet.next()) { String string = resultSet.getString(1); word.add(string); } Save.saveStringList(word, "word"); }
The results are as follows:
For the specific content of Chinese character interpretation, the links are saved without crawling.
Selection of Technical Articles
- One line of java code prints a heart
- Chinese Language Version of Linux Performance Monitoring Software netdata
- Interface Test Code Coverage (jacoco) Scheme Sharing
- Performance testing framework
- How to Enjoy Performance Testing on Linux Command Line Interface
- Graphic HTTP Brain Map
- How to Test Probabilistic Business Interface
- httpclient handles multi-user simultaneous online
- Automatically convert swagger documents into test code
- Five lines of code to build static blogs
- How httpclient handles 302 redirection
- A preliminary study on the testing framework of linear interface based on java
- Tcloud Cloud Measurement Platform
Selection of non-technical articles
- Why choose software testing as a career path?
- Ten Steps to Become a Great Java Developer
- Writing to everyone about programming thinking
- Obstacles to automated testing
- The Problems of Automated Testing
- Tested "Code Immortality" Brain Map
- Seven Steps to Become an Excellent Automated Testing Engineer
- Attitudes of Excellent Software Developers
- How to Execute Functional API Testing Correctly