preface:
Reference article:
https://www.cnblogs.com/yangzhilong/p/3530700.html
https://www.cnblogs.com/liushaofeng89/p/4873086.html
Recently, due to the user's feedback that the provincial data form is partially missing, baidu decided to pull it by itself after a circle of duniang. The provincial data comes from the National Bureau of statistics. The data pulled by the author is from 2019, 2020-02-25.
Provincial Data Source: National Bureau of Statistics
http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/
The author uses jsoup in java. For the usage of jsoup, please refer to the following article: https://www.open-open.com/jsoup/
start
1. Prepare a table region_directory
CREATE TABLE `region_directory` ( `id` int(32) NOT NULL AUTO_INCREMENT, `pid` int(32) DEFAULT NULL COMMENT 'Parent ID', `name` varchar(64) DEFAULT NULL COMMENT 'Region name', `name_CN` varchar(64) DEFAULT NULL COMMENT 'English name of region', `create_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'Creation time', `update_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT 'Modification time', `create_user` varchar(255) DEFAULT NULL COMMENT 'Created by', `update_user` varchar(255) DEFAULT NULL COMMENT 'Modified by', `is_open` char(2) DEFAULT NULL COMMENT 'Open or not (0 means not open 1 means open)', PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=2421 DEFAULT CHARSET=utf8 COMMENT='Geographical table';
2. You need to introduce the jar package of jsoup into the pom file.
Officially, there is a higher version. I use a version with a relatively large number of users.
<!-- jsoup HTML parser library @ https://jsoup.org/ --> <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.11.3</version> </dependency>
3. The code for pulling data is mainly in getRegionDirectory.
4. One thing to note: the name in the figure below represents the data at the national and provincial level. I added a judgment to pull the data of Beijing first. The reason for adding this judgment is that the data volume is relatively large. If I pull too much data at one time, the connection will report 502. Now many websites will take precautions against this disgusting attack. Please note here.
4.1 this is the 502 error report described in the above picture
5. Next, you can access the pull data interface on the browser:
The console prints the following data:
Data saved to database:
6. All codes involved in the article
RegionDirectoryController
package com.bos.controller.basic; import com.alibaba.fastjson.JSONArray; import com.alibaba.fastjson.JSONObject; import com.bos.data.model.RegionDirectoryModel; import com.bos.data.model.vo.basic.RegionVo; import com.bos.data.repositories.jpa.setting.RegionDirectoryJPARepository; import com.google.common.base.Strings; import com.google.gson.Gson; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.transaction.interceptor.TransactionAspectSupport; import org.springframework.web.bind.annotation.GetMapping; import org.springframework.web.bind.annotation.RequestMapping; import org.springframework.web.bind.annotation.RestController; import java.io.BufferedWriter; import java.io.File; import java.io.FileWriter; import java.io.IOException; import java.net.URL; import java.util.ArrayList; import java.util.List; /** * @Author tanghh * @Date 2020/6/23 10:37 */ @RestController @RequestMapping(value = "/region") public class RegionDirectoryController { private Logger logger = LoggerFactory.getLogger(RegionDirectoryController.class); @Autowired private RegionDirectoryJPARepository regionDirectoryJPARepository; private static List<String> types = new ArrayList<>(); private static List<String> specialCitys = new ArrayList<>(); /** * province */ public static final String LEVEL_PROVINCE = "provincetr"; /** * city */ public static final String LEVEL_CITY = "citytr"; /** * area */ public static final String LEVEL_COUNTY = "countytr"; /** * street */ public static final String LEVEL_TOWN = "towntr"; /** * neighborhood committee */ public static final String LEVEL_VILLAGE = "villagetr"; public static final int LEVEL_MODE_STRING = 1; public static final int LEVEL_MODE_NUMBER = 2; public static final String CHARSET = "GBK"; static { types.add(LEVEL_PROVINCE); types.add(LEVEL_CITY); types.add(LEVEL_COUNTY); types.add(LEVEL_TOWN); types.add(LEVEL_VILLAGE); } /** * This list stores special cities. They belong to LEVEL_CITY, but the next level skips level_ Country, but directly to LEVEL_TOWN * Due to the large amount of data, it is not possible to compare them one by one. Cities found in this situation can join here */ static { specialCitys.add("Dongguan City"); specialCitys.add("zhongshan "); specialCitys.add("Danzhou City"); } //**************************Please modify the following values according to the actual situation************************************* /** * Grabbed home page */ public static final String webUrl = "http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2019/index.html"; /** * Save path */ public static final String savePath = "C:/project/latestEbo/ebo-web/ebo/src/main/resources/china.json"; /** * The scope of data capture [] supports the first level and China, such as China, Guangdong Province, Beijing city */ public static final String AREA = "China"; public static int TARGET_LEVEL = 3; /** * The mode used to represent the level of data grabbing: LEVEL_MODE_STRING -- indicates level by character LEVEL_MODE_NUMBER -- indicates by number level */ public static int LEVEL_MODE = LEVEL_MODE_NUMBER; //**************************Please modify the above values according to the actual situation************************************* @GetMapping(value = "/getRegionDirectory") public void getRegionDirectory() { int id = 0; try { System.out.println("Start grabbing,Please wait!!!"); System.out.println("Grab range:" + AREA + ",Extraction Method (1--Character 2--number): " + LEVEL_MODE + ",Grab level:" + TARGET_LEVEL + "(Mode is character: 1--province,2--city,3--county,4--town,5--village;)"); long starttime = System.currentTimeMillis(); RegionVo region = new RegionVo("000000000000", "China", 0); region.child = new ArrayList<>(); Document doc = getDocument(webUrl); Elements provincetr = doc.getElementsByClass(LEVEL_PROVINCE); for (Element e : provincetr) { Elements a = e.getElementsByTag("a"); for (Element ea : a) { //Get the absolute path String nextUrl = ea.attr("abs:href"); String[] arr = nextUrl.split("/"); String code = arr[arr.length - 1].split("\\.")[0] + "0000000000"; String name = ea.text(); if (name.equals("Beijing")) { if (AREA.equals("China") || AREA.equals(name)) { System.out.println(name); RegionDirectoryModel regionDirectoryModel = new RegionDirectoryModel(); regionDirectoryModel.setPid(0); regionDirectoryModel.setName(name); RegionDirectoryModel newModel = regionDirectoryJPARepository.save(regionDirectoryModel); id = newModel.getId(); System.out.println(id); RegionVo child = new RegionVo(code, name, 1); region.child.add(child); int currentlevel = LEVEL_MODE == LEVEL_MODE_STRING ? getLevel(LEVEL_PROVINCE) : child.level; //Indicates that further parsing is needed if (currentlevel < TARGET_LEVEL) { parseNext(types.get(1), nextUrl, child); } } } } } //Parsing json String jsonStr = new Gson().toJson(region); JSONObject jsonObject = JSONObject.parseObject(jsonStr); JSONArray childJsonArray = jsonObject.getJSONArray("child"); for (int i = 0; i < childJsonArray.size(); i++) { JSONObject childJsonObject = (JSONObject) childJsonArray.get(i); JSONArray jsonArray = childJsonObject.getJSONArray("child"); for (Object o : jsonArray) { JSONObject itemJsonObject = (JSONObject) o; if (!Strings.isNullOrEmpty(itemJsonObject.getString("name"))) { RegionDirectoryModel regionDirectoryModel = new RegionDirectoryModel(); regionDirectoryModel.setPid(id); regionDirectoryModel.setName(itemJsonObject.getString("name")); RegionDirectoryModel newModel = regionDirectoryJPARepository.saveAndFlush(regionDirectoryModel); id = newModel.getId(); } JSONArray finalChildJsonArray = itemJsonObject.getJSONArray("child"); for (Object o1 : finalChildJsonArray) { JSONObject finalJsonObject = (JSONObject) o1; RegionDirectoryModel regionDirectoryModel = new RegionDirectoryModel(); regionDirectoryModel.setPid(id); regionDirectoryModel.setName(finalJsonObject.getString("name")); regionDirectoryJPARepository.save(regionDirectoryModel); } } } long endtime = System.currentTimeMillis(); System.out.println("Grab finished!!! Time consuming:" + (endtime - starttime) / 1000 / 60 + "min"); } catch (Exception e) { logger.error("Failed to get province data", e); TransactionAspectSupport.currentTransactionStatus().setRollbackOnly(); } } private static Document getDocument(String url) throws IOException { return Jsoup.parse(new URL(url).openStream(), CHARSET, url); } /** * @param type See LEVEL_ * @return */ private static int getLevel(String type) { return types.indexOf(type) + 1; } private static void saveJson(RegionVo region) throws IOException { FileWriter fw = new FileWriter(new File(savePath)); BufferedWriter bw = new BufferedWriter(fw); bw.write(new Gson().toJson(region)); bw.flush(); bw.close(); } /** * Analyze next level data * * @param type See LEVEL_ start * @param url Web page url to grab * @param region Data to be saved * @throws Exception */ public static void parseNext(String type, String url, RegionVo region) throws Exception { region.child = new ArrayList<>(); Document doc = getDocument(url); Elements es = doc.getElementsByClass(type); if (LEVEL_VILLAGE.equals(type)) { //< tr class = "villager" > for (Element e : es) { Elements tds = e.getElementsByTag("td"); String code = tds.get(0).text(); String name = tds.get(2).text(); RegionVo child = new RegionVo(code, name, region.level + 1); region.child.add(child); System.out.println(space(child.level) + name); } } else { //There are two situations that need to be addressed //The first type: < tr class = "countytr" > < td > 130101000000 < / td > < td > municipal district < / td > < tr > //The second type: < tr class = "countytr" > < td > < a href = "01 / 130102. HTML" > 130102000000 < / a > < td > < td > < a href = "01 / 130102. HTML" > Chang'an District < / a > < td > < tr > for (Element e : es) { String code = null; String name = null; String nextUrl = null; Elements a = e.getElementsByTag("a"); if (a.isEmpty()) { //In the first case Elements tds = e.getElementsByTag("td"); code = tds.get(0).text(); name = tds.get(1).text(); } else { //13/1301.html nextUrl = a.get(0).attr("abs:href"); code = a.get(0).text(); name = a.get(1).text(); } RegionVo child = new RegionVo(code, name, region.level + 1); region.child.add(child); System.out.println(space(child.level) + name); int currentlevel = LEVEL_MODE == LEVEL_MODE_STRING ? getLevel(type) : child.level; if (!a.isEmpty() && currentlevel < TARGET_LEVEL) { //If Dongguan City, level_ The next level of city is LEVEL_TOWN, not level_ Country needs special treatment here String nextType = null; if (LEVEL_MODE == LEVEL_MODE_NUMBER && (specialCitys.contains(name))) { nextType = LEVEL_TOWN; } else { nextType = types.get(types.indexOf(type) + 1); } parseNext(nextType, nextUrl, child); } } } } private static String space(int level) { if (level > 5) { return ""; } return " ".substring(0, level); } }
RegionVo
package com.bos.data.model.vo.basic; import lombok.Data; import java.util.List; /** * @Author tanghh * @Date 2020/6/23 10:41 */ @Data public class RegionVo { /** * code */ public String code; /** * name */ public String name; /** * Current level */ public int level; /** * Subdata */ public List<RegionVo> child; public RegionVo(String code, String name, int level) { this.code = code; this.name = name; this.level = level; } }
RegionDirectoryModel
package com.bos.data.model; import javax.persistence.*; import java.io.Serializable; import java.sql.Timestamp; import java.util.Objects; /** * @author luojie 2018/7/4 */ @Entity @Table(name = "region_directory", schema = "test", catalog = "") public class RegionDirectoryModel implements Serializable { private Integer id; private Integer pid; private String name; private String nameCn; private String isOpen="0"; private Timestamp createTime; private Timestamp updateTime; private String createUser; private String updateUser; @Id @Column(name = "id") @GeneratedValue(strategy = GenerationType.IDENTITY) public Integer getId() { return id; } public void setId(Integer id) { this.id = id; } @Basic @Column(name = "name") public String getName() { return name; } public void setName(String name) { this.name = name; } @Basic @Column(name = "name_CN") public String getNameCn() { return nameCn; } public void setNameCn(String nameCn) { this.nameCn = nameCn; } @Basic @Column(name = "pid") public Integer getPid() { return pid; } public void setPid(Integer pid) { this.pid = pid; } @Basic @Column(name = "is_open") public String getIsOpen() { return isOpen; } public void setIsOpen(String isOpen) { this.isOpen = isOpen; } @Basic @Column(name = "create_time") public Timestamp getCreateTime() { return createTime; } public void setCreateTime(Timestamp createTime) { this.createTime = createTime; } @Basic @Column(name = "update_time") public Timestamp getUpdateTime() { return updateTime; } public void setUpdateTime(Timestamp updateTime) { this.updateTime = updateTime; } @Basic @Column(name = "create_user") public String getCreateUser() { return createUser; } public void setCreateUser(String createUser) { this.createUser = createUser; } @Basic @Column(name = "update_user") public String getUpdateUser() { return updateUser; } public void setUpdateUser(String updateUser) { this.updateUser = updateUser; } @Override public boolean equals(Object o) { if (this == o) return true; if (o == null || getClass() != o.getClass()) return false; RegionDirectoryModel that = (RegionDirectoryModel) o; return id == that.id && Objects.equals(name, that.name) && Objects.equals(nameCn, that.nameCn) && Objects.equals(pid, that.pid); } @Override public int hashCode() { return Objects.hash(id, name, nameCn, pid); } }
This is the end of this article,
If you think the author wrote well, please comment and praise.
Next post all province data.