Jsoup
After we grab the page, we need to parse it. String processing tools can be used to parse pages, or regular expressions can be used, but these methods can bring a lot of development costs, so we need to use a special technology to parse html pages.
1.1. Introduction to jsoup
jsoup is a Java HTML parser that can directly parse a URL address and HTML text content. It provides a very labor-saving API for extracting and manipulating data through DOM, CSS and jQuery-like operations.
The main functions of jsoup are as follows:
- Parsing HTML from a URL, file, or string;
- Use DOM or CSS selector to find and retrieve data.
- Operate HTML elements, attributes, text;
1 <dependency> 2 <groupId>org.jsoup</groupId> 3 <artifactId>jsoup</artifactId> 4 <version>1.10.3</version> 5 </dependency> 6 <!--test--> 7 <dependency> 8 <groupId>junit</groupId> 9 <artifactId>junit</artifactId> 10 <version>4.12</version> 11 </dependency> 12 <!--tool--> 13 <dependency> 14 <groupId>org.apache.commons</groupId> 15 <artifactId>commons-lang3</artifactId> 16 <version>3.7</version> 17 </dependency> 18 <dependency> 19 <groupId>commons-io</groupId> 20 <artifactId>commons-io</artifactId> 21 <version>2.6</version> 22 </dependency>
Use dom to traverse documents
Element acquisition
- Query element getElementById according to id
- Get the element getElementsByTag from the tag
- Get the element getElementsByClass according to class
- Get element getElementsByAttribute based on attributes
Getting data in elements
- Get id from element
- Getting className from the element
- Get the attr value of the attribute from the element
- Get all attributes from the element
- Getting text content from elements
Find elements using selector syntax
jsoup elements object supports selector grammar similar to CSS (or jquery) to achieve very powerful and flexible lookup functions. This select method can be used in Document, Element, or Elements objects. And it is context-dependent, so it can achieve filtering of specified elements, or chain-selective access.
The Select method returns a collection of Elements and provides a set of methods to extract and process the results.
Use dom to traverse documents
tagname: Find elements by tagging, such as span
# id: Find elements by ID, such as: # city_bj
Class: Find elements by class name, such as:.class_a
[attribute]: Use attributes to find elements, such as: [abc]
[attr=value]: Use attribute value s to find elements, such as: [class=s_name]
Selector selector combination
El_ id: Element + ID, such as: h3_ city_bj
el.class: Element + class, such as: li.class_a
el[attr]: element + attribute name, such as span[abc]
Any combination: for example: span [abc], s_name
ancestor child: Find subelements of an element, such as:.City_con_li for all Li under "city_con"
Parent > child: Find a direct child under a parent element, such as:
City_con > ul > li Find ul of the first level of city_con (direct child element), and then find the first level of li under all ul
Parent > *: Find all direct child elements under a parent element
Code testing
public class JsoupTest { @Test public void testJsoupUrl() throws Exception { // analysis url address Document document = Jsoup.parse(new URL("http://www.jingdong.com/"), 1000); //Obtain title Contents Element title = document.getElementsByTag("title").first(); System.out.println(title.text()); } @Test public void testJsoupHtml() throws Exception { // Parsing files Document document = Jsoup.parse(new File("F:\\boss\\1.html"), "UTF-8"); //Obtain title Contents Element title = document.getElementsByTag("title").first(); System.out.println(title.text()); //1.according to id Query element getElementById Element element = document.getElementById("city_bj"); System.out.println(element.text()); //2.Get elements from tags getElementsByTag element = document.getElementsByTag("title").first(); System.out.println(element.text()); //3.according to class Get elements getElementsByClass element = document.getElementsByClass("s_name").last(); System.out.println(element.text()); //4.Getting elements based on attributes getElementsByAttribute element = document.getElementsByAttribute("abc").first(); System.out.println("abc:" + element.text()); //Define attributes element = document.getElementsByAttributeValue("class", "city_con").first(); System.out.println(element.text()); } /** * Tag attribute selection * * @throws Exception */ @Test public void testJsoupHtml2() throws Exception { //Parsing files Document document = Jsoup.parse(new File("F:\\boss\\1.html"), "UTF-8"); //Get elements Element element = document.getElementById("test"); //1.Get from the element id String str = element.id(); System.out.println("id:" + str); //2.Get from the element className str = element.className(); System.out.println("className:" + str); //3.Getting the value of an attribute from an element attr str = element.attr("id"); System.out.println("attr:" + str); //4.Get all attributes from the element attributes str = element.attributes().toString(); System.out.println("attributes:" + str); //5.Getting text content from elements text str = element.text(); System.out.println("text:" + str); } /** * css selector * * @throws Exception */ @Test public void testJsoupHtml3() throws Exception { //Parsing files Document document = Jsoup.parse(new File("F:\\boss\\1.html"), "UTF-8"); //tagname: Find elements through tags, such as: span Elements span = document.select("span"); for (Element element : span) { System.out.println("span:" + element.text()); } //#id: adopt ID Find elements, such as:#city_bjj String str = document.select("#city_bj").text(); System.out.println("#city_bj" + str); //.class: adopt class Name lookup elements, such as:.class_a str = document.select(".class_a").text(); System.out.println(".class_a" + str); //[attribute]: Use attributes to find elements, such as:[abc] str = document.select("[abc]").text(); System.out.println("[abc]" + str); //[attr=value]: Use attribute values to find elements, such as:[class=s_name] str = document.select("[class=s_name]").text(); System.out.println("#[class=s_name]" + str); } /** * Combination selector * * @throws Exception */ @Test public void testJsoupHtml4() throws Exception { Document document = Jsoup.parse(new File("F:\\boss\\1.html"), "UTF-8"); //el#id: element+ID,For example: h3#city_bj String str = document.select("h3#city_bj").text(); //el.class: element+class,For example: li.class_a str = document.select("li.class_a").text(); //el[attr]: element+Attribute names, such as: span[abc] str = document.select("span[abc]").text(); //Any combination, such as: span[abc].s_name str = document.select("span[abc].s_name").text(); //ancestor child: Find subelements under an element, such as:.city_con li lookup"city_con"All below li str = document.select(".city_con li").text(); //parent > child: Find the direct child element under a parent element. //For example:.city_con > ul > li lookup city_con Level 1 (direct subelement) ul,Find all again ul The first level below li str = document.select(".city_con > ul > li").text(); //parent > * Find all direct child elements under a parent element.city_con > * str = document.select(".city_con > *").text(); }