Reptilian-jsoup parsing

Posted by JonnyThunder on Wed, 07 Aug 2019 08:50:14 +0200

Jsoup

After we grab the page, we need to parse it. String processing tools can be used to parse pages, or regular expressions can be used, but these methods can bring a lot of development costs, so we need to use a special technology to parse html pages.

1.1. Introduction to jsoup

jsoup is a Java HTML parser that can directly parse a URL address and HTML text content. It provides a very labor-saving API for extracting and manipulating data through DOM, CSS and jQuery-like operations.

 

The main functions of jsoup are as follows:

  1. Parsing HTML from a URL, file, or string;
  2. Use DOM or CSS selector to find and retrieve data.
  3. Operate HTML elements, attributes, text;
 1 <dependency>
 2     <groupId>org.jsoup</groupId>
 3     <artifactId>jsoup</artifactId>
 4     <version>1.10.3</version>
 5 </dependency>
 6 <!--test-->
 7 <dependency>
 8     <groupId>junit</groupId>
 9     <artifactId>junit</artifactId>
10     <version>4.12</version>
11 </dependency>
12 <!--tool-->
13 <dependency>
14     <groupId>org.apache.commons</groupId>
15     <artifactId>commons-lang3</artifactId>
16     <version>3.7</version>
17 </dependency>
18 <dependency>
19     <groupId>commons-io</groupId>
20     <artifactId>commons-io</artifactId>
21     <version>2.6</version>
22 </dependency>

Use dom to traverse documents

Element acquisition

  1. Query element getElementById according to id
  2. Get the element getElementsByTag from the tag
  3. Get the element getElementsByClass according to class
  4. Get element getElementsByAttribute based on attributes

Getting data in elements

  1. Get id from element
  2. Getting className from the element
  3. Get the attr value of the attribute from the element
  4. Get all attributes from the element
  5. Getting text content from elements

Find elements using selector syntax

jsoup elements object supports selector grammar similar to CSS (or jquery) to achieve very powerful and flexible lookup functions. This select method can be used in Document, Element, or Elements objects. And it is context-dependent, so it can achieve filtering of specified elements, or chain-selective access.

The Select method returns a collection of Elements and provides a set of methods to extract and process the results.

Use dom to traverse documents

tagname: Find elements by tagging, such as span

# id: Find elements by ID, such as: # city_bj

Class: Find elements by class name, such as:.class_a

[attribute]: Use attributes to find elements, such as: [abc]

[attr=value]: Use attribute value s to find elements, such as: [class=s_name]

Selector selector combination

El_ id: Element + ID, such as: h3_ city_bj

el.class: Element + class, such as: li.class_a

el[attr]: element + attribute name, such as span[abc]

Any combination: for example: span [abc], s_name

ancestor child: Find subelements of an element, such as:.City_con_li for all Li under "city_con"

Parent > child: Find a direct child under a parent element, such as:

City_con > ul > li Find ul of the first level of city_con (direct child element), and then find the first level of li under all ul

Parent > *: Find all direct child elements under a parent element

Code testing

public class JsoupTest {

    @Test
    public void testJsoupUrl() throws Exception {
        //    analysis url address
        Document document = Jsoup.parse(new URL("http://www.jingdong.com/"), 1000);
        //Obtain title Contents
        Element title = document.getElementsByTag("title").first();
        System.out.println(title.text());
    }

    @Test
    public void testJsoupHtml() throws Exception {
        //    Parsing files
        Document document = Jsoup.parse(new File("F:\\boss\\1.html"), "UTF-8");
        //Obtain title Contents
        Element title = document.getElementsByTag("title").first();
        System.out.println(title.text());
        //1.according to id Query element getElementById
        Element element = document.getElementById("city_bj");
        System.out.println(element.text());
        //2.Get elements from tags getElementsByTag
        element = document.getElementsByTag("title").first();
        System.out.println(element.text());
        //3.according to class Get elements getElementsByClass
        element = document.getElementsByClass("s_name").last();
        System.out.println(element.text());
        //4.Getting elements based on attributes getElementsByAttribute
        element = document.getElementsByAttribute("abc").first();
        System.out.println("abc:" + element.text());
        //Define attributes
        element = document.getElementsByAttributeValue("class", "city_con").first();
        System.out.println(element.text());
    }

    /**
     * Tag attribute selection
     *
     * @throws Exception
     */
    @Test
    public void testJsoupHtml2() throws Exception {
        //Parsing files
        Document document = Jsoup.parse(new File("F:\\boss\\1.html"), "UTF-8");
        //Get elements
        Element element = document.getElementById("test");
        //1.Get from the element id
        String str = element.id();
        System.out.println("id:" + str);
        //2.Get from the element className
        str = element.className();
        System.out.println("className:" + str);
        //3.Getting the value of an attribute from an element attr
        str = element.attr("id");
        System.out.println("attr:" + str);
        //4.Get all attributes from the element attributes
        str = element.attributes().toString();
        System.out.println("attributes:" + str);
        //5.Getting text content from elements text
        str = element.text();
        System.out.println("text:" + str);
    }

    /**
     * css selector
     *
     * @throws Exception
     */
    @Test
    public void testJsoupHtml3() throws Exception {
        //Parsing files
        Document document = Jsoup.parse(new File("F:\\boss\\1.html"), "UTF-8");
        //tagname: Find elements through tags, such as: span
        Elements span = document.select("span");
        for (Element element : span) {
            System.out.println("span:" + element.text());
        }
        //#id: adopt ID Find elements, such as:#city_bjj
        String str = document.select("#city_bj").text();
        System.out.println("#city_bj" + str);
        //.class: adopt class Name lookup elements, such as:.class_a
        str = document.select(".class_a").text();
        System.out.println(".class_a" + str);
        //[attribute]: Use attributes to find elements, such as:[abc]
        str = document.select("[abc]").text();
        System.out.println("[abc]" + str);
        //[attr=value]: Use attribute values to find elements, such as:[class=s_name]
        str = document.select("[class=s_name]").text();
        System.out.println("#[class=s_name]" + str);

    }
    /**
     * Combination selector
     *
     * @throws Exception
     */
    @Test
    public void testJsoupHtml4() throws Exception {
        Document document = Jsoup.parse(new File("F:\\boss\\1.html"), "UTF-8");
        //el#id: element+ID,For example: h3#city_bj
        String str = document.select("h3#city_bj").text();
        //el.class: element+class,For example: li.class_a
        str = document.select("li.class_a").text();
        //el[attr]: element+Attribute names, such as: span[abc]
        str = document.select("span[abc]").text();
        //Any combination, such as: span[abc].s_name
        str = document.select("span[abc].s_name").text();
        //ancestor child: Find subelements under an element, such as:.city_con li lookup"city_con"All below li
        str = document.select(".city_con li").text();
        //parent > child: Find the direct child element under a parent element.
        //For example:.city_con > ul > li lookup city_con Level 1 (direct subelement) ul,Find all again ul The first level below li
        str = document.select(".city_con > ul > li").text();
        //parent > * Find all direct child elements under a parent element.city_con > *
        str = document.select(".city_con > *").text();
    }

Topics: Python Attribute Junit JQuery Java