Summary of java crawler using WebMagic

Posted by jil on Tue, 01 Mar 2022 05:59:19 +0100

WebMagic introduction

The WebMagic project code is divided into two parts: core and extension. The core part is a simplified and modular crawler implementation, while the extension part includes some convenient and practical functions. The architecture design of WebMagic refers to scripy, and the goal is to be modular as much as possible and reflect the functional characteristics of crawlers.

This part provides a very simple and flexible API to write a crawler without changing the development mode.

The webmatic extension provides some convenient functions, such as annotation mode, writing crawler and so on. At the same time, some common components are built in to facilitate crawler development.
Chinese documents of WebMagic

Knowledge summary during use

1. You can only crawl http but not https

Reference blog (scheme 1: modify the source code, scheme 2: nginx reverse proxy)
For scheme 2, configure nginx

server {
    listen       8808;
    # server_name  partner.domain.com; #Server can not be set_ Name, modify the port number and access it directly through localhost: port number.
    #If you want to set up server_name needs to be configured in hosts

      location / {
          proxy_set_header Host github.com;
          proxy_pass https://github.com/;
      }
}

Now pass http://localhost:8808/ To access https://github.com/

2. Set cookies to simulate login

Since the crawled web page needs to verify the login, cookies can be used to simulate the login.

2.1. How to set cookies in code

reference resources Chinese documents of WebMagic 4.4 configuration, startup and termination of crawler in

2.2. How to view that Cookie is required for login

You can open the settings of the browser and view "cookies and stored data". After removing the relevant values in the settings, the value that the web page needs to log in again is the value required for login. (after troubleshooting, it is found that key=session)

3. Regular expression

3.1. brief introduction

Regular Expression, also known as Regular Expression. (English: Regular Expression, often abbreviated as regex, regexp or RE in code), a concept of computer science. Regular expressions are often used to retrieve and replace text that conforms to a pattern (rule).

3.2. Used in this project
\wMatch letters, numbers, underscores. Equivalent to [A-Za-z0-9_]
\w-Match letters, numbers, underscores. Equivalent to [A-Za-z0-9_] (plus minus sign)
+Match the previous subexpression one or more times (greater than or equal to 1 time). For example, "zo +" can match "zo" and "zoo", but not "z"+ Equivalent to {1,}.
.Matches any single character except '\ n' and '\ r'. To match any character including '\ n' and '\ r', use a pattern like '[\ s\S]'.
*Match the previous subexpression any number of times. For example, zo * can match "z", as well as "zo" and "zoo"* Equivalent to {0,}.
.*Match any single character any time, that is, match any string. For example, < a href = "/ news / > < Li class =" on "> all < / Li > < / a > is equivalent to < a*</ a>

Difference between [\ W -] and [\ w]

4. Store the crawling data into the database

In the official document "4.3 using Pipeline to save results", there are "console output results" and "two schemes for generating local json files". At the end of this section, Chapter 7 will say "realize a series of functions such as saving results to files and databases." It turns out that chapter 7 = =.

4.1. code

Webmagic crawls the website data information and saves it to a local file
Write a class yourself and implement the Pipeline class.
Then write the crawling data warehousing logic in the code.

public class TestJsonFilePipeline extends FilePersistentBase implements Pipeline {
    private Logger logger = LoggerFactory.getLogger(this.getClass());

    public static TestJsonFilePipeline testJsonFilePipeline;

    @PostConstruct
    public void init(){
        testJsonFilePipeline = this;
    }

    @Autowired
    private LinkMapper linkMapper;

    public TestJsonFilePipeline() {
        this.setPath("/data/webmagic");
    }

    public TestJsonFilePipeline(String path) {
        this.setPath(path);
    }

    /**
     * crux
     * Save data to mysql 
     * @param resultItems
     */
    private void toMysql(ResultItems resultItems){
        final List<String> listLink = JSON.parseArray(JSON.toJSONString(resultItems.get("link")),String.class);
        final List<String> listTitle = JSON.parseArray(JSON.toJSONString(resultItems.get("title")),String.class);
        for (int j = 0; j < listLink.size(); j++) {
            Link link = new Link();
            Link.setLink(listLink.get(j));
            Link.setTitle(listTitle.get(j));
            Link.setPage(resultItems.get("page").toString());
            testJsonFilePipeline.linkMapper.insert(buffLink);
        }
    }

    public void process(ResultItems resultItems, Task task) {
        String path = this.path + PATH_SEPERATOR + task.getUUID() + PATH_SEPERATOR;

        toMysql(resultItems);
        }
}

4.2. How to use @ Autowired annotation in common classes

Spring @Autowired annotation is used in utils static tool class and non controller ordinary class
For method 1:

@Component 
public class TestUtils {
    @Autowired
    private ItemService itemService;
    
    @Autowired
    private ItemMapper itemMapper;
    
    public static TestUtils testUtils;
    
    @PostConstruct
    public void init() {    
        testUtils = this;
    } 
    
    //For the method example of using service and mapper interfaces in utils tool class, just use "testUtils.xxx. Method"      
    public static void test(Item record){
        testUtils.itemMapper.insert(record);
        testUtils.itemService.queryAll();
    }
}

We can use the following annotations in the init method. In time, the init() method can be defined by ourselves. Note: there is no need to write anything in the inti() method. It is absolutely ok to follow me. There is no need to look at the nonsense of others on the Internet!
@PostConstruct

Topics: Java crawler