SWCJ crawler framework

Posted by sheen.andola on Thu, 27 Jan 2022 01:48:38 +0100

SWSJ

What is SWSJ

SWSJ is a java crawler framework based on jsoup and regular expression, which can separate your crawler from the code, reduce the decoupling, and

Your crawler is configured not through code, but through a configuration file, which means that when some of your requirements change, you can directly modify the configuration file without modifying your code

What can he do

It can make your configuration hell more hell (bah

It enables you to realize the specific implementation of a crawler through several simple crawlers

As for the author and imperfections, he has a strong desire for survival

This is just an immature work of a 14 year old boy. It took me only three days to finish it. He may have a lot of imperfections. I'm constantly improving it. At present, it's barely usable, so I sent it first, SWSJ crawler framework: a framework for implementing crawlers through configuration files (gitee.com) (open source)

Optimization objectives: optimize exceptions, optimize incoming parameters, optimize return value acquisition, optimize additional method processing

See the link for the jar package

Detailed explanation of use

First, you need a configuration file, some of which are not necessary

Specific examples (see below for analysis)

1. Import the jar package (nonsense), which is not perfect at the moment, and maven has not been uploaded

2. Define an interface

import com.midream.sheep.swsj.Annotation.WebSpider;

public interface test {
    @WebSpider("getHtml")//The id, return value and parameters of the url should be consistent with the configuration file
    String[] getData(int count);
    @WebSpider("getli")//Support multiple methods, non parameter transfer
    String[] getLi();
}

3. Configuration file

<?xml version="1.0" encoding="UTF-8" ?>
<SWCL>
    <config>
        <constructionSpace isAbsolute="false" workSpace="E:\Temporary documents"/>
        <timeout value="10000"/>
        <createTactics isCache="true"/>
    </config>
    <swc id="getHtml">
        <cookies>
        </cookies>
        <parentInterface class="com.midream.sheep.test"/>
        <url name="getHtml" inPutType="int" inPutName="count">
            <type type="GET"/>
            <url path="https://pic.netbian.com/index_#{count}.html"/>
            <parseProgram  isHtml="true">
                <jsoup>
                    <pa>
                        #main>div.slist>ul.clearfix>li>a
                    </pa>
                </jsoup>
            </parseProgram>
            <returnType type="String[]"/>
        </url>
        <url name="getli" inPutType="" inPutName="">
            <type type="GET"/>
            <url path="https://pic.netbian.com/index_5.html"/>
            <parseProgram  isHtml="true">
                <jsoup>
                    <pa>
                        #main>div.slist>ul.clearfix>li
                    </pa>
                </jsoup>
            </parseProgram>
            <returnType type="String[]"/>
        </url>
    </swc>
</SWCL>

One is to pass parameters, the other is not to pass parameters. Parameters can be passed to change the value

4. Adjustment method

XmlFactory xf = null;
        try {
            xf = new XmlFactory(XmlFactory.class.getClassLoader().getResource("").getPath()+"com/midream/sheep/test.xml");
            test getHtml = (test)xf.getWebSpider("getHtml");
            String[] li = getHtml.getLi();
            for (String s : li) {
                System.out.println(s);
            }

Call XmlFactory to obtain the configuration and obtain the class through the factory (Note: forced conversion is required)

5. Direct interface debugging

xml

<?xml version="1.0" encoding="UTF-8" ?>
<SWCL>
    <!--Global configuration-->
    <config>
        <!--The generated bytecode will be stored in the workspace
        isAbsolute->Is it a relative path
        workplace->Folder path
        -->
        <constructionSpace isAbsolute="false" workSpace="E:\Temporary documents"/>
        <!--Timeout. If the request exceeds this time, an exception will be reported
        value->Timeout specific time
        -->
        <timeout value="10000"/>
        <!--userAgrnt data
        value->concrete userAgent text
        -->
        <userAgent>
            <value>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36 Edg/97.0.1072.62</value>
            <value>User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)</value>
            <value>User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)</value>
        </userAgent>
        <!--Crawler strategy
        cache Cache, directly convert the generated class into bytecode load
        Non cached, output the generated class locally class Loading
        -->
        <createTactics isCache="false"/>
    </config>
    <!--A specific reptile class
         inPutType:Value type passed in (can be empty)
         id How to get
        The following is the identifier used when using the incoming value (can be empty)-->
    <swc id="getHtml">
        <!--Used by local reptiles cookies text
        Format key:value;ยทยทยท
        -->
        <cookies>
            uuid_tt_dd=4646545646-1642571061362-956268; UserName=xmdymcsheepsir;
        </cookies>
        <!--Parent class interface, crawler calls through the interface-->
        <parentInterface class="com.midream.sheep.TestWeb"/>
        <!--Request configuration
        One configuration corresponds to one method
        -->
        <url name="getHtml" inPutType="" inPutName="">
            <!--Request type
            Currently only supported POST and GET request
            type="POST||GET"
            -->
            <type type="GET"/>
            <!--url link-->
            <url path="https://pic.netbian.com/index_#{count}.html"/>
            <!--analysis html programme
            Simultaneous use is not supported
            <regular>Regular expression regular special value ALL All text is returned
            <jsoup>jsoup to configure-->
            <parseProgram  isHtml="false">
<!--                <regular reg="href="/>-->
                <!--jsoup It can be divided into multi-layer analysis
                That is, once<pa>It's just an analysis
                -->
                <jsoup>
                    <!--pa Configurable properties to select targets Document
					#id is selected by id
					htmlName Select by tag name
					name.class Select by class
-->
                    <pa>
                        #main>div.slist>ul>li>a
                    </pa>
                </jsoup>
            </parseProgram>
            <!--return type
            Basic data types are used directly,
            The reference type must be the full class name, such as:java.lang.String
            -->
            <returnType type="String[]"/>
        </url>
    </swc>
</SWCL>

unscramble

SWCJ

This is the root tag obtained and the window for program recognition, which is necessary

config

This is a global configuration, which can be omitted, but the workspace is recommended

constructionSpace

Workspace, bytecode generated folder

isAbsolute -- whether it is a relative path. A relative path is a path relative to the current project

workSpace -- path, specific folder path

timeout

Timeout, which is 1000 ms by default, can be customized or omitted

userAgent

Simulate the browser. The bypass check is a specific single userAgent. There is one configured by default and can be omitted

createTactics

Create policy: does isCache have a cache

swc

A specific interface,

id gets the identifier of the crawler class

cookies

Can carry cookie s for crawling

parentInterface

Parent interface, calling method through interface

Full class name of class interface

url

A specific crawler method

Annotation identifier of id method

inPutType, inputname (the incoming value, the calling name of the incoming value (using #{inputname}) shall be consistent with the interface

type

Type: get type, with get and POST (not perfect at the moment) for selection

url

path is a specific hyperlink

parseProgram

There are two crawler strategies, jsoup and regular expression (not recommended)

See xml comments for policy selection

returnType

Return value type. Only String and String [] are supported temporarily

At the end of the whole article, you are welcome to put forward your opinions. See the link for the current optimization ideas

Topics: Java