SWSJ
What is SWSJ
SWSJ is a java crawler framework based on jsoup and regular expression, which can separate your crawler from the code, reduce the decoupling, and
Your crawler is configured not through code, but through a configuration file, which means that when some of your requirements change, you can directly modify the configuration file without modifying your code
What can he do
It can make your configuration hell more hell (bah
It enables you to realize the specific implementation of a crawler through several simple crawlers
As for the author and imperfections, he has a strong desire for survival
This is just an immature work of a 14 year old boy. It took me only three days to finish it. He may have a lot of imperfections. I'm constantly improving it. At present, it's barely usable, so I sent it first, SWSJ crawler framework: a framework for implementing crawlers through configuration files (gitee.com) (open source)
Optimization objectives: optimize exceptions, optimize incoming parameters, optimize return value acquisition, optimize additional method processing
See the link for the jar package
Detailed explanation of use
First, you need a configuration file, some of which are not necessary
Specific examples (see below for analysis)
1. Import the jar package (nonsense), which is not perfect at the moment, and maven has not been uploaded
2. Define an interface
import com.midream.sheep.swsj.Annotation.WebSpider; public interface test { @WebSpider("getHtml")//The id, return value and parameters of the url should be consistent with the configuration file String[] getData(int count); @WebSpider("getli")//Support multiple methods, non parameter transfer String[] getLi(); }
3. Configuration file
<?xml version="1.0" encoding="UTF-8" ?> <SWCL> <config> <constructionSpace isAbsolute="false" workSpace="E:\Temporary documents"/> <timeout value="10000"/> <createTactics isCache="true"/> </config> <swc id="getHtml"> <cookies> </cookies> <parentInterface class="com.midream.sheep.test"/> <url name="getHtml" inPutType="int" inPutName="count"> <type type="GET"/> <url path="https://pic.netbian.com/index_#{count}.html"/> <parseProgram isHtml="true"> <jsoup> <pa> #main>div.slist>ul.clearfix>li>a </pa> </jsoup> </parseProgram> <returnType type="String[]"/> </url> <url name="getli" inPutType="" inPutName=""> <type type="GET"/> <url path="https://pic.netbian.com/index_5.html"/> <parseProgram isHtml="true"> <jsoup> <pa> #main>div.slist>ul.clearfix>li </pa> </jsoup> </parseProgram> <returnType type="String[]"/> </url> </swc> </SWCL>
One is to pass parameters, the other is not to pass parameters. Parameters can be passed to change the value
4. Adjustment method
XmlFactory xf = null; try { xf = new XmlFactory(XmlFactory.class.getClassLoader().getResource("").getPath()+"com/midream/sheep/test.xml"); test getHtml = (test)xf.getWebSpider("getHtml"); String[] li = getHtml.getLi(); for (String s : li) { System.out.println(s); }
Call XmlFactory to obtain the configuration and obtain the class through the factory (Note: forced conversion is required)
5. Direct interface debugging
xml
<?xml version="1.0" encoding="UTF-8" ?> <SWCL> <!--Global configuration--> <config> <!--The generated bytecode will be stored in the workspace isAbsolute->Is it a relative path workplace->Folder path --> <constructionSpace isAbsolute="false" workSpace="E:\Temporary documents"/> <!--Timeout. If the request exceeds this time, an exception will be reported value->Timeout specific time --> <timeout value="10000"/> <!--userAgrnt data value->concrete userAgent text --> <userAgent> <value>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36 Edg/97.0.1072.62</value> <value>User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)</value> <value>User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)</value> </userAgent> <!--Crawler strategy cache Cache, directly convert the generated class into bytecode load Non cached, output the generated class locally class Loading --> <createTactics isCache="false"/> </config> <!--A specific reptile class inPutType:Value type passed in (can be empty) id How to get The following is the identifier used when using the incoming value (can be empty)--> <swc id="getHtml"> <!--Used by local reptiles cookies text Format key:value;ยทยทยท --> <cookies> uuid_tt_dd=4646545646-1642571061362-956268; UserName=xmdymcsheepsir; </cookies> <!--Parent class interface, crawler calls through the interface--> <parentInterface class="com.midream.sheep.TestWeb"/> <!--Request configuration One configuration corresponds to one method --> <url name="getHtml" inPutType="" inPutName=""> <!--Request type Currently only supported POST and GET request type="POST||GET" --> <type type="GET"/> <!--url link--> <url path="https://pic.netbian.com/index_#{count}.html"/> <!--analysis html programme Simultaneous use is not supported <regular>Regular expression regular special value ALL All text is returned <jsoup>jsoup to configure--> <parseProgram isHtml="false"> <!-- <regular reg="href="/>--> <!--jsoup It can be divided into multi-layer analysis That is, once<pa>It's just an analysis --> <jsoup> <!--pa Configurable properties to select targets Document #id is selected by id htmlName Select by tag name name.class Select by class --> <pa> #main>div.slist>ul>li>a </pa> </jsoup> </parseProgram> <!--return type Basic data types are used directly, The reference type must be the full class name, such as:java.lang.String --> <returnType type="String[]"/> </url> </swc> </SWCL>
unscramble
SWCJ
This is the root tag obtained and the window for program recognition, which is necessary
config
This is a global configuration, which can be omitted, but the workspace is recommended
constructionSpace
Workspace, bytecode generated folder
isAbsolute -- whether it is a relative path. A relative path is a path relative to the current project
workSpace -- path, specific folder path
timeout
Timeout, which is 1000 ms by default, can be customized or omitted
userAgent
Simulate the browser. The bypass check is a specific single userAgent. There is one configured by default and can be omitted
createTactics
Create policy: does isCache have a cache
swc
A specific interface,
id gets the identifier of the crawler class
cookies
Can carry cookie s for crawling
parentInterface
Parent interface, calling method through interface
Full class name of class interface
url
A specific crawler method
Annotation identifier of id method
inPutType, inputname (the incoming value, the calling name of the incoming value (using #{inputname}) shall be consistent with the interface
type
Type: get type, with get and POST (not perfect at the moment) for selection
url
path is a specific hyperlink
parseProgram
There are two crawler strategies, jsoup and regular expression (not recommended)
See xml comments for policy selection
returnType
Return value type. Only String and String [] are supported temporarily
At the end of the whole article, you are welcome to put forward your opinions. See the link for the current optimization ideas