WebMagic introduction
The WebMagic project code is divided into two parts: core and extension. The core part is a simplified and modular crawler implementation, while the extension part includes some convenient and practical functions. The architecture design of WebMagic refers to scripy, and the goal is to be modular as much as possible and reflect the functional characteristics of crawlers.
This part provides a very simple and flexible API to write a crawler without changing the development mode.
The webmatic extension provides some convenient functions, such as annotation mode, writing crawler and so on. At the same time, some common components are built in to facilitate crawler development.
Chinese documents of WebMagic
Knowledge summary during use
1. You can only crawl http but not https
Reference blog (scheme 1: modify the source code, scheme 2: nginx reverse proxy)
For scheme 2, configure nginx
server { listen 8808; # server_name partner.domain.com; #Server can not be set_ Name, modify the port number and access it directly through localhost: port number. #If you want to set up server_name needs to be configured in hosts location / { proxy_set_header Host github.com; proxy_pass https://github.com/; } }
Now pass http://localhost:8808/ To access https://github.com/
2. Set cookies to simulate login
Since the crawled web page needs to verify the login, cookies can be used to simulate the login.
2.1. How to set cookies in code
reference resources Chinese documents of WebMagic 4.4 configuration, startup and termination of crawler in
2.2. How to view that Cookie is required for login
You can open the settings of the browser and view "cookies and stored data". After removing the relevant values in the settings, the value that the web page needs to log in again is the value required for login. (after troubleshooting, it is found that key=session)
3. Regular expression
3.1. brief introduction
Regular Expression, also known as Regular Expression. (English: Regular Expression, often abbreviated as regex, regexp or RE in code), a concept of computer science. Regular expressions are often used to retrieve and replace text that conforms to a pattern (rule).
3.2. Used in this project
\w | Match letters, numbers, underscores. Equivalent to [A-Za-z0-9_] |
\w- | Match letters, numbers, underscores. Equivalent to [A-Za-z0-9_] (plus minus sign) |
+ | Match the previous subexpression one or more times (greater than or equal to 1 time). For example, "zo +" can match "zo" and "zoo", but not "z"+ Equivalent to {1,}. |
. | Matches any single character except '\ n' and '\ r'. To match any character including '\ n' and '\ r', use a pattern like '[\ s\S]'. |
* | Match the previous subexpression any number of times. For example, zo * can match "z", as well as "zo" and "zoo"* Equivalent to {0,}. |
.* | Match any single character any time, that is, match any string. For example, < a href = "/ news / > < Li class =" on "> all < / Li > < / a > is equivalent to < a*</ a> |
Difference between [\ W -] and [\ w]
4. Store the crawling data into the database
In the official document "4.3 using Pipeline to save results", there are "console output results" and "two schemes for generating local json files". At the end of this section, Chapter 7 will say "realize a series of functions such as saving results to files and databases." It turns out that chapter 7 = =.
4.1. code
Webmagic crawls the website data information and saves it to a local file
Write a class yourself and implement the Pipeline class.
Then write the crawling data warehousing logic in the code.
public class TestJsonFilePipeline extends FilePersistentBase implements Pipeline { private Logger logger = LoggerFactory.getLogger(this.getClass()); public static TestJsonFilePipeline testJsonFilePipeline; @PostConstruct public void init(){ testJsonFilePipeline = this; } @Autowired private LinkMapper linkMapper; public TestJsonFilePipeline() { this.setPath("/data/webmagic"); } public TestJsonFilePipeline(String path) { this.setPath(path); } /** * crux * Save data to mysql * @param resultItems */ private void toMysql(ResultItems resultItems){ final List<String> listLink = JSON.parseArray(JSON.toJSONString(resultItems.get("link")),String.class); final List<String> listTitle = JSON.parseArray(JSON.toJSONString(resultItems.get("title")),String.class); for (int j = 0; j < listLink.size(); j++) { Link link = new Link(); Link.setLink(listLink.get(j)); Link.setTitle(listTitle.get(j)); Link.setPage(resultItems.get("page").toString()); testJsonFilePipeline.linkMapper.insert(buffLink); } } public void process(ResultItems resultItems, Task task) { String path = this.path + PATH_SEPERATOR + task.getUUID() + PATH_SEPERATOR; toMysql(resultItems); } }
4.2. How to use @ Autowired annotation in common classes
Spring @Autowired annotation is used in utils static tool class and non controller ordinary class
For method 1:
@Component public class TestUtils { @Autowired private ItemService itemService; @Autowired private ItemMapper itemMapper; public static TestUtils testUtils; @PostConstruct public void init() { testUtils = this; } //For the method example of using service and mapper interfaces in utils tool class, just use "testUtils.xxx. Method" public static void test(Item record){ testUtils.itemMapper.insert(record); testUtils.itemService.queryAll(); } }
We can use the following annotations in the init method. In time, the init() method can be defined by ourselves. Note: there is no need to write anything in the inti() method. It is absolutely ok to follow me. There is no need to look at the nonsense of others on the Internet!
@PostConstruct