Windbell worm tutorial: quickly create windbell worm

Posted by ExpertAlmost on Thu, 19 Dec 2019 15:27:50 +0100

There are two main construction methods of Campanula

Construction of Campanula objects

        //Create an extraction rule
        //The extraction rule ID is extracted by using the XPATH extractor. The expression of XPATH is / / h1[@class='topic-_XJ6ViSR']/text(), and the order of action of the extraction extractor is 0
        FieldExtractRule extractRule = new FieldExtractRule(Rule.XPATH, "//h1[@class='topic-_XJ6ViSR']/text()", "", 0);

        //Create an extraction
        ContentItem contentItem = new ContentItem();
        contentItem
                .setFiledName("name") //Extract item code, cannot be empty
                .setName("News headlines") //Extract item name, can not be set
                .setRules(Arrays.asList(extractRule)); //Set extraction rules

        //Create an instance of Campbell worm
        CrawlerBuilder builder = CrawlerBuilder.create()
                .startUrl("https://news.ifeng.com/c/7sMchCLy5se ") / / the initial link of the bellworm
                // The Campbell worm will extract all the URL s in the content of each requested web page first, and then put the links that fully match this rule into the link pool as the seed links of the next request.
                // If it is not set, it means that all links containing domain name keywords (such as ifeng in this case) in the extracted links are put into the link pool
                .addLinkRule("http[s]?://news\.ifeng\.com /. * ") / / link extraction rules. Multiple link extraction rules can be added,
                //You can set rules for multiple content pages, which are separated by half width commas
                //As long as the content page URL exactly matches this rule, content extraction will be performed. If you do not set the identity to extract all links under the domain name
                .extractUrl("https://news\.ifeng\.com/c/[A-Za-z0-9] + ") / / rules for the content page,
                .addExtractItem(contentItem) //Add an extraction item. Campanula can set multiple extraction items. Here, only one extraction item is set for demonstration
                .interval(8);= CrawlerBuilder.create()
                .startUrl("https://news.ifeng.com/c/7sMchCLy5se ") / / the initial link of the bellworm
                // The Campbell worm will extract all the URL s in the content of each requested web page first, and then put the links that fully match this rule into the link pool as the seed links of the next request.
                // If it is not set, it means that all links containing domain name keywords (such as ifeng in this case) in the extracted links are put into the link pool
                .addLinkRule("http[s]?://news\.ifeng\.com /. * ") / / link extraction rules. Multiple link extraction rules can be added,
                //You can set rules for multiple content pages, which are separated by half width commas
                //As long as the content page URL exactly matches this rule, content extraction will be performed. If you do not set the identity to extract all links under the domain name
                .extractUrl("https://news\.ifeng\.com/c/[A-Za-z0-9] + ") / / rules for the content page,
                .addExtractItem(contentItem) //Add an extraction item. Campanula can set multiple extraction items. Here, only one extraction item is set for demonstration
                .interval(8);//The average interval time of each crawl, in seconds. If it is not set, the default time is 10 seconds. This value is used to prevent the crawl frequency from being blocked by the server
        //Information on rules of Campanula
        CrawlerRule rule = builder.build();
        //Building a simple example of Campanula
        Crawler crawler = Crawler.create(rule);

Construction with Campanula builder

//Create an extraction rule
        //The extraction rule ID is extracted by using the XPATH extractor. The expression of XPATH is / / h1[@class='topic-_XJ6ViSR']/text(), and the order of action of the extraction extractor is 0
        FieldExtractRule extractRule = new FieldExtractRule(Rule.XPATH, "//h1[@class='topic-_XJ6ViSR']/text()", "", 0);

        //Create an extraction
        ContentItem contentItem = new ContentItem();
        contentItem
                .setFiledName("name") //Extract item code, cannot be empty
                .setName("News headlines") //Extract item name, can not be set
                .setRules(Arrays.asList(extractRule)); //Set extraction rules

        //Create an instance of Campbell worm
        CrawlerBuilder builder = CrawlerBuilder.create()
                .startUrl("https://news.ifeng.com/c/7sMchCLy5se ") / / the initial link of the bellworm
                // The Campbell worm will extract all the URL s in the content of each requested web page first, and then put the links that fully match this rule into the link pool as the seed links of the next request.
                // If it is not set, it means that all links containing domain name keywords (such as ifeng in this case) in the extracted links are put into the link pool
                .addLinkRule("http[s]?://news\.ifeng\.com /. * ") / / link extraction rules. Multiple link extraction rules can be added,
                //You can set rules for multiple content pages, which are separated by half width commas
                //As long as the content page URL exactly matches this rule, content extraction will be performed. If you do not set the identity to extract all links under the domain name
                .extractUrl("https://news\.ifeng\.com/c/[A-Za-z0-9] + ") / / rules for the content page,
                .addExtractItem(contentItem) //Add an extraction item. Campanula can set multiple extraction items. Here, only one extraction item is set for demonstration
                .interval(8);= CrawlerBuilder.create()
                .startUrl("https://news.ifeng.com/c/7sMchCLy5se ") / / the initial link of the bellworm
                // The Campbell worm will extract all the URL s in the content of each requested web page first, and then put the links that fully match this rule into the link pool as the seed links of the next request.
                // If it is not set, it means that all links containing domain name keywords (such as ifeng in this case) in the extracted links are put into the link pool
                .addLinkRule("http[s]?://news\.ifeng\.com /. * ") / / link extraction rules. Multiple link extraction rules can be added,
                //You can set rules for multiple content pages, which are separated by half width commas
                //As long as the content page URL exactly matches this rule, content extraction will be performed. If you do not set the identity to extract all links under the domain name
                .extractUrl("https://news\.ifeng\.com/c/[A-Za-z0-9] + ") / / rules for the content page,
                .addExtractItem(contentItem) //Add an extraction item. Campanula can set multiple extraction items. Here, only one extraction item is set for demonstration
                .interval(8);//The average interval time of each crawl, in seconds. If it is not set, the default time is 10 seconds. This value is used to prevent the crawl frequency from being blocked by the server
        
        
        //Building a simple example of Campanula
        Crawler crawler = builder.creatCrawler();

No matter which construction method, the most basic information provided for generating a Campbell worm must include the following two kinds:

  • Initial link of Campanula
  • At least one content extraction item information

After the Campanula is successfully started, each Campanula will have a unique random name, which is used to distinguish the Campanula. The method is as follows

crawler.getName()

After the instance of Campbell worm is started, it is forbidden to modify the attribute, otherwise it may cause abnormal operation of Campbell worm

Related resources:

Official documents Online API Source code <br/><br/>

Topics: Programming Attribute