Golang Reptiles: Use of colly Library

Posted by Mikell on Mon, 07 Oct 2019 22:35:11 +0200

1. Introduction to colly:

Colly is a Golang framework for building Web scraper. With Colly, you can build complex Web scrapers, ranging from simple scrapers to complex asynchronous Web crawlers that handle millions of Web pages. Colly provides an API for executing network requests and processing received content (for example, interacting with the DOM tree of HTML documents).

2. Installation Configuration

install
go get -u github.com/gocolly/colly/...
Import
import "github.com/gocolly/colly"

3. Introduction to Use

Colly's main entity is a Collector object. Collector manages network communication while the collector job is running and is responsible for executing additional callbacks. To use colly, you must initialize Collector:

Basic examples:

// Instantiate the default collector
c := colly.NewCollector(
	// Set up domain names that are accessible only: hackerspaces.org, wiki.hackerspaces.org
	colly.AllowedDomains("hackerspaces.org", "wiki.hackerspaces.org"),
)

//Set Access Device to Browser
c.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36"

// Callback for each element with href attribute
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
	link := e.Attr("href")
	fmt.Printf("Link found: %q -> %s\n", e.Text, link)
	// Find access links on the page
	// Only those links in alloweddomain will be accessed
	c.Visit(e.Request.AbsoluteURL(link))
})

// This method is called before the request, and the request address is printed out.
c.OnRequest(func(r *colly.Request) {
	fmt.Println("Visiting", r.URL.String())
})

// Set the initial page to start crawling to https://hackerspaces.org
c.Visit("https://hackerspaces.org/")

Callback function description:
OnRequest: Called before the request
OnError: Called when an error occurs in the request
OnResponse: Called after the response is received
OnHTML: When OnResponse executes correctly, it executes if the received text is HTML
OnXML: When OnResponse executes correctly, it executes if the received text is XML
OnScraped: OnXML callback

Simulated landing example
c := colly.NewCollector()
// User authentication
err := c.Post("http://example.com/login", map[string]string{"username": "admin", "password": "admin"})
if err != nil {
	log.Fatal(err)
}

//  Additional callbacks after login
c.OnResponse(func(r *colly.Response) {
	log.Println("response received", r.StatusCode)
})
c.Visit("https://example.com/")
Queue usage example
url := "https://httpbin.org/delay/1"
c := colly.NewCollector(colly.AllowURLRevisit())

// Create a request queue with two user threads
q, _ := queue.New(
	2, // Number of user threads
	&queue.InMemoryQueueStorage{MaxSize: 10000}, // The number of queues stored in memory
)

c.OnRequest(func(r *colly.Request) {
	fmt.Println("visiting", r.URL)
	if r.ID < 15 {
		r2, err := r.New("GET", fmt.Sprintf("%s?x=%v", url, r.ID), nil)
		if err == nil {
			q.AddRequest(r2)
		}
	}
})

for i := 0; i < 5; i++ {
	// Add URLs to queues
	q.AddURL(fmt.Sprintf("%s?n=%d", url, i))
}
// Start crawling the url in the queue
q.Run(c)

For more examples, see the official example: https://github.com/gocolly/colly/tree/master/_examples

Topics: github network Windows Attribute