Half an hour to achieve Java hand-held web crawler framework!! (complete source code attached, recommended Collection)

Posted by KenGR on Sun, 23 Jan 2022 08:30:34 +0100

Hello, I'm glacier~~

Recently, I was working on a search related project and needed to crawl some links on the network to store them in the index library. Although there are many powerful open-source crawlers, I wrote a simple web crawler with a learning attitude in order to understand the principle.

Just do it. Let's start!

First, introduce the functions of each class:

  • DownloadPage. The function of Java is to download the page source code of this hyperlink

  • FunctionUtils. The function of Java is to provide different static methods, including: page link regular expression matching, obtaining Url link elements, judging whether to create a file, obtaining the Url of the page and converting it into a standard Url, and intercepting the target content of the web page source file.

  • HrefOfPage. The function of Java is to get the hyperlink of the page source code.

  • UrlDataHanding. The function of Java is to integrate various classes, from url to data acquisition to data processing.

  • UrlQueue. The Url queue for Java is not accessed.

  • VisitedUrlQueue.java has accessed the URL queue.

The following describes the source code of each class:

DownloadPage.java class needs HttpClient component.

package com.sreach.spider;

import java.io.IOException;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.util.EntityUtils;

/**
 * @author binghe
 */
public class DownloadPage {

	/**
	 * Grab web content according to URL
	 * 
	 * @param url
	 * @return
	 */
	public static String getContentFormUrl(String url) {
		/* Instantiate an HttpClient client */
		HttpClient client = new DefaultHttpClient();
		HttpGet getHttp = new HttpGet(url);

		String content = null;

		HttpResponse response;
		try {
			/* Obtain information carrier */
			response = client.execute(getHttp);
			HttpEntity entity = response.getEntity();

			VisitedUrlQueue.addElem(url);

			if (entity != null) {
				/* Convert to text information */
				content = EntityUtils.toString(entity);

				/* Judge whether it meets the conditions for downloading the web page source code to the local */
				if (FunctionUtils.isCreateFile(url)
						&& FunctionUtils.isHasGoalContent(content) != -1) {
					FunctionUtils.createFile(
							FunctionUtils.getGoalContent(content), url);
				}
			}

		} catch (ClientProtocolException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		} finally {
			client.getConnectionManager().shutdown();
		}

		return content;
	}

}

FunctionUtils. The methods of Java class are static methods

package com.sreach.spider;

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/**
 * @author binghe
 */
public class FunctionUtils {

	/**
	 * Regular expressions that match hyperlinks
	 */
	private static String pat = "http://www\\.oschina\\.net/code/explore/.*/\\w+\\.[a-zA-Z]+";
	private static Pattern pattern = Pattern.compile(pat);

	private static BufferedWriter writer = null;

	/**
	 * Crawler search depth
	 */
	public static int depth = 0;

	/**
	 * Split the URL with "/" to get the elements of the hyperlink
	 * 
	 * @param url
	 * @return
	 */
	public static String[] divUrl(String url) {
		return url.split("/");
	}

	/**
	 * Determine whether to create a file
	 * 
	 * @param url
	 * @return
	 */
	public static boolean isCreateFile(String url) {
		Matcher matcher = pattern.matcher(url);

		return matcher.matches();
	}

	/**
	 * Create corresponding file
	 * 
	 * @param content
	 * @param urlPath
	 */
	public static void createFile(String content, String urlPath) {
		/* Split url */
		String[] elems = divUrl(urlPath);
		StringBuffer path = new StringBuffer();

		File file = null;
		for (int i = 1; i < elems.length; i++) {
			if (i != elems.length - 1) {

				path.append(elems[i]);
				path.append(File.separator);
				file = new File("D:" + File.separator + path.toString());

			}

			if (i == elems.length - 1) {
				Pattern pattern = Pattern.compile("\\w+\\.[a-zA-Z]+");
				Matcher matcher = pattern.matcher(elems[i]);
				if ((matcher.matches())) {
					if (!file.exists()) {
						file.mkdirs();
					}
					String[] fileName = elems[i].split("\\.");
					file = new File("D:" + File.separator + path.toString()
							+ File.separator + fileName[0] + ".txt");
					try {
						file.createNewFile();
						writer = new BufferedWriter(new OutputStreamWriter(
								new FileOutputStream(file)));
						writer.write(content);
						writer.flush();
						writer.close();
						System.out.println("File created successfully");
					} catch (IOException e) {
						e.printStackTrace();
					}

				}
			}

		}
	}

	/**
	 * Get the hyperlink of the page and convert it into A formal A tag
	 * 
	 * @param href
	 * @return
	 */
	public static String getHrefOfInOut(String href) {
		/* Internal and external links are finally transformed into a complete link format */
		String resultHref = null;

		/* Determine whether it is an external link */
		if (href.startsWith("http://")) {
			resultHref = href;
		} else {
			/* If it is an internal link, the complete link address will be added. Other formats will be ignored and will not be processed, such as: a href = "#" */
			if (href.startsWith("/")) {
				resultHref = "http://www.oschina.net" + href;
			}
		}

		return resultHref;
	}

	/**
	 * Intercept the target content of the web page source file
	 * 
	 * @param content
	 * @return
	 */
	public static String getGoalContent(String content) {
		int sign = content.indexOf("<pre class=\"");
		String signContent = content.substring(sign);

		int start = signContent.indexOf(">");
		int end = signContent.indexOf("</pre>");

		return signContent.substring(start + 1, end);
	}

	/**
	 * Check whether there is a target file in the web page source file
	 * 
	 * @param content
	 * @return
	 */
	public static int isHasGoalContent(String content) {
		return content.indexOf("<pre class=\"");
	}

}

HrefOfPage.java class is to get the hyperlink of the page

package com.sreach.spider;
/**
 * @author binghe
 *
 */
public class HrefOfPage {
	/**
	 * Get the hyperlink in the page source code
	 */
	public static void getHrefOfContent(String content) {
		System.out.println("start");
		String[] contents = content.split("<a href=\"");
		for (int i = 1; i < contents.length; i++) {
			int endHref = contents[i].indexOf("\"");

			String aHref = FunctionUtils.getHrefOfInOut(contents[i].substring(
					0, endHref));

			if (aHref != null) {
				String href = FunctionUtils.getHrefOfInOut(aHref);

				if (!UrlQueue.isContains(href)
						&& href.indexOf("/code/explore") != -1
						&& !VisitedUrlQueue.isContains(href)) {
					UrlQueue.addElem(href);
				}
			}
		}

		System.out.println(UrlQueue.size() + "--Number of connections fetched");
		System.out.println(VisitedUrlQueue.size() + "--Number of pages processed");

	}

}

UrlDataHanding.java class is mainly used to obtain URLs from the never visited queue, download pages, analyze URLs, save visited URLs, and implement the Runnable interface

package com.sreach.spider;
/**
 * @author binghe
 *
 */
public class UrlDataHanding implements Runnable {
	/**
	 * Download the corresponding page and analyze the URL corresponding to the page and put it in the unreached queue.
	 * 
	 * @param url
	 */
	public void dataHanding(String url) {
		HrefOfPage.getHrefOfContent(DownloadPage.getContentFormUrl(url));
	}

	public void run() {
		while (!UrlQueue.isEmpty()) {
			dataHanding(UrlQueue.outElem());
		}
	}
}

UrlQueue.java class is mainly used to store URL queues that are not accessed

package com.sreach.spider;

import java.util.LinkedList;
/**
 * @author binghe
 *
 */
public class UrlQueue {
	/** Hyperlink queue */
	public static LinkedList<String> urlQueue = new LinkedList<String>();

	/** Maximum number of corresponding hyperlinks in the queue */
	public static final int MAX_SIZE = 10000;

	public synchronized static void addElem(String url) {
		urlQueue.add(url);
	}

	public synchronized static String outElem() {
		return urlQueue.removeFirst();
	}

	public synchronized static boolean isEmpty() {
		return urlQueue.isEmpty();
	}

	public static int size() {
		return urlQueue.size();
	}

	public static boolean isContains(String url) {
		return urlQueue.contains(url);
	}

}

VisitedUrlQueue.java mainly saves the visited URLs, which are saved by HashSet, mainly considering that each visited URL is different. HashSet just meets this requirement

package com.sreach.spider;

import java.util.HashSet;

/**
 * url queue accessed
 * @author binghe
 * 
 */
public class VisitedUrlQueue {
	public static HashSet<String> visitedUrlQueue = new HashSet<String>();

	public synchronized static void addElem(String url) {
		visitedUrlQueue.add(url);
	}

	public synchronized static boolean isContains(String url) {
		return visitedUrlQueue.contains(url);
	}

	public synchronized static int size() {
		return visitedUrlQueue.size();
	}
}

Test.java class is a test class

import java.sql.SQLException;

import com.sreach.spider.UrlDataHanding;
import com.sreach.spider.UrlQueue;
/**
 * @author binghe
 *
 */
public class Test {
	public static void main(String[] args) throws SQLException {
		String url = "http://www.oschina.net/code/explore/achartengine/client/AndroidManifest.xml";
		String url1 = "http://www.oschina.net/code/explore";
		String url2 = "http://www.oschina.net/code/explore/achartengine";
		String url3 = "http://www.oschina.net/code/explore/achartengine/client";

		UrlQueue.addElem(url);
		UrlQueue.addElem(url1);
		UrlQueue.addElem(url2);
		UrlQueue.addElem(url3);

		UrlDataHanding[] url_Handings = new UrlDataHanding[10];

		for (int i = 0; i < 10; i++) {
			url_Handings[i] = new UrlDataHanding();
			new Thread(url_Handings[i]).start();
		}

	}
}

Note: because what I grabbed is for oschina, the url regular expression is not suitable for other websites and needs to be modified by myself. You can also write xml to configure.

Write at the end

If you want to enter a big factory, want to be promoted and raised, or are confused about your existing work, you can communicate with me privately. I hope some of my experience can help you~~

Recommended reading:

Well, that's all for today. Let's praise, collect and comment. Let's walk up three times with one button. I'm glacier. I'll see you next time~~

Topics: Back-end crawler