Hello, I'm glacier~~
Recently, I was working on a search related project and needed to crawl some links on the network to store them in the index library. Although there are many powerful open-source crawlers, I wrote a simple web crawler with a learning attitude in order to understand the principle.
Just do it. Let's start!
First, introduce the functions of each class:
-
DownloadPage. The function of Java is to download the page source code of this hyperlink
-
FunctionUtils. The function of Java is to provide different static methods, including: page link regular expression matching, obtaining Url link elements, judging whether to create a file, obtaining the Url of the page and converting it into a standard Url, and intercepting the target content of the web page source file.
-
HrefOfPage. The function of Java is to get the hyperlink of the page source code.
-
UrlDataHanding. The function of Java is to integrate various classes, from url to data acquisition to data processing.
-
UrlQueue. The Url queue for Java is not accessed.
-
VisitedUrlQueue.java has accessed the URL queue.
The following describes the source code of each class:
DownloadPage.java class needs HttpClient component.
package com.sreach.spider; import java.io.IOException; import org.apache.http.HttpEntity; import org.apache.http.HttpResponse; import org.apache.http.client.ClientProtocolException; import org.apache.http.client.HttpClient; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.DefaultHttpClient; import org.apache.http.util.EntityUtils; /** * @author binghe */ public class DownloadPage { /** * Grab web content according to URL * * @param url * @return */ public static String getContentFormUrl(String url) { /* Instantiate an HttpClient client */ HttpClient client = new DefaultHttpClient(); HttpGet getHttp = new HttpGet(url); String content = null; HttpResponse response; try { /* Obtain information carrier */ response = client.execute(getHttp); HttpEntity entity = response.getEntity(); VisitedUrlQueue.addElem(url); if (entity != null) { /* Convert to text information */ content = EntityUtils.toString(entity); /* Judge whether it meets the conditions for downloading the web page source code to the local */ if (FunctionUtils.isCreateFile(url) && FunctionUtils.isHasGoalContent(content) != -1) { FunctionUtils.createFile( FunctionUtils.getGoalContent(content), url); } } } catch (ClientProtocolException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } finally { client.getConnectionManager().shutdown(); } return content; } }
FunctionUtils. The methods of Java class are static methods
package com.sreach.spider; import java.io.BufferedWriter; import java.io.File; import java.io.FileOutputStream; import java.io.IOException; import java.io.OutputStreamWriter; import java.util.regex.Matcher; import java.util.regex.Pattern; /** * @author binghe */ public class FunctionUtils { /** * Regular expressions that match hyperlinks */ private static String pat = "http://www\\.oschina\\.net/code/explore/.*/\\w+\\.[a-zA-Z]+"; private static Pattern pattern = Pattern.compile(pat); private static BufferedWriter writer = null; /** * Crawler search depth */ public static int depth = 0; /** * Split the URL with "/" to get the elements of the hyperlink * * @param url * @return */ public static String[] divUrl(String url) { return url.split("/"); } /** * Determine whether to create a file * * @param url * @return */ public static boolean isCreateFile(String url) { Matcher matcher = pattern.matcher(url); return matcher.matches(); } /** * Create corresponding file * * @param content * @param urlPath */ public static void createFile(String content, String urlPath) { /* Split url */ String[] elems = divUrl(urlPath); StringBuffer path = new StringBuffer(); File file = null; for (int i = 1; i < elems.length; i++) { if (i != elems.length - 1) { path.append(elems[i]); path.append(File.separator); file = new File("D:" + File.separator + path.toString()); } if (i == elems.length - 1) { Pattern pattern = Pattern.compile("\\w+\\.[a-zA-Z]+"); Matcher matcher = pattern.matcher(elems[i]); if ((matcher.matches())) { if (!file.exists()) { file.mkdirs(); } String[] fileName = elems[i].split("\\."); file = new File("D:" + File.separator + path.toString() + File.separator + fileName[0] + ".txt"); try { file.createNewFile(); writer = new BufferedWriter(new OutputStreamWriter( new FileOutputStream(file))); writer.write(content); writer.flush(); writer.close(); System.out.println("File created successfully"); } catch (IOException e) { e.printStackTrace(); } } } } } /** * Get the hyperlink of the page and convert it into A formal A tag * * @param href * @return */ public static String getHrefOfInOut(String href) { /* Internal and external links are finally transformed into a complete link format */ String resultHref = null; /* Determine whether it is an external link */ if (href.startsWith("http://")) { resultHref = href; } else { /* If it is an internal link, the complete link address will be added. Other formats will be ignored and will not be processed, such as: a href = "#" */ if (href.startsWith("/")) { resultHref = "http://www.oschina.net" + href; } } return resultHref; } /** * Intercept the target content of the web page source file * * @param content * @return */ public static String getGoalContent(String content) { int sign = content.indexOf("<pre class=\""); String signContent = content.substring(sign); int start = signContent.indexOf(">"); int end = signContent.indexOf("</pre>"); return signContent.substring(start + 1, end); } /** * Check whether there is a target file in the web page source file * * @param content * @return */ public static int isHasGoalContent(String content) { return content.indexOf("<pre class=\""); } }
HrefOfPage.java class is to get the hyperlink of the page
package com.sreach.spider; /** * @author binghe * */ public class HrefOfPage { /** * Get the hyperlink in the page source code */ public static void getHrefOfContent(String content) { System.out.println("start"); String[] contents = content.split("<a href=\""); for (int i = 1; i < contents.length; i++) { int endHref = contents[i].indexOf("\""); String aHref = FunctionUtils.getHrefOfInOut(contents[i].substring( 0, endHref)); if (aHref != null) { String href = FunctionUtils.getHrefOfInOut(aHref); if (!UrlQueue.isContains(href) && href.indexOf("/code/explore") != -1 && !VisitedUrlQueue.isContains(href)) { UrlQueue.addElem(href); } } } System.out.println(UrlQueue.size() + "--Number of connections fetched"); System.out.println(VisitedUrlQueue.size() + "--Number of pages processed"); } }
UrlDataHanding.java class is mainly used to obtain URLs from the never visited queue, download pages, analyze URLs, save visited URLs, and implement the Runnable interface
package com.sreach.spider; /** * @author binghe * */ public class UrlDataHanding implements Runnable { /** * Download the corresponding page and analyze the URL corresponding to the page and put it in the unreached queue. * * @param url */ public void dataHanding(String url) { HrefOfPage.getHrefOfContent(DownloadPage.getContentFormUrl(url)); } public void run() { while (!UrlQueue.isEmpty()) { dataHanding(UrlQueue.outElem()); } } }
UrlQueue.java class is mainly used to store URL queues that are not accessed
package com.sreach.spider; import java.util.LinkedList; /** * @author binghe * */ public class UrlQueue { /** Hyperlink queue */ public static LinkedList<String> urlQueue = new LinkedList<String>(); /** Maximum number of corresponding hyperlinks in the queue */ public static final int MAX_SIZE = 10000; public synchronized static void addElem(String url) { urlQueue.add(url); } public synchronized static String outElem() { return urlQueue.removeFirst(); } public synchronized static boolean isEmpty() { return urlQueue.isEmpty(); } public static int size() { return urlQueue.size(); } public static boolean isContains(String url) { return urlQueue.contains(url); } }
VisitedUrlQueue.java mainly saves the visited URLs, which are saved by HashSet, mainly considering that each visited URL is different. HashSet just meets this requirement
package com.sreach.spider; import java.util.HashSet; /** * url queue accessed * @author binghe * */ public class VisitedUrlQueue { public static HashSet<String> visitedUrlQueue = new HashSet<String>(); public synchronized static void addElem(String url) { visitedUrlQueue.add(url); } public synchronized static boolean isContains(String url) { return visitedUrlQueue.contains(url); } public synchronized static int size() { return visitedUrlQueue.size(); } }
Test.java class is a test class
import java.sql.SQLException; import com.sreach.spider.UrlDataHanding; import com.sreach.spider.UrlQueue; /** * @author binghe * */ public class Test { public static void main(String[] args) throws SQLException { String url = "http://www.oschina.net/code/explore/achartengine/client/AndroidManifest.xml"; String url1 = "http://www.oschina.net/code/explore"; String url2 = "http://www.oschina.net/code/explore/achartengine"; String url3 = "http://www.oschina.net/code/explore/achartengine/client"; UrlQueue.addElem(url); UrlQueue.addElem(url1); UrlQueue.addElem(url2); UrlQueue.addElem(url3); UrlDataHanding[] url_Handings = new UrlDataHanding[10]; for (int i = 0; i < 10; i++) { url_Handings[i] = new UrlDataHanding(); new Thread(url_Handings[i]).start(); } } }
Note: because what I grabbed is for oschina, the url regular expression is not suitable for other websites and needs to be modified by myself. You can also write xml to configure.
Write at the end
If you want to enter a big factory, want to be promoted and raised, or are confused about your existing work, you can communicate with me privately. I hope some of my experience can help you~~
Recommended reading:
- <After three days of rolling up MyBatis, please ask!! (glacier hematemesis finishing, recommended Collection)>
- <I advise those students who have just joined the work: if you want to enter the big factory, you must master these core skills! Complete learning route!! (recommended Collection)>
- <I advise those students who have just joined the work: the sooner they know the basic knowledge of computers and operating systems, the better! Ten thousand words long text is too top!! (recommended Collection)>
- <I developed a national game suitable for all ages in three days, which supports playing music. Now I open the complete source code and comments (recommended Collection)!!>
- <I am the author of high concurrency programming with the hardest core in the whole network and the blogger who deserves the most attention of CSDN. Do you agree? (recommended Collection)>
- <Five years after graduation, from a monthly salary of 3000 to an annual salary of one million, what core skills have I mastered? (recommended Collection)>
- <I invaded the Wifi of my sister next door and found... (actual dry goods in the whole process, collection recommended)>
- <Don't try "panda burning incense" easily. I regret it!>
- <On the Qingming Festival, I secretly trained "panda burning incense". As a result, my computer "died" for the panda!>
- <73000 words liver burst Java 8 new features, I don't believe you can read it! (recommended Collection)>
- <What kind of experience is it to unplug the server during peak business hours?>
- <Summary of the most complete Linux commands in the whole network!! (the most complete in history, recommended Collection)>
Well, that's all for today. Let's praise, collect and comment. Let's walk up three times with one button. I'm glacier. I'll see you next time~~