Nutch2.1+Hbase+Solr to quickly build a crawler and search engine (fast, basically within 2 hours)

Posted by MAXIEDECIMAL on Thu, 26 Sep 2019 07:49:34 +0200

Note: This method is for quick experience or small data volume, not suitable for large data volume production environment.

Environmental preparation:

  1. Centos7
  2. Nutch2.2.1
  3. JAVA1.8
  4. ant1.9.14
  5. HBase 0.90.4 (stand-alone version)
  6. solr7.7

Relevant download address:

Links: https://pan.baidu.com/s/1Tut2CcKoJ9-G-HBq8zexMQ Extraction code: v75v

Start installation

  1. Default installation of jdk, ant (in fact, decompression configuration of environment variables will not be able to Baidu once)

  2. Install hbase stand-alone version

    1. Download decompression
      	wget http://archive.apache.org/dist/hbase/hbase-0.90.4/hbase-0.90.4.tar.gz
      	tar zxf hbase-0.90.4.tar.gz
      	# Or use the package I provided directly
      
    2. To configure
    	<configuration>
    	  <property>
    		<name>hbase.rootdir</name>
    		<value>/data/hbase</value>
    	  </property>
    	  <property>
    		<name>hbase.zookeeper.property.dataDir</name>
    		<value>/data/zookeeper</value>
    	  </property>
    	</configuration>
    

    Description: The hbase.rootdir directory is used to store information about HBase. The default value is / tmp/hbase-${user.name}/hbase; the hbase.zookeeper.property.dataDir directory is used to store information about zookeeper (HBase has zookeeper built in), and the default value is / tmp/hbase-${user.name}/zookeeper. 3. boot

    ./bin/start-hbase.sh

  3. solr installation configuration

    1. Download and install
    	wget https://mirrors.cnnic.cn/apache/lucene/solr/7.7.2/solr-7.7.2-src.tgz
    	tar -zxvf solr-7.7.2-src.tgz
    	./bin/solr  start  -force   //start-up
    
    1. Add core reference https://blog.csdn.net/weixin_39082031/article/details/78924909

    After adding, remember to restart start transposition restart

  4. Nutch Edit Installation (don't forget the pre-ant configuration)

    1. download
    	wget http://archive.apache.org/dist/nutch/2.2.1/apache-nutch-2.2.1-src.tar.gz
    	tar zxf apache-nutch-2.2.1-src.tar.gz
    
    1. Configuration modification

    conf/nutch-site.xml

    	<property>
    	  <name>storage.data.store.class</name>
    	  <value>org.apache.gora.hbase.store.HBaseStore</value>
    	  <description>Default class for storing data</description>
    	</property>
    

    ivy/ivy.xml

    	<!-- Uncomment this to use HBase as Gora backend. -->
    	<dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />
    

    conf/gora.properties

    	gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
    
    1. Compile ant runtime Here is particularly slow, you can optimize the speed of ivy by Baidu, you can also download it like this. If you encounter failure, you can download the package yourself and put it in the wrong path.

      After success: generate two directories, runtime and build. The following configuration file modifications are all changed to the following files in runtime/local

    2. Adding seed url

    	#In the directory you want to store
    	mkdir /data/urls
    	vim seed.txt
    	#Add url to grab
    	http://www.dxy.cn/
    
    1. Setting url filtering rules (optional)
    	#Comment out this line
    	# skip URLs containing certain characters as probable queries, etc.
    	#-[?*!@=]
    	# accept anything else
    	#Comment out this line
    	#+.
    	+^http:\/\/heart\.dxy\.cn\/article\/[0-9]+$
    
    1. Configure the agent name (must be configured or errors will be reported)
    	<property>
    	  <name>http.agent.name</name>
    	  <value>My Nutch Spider</value>
    	</property>
    
  5. The last step is to configure Solr to support nutch stored data structure (schema), modify / data/solr-7.7.2/server/solr/jkj_core/conf/managed-schema file, and restart Solr

    Add a new configuration section (just put it in < schema>)

    	<!-- New field for nutch  start-->
    
    	<fieldType name="url" class="solr.TextField" positionIncrementGap="100">
    		  <analyzer>
    			<tokenizer class="solr.StandardTokenizerFactory"/>
    			   <filter class="solr.LowerCaseFilterFactory"/>
    			   <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"/>
    		  </analyzer>
    		</fieldType>
    
    		<fieldType name="date" class="solr.TrieDateField" omitNorms="true" precisionStep="0" positionIncrementGap="0"/>
    
    		<!-- A Trie based date field for faster date range queries and date faceting. -->
    		<fieldType name="tdate" class="solr.TrieDateField" omitNorms="true" precisionStep="6" positionIncrementGap="0"/>
    
    		<!-- core fields -->
    		<field name="batchId" type="string" stored="true" indexed="false"/>
    		<field name="digest" type="string" stored="true" indexed="false"/>
    		<field name="boost" type="pfloat" stored="true" indexed="false"/>
    
    		<!-- fields for index-basic plugin -->
    		<field name="host" type="url" stored="false" indexed="true"/>
    		<field name="url" type="url" stored="true" indexed="true" required="true"/>
    		<field name="orig" type="url" stored="true" indexed="true" />
    		<!-- stored=true for highlighting, use term vectors  and positions for fast highlighting -->
    		<field name="content" type="text_general" stored="true" indexed="true"/>
    		<field name="title" type="text_general" stored="true" indexed="true"/>
    		<field name="cache" type="string" stored="true" indexed="false"/>
    		<field name="tstamp" type="date" stored="true" indexed="false"/>
    
    		<!-- catch-all field -->
    		<field name="text" type="text_general" stored="false" indexed="true" multiValued="true"/>
    
    		<!-- fields for index-anchor plugin -->
    		<field name="anchor" type="text_general" stored="true" indexed="true"
    			multiValued="true"/>
    
    		<!-- fields for index-more plugin -->
    		<field name="type" type="string" stored="true" indexed="true" multiValued="true"/>
    		<field name="contentLength" type="string" stored="true" indexed="false"/>
    		<field name="lastModified" type="date" stored="true" indexed="false"/>
    		<field name="date" type="tdate" stored="true" indexed="true"/>
    
    		<!-- fields for languageidentifier plugin -->
    		<field name="lang" type="string" stored="true" indexed="true"/>
    
    		<!-- fields for subcollection plugin -->
    		<field name="subcollection" type="string" stored="true" indexed="true" multiValued="true"/>
    
    		<!-- fields for feed plugin (tag is also used by microformats-reltag)-->
    		<field name="author" type="string" stored="true" indexed="true"/>
    		<field name="tag" type="string" stored="true" indexed="true" multiValued="true"/>
    		<field name="feed" type="string" stored="true" indexed="true"/>
    		<field name="publishedDate" type="date" stored="true" indexed="true"/>
    		<field name="updatedDate" type="date" stored="true" indexed="true"/>
    
    		<!-- fields for creativecommons plugin -->
    		<field name="cc" type="string" stored="true" indexed="true" multiValued="true"/>
    
    		<!-- fields for tld plugin -->    
    		<field name="tld" type="string" stored="false" indexed="false"/>
    
    		<!-- New field for nutch  end-->
    
  6. Start nutch grab

    	# The bin directory is Runtime / bin under nutch
    	./bin/crawl ~/urls/ jkj http://192.168.1.61:8983/solr/jkj_core 2
    	~/urls/ It's the directory where I store the crawled files.  jkj It's my designated store in hbase Medium id(As you can understand, create tables automatically
    	http://192.168.1.61:8983/solr/jkj_core solr creates the address of collection
    	2 For grabbing depth
    

7. View the results through solr or hbase

Topics: Programming HBase solr Apache Zookeeper