Introduction
Apache Nutch is a web scraper. It takes a list of seed URLs, generates relevant URLS and then parses the content in each of the web pages by stripping HTML tags. It is the gold standard of all web scrapers available today. This guide is specifically designed for version 2.3.1 which is the latest version as of now with installation on a CentOS 7.0 linux server.Requirements
The major difference between Nutch 1 and Nutch 2 is that Nutch 2 stores all results in a data store. The default data store is Apache HBase, but you can also use MongoDB. Since Nutch 2.3.1 as of now is distributed via source code and not a binary yet, you will need to compile the code locally after adjusting all the configuration parameters. The compiler of choice is Apache Ant.Install Java
Install the latest version of Java if you don't already have it installed using Yum.sudo -i
yum install java-1.8.0-openjdk.x86_64
Install Apache HBase
cd /tmpwget http://archive.apache.org/dist/hbase/hbase-0.98.8/hbase-0.98.8-hadoop2-bin.tar.gz
cd /usr/share
tar zxf /tmp/hbase-0.98.8-hadoop2-bin.tar.gz
Edit: /usr/share/hbase-0.98.8-hadoop2/conf/hbase-site.xml
Insert this block of code to set up the storage location:
<property>
<name>hbase.rootdir</name>
<value>/usr/share/hbase-0.98.8-hadoop2/data/</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>false</value>
</property>
Commands:
/usr/share/hbase-0.98.8-hadoop2/bin/start-hbase.sh/usr/share/hbase-0.98.8-hadoop2/bin/hbase shell
At this point, if you are taken to the HBase prompt, you have installed HBase successfully. To further test, try listing existing tables and creating new ones.
> list
> create 'test','cf'
Install Apache Ant
yum install antInstall Nutch
cd /tmpwget http://apache.mesi.com.ar/nutch/2.3.1/apache-nutch-2.3.1-src.tar.gz
cd /usr/share
tar zxf /tmp/apache-nutch-2.3.1-src.tar.gz
Edit and add to: /usr/share/apache-nutch-2.3.1/conf/nutch-site.xml
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>
Edit the cluster name to match the name on the Elasticsearch config.
Edit /usr/share/apache-nutch-2.3.1/ivy/ivy.xml
Ensure the following lines are there and uncommented:
<dependency org="org.apache.gora" name="gora-core" rev="0.6.1" conf="*->default"/>
<dependency org="org.apache.gora" name="gora-hbase" rev="0.6.1" conf="*->default" />
<dependency org="org.apache.hbase" name="hbase-common" rev="0.98.8-hadoop2" conf="*->default" />
Edit /usr/share/apache-nutch-2.3.1/conf/gora.propertiesEnsure this is there:
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
Compile Nutch from source code:cd /usr/share/apache-nutch-2.3.1
ant runtime
If failed:
ant clean
Update configs
ant runtime
If build successful, then a compiled runtime folder is created
Nutch Commands
The following commands can be used to run through a simple scenario consisting of a web crawl and indexing to ES.Make sure you run these inside the runtime/local folder.
mkdir urls
echo "https://en.wikipedia.org" > urls/seed.txt
bin/nutch inject urls/seed.txt
bin/nutch generate -topN 40
bin/nutch fetch -all
bin/nutch parse -all
bin/nutch updatedb -all
bin/nutch index -all (This sends data over to ES. Will not work until ES is installed and running. A different, but similar command is used for another indexer such as Solr.)
Automation
The commands above can been included in a script file (/scripts/scrape). Please set up a crontab cronjob to run the script at a desired time interval.Install Elasticsearch
The version of ES that works in this whole setup is 1.7.3. Please refer to the guide here for installing ES:http://mrstevenzhao.blogspot.com/2016/06/elasticsearch-install-on-linux.html