Steven Zhao's Blog: July 2016

Introduction

Apache Nutch is a web scraper. It takes a list of seed URLs, generates relevant URLS and then parses the content in each of the web pages by stripping HTML tags. It is the gold standard of all web scrapers available today. This guide is specifically designed for version 2.3.1 which is the latest version as of now with installation on a CentOS 7.0 linux server.

Requirements

The major difference between Nutch 1 and Nutch 2 is that Nutch 2 stores all results in a data store. The default data store is Apache HBase, but you can also use MongoDB. Since Nutch 2.3.1 as of now is distributed via source code and not a binary yet, you will need to compile the code locally after adjusting all the configuration parameters. The compiler of choice is Apache Ant.

Install Java

Install the latest version of Java if you don't already have it installed using Yum.

sudo -i
yum install java-1.8.0-openjdk.x86_64

Install Apache HBase

cd /tmp
wget http://archive.apache.org/dist/hbase/hbase-0.98.8/hbase-0.98.8-hadoop2-bin.tar.gz
cd /usr/share
tar zxf /tmp/hbase-0.98.8-hadoop2-bin.tar.gz

Edit: /usr/share/hbase-0.98.8-hadoop2/conf/hbase-site.xml

Insert this block of code to set up the storage location:

<property>
<name>hbase.rootdir</name>
<value>/usr/share/hbase-0.98.8-hadoop2/data/</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>false</value>
</property>

Commands:

/usr/share/hbase-0.98.8-hadoop2/bin/start-hbase.sh
/usr/share/hbase-0.98.8-hadoop2/bin/hbase shell

At this point, if you are taken to the HBase prompt, you have installed HBase successfully. To further test, try listing existing tables and creating new ones.

> list
> create 'test','cf'

Install Apache Ant

yum install ant

Install Nutch

cd /tmp
wget http://apache.mesi.com.ar/nutch/2.3.1/apache-nutch-2.3.1-src.tar.gz
cd /usr/share
tar zxf /tmp/apache-nutch-2.3.1-src.tar.gz

Edit and add to: /usr/share/apache-nutch-2.3.1/conf/nutch-site.xml

<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>

Edit the cluster name to match the name on the Elasticsearch config.

Edit /usr/share/apache-nutch-2.3.1/ivy/ivy.xml
Ensure the following lines are there and uncommented:

<dependency org="org.apache.gora" name="gora-core" rev="0.6.1" conf="*->default"/>
<dependency org="org.apache.gora" name="gora-hbase" rev="0.6.1" conf="*->default" />
<dependency org="org.apache.hbase" name="hbase-common" rev="0.98.8-hadoop2" conf="*->default" />

Edit /usr/share/apache-nutch-2.3.1/conf/gora.properties
Ensure this is there:

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

Compile Nutch from source code:

cd /usr/share/apache-nutch-2.3.1
ant runtime

If failed:

ant clean

Update configs

ant runtime

If build successful, then a compiled runtime folder is created

Nutch Commands

The following commands can be used to run through a simple scenario consisting of a web crawl and indexing to ES.

Make sure you run these inside the runtime/local folder.

mkdir urls
echo "https://en.wikipedia.org" > urls/seed.txt
bin/nutch inject urls/seed.txt
bin/nutch generate -topN 40
bin/nutch fetch -all
bin/nutch parse -all
bin/nutch updatedb -all
bin/nutch index -all (This sends data over to ES. Will not work until ES is installed and running. A different, but similar command is used for another indexer such as Solr.)

Automation

The commands above can been included in a script file (/scripts/scrape). Please set up a crontab cronjob to run the script at a desired time interval.

Install Elasticsearch

The version of ES that works in this whole setup is 1.7.3. Please refer to the guide here for installing ES:

http://mrstevenzhao.blogspot.com/2016/06/elasticsearch-install-on-linux.html

Steven Zhao's Blog

Tuesday, July 5, 2016

Install Apache Nutch 2.3.1 On Linux