Tuesday, July 5, 2016

Install Apache Nutch 2.3.1 On Linux

Introduction

Apache Nutch is a web scraper.  It takes a list of seed URLs, generates relevant URLS and then parses the content in each of the web pages by stripping HTML tags.  It is the gold standard of all web scrapers available today.  This guide is specifically designed for version 2.3.1 which is the latest version as of now with installation on a CentOS 7.0 linux server.

Requirements

The major difference between Nutch 1 and Nutch 2 is that Nutch 2 stores all results in a data store.  The default data store is Apache HBase, but you can also use MongoDB.  Since Nutch 2.3.1 as of now is distributed via source code and not a binary yet, you will need to compile the code locally after adjusting all the configuration parameters.  The compiler of choice is Apache Ant.

Install Java

Install the latest version of Java if you don't already have it installed using Yum.

sudo -i
yum install java-1.8.0-openjdk.x86_64


Install Apache HBase

cd /tmp
wget http://archive.apache.org/dist/hbase/hbase-0.98.8/hbase-0.98.8-hadoop2-bin.tar.gz
cd /usr/share
tar zxf /tmp/hbase-0.98.8-hadoop2-bin.tar.gz

Edit: /usr/share/hbase-0.98.8-hadoop2/conf/hbase-site.xml

Insert this block of code to set up the storage location:

<property>
<name>hbase.rootdir</name>
<value>/usr/share/hbase-0.98.8-hadoop2/data/</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>false</value>
</property>

Commands:
/usr/share/hbase-0.98.8-hadoop2/bin/start-hbase.sh
/usr/share/hbase-0.98.8-hadoop2/bin/hbase shell

At this point, if you are taken to the HBase prompt, you have installed HBase successfully.  To further test, try listing existing tables and creating new ones.

> list
> create 'test','cf'

Install Apache Ant

yum install ant

Install Nutch

cd /tmp
wget http://apache.mesi.com.ar/nutch/2.3.1/apache-nutch-2.3.1-src.tar.gz
cd /usr/share
tar zxf /tmp/apache-nutch-2.3.1-src.tar.gz


Edit and add to: /usr/share/apache-nutch-2.3.1/conf/nutch-site.xml

<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>



Edit the cluster name to match the name on the Elasticsearch config.


Edit /usr/share/apache-nutch-2.3.1/ivy/ivy.xml
Ensure the following lines are there and uncommented:

<dependency org="org.apache.gora" name="gora-core" rev="0.6.1" conf="*->default"/>
<dependency org="org.apache.gora" name="gora-hbase" rev="0.6.1" conf="*->default" />
<dependency org="org.apache.hbase" name="hbase-common" rev="0.98.8-hadoop2" conf="*->default" />


Edit /usr/share/apache-nutch-2.3.1/conf/gora.properties
Ensure this is there:

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
 
Compile Nutch from source code:

cd /usr/share/apache-nutch-2.3.1
ant runtime

If failed:

ant clean

Update configs

ant runtime


If build successful, then a compiled runtime folder is created
 

Nutch Commands

The following commands can be used to run through a simple scenario consisting of a web crawl and indexing to ES.

Make sure you run these inside the runtime/local folder.

mkdir urls
echo "https://en.wikipedia.org" > urls/seed.txt
bin/nutch inject urls/seed.txt
bin/nutch generate -topN 40
bin/nutch fetch -all
bin/nutch parse -all
bin/nutch updatedb -all
bin/nutch index -all (This sends data over to ES. Will not work until ES is installed and running.  A different, but similar command is used for another indexer such as Solr.)

Automation

The commands above can been included in a script file (/scripts/scrape). Please set up a crontab cronjob to run the script at a desired time interval.

Install Elasticsearch

The version of ES that works in this whole setup is 1.7.3.  Please refer to the guide here for installing ES:

http://mrstevenzhao.blogspot.com/2016/06/elasticsearch-install-on-linux.html

 

7 comments:

  1. mkdir urls
    echo "https://en.wikipedia.org" > urls/seed.txt
    bin/nutch inject urls/seed.txt
    bin/nutch generate -topN 40
    bin/nutch fetch -all
    bin/nutch parse -all
    bin/nutch updatedb -all
    bin/nutch index -all (This sends data over to ES. Will not work until ES is installed and running. A different, but s






    can you explain these more specifc???.because i cannot find bin/nutch

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
    2. should i need to create this bin/nutch?

      can you provide any demo project based on this??

      Delete

  2. It was really a nice article and i was really impressed by reading this Big data hadoop online training India

    ReplyDelete