Friday, May 13, 2016

Apache Solr Install w/ MongoDB Indexing

Introduction

Solr is an awesome search platform.  Based on the trusted and beloved Lucene platform, Solr offers everything Lucene offers and more such as replication and sharding for horizontal scaling and faceted filtration of results and multi-core separation.  SolrCloud is a Solr offering with sharding in place for real time distributed reading and writing from a farm of servers designed to handle extra storage and traffic.  Traditional Solr setups can also be distributed but only in the form of replication where multiple slave nodes can periodically pull from the master index and all search queries are performed on the slaves.  This is pull replication where SolrCloud implements real-time push replication.


Solr vs ElasticSearch (ES)

Both are Java-based and both are extended forms of Lucene so what sets them apart?  For the most part they are very similar with a few differences.  For the average business case with a need for a highly-scalable search platform, either one will be fine.  These are some of the differences between both platforms.

  1. ES is a newer platform which makes them have the vibe of being more modern and distributed performance is handled out-of-the-box with minimal configurations and setup steps.  SolrCloud requires a separate distributed environment system such as Apache Zookeeper to sync up all the shards.  This requires additional steps as compared to ES in setting up the distributed environment which has the built-in ZenKeeper tool as part of the install and is distributed with very little effort.  With this in mind, you could say Solr setups are for the more serious developers since more in-depth knowledge is required to get things up and running.  Also, ES setups are known to have long-term problems because the initial setup is not very demanding and since "anyone can get it up and running" they may lack the integrity or knowledge to maintain it.
  2. Solr is an older product by a few years and the community reflects that.  The community is bigger and more established with more resources available on the web to address issues that you may encounter.   Solr also has better official documentation. 
  3. ES has a more robust analytics suite.  This data proves useful for marketing purposes.


Steps To Install Solr 6.0.0 in Linux

  1. yum -y update
    Ensure everything is up to date first
  2. java -version
    Check java version
  3. yum list available java*
    Check all available java versions in the YUM package manager
  4. yum install java-1.8.0-openjdk.x86_64
    Install the latest version of java if not installed already
  5. cd /tmp
    Change the working directory to temp to prepare for download
  6. wget http://apache.org/dist/lucene/solr/6.0.0/solr-6.0.0.tgz
    Download the solr install
  7. tar -zxvf solr-6.0.0.tgz
    Uncompress file
  8. cd solr-6.0.0
  9. bin/install_solr_service.sh /tmp/solr-6.0.0.tgz
    Run the script to install solr

    Solr service is now insalled. Installation dir is  /opt/solr. Data dir is /var/solr.

    Solr user and group created by script and is used to create Solr cores.
  10. View admin web UI here:

    http://localhost:8983/solr/

    or

    http://IPADDRESS:8983/solr/ (remotely)

    This may require updating the firewall for public access and opening up the default port 8983.

Create Sample Core and Load Documents

  1. sudo chown -R solr:solr /var/solr/
    Make solr user the owner of dir
  2. sudo chown -R solr:solr /var/solr
  3. cd /opt/solr
  4. su - solr
    Switch to solr user
  5. bin/solr create -c documents
    Create documents core, if permissions are correct you will not get errors
  6. bin/post -c documents docs
    Load up the core with the test html docs, do not use if not testing

Operations

All configuration data points will reside in the folder of the core in the file system.  In our case, the path is:

/var/solr/CORE_NAME/conf

All changes to the config files will only be reflected in the system after restarting.  These are the commands:
  1. service solr start
  2. service solr stop
  3. service solr restart
 

Data Import Handler and MongoDB

 The example above is a simple scenario where static files in the same file system are pulled in and indexed.  In a real environment, you may have to index data records in a relational database or even NoSQL databases such as MongoDB.  All this is done using the DataImportHandler.  There is support for indexing remote files and database records along with other sources but there is no native support for indexing Mongo collections and documents.  At this time a custom solution is needed.  You will need to download the latest versions of the following JAR files:

  1. solr-dataimporthandler-x.x.x.jar 
  2. solr-mongo-importer-x.x.x.jar 
  3. mongo-java-driver-x.x.jar
 
These files have to be dropped into:

/opt/solr/dist

The core configuration files are at:/var/solr/CORE_NAME/conf
In this folder will be XML files used to define the system behavior.  You have to make the system aware of the JAR files that are dropped in by updating the solrconfig.xml file.  Please add in with the other lib declarations:

  <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-.*\.jar" />
  <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-mongo-importer-.*\.jar" />
  <lib dir="${solr.install.dir:../../../..}/dist/" regex="mongo-java-driver-.*\.jar" />


The schema file is managed-schema.xml.  In this file is where all field names and definitions are stored.  Please add the custom fields for the indexes here.  They will look something like this:

<field name="firstName" type="string" indexed="true" stored="true"/>
<field name="email" type="string" indexed="true" stored="true"/>
etc... 

In the solrconfig.xml file you will need to indicate that a DataImportHandler will be used to handle external data imports. Please add:

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">  
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>

In the same conf directory, create a new data-config.xml file and put in:

<dataConfig>
<dataSource name="MongoSource" type="MongoDataSource" host="localhost" port="27017" database="dbname" username="userid" password="password"/>
<document name="import">
     <entity processor="MongoEntityProcessor"            
             datasource="MongoSource"
             transformer="MongoMapperTransformer"
             name="Users"
             collection="users"
             query="">

            <field column="_id"  name="id" mongoField="_id"/>  
            <field column="email"  name="email" mongoField="email"/>              
            <field column="firstName" name="firstName" mongoField="firstName"/> 
            <field column="lastName" name="lastName" mongoField="lastName"/>
           
            <entity name="Address"
              processor="MongoEntityProcessor"
              query="{'email':'${User.email}'}"
              collection="users"
              datasource="MongoSource"
              transformer="script:addressDataTransformer">
            </entity>
           
            <entity name="Collection"
              processor="MongoEntityProcessor"
              query="{'email':'${User.email}'}"
              collection="users"
              datasource="MongoSource"
              transformer="script:userCollectionDataTransformer">
            </entity>            
                   
       </entity>
      
       <entity processor="MongoEntityProcessor"            
             datasource="MongoSource"
             transformer="MongoMapperTransformer"
             name="Jobs"
             collection="jobs"
             query="">

            <field column="_id"  name="id" mongoField="_id"/>  
            <field column="location"  name="location" mongoField="location"/>    
            <field column="title"  name="title" mongoField="title"/>    
            <field column="description"  name="description" mongoField="description"/>                   

            <entity name="Collection"
              processor="MongoEntityProcessor"
              query=""
              collection="jobs"
              datasource="MongoSource"
              transformer="script:jobCollectionDataTransformer">
            </entity>
                   
       </entity>
 </document>

 <script><![CDATA[
function addressDataTransformer(row){
    var ret = row;
   
    if (row.get("address") !== null) {
        var address = row.get("address");
        var address1 = row.get("address");
        if (address.get("address1") !== null) {
            ret.put("address1", address.get("address1").toString());
        }
        if (address.get("address2") !== null) {
            ret.put("address2", address.get("address2").toString());
        }
        if (address.get("city") !== null) {
            ret.put("city", address.get("city").toString());
        }
        if (address.get("state") !== null) {
            ret.put("state", address.get("state").toString());
        }
        if (address.get("zip") !== null) {
            ret.put("zip", address.get("zip").toString());
        }
    }
    return ret;
}

function userCollectionDataTransformer(row){
    var ret = row;
   
    ret.put("collection", "users");
   
    return ret;
}

function jobCollectionDataTransformer(row){
    var ret = row;
   
    ret.put("collection", "jobs");
   
    return ret;
}
]]></script>
</dataConfig>



This example is a complex one. It highlights flat one-to-one fields from a Mongo document mapped to an index field. It also highlights the use of adding in static field values and using a custom data transformer to map nested Mongo document fields to index fields.


At this point the DataImportHandler should be set up. If you load up the web UI and if the Data Import section does not display an error message that there are no errors in the configuration files. The only thing needed now is to tweak the schema field values and configuration data points to customize system behavior and performance.


Automating Data Import

Now that the DataImportHandler is set up to import correctly, the final step is to schedule it to run periodically.  The simplest way to do this is to hit the URL which triggers a full or delta import like this:

http://localhost:8983/solr/CORE_NAME/dataimport?command=full-import

In the Linux environment, you can create a CRONTAB task to CURL this URL at a scheduled interval.  The Data Import section of the web UI will also let you know the last time the index was updated whether it is done via the UI or via a URL request.


Summary

Solr is a very powerful tool and the advantage of using Solr with MongoDB is that you are separating the data store from the search engine.  The data store can focus on collecting data and the search engine can focus on the queries and indexing.  You can also scale each component separately and is essential for data and resource redundancy.














4 comments:

  1. I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in APACHE SOLR
    , kindly contact us http://www.maxmunus.com/contact
    MaxMunus Offer World Class Virtual Instructor led training on APACHE SOLR . We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
    For Demo Contact us:
    Name : Arunkumar U
    Email : arun@maxmunus.com
    Skype id: training_maxmunus
    Contact No.-+91-9738507310
    Company Website –http://www.maxmunus.com



    ReplyDelete
  2. I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor led live training in Apache Solr, kindly contact us http://www.maxmunus.com/contact
    MaxMunus Offer World Class Virtual Instructor led training on Apache Solr. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
    For Demo Contact us.
    Nitesh Kumar
    MaxMunus
    E-mail: nitesh@maxmunus.com
    Skype id: nitesh_maxmunus
    Ph:(+91) 8553912023
    http://www.maxmunus.com/


    ReplyDelete
  3. Hi Steven,

    Thank you very much for this Blog.
    Could you put out the mongodb dump? Thanks!

    ReplyDelete