Steven Zhao's Blog: Apache Solr Install w/ MongoDB Indexing

Introduction

Solr is an awesome search platform. Based on the trusted and beloved Lucene platform, Solr offers everything Lucene offers and more such as replication and sharding for horizontal scaling and faceted filtration of results and multi-core separation. SolrCloud is a Solr offering with sharding in place for real time distributed reading and writing from a farm of servers designed to handle extra storage and traffic. Traditional Solr setups can also be distributed but only in the form of replication where multiple slave nodes can periodically pull from the master index and all search queries are performed on the slaves. This is pull replication where SolrCloud implements real-time push replication.

Solr vs ElasticSearch (ES)

Both are Java-based and both are extended forms of Lucene so what sets them apart? For the most part they are very similar with a few differences. For the average business case with a need for a highly-scalable search platform, either one will be fine. These are some of the differences between both platforms.

ES is a newer platform which makes them have the vibe of being more modern and distributed performance is handled out-of-the-box with minimal configurations and setup steps. SolrCloud requires a separate distributed environment system such as Apache Zookeeper to sync up all the shards. This requires additional steps as compared to ES in setting up the distributed environment which has the built-in ZenKeeper tool as part of the install and is distributed with very little effort. With this in mind, you could say Solr setups are for the more serious developers since more in-depth knowledge is required to get things up and running. Also, ES setups are known to have long-term problems because the initial setup is not very demanding and since "anyone can get it up and running" they may lack the integrity or knowledge to maintain it.
Solr is an older product by a few years and the community reflects that. The community is bigger and more established with more resources available on the web to address issues that you may encounter. Solr also has better official documentation.
ES has a more robust analytics suite. This data proves useful for marketing purposes.

Steps To Install Solr 6.0.0 in Linux

yum -y update
Ensure everything is up to date first
java -version
Check java version
yum list available java*
Check all available java versions in the YUM package manager
yum install java-1.8.0-openjdk.x86_64
Install the latest version of java if not installed already
cd /tmp
Change the working directory to temp to prepare for download
wget http://apache.org/dist/lucene/solr/6.0.0/solr-6.0.0.tgz
Download the solr install
tar -zxvf solr-6.0.0.tgz
Uncompress file
cd solr-6.0.0
bin/install_solr_service.sh /tmp/solr-6.0.0.tgz
Run the script to install solr

Solr service is now insalled. Installation dir is /opt/solr. Data dir is /var/solr.

Solr user and group created by script and is used to create Solr cores.
View admin web UI here:

http://localhost:8983/solr/

or

http://IPADDRESS:8983/solr/ (remotely)

This may require updating the firewall for public access and opening up the default port 8983.

Create Sample Core and Load Documents

sudo chown -R solr:solr /var/solr/
Make solr user the owner of dir
sudo chown -R solr:solr /var/solr
cd /opt/solr
su - solr
Switch to solr user
bin/solr create -c documents
Create documents core, if permissions are correct you will not get errors
bin/post -c documents docs
Load up the core with the test html docs, do not use if not testing

Operations

All configuration data points will reside in the folder of the core in the file system. In our case, the path is:

/var/solr/CORE_NAME/conf

All changes to the config files will only be reflected in the system after restarting. These are the commands:

service solr start
service solr stop
service solr restart

Data Import Handler and MongoDB

The example above is a simple scenario where static files in the same file system are pulled in and indexed. In a real environment, you may have to index data records in a relational database or even NoSQL databases such as MongoDB. All this is done using the DataImportHandler. There is support for indexing remote files and database records along with other sources but there is no native support for indexing Mongo collections and documents. At this time a custom solution is needed. You will need to download the latest versions of the following JAR files:

solr-dataimporthandler-x.x.x.jar
solr-mongo-importer-x.x.x.jar
mongo-java-driver-x.x.jar

These files have to be dropped into:

/opt/solr/dist

The core configuration files are at:/var/solr/CORE_NAME/conf
In this folder will be XML files used to define the system behavior. You have to make the system aware of the JAR files that are dropped in by updating the solrconfig.xml file. Please add in with the other lib declarations:

<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-.*\.jar" />
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-mongo-importer-.*\.jar" />
<lib dir="${solr.install.dir:../../../..}/dist/" regex="mongo-java-driver-.*\.jar" />

The schema file is managed-schema.xml. In this file is where all field names and definitions are stored. Please add the custom fields for the indexes here. They will look something like this:

<field name="firstName" type="string" indexed="true" stored="true"/>
<field name="email" type="string" indexed="true" stored="true"/>

etc...

In the solrconfig.xml file you will need to indicate that a DataImportHandler will be used to handle external data imports. Please add:

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">  
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>

In the same conf directory, create a new data-config.xml file and put in:

<dataConfig> <dataSource name="MongoSource" type="MongoDataSource" host="localhost" port="27017" database="dbname" username="userid" password="password"/> <document name="import"> <entity processor="MongoEntityProcessor" datasource="MongoSource" transformer="MongoMapperTransformer" name="Users" collection="users" query=""> <field column="_id" name="id" mongoField="_id"/> <field column="email" name="email" mongoField="email"/> <field column="firstName" name="firstName" mongoField="firstName"/> <field column="lastName" name="lastName" mongoField="lastName"/> <entity name="Address" processor="MongoEntityProcessor" query="{'email':'${User.email}'}" collection="users" datasource="MongoSource" transformer="script:addressDataTransformer"> </entity> <entity name="Collection" processor="MongoEntityProcessor" query="{'email':'${User.email}'}" collection="users" datasource="MongoSource" transformer="script:userCollectionDataTransformer"> </entity> </entity> <entity processor="MongoEntityProcessor" datasource="MongoSource" transformer="MongoMapperTransformer" name="Jobs" collection="jobs" query=""> <field column="_id" name="id" mongoField="_id"/> <field column="location" name="location" mongoField="location"/> <field column="title" name="title" mongoField="title"/> <field column="description" name="description" mongoField="description"/> <entity name="Collection" processor="MongoEntityProcessor" query="" collection="jobs" datasource="MongoSource" transformer="script:jobCollectionDataTransformer"> </entity> </entity> </document> <script><![CDATA[ function addressDataTransformer(row){ var ret = row; if (row.get("address") !== null) { var address = row.get("address"); var address1 = row.get("address"); if (address.get("address1") !== null) { ret.put("address1", address.get("address1").toString()); } if (address.get("address2") !== null) { ret.put("address2", address.get("address2").toString()); } if (address.get("city") !== null) { ret.put("city", address.get("city").toString()); } if (address.get("state") !== null) { ret.put("state", address.get("state").toString()); } if (address.get("zip") !== null) { ret.put("zip", address.get("zip").toString()); } } return ret; } function userCollectionDataTransformer(row){ var ret = row; ret.put("collection", "users"); return ret; } function jobCollectionDataTransformer(row){ var ret = row; ret.put("collection", "jobs"); return ret; } ]]></script> </dataConfig>

This example is a complex one. It highlights flat one-to-one fields from a Mongo document mapped to an index field. It also highlights the use of adding in static field values and using a custom data transformer to map nested Mongo document fields to index fields.

At this point the DataImportHandler should be set up. If you load up the web UI and if the Data Import section does not display an error message that there are no errors in the configuration files. The only thing needed now is to tweak the schema field values and configuration data points to customize system behavior and performance.

Automating Data Import

Now that the DataImportHandler is set up to import correctly, the final step is to schedule it to run periodically. The simplest way to do this is to hit the URL which triggers a full or delta import like this:

http://localhost:8983/solr/CORE_NAME/dataimport?command=full-import

In the Linux environment, you can create a CRONTAB task to CURL this URL at a scheduled interval. The Data Import section of the web UI will also let you know the last time the index was updated whether it is done via the UI or via a URL request.

Summary

Solr is a very powerful tool and the advantage of using Solr with MongoDB is that you are separating the data store from the search engine. The data store can focus on collecting data and the search engine can focus on the queries and indexing. You can also scale each component separately and is essential for data and resource redundancy.

4 comments:

UnknownMarch 30, 2017 at 5:42 AM
I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in APACHE SOLR
, kindly contact us http://www.maxmunus.com/contact
MaxMunus Offer World Class Virtual Instructor led training on APACHE SOLR . We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
For Demo Contact us:
Name : Arunkumar U
Email : arun@maxmunus.com
Skype id: training_maxmunus
Contact No.-+91-9738507310
Company Website –http://www.maxmunus.com

AnonymousMay 26, 2017 at 12:33 AM
I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor led live training in Apache Solr, kindly contact us http://www.maxmunus.com/contact
MaxMunus Offer World Class Virtual Instructor led training on Apache Solr. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
For Demo Contact us.
Nitesh Kumar
MaxMunus
E-mail: nitesh@maxmunus.com
Skype id: nitesh_maxmunus
Ph:(+91) 8553912023
http://www.maxmunus.com/

AnonymousJuly 23, 2018 at 1:01 PM
Hi Steven,

Thank you very much for this Blog.
Could you put out the mongodb dump? Thanks!
akhilapriya404October 7, 2018 at 11:06 PM
Thanks for sharing this blog post,Nice written skill Java online training Hyderabad

Friday, May 13, 2016

Apache Solr Install w/ MongoDB Indexing