Quick guide on how to index data from IBM Domino databases in the Apache Solr search engine


Full Text indexing has always been a great feature of IBM Notes and Domino. In the old days it was rare to see other systems have Full Text Indexing and it was really a unique and useful feature of IBM Notes and Domino.


Unfortunately for IBM Notes and Domino two things changed that advantage.
- IBM really did not keep on improving the Full Text engine, a new engine arrived in release 5, but since then only minor upgrades have been done to the engine.
- New search engine appeared with Doug Cutting created the Java based Full Text engine Lucene. By 2001 it was part of Apache as Open Source and has grown ever since then.
It has since been the foundation or inspiration of most Full Text engines.

Apache Solr is an Enterprise Search Engine based on Lucene. It is free and Open Source, so why not have a look at it?.

At IBM Connect 2014 IBM announced that Solr will be used to index mail database in a later major release...what ever that means...

I will give you a quick tutorial to set it up and have it running on some of your Domino data fast.

What is Apache Solr?


From their website:

"SolrTM is the popular, blazing fast open source enterprise search platform from the Apache LuceneTM project. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. .."

The goal of this quick tutorial.

- setting up a Solr server for testing
- have Solr index a Domino database
- query the data

1. The IBM Domino database to be indexed
The database to be index will be a simple web enabled database.

One form which has 3 fields

Subject - a Text field
Body - a Rich Text field
Attachments - a Rich Text field for attachments

2. Installation of Apache Solr
For this tutorial I will take the easy route.
This means downloading Apache Solr 4.10 and just use a modified version of the included example.
Start by downloading Solr here at http://lucene.apache.org/solr/.
Unpack the files files anywhere you want.
In this tutorial I will just run it on my desktop pc.

3. Preparing Solr for Domino data
The most important files in Solr are the files schema.xml and solrconfig.xml.

solr-conf.jpg

Solr can handle data as dynamic or static fields. We will define are few static fields

In the schema.xml file we will add

<field name="title" type="text_general" indexed="true" stored="true" multiValued="false"/>
<field name="subject" type="text_general" indexed="true" stored="true"/>
<field name="body" type="text_general" indexed="true" stored="true"/>
<field name="docurl" type="string" indexed="true" stored="true"/>
<field name="domino_doc_type" type="string" indexed="true" stored="true"/>

the field ID is very important, it is the key of the Solr document. We will use the UNID of the Domino as the ID in Solr documents

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />

4. Getting data from IBM Domino into Solr
There are different ways of getting data into Solr.
You can use the Data Import Handler (DIH) which is part of Solr and import data from RDMS, XML, data from file systems or websites, You can use Apache Nutch, Apache Manifold and more.

What I will use is the Java API called SolrJ.

For Domino documents the steps are:

- Connect to the Solr server.
- for each Domino document create a Solr document and fill in the fields with data from the Domino document
- if the Domino document has attachments upload them to the server
- for every 1000 documents commit the Solr documents to the server

I will be using DIIOP to get the data from the Domino database. This is not exactly the fastest way to get data but for the this purpose it is fine.

- Connect to server

public class SolrImporter {
Collection<SolrInputDocument> solrDocs;
HttpSolrServer solrServer;
String solrUrl = "http://127.0.0.1:8983/solr/";

public void init() {
solrServer = new HttpSolrServer(solrUrl);
solrServer.setSoTimeout(10000);
solrServer.setConnectionTimeout(10000);
solrServer.setDefaultMaxConnectionsPerHost(100);
solrServer.setMaxTotalConnections(100);
solrServer.setFollowRedirects(false);
solrServer.setAllowCompression(true);
solrServer.setMaxRetries(1);

try {
solrServer.deleteByQuery("*:*");// delete everything!
} catch (SolrServerException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}

In this demo every time I import data, ... all data in Solr is deleted first.
In real life you would of course update Solr documents by its ID.

- Creating the Solr documents:

In Solr data is saved in documents, which are much like Domino documents.

SolrInputDocument solrDoc = new SolrInputDocument();
solrDoc.addField("id", doc.getUniversalID(), 1.0f);
solrDoc.addField("subject", doc.getItemValueString("subject"), 1.0f);
solrDoc.addField("body", ((RichTextItem)doc.getFirstItem("Body")).getFormattedText(false, 0, 0), 1.0f);
solrDoc.addField("docurl", "http://jezzper.com/"+ db.getFilePath()+"/0/"+doc.getUniversalID());
solrDoc.addField("domino_doc_type","document");

//add to collection of docs to be submitted
boolean result = solrDocs.add(solrDoc);

Notice the last parameter, it gives you the possibility to "boost" the value (greater weight) in the search index or the opposite. A number larger than 1 boosts

For performance sake don't commit after every Solr document.
Here I commit the document for every 1000 documents to the Solr server

if ((i % 1000) == 0) {
solrServer.add(solrDocs);
solrServer.commit();
solrDocs.clear();
}

- For attachments I will use another appraoch:
For each attachment, I will extract the attachment to the file system and then upload the file to the Solr server

while (e.hasMoreElements()) {
EmbeddedObject eo = (EmbeddedObject) e.nextElement();
if (eo.getType() == EmbeddedObject.EMBED_ATTACHMENT) {
try {
eo.extractFile("c:\\extracts\\" + eo.getSource());
} catch (Exception e2) {
e2.printStackTrace();
}
try {
AttachmentImporter.Upload(solrServer,"c:\\extracts\\" + eo.getSource(), doc.getUniversalID() + "."+eo.getSource());
} catch (Exception e1) {
e1.printStackTrace();
}
}
}

Solr uses Apache Tika which can detect and extract metadata and text content from various document types.

We will send the attachments to the Solr server which will detect the Content Type and automatically extract Meta data and content from the files.

The benefit of this is that it is very easy to do, but on the other hand you do not want to send a 2 GB AVI file to the server just to get the Meta data extracted.
In that case you might want to consider a solution where you extract the meta data yourself and only save these in a Solr document.

For ID we will use the Domino UNID +"."+attachment Internal name

public static void Upload(HttpSolrServer solrServer,String fileName,String id) throws IOException, SolrServerException {
ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract");
req.setParam("literal.id", id);
req.setParam("fmap.content", "text");
req.setParam("commit","true");
req.setParam("literal.attachment_name",fileName.substring(fileName.lastIndexOf('\\') + 1));
req.setParam("literal.domino_doc_type","attachment");

File attFile =new File(fileName);
// empty content type to let Tika itself find the content type
req.addFile(attFile,"");

//new test parm
req.setParam("uprefix", "attr_");
req.setParam("fmap.content", "attr_content");
req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);

// workaround, this must be done to be able to delete tmp attachment file in Windows
req=null;
System.gc();
System.out.println( "Is attachment deleted from tmp directory? " + attFile.delete());
}

5. The Solr administrator console
You can mange, analyze and query the Solr server from a browser at http://localhost:8983/solr/



Select the core called "Collection1" and you will get the menu to analyze and query the data



6. Query the data
The basic form of query in Solr is

Field:Seach Text

so Body:pdf means search for text pdf in the body field

The response result can be served in many formats JSON, XML, CSV etc. when querying the Solr server.
Another way is to use SolrJ again query and get the result and handle it in for example in an Xpage, or use the JSON output in Dojo.

You can get can the data from SolrJ into the Solr server as Java Beans, and also get the result back as Java Beans

An example of a document returned as JSON in a result :

{ "id": "4A4DF8CF83AFFC8FC1257D6000416FCB",
"subject": "Elvis Costello",
"body": "Elvis Costello was in Denmark and visited the Elvis Presly Graceland museum in \nRanders\n\nA picture of Elvis Costello ",
"docurl": "http://jezzper.com/jezzper/Solr.nsf/0/4A4DF8CF83AFFC8FC1257D6000416FCB",
"domino_doc_type": "document",
"_version_": 1480501411752968200
},

A example of all 7 documents returned (3 Domino documents and 4 attachments). :

Search *:* gives:

As XML: Search result.xmlSearch result.xml
As JSON Search result.JSONSearch result.JSON


This was just quick example of getting up and running to play with Domino data in Solr.
There is much more to Solr so I will probably do some more blogging on Solr over the next months, if I can find the time.

Downloads
Domino data export to Solr Search Engine Source Code
Libraries needed for SolJ 4.10
Demo Domino database for Solr indexing


Posted on 09/27/2014 03:26:16 PM CEDT