|CCM/FileNet search index fails in IBM Connections 4.5 due to special character
The customer told me that his search index never completed correctly when Connections was initially deployed and now users are complaining that search results do not contain CCM documents.
The customer had tried recreating the index but to no avail and called me to take a look.
I first enabled trace on one of the infrastructure nodes (*=info: com.ibm.connections.search.index.indexing.*=all: com.ibm.connections.search.seedlist.*=all: com.ibm.connections.httpClient.*=all: com.ibm.connections.search.index.indexing.EcmFilesIndexer=all) as detailed in http://www-01.ibm.com/support/docview.wss?uid=swg21636559
I then created a back ground index as detailed in, Creating a back ground index and tailed the trace.log and SystemOut.log. To create the background index I ran the following commands on the Windows server.
wsadmin.bat -lang jython -username wasadmin -password ********
SearchService.startBackgroundIndex(“c:/IBM/Connections/background/crawl”, “c:/IBM/Connections/background/extracted”, “c:/IBM/Connections/background/index”, “ecm_files”)
I found that the indexing process finished abruptly about 3500 documents in (with another 6500 odd remaining).
[10/09/14 09:15:59:293 BST] 0000007a SeedlistPagin < com.ibm.connections.search.seedlist.parser.impl.SeedlistPaginationHandler resolve RETURN https://connections.acme.com/dm/atom/library/8DB6D184-AAF5-41F3-A28D-D1B7BEF17967%3BC11D230C-66A5-4CEB-8906-EAB19DFE0B8D/document/%7B5DEBC165-CDF6-4672-8300-A3345507867F%7D/media/%33%35%20%28%32%30%31%34%29%20%34%33%2d%38%35%20%54%68%65%20%53%79%73%74%65%6d%73%20%54%61%6e%74%6164%66?follow=true
[10/09/14 09:15:59:293 BST] 0000007a SystemErr R [Fatal Error] :23466:346: An invalid XML character (Unicode: 0x2) was found in the element content of the document.
[10/09/14 09:15:59:293 BST] 0000007a SeedlistEntry 2 com.ibm.connections.search.seedlist.crawler.impl.SeedlistEntryIterator hasNext CLFRW0063E: SAX parser error.
org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x2) was found in the element content of the document.
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
I took the URL (which has been edited) and logged in using an administrative account and was provided with a pdf. I initially believed that it must have been the contents of the document that caused the problem so I uploaded the same document to a 4.5 CR4 server I run in the lab and couldn’t reproduce the problem.
I raised a PMR and they came back and said that problem is likely to be due a special character in the description and not in the document itself.
I looked at the trace.log and found reference to the seedlist xml that was being processed at the time.
[10/09/14 09:52:26:121 BST] 0000007a SeedlistPersi > com.ibm.connections.search.seedlist.crawler.impl.SeedlistPersistenceManager getSeedlistDirs ENTRY ecm_files
[10/09/14 09:52:26:121 BST] 0000007a SeedlistPersi < com.ibm.connections.search.seedlist.crawler.impl.SeedlistPersistenceManager getSeedlistDirs RETURN ecm_files, [c:IBMConnectionsbackgroundcrawlseedlists-ecm_files-initial-1410267828454]
[10/09/14 09:52:26:121 BST] 0000007a SeedlistPersi < com.ibm.connections.search.seedlist.crawler.impl.SeedlistPersistenceManager getSeedlistDir RETURN c:IBMConnectionsbackgroundcrawlseedlists-ecm_files-initial-1410267828454
[10/09/14 09:52:26:121 BST] 0000007a SeedlistFetch 3 seedlistFile = [c:IBMConnectionsbackgroundcrawlseedlists-ecm_files-initial-14102678284541410267828454-00007.xml]
[10/09/14 09:52:26:121 BST] 0000007a SeedlistFetch 2 Retrieving seedlist content: https://connections.acme.com/dm/atom/seedlist/myserver?useLocalFS=true&Start=3500&Action=GetDocuments&Format=xml&Range=500
[10/09/14 09:52:26:121 BST] 0000007a SeedlistFetch 3 Retrieving seedlist from file: 1410267828454-00007.xml
I opened the xml in Notepad++ and searched for the document name which I obtained from the URL previously and found a match. In one of the fields I see the following.
I provided the community and library that the document resided in and the customer couldn’t view the description data in the web browser. The customer made some changes to the field via the FileNet interface and once the special character was removed the data showed in the web browser.
To check whether the index is created correctly after this change I ran the background index again but wrote the files to a new location. If you run the command again to the same location as the initial background index then it will fail because the seedlist will not have been recreated and the original special character is retained.
To speed things up, copy the extracted files from the previ0us location to the new extracted files. This customer had over ten thousand CCM documents so extracting them all again was time consuming.
I had to iterate this process four times until all the special characters were removed. Once you have an INDEX.READY file then I repeated the process for all the applications by copying over the extracted files and using SearchService.startBackgroundIndex(“c:/IBM/Connections/background/crawl”, “c:/IBM/Connections/background/extracted”, “c:/IBM/Connections/background/index”, “all_configured”) which built an index successfully.
I then used the steps in the IBM wiki to replace the current with the new index.
It turns out that the customer used a scripted import facility to import all the documents into CCM and this process introduced these characters.
Sep 22, 2014
| Recent Blog Posts
IBM Connections Files plugin not working within Notes when TLSv1.2 is enforced|
Mon, Jun 19th 2017 2:46p Ben Williams
Touchpoint problem due to no search index|
Thu, Jun 1st 2017 4:39p Ben Williams
Fri, May 19th 2017 7:40p Ben Williams
Sametime file transfer not working due to chat logging settings|
Thu, Apr 27th 2017 7:11a Ben Williams
Orient Me and mongoDB connection failures|
Thu, Apr 20th 2017 3:16p Ben Williams
Orient Me and some things I’ve come across and wrestled with|
Thu, Apr 13th 2017 7:04p Ben Williams
Version of Notes Java breaks IBM Connections Files plugin when TLSv1.2 is enforced|
Wed, Mar 29th 2017 2:39p Ben Williams
Connections Pink and container orchestration using CfC|
Thu, Mar 16th 2017 9:16a Ben Williams
Whiteboard now removed from Sametime meetings|
Tue, Mar 14th 2017 9:30p Ben Williams
Exception when Connections email digests are sent – LO90678|
Thu, Mar 9th 2017 10:03a Ben Williams