Monday's post prompted some fascinating feedback and discussion. Predictably, one of the questions raised was, "but what about search?" The quick answer is, instead of settling for the "one size fits all" approach of using Domino's full text indexer, this presents an opportunity to tailor a search methodology specific to each application, providing fine-grained control over determination of relevance... instead of just an "educated guess".
As just one example of how this can be extremely useful, I'll detail how search will be handled in watrCoolr, even though the implementation is entirely in my head at this point... and, therefore, entirely theoretical. As such, it's quite possible this simply won't work, but I'd honestly be rather surprised if it didn't.
One of the properties of the Chat class is called "transcript", and it's a ConcurrentSkipListMap (basically, just a TreeMap that's thread safe). Because this class implements SortedMap (the entry key is the millisecond timestamp of each message, represented as a Long), we can bind a repeat control directly to the values... the storage mechanism itself ensures it always stays correctly sorted, even if we're adding messages later that were replicated from other servers. Each value in the map is a Message object that stores only a few properties: timestamp, author, content, and locale code (i.e. "en-us") based on the browser preferences when the message is posted. So within the repeat, we can just bind three controls directly to the properties of the message instance: a computed date field, a name field, and a rich text field. And, for bonus points, surround the whole thing in a div with a "lang" attribute set to the locale code. So if a user is accessing the app via Chrome, and messages are being posted in numerous languages, they can just right click the web page, select "Translate to X" (where X is their preferred language), and Chrome has a decent shot of being able to display the entire history in their native tongue.
That's all nifty, of course, but doesn't address the search use case. So how's this? Suppose this had already been in place for a year, and a couple weeks ago someone searched for the phrase "lotusphere abstract deadline". We want to treat this as an "AND" search, so we split the query, and start looking for a match for all three terms.
For each Chat instance (all of which are in the applicationScope, so we don't have to even touch disk to do the following), we grab the transcript map, and call its descendingMap method. Since, again, it's a sorted map, this returns a duplicate view (pointers, not duplicate copies in memory) of the transcript, but in reverse order. Then we just iterate through all the messages. The first time we find any of the terms, we construct a new SearchResult instance. This will keep track of which chat contains the result, and the timestamp of the newest message that contained any of the terms. If that message contained all the terms, then its timestamp is also used to indicate the oldest message for this search result. Otherwise, we keep scanning until we've either found all the terms or run out of messages.
But even if we find all the terms, if there are any messages remaining, we keep scanning: we've finished the first search result, so that gets added to a list, but there might be additional references to these terms even within the same chat.
Suppose that, for instance, in February, someone posted, "it was great to see you all at lotusphere, can't wait 'till next year". Then, in June, someone else asked, "what does it mean if a class is defined as abstract?"... then, in September, someone said, "sorry i haven't had a chance to respond to your question, big deadline tomorrow". This means we have a perfectly valid search result... but one which has no chance of answering the user's question.
In contrast, if the day before this user performed the search, someone said, "i'm thinking of submitting a lotusphere abstract", and someone responded, "better hurry, deadline is nov 6"... now we can answer the user's question. How do we know which result is more likely to do so? Proximity.
Once we've scanned every chat, we now have a list of all valid search results. Each stores a pointer to the chat it came from, and the timestamp of the newest and oldest message that was found in order to satisfy the search criteria. What I forgot to mention earlier is that, since we know what the newest and oldest is (even if it's the same message), we can grab an inclusive subMap from the original transcript using those keys. In other words, this gives us a separate map of the first message to contain one or more of the terms, the last message that contained one or more, and any messages between. So relevance becomes simple: we sort the results ascending by map size. In the above examples, one search result would have a subMap size of 2; the other might be in the thousands. If we had, instead, found a message that said, "the deadline for submitting a lotusphere abstract is tomorrow", we have a map size of 1: a single message contains every term that was searched for, and is almost certain to provide the user the precise information they were searching for.
As a result, we can construct a SearchResultCollection that sorts the SearchResult instances ascending by size, and bind a repeat to the collection on the search page. The contents of the repeat are fairly simple, but powerful:
a link to the chat, with a fragment of "#8675309" (where Jenny's phone number would be replaced by an actual timestamp), so if the user clicks through, the page scrolls directly to the message that began the search result
a date field indicating the start time of the search result (to be able to distinguish between multiple results from the same page)
a rich text field bound to the oldest message... in the case of a map size of 1, we've now answered their question directly in the search result screen... no need to click through
for some additional UI candy, a tooltip bound to the search result link; inside it, a nested repeat bound to the actual search result map... so, if the user hovers over the link, they'll see a portion of the chat: the specific portion that satisfies their search... so, again, the user may never need to click through
The crazy part is that all of this is in memory. Not only are we searching in-memory data to begin with, but everything about the result collection is just a set of strategically organized pointers to the same in-memory data. So, again, in theory, this should be lightning fast compared to searching even a full-text index... and, unlike a full-text index, we'll get real time results. This efficiency gives us the luxury of factoring in additional weighting criteria in addition to simple proximity.
One possibility would be contextual author reputation: for example, if I mention the word "theme" in a chat about Lotusphere, there's at least a small possibility that the message containing that word would be snarky, and not particularly useful. If, on the other hand, I'm using the same term in a chat about XPage development, there's at least a small possibility that the content might actually be useful to someone. One of the features I'm planning on adding is the ability for people to +1 (or some other form of "social reinforcement") specific messages in a chat. So imagine how this type of custom search engine could gradually "learn" who the subject matter experts are on any given term... the more searches that are performed that include a specific word, the more messages will be found that contain that word; as more of these matching messages are found, some of them will have been voted up; over time, patterns would slowly emerge. If you're searching for information about HTML5 local storage, for instance, a search result that includes messages posted by Mark Hughes might be listed higher than another search result just because the user community has acknowledged that he's a guru on that subject.
So there's just one final concern to address regarding this whole approach: scanning the entirety of every chat for every search. Sure, it's in memory, so it'll be fast. But if the app is heavily used over a period of years, that's a lot of context to scan through every time. Surely there's a way to make this even more efficient. And we can do it very easily using my old friend, the predictable UNID technique. The reason this amuses me is that, while I've used this technique for all sorts of functionality over the past three years, it was originally inspired by Andrei Kouvchinnikov's DigestSearch, and this is the closest I've yet come to using it in the way he originally did.
The idea is simple but elegant: hash the search string, and we have a valid UNID. That doesn't necessarily mean a document exists with that UNID... just that it's syntactically valid. So suppose we create a separate database just for storing search results. When a search is performed, we hash the query (lowercase, since case matters to the hash), then check the search result database to see if there's already a document with that UNID. Since search result documents are all that we store in that database, if a document exists, it's guaranteed to be the log from a previous search result collection.
Let's assume for now that no match is found. So we create a new document and assign it the hash as its UNID. Then we proceed as described above. When we have our SearchResultCollection and are ready to display it to the user, we first serialize the entire collection to a MIME entity on our new search result document. Hence, if someone else ever performs the same search, we can instantly retrieve the digest document, deserialize the MIME entity, and we immediately have the prior SearchResultCollection in memory, in the exact state it was in the last time the same search was performed. But we're not done yet... there might be new content available.
If the SearchResultCollection has a separate property that is a chatId-to-timestamp map, then as we're searching each chat, we can store the timestamp of the newest message in each. In addition to subMap, maps also support a notion of tailMap. So when we search each chat, if it's the first time we've searched for this particular query, we search the entirety of each chat transcript. If, however, we retrieved a prior SearchResultCollection, and a given chat's ID is listed in the timestamp map, then we've already searched that chat up to the point in time that timestamp represents. So... instead of grabbing the entire transcript, we just grab an exclusive tailMap, and search only that. If any new valid search results are found, we add them to the SearchResultCollection, and allow its built-in sorting algorithm to ensure the new result appears in the correct order based on its relevance within the overall collection... but, whether or not any valid result is found, we update the timestamp map so that the updated collection knows how much new content it searched.
And, just like that, we've implemented incremental search.
Now, obviously, there are some caveats. Primary among them is that this only works because, once created, individual chunks of content never change, and (theoretically) never disappear. Granted, when retrieving a prior SearchResultCollection, we'd have to check to see if each chat is still flagged to allow public search. Come to think of it, this whole model probably needs to account for security (we search everything so we can store the results, but when returning the collection to the repeat, only include search results that were found in a public chat or one that the current user is a member of, etc.). But, assuming security is accounted for, if a given message set used to be a valid result for a given query, it still is and always will be. The same cannot be said for most Domino applications: field values change. Statuses get updated as documents move through a workflow. Tasks are reassigned. Dollar amounts are recalculated to account for new exchange rates. So the specific search strategy outlined above is only valid for this specific application.
And that's the whole point. We're leveraging the strengths of the platform as it currently exists to meet a context-specific need. To be precise, we're acknowledging that the ability to share memory across what have traditionally been separate memory contexts - even amongst all active users - is something the platform never had before the arrival of XPages, and this new capability gives us infinite control over how we direct users to pertinent content in ways we never could before.
Domino's full-text index capabilities have always been useful, and continue to provide a useful generic approach to locating content that matches certain criteria. But massive power is now at our fingertips. We've barely scratched the surface of things we can now do in Domino applications, because the tight coupling between data and user interface, to which we had been so long accustomed, would previously have made implementation of features like the type of completely custom search engine described above such a massive undertaking that it likely would not have occurred to us to try it in the first place, and, even if it had, would most likely seem like more trouble than it's worth. But now... this stuff is easy. I know it might not seem like it, because we've never done this before. Strictly speaking, I still haven't. To be absolutely sure this would actually work, I still have to write the code. But I know it works. This is just core Java stuff. Anybody who's been working with Java as long as I've been working with Domino would probably look at this approach and think, "well... duh". And that's awesome. Capabilities that are so core to the language that a true Java developer just takes them for granted provide us opportunities we never dreamed of back in the R7 days...
...imagine what ideas we'll dream up once more of us actually learn the language.
locating XPage components with XspQuery
Sun, Apr 14th 2013 12:00a Tim Tripcony Several years ago, I wrote a utility Java class designed to make it easy to search for components within the current XPage instance based on various criteria. I've found it enormously useful, and, apparently, so has Keith Strickland, because he added it to org.openntf.xsp.extlib, complete with a few refinements. As an example of how you might use this, examine the following line of code:
List requiredFields = new XspQuery()
.loc [read] Keywords: ldd
your how is not your what
Wed, Apr 3rd 2013 11:36a Tim Tripcony I've noticed a pattern emerging when I'm asked for help with XPages. Here's a representative conversation:
"I'm trying to do [X] and it's not working. How can I do that?"
"What are you trying to accomplish?"
"I already told you. I'm trying to do [X]."
"No, that's how you're trying to do it. What are you trying to do?"
For example, replace "[X]" with "reach into a repeat control from outside it" (since this has become the most frequent topic I'm asked about [read] Keywords: xpages application
my new favorite quote
Sat, Mar 23rd 2013 5:20p Tim Tripcony "We go about our daily lives understanding almost nothing of the world. We give little thought to the machinery that generates the sunlight that makes life possible, to the gravity that glues us to an earth that would otherwise send us spinning off into space, or the atoms of which we are made and on whose stability we fundamentally depend. Except for children (who don’t know enough not to ask the important questions), few of us spend much time wondering why nature is the way it is; where the [read] Keywords: wiki
Taking the scary out of Java in XPages: Prologue
Tue, Feb 26th 2013 9:50p Tim Tripcony The discussion following my last post made stark the need for greater availability of information that makes the nature of Java more accessible to Domino developers. Credit for the title of this post goes to Declan, who is considering writing a series of blog posts on this topic. I will be doing the same; hopefully there will be a fair amount of duplication. As David Leedy is fond of stating, it's a good thing when several people share the same information, because that makes it easier for the [read] Keywords: domino
Passthru vs. component - my perspective
Sat, Feb 16th 2013 9:40p Tim Tripcony Paul Withers posted a thorough article explaining the differences between namespaced XPage components (e.g. ) and their corresponding passthru elements (e.g. ), providing numerous examples of what actually happens when these objects are constructed. I've always heard (and often repeated) that passthru elements are more efficiently processed than their namespaced equivalents, so Paul's post inspired me to offer my own perspective.
Simply put, there's practically no difference... but there a [read] Keywords: acl