Search This Blog

Sunday, August 8, 2021

DDM probes + Cluster Replicator + intensively used database = Database open error: [filepath.nsf]: Database is currently in use by you or another user

 Hi

Here it is a thing which I spent a few days on.

We had two Domino R11.0.1 FP3 in a cluster. Then we added a 3rd Domino server of the same version into the cluster and after that we started to see many errors like that:

08-08-2021 18:09:53   Database open error: <filepath.nsf>: Database is currently in use by you or another user
08-08-2021 18:09:53   Database open error: <filepath.nsf>: Database is currently in use by you or another user
08-08-2021 18:09:53   Database open error: <filepath.nsf>: Database is currently in use by you or another user
08-08-2021 18:09:53   Database open error: <filepath.nsf>: Database is currently in use by you or another user
08-08-2021 18:09:53   Database open error: <filepath.nsf>: Database is currently in use by you or another user

Despite the errors in logs, all our applications worked OK, we didn't have any consequences of this error except that it was very hard to use log.nsf, because we could get ~5K such messages per day. The errors referred to a few different databases, but not to all databases we had. The errors were logged on all three Domino-servers. 

I googled and found these two articles without a solution:

So, at the end of the day I didn't find anything helpful in the Internet, except the point that the error message was produced by Cluster Replicator. My first attempts to resolve it myself ended up with nothing, so I decided to contact HCL. 

Unfortunatelly, they couldn't help me either, they suggested some basic advices, like try to delete the database and then replicate it again, try to restart Cluster Replicator and so on.

So, I didn't have any choice except trying to resolve it myself. 

After spending two days on that I discovered that the reason of the issues were DDM probes. We always used DDM probes for server monitoring and didn't change anything there for a long time, that's why it was so hard to catch that but after I disabled them - the error messages stopped to appear.

Now, let me tell you my explanation which is a pure theory but looks reasonable to me.

We rolled out 3rd Domino server because we planned to run a new Web-application which had to attract many new users in the Internet and we wanted to be sure that our Domino-environment would handle the increased workload - we were using external load balancer which had to route users to our Domino cluster and 3rd Domino had to help a lot. 

After we added a 3rd Domino, we ran the application as well and we got more Web-users as we expected.  So, basically, what happened: 

  • More users than before worked with just a few databases (kind of an entry point for the new Web-application);
  • Three Cluster Replicators (instead of two) pushed data to replicas on other servers and they had to do it often than before;
  • DDM worked without changes but it seemed the server tasks executing DDM probes needed to have an exclusive access to the database for a short period;
  • Cluster Replicators worked intensively with a few databases and often failed to open the databases while DDM probes were running and therefore Cluster Replicator showed that message and kept the changes in memory until it would be possible to open the database later.
That's basically it.
Maybe there were specific probes which caused this message, I did not check that, I only tested that: 
  • disabling of probes disabled the messages 
  • enabling the same probes back returned the messages back as well
It is hard even to ask HCL to fix anything because it is good to be able to see such messages, - they can help to identify important issues but on another hand when you get log.nsf spammed with 5K messages per day it is not acceptable either. The only idea is - maybe it makes sense to introduce another notes.ini parameter which could suppress the message, not sure.



4 comments:

  1. We have seen this error occasionally (every couple of months) for years now but we never found the root cause. I will look into this for our servers also and try to find out if the DDM-probes might be the root cause for us also.

    ReplyDelete
    Replies
    1. If it is not a big deal for you, can you please let me know if it helps?

      Delete
  2. Hello Yuriy,

    I am from HCL Technical Support. I was able to fix the error by disabling the DDM Probes for one of my customers. Thank you for your article that helped. I will further check on this to understand why cluster replicator task is causing the error. I will post here if I find anything fruitful. Thank you.

    ReplyDelete