Archive for the ‘Monitoring’ Category

Missed it by that much: Dead Drive

We have two servers in a cluster in Brazil. The hardware is HP ProLiant DL380 G5. This disks are 5 SAS 10,000RPM 146GB hot pluggable drives in a Raid5 Array. We have HP Insight Management Agents and HP Systems Management Home Page setup on those servers with the HP Event Notifier configured to send our admin group notifications for any hardware events though the SMTP server on the Domino server of the primary server in the cluster.

Today, the local administrator in Brazil sent us a photo of the front of the server. In the photo one of the drives had a red light on it. I connected to the server and looked a HP SMH and HP Integrated Management Log (IML). Sure enough, one of the drives was dead and had been acting up since January 31st. But why not notification?? Read the rest of this entry »

Creating corrective actions with events4.nsf (Part 2 Disaster Strikes!)

After a few days of having my corrective actions enabled that I blogged about in this previous post, I have had a couple of disasters and a couple of let downs.

Corrective Action 1: It doesn’t work. The is not passed so the compact task can use it.
Here’s an excerpt from my log.
01/23/2009 04:39:54 PM Error compacting mail\somemailbox.nsf: Database is corrupt — Cannot allocate space
01/23/2009 04:39:54 PM Database compactor process shutdown

> load compact -c
01/23/2009 04:39:54 PM Database compactor error: File does not exist

Corrective Action 2: We haven’t had a mail.box corrupt yet, so I don’t know if this one works.

Corrective Action 3: Originally, I had this Event Handler set so that trigger was “a built-in or add-in task” and I choose the error message that started out with something like Unable to update activity in Log for . However, when that occurred, nothing happened. I looked in my statrep.nsf to find that the type was mismatched.

So I re-defined event handler, so that the Trigger: was set to “Any event that matches a criteria” and the event type any and severity any with the event text as “Unable to update activity document in log database.” The corrective action was nserver.exe and -c “load fixup log.nsf”

I thought all was fine now. The next thing I know one of our main servers locked up with hundreds of error messages on the console. It turns out that for every time it found “Unable to update activity document in log database” it tried to run fixup on the log.nsf. So there were hundreds of tasks spawned to run fixup on log.nsf because the message was repeating over and over again when statlog ran at 5AM.

> 01/22/2009 05:01:36 AM Unable to write to database – database e:\Lotus\Domino\Data\mail\database1.nsf would exceed its disk quota of 1024000 KB by 1 GB.
01/22/2009 05:01:36 AM Unable to update activity document in log database for mail\database1.nsf: Unable to write to database because database would exceed its disk quota.
> load fixup log.nsf
> 01/22/2009 05:01:36 AM Database Fixup: Started
01/22/2009 05:01:36 AM Performing consistency check on log.nsf…
> load fixup log.nsf
> 01/22/2009 05:01:37 AM Database Fixup: Started
01/22/2009 05:01:37 AM Unable to fixup database e:\Lotus\Domino\Data\log.nsf: The database is being taken off-line and cannot be opened.
01/22/2009 05:01:37 AM Database Fixup: Shutdown
> 01/22/2009 05:01:37 AM Unable to write to database – database e:\Lotus\Domino\Data\mail\database2.nsf would exceed its disk quota of 1024000 KB by 92 MB.
01/22/2009 05:01:37 AM Unable to update activity document in log database for mail\database2.nsf: Unable to write to database because database would exceed its disk quota.
01/22/2009 05:01:37 AM Unable to update activity document in log database for mail\database3.nsf: The database is being taken off-line and cannot be opened.
01/22/2009 05:01:37 AM Unable to update activity document in log database for mail\database4.nsf: The database is being taken off-line and cannot be opened.
01/22/2009 05:01:38 AM Unable to update activity document in log database for mail\database5.nsf: The database is being taken off-line and cannot be opened.
> load fixup log.nsf
> 01/22/2009 05:01:38 AM Unable to update activity document in log database for mail\database6.nsf: The database is being taken off-line and cannot be opened.
01/22/2009 05:01:38 AM Unable to update activity document in log database for mail\database7.nsf: The database is being taken off-line and cannot be opened.
01/22/2009 05:01:38 AM Database Fixup: Started
01/22/2009 05:01:38 AM Unable to fixup database e:\Lotus\Domino\Data\log.nsf: The database is being taken off-line and cannot be opened.
01/22/2009 05:01:38 AM Database Fixup: Shutdown
> load fixup log.nsf
> 01/22/2009 05:01:38 AM Unable to update activity document in log database for mail\database8.nsf: The database is being taken off-line and cannot be opened.
01/22/2009 05:01:38 AM Database Fixup: Started
01/22/2009 05:01:38 AM Unable to fixup database e:\Lotus\Domino\Data\log.nsf: The database is being taken off-line and cannot be opened.
01/22/2009 05:01:38 AM Database Fixup: Shutdown
01/22/2009 05:01:38 AM Unable to update activity document in log database for mail\database9.nsf: The database is being taken off-line and cannot be opened.
> load fixup log.nsf

The error on the Domino server:

Poke buffer already contains characters

NOT GOOD.

I only wish that we could have an option to only run once.

I’ve had to disable this corrective action and another one that basically did the same thing except for the trigger text was “Cannot write to log file”

Creating corrective actions with events4.nsf (with screenshots)

For a long time, I’ve been frustrated with Domino on the massive amount of messages that we sometimes receive over the weekend, late at night, or early in the morning indicating that a mail.box or some other database is corrupt.

Here’s a list of the common event generators/notifications we receive and a step by step guide on how I setup corrective actions to fix the problem and prevent the event task from spewing thousands of notification messages.

  • Event/notification: Error compacting mail\somedatabase.nsf: Database is corrupt — Cannot allocate space
  • Corrective Action: New event handler runs “load compact -c

  • Event/notification: Router: Mailbox file mail1.box is corrupt
  • Corrective Action: New event runs “load fixup

  • Event/notification: Unable to update activity document in log database for bookmark.nsf: Database is corrupt — Cannot allocate space
  • Corrective Action: New event handler runs “load fixup log.nsf”

This should significantly reduce the amount of messages that are generated by runaway event monitor notifications.

The following events cannot have event handlers run on them.

  • Event/notification: Unable to store document in MailServer1/MyDomain mail\usermailbox.nsf (NoteID = 1264886) from mail\usermailbox.nsf (NoteID = 1209126): Database is corrupt — Cannot allocate space
  • Reason: the command would have to be generated on remote server. No possible way to pass an argument to remote server on R6.5.6.

  • Event/notification: Unable to replicate MailServer1/MyDomain mail\usermailbox.nsf: Database is corrupt — Cannot allocate space
  • Reason: the command would have to be generated on remote server. No possible way to pass an argument to remote server on R6.5.6.

  • Event/notification: Database is corrupt
  • Reason: This event is generated by the router task based upon messages in the mail routing view in the log. The database that is corrupt is listed on a seperate line in the log than the “Database is corrupt” text, so there is no way to determine by which database to run a command on.

Read the rest of this entry »

The ease and benefit of setting up diagnostics collection and LND Fault Reports database

I setup the diagnostics collection and Lotus Notes/Domino Fault reports database for the first time today. It only took about 15 minutes.

I created a new database called Lotus Notes/Domino Fault Reports (faultreport.nsf) from the lndfr.ntf template. I set this up on a central server that also use as my main monitoring server, which initiates mail probes to all servers. Then I created a mail-in database document for this database with the name Lotus Notes/Domino Fault Reports.

I then edited the * Server configuration document and set the diagnostics tab with the following settings:

Read the rest of this entry »

HP Systems Insight Management – who uses it?

I just want to mention that I love HP (and previously Compaq) servers. I love installing Windows 2003 with a Smart Start DVD and all of the HP Insight Management agents, the Windows drivers for the various systems components being installed when it’s all said and done.

I’ve been playing with the HP Systems Insight Management Console for the last few months. It’s a great tool and lots and lots of granular monitoring, patch and firmware deployment, etc.

I just realized this week that you could go to HP’s website and install what’s called a Proliant Support Pack.
Read the rest of this entry »

Consulting

I'm currently available
for Lotus Notes / Domino consulting engagements.

LinkedIn

Connect with me:

LinkedIn

Advertisement
Advertisement
Categories