Missed it by that much: Dead Drive
We have two servers in a cluster in Brazil. The hardware is HP ProLiant DL380 G5. This disks are 5 SAS 10,000RPM 146GB hot pluggable drives in a Raid5 Array. We have HP Insight Management Agents and HP Systems Management Home Page setup on those servers with the HP Event Notifier configured to send our admin group notifications for any hardware events though the SMTP server on the Domino server of the primary server in the cluster.
Today, the local administrator in Brazil sent us a photo of the front of the server. In the photo one of the drives had a red light on it. I connected to the server and looked a HP SMH and HP Integrated Management Log (IML). Sure enough, one of the drives was dead and had been acting up since January 31st. But why not notification??
I took a deep look at NotesBR1 the primary server in the cluster to determine why we did not receive any notification that the drive had gone bad.
The event notifier is setup correctly, and I was able to send two Test Trap Messages which were properly received.
The HP Integrated Management Log (IML) shows this:
We should be receiving email notifications for each of these warnings, especially the critical ones.
I noticed this in the Event Log:
This is from the Domino server log:
02/07/2009 07:11:59 AM SMTP Server: Starting…
02/07/2009 07:12:03 AM SMTP Server: Started
When the Event Notifier (CimNotify) service starts up, it tries to open a connection to the SMTP server to make sure it is configured properly, but does not send a message. If it cannot open an SMTP connection, the service fails and generates this event log event.
What is happening is that the hardware events (errors/warnings) are generated when the server starts. Notice the time 7:10AM, 7:11AM, and 7:12AM. This is right after the weekly Saturday morning 7AM maintenance reboot. So basically, the Event Notifier doesn’t start because Domino has not finished starting up yet so it cannot open an SMTP connection, and then the hardware events are generated and appear in IML and cannot be sent via email because the Event Notifier service is not running.
Domino eventually starts up and Event Notifier eventually starts – so that’s why I was able to send test trap messages.
So the bottom line is that there is a flaw in the way that I’ve configured the Event Notifier on several of the servers. This would explain why over the past several months, we only receive Event Notifications from NotesBR2 and other even numbered (secondary) cluster servers. The even numbered servers are pointing to odd numbered (primary) servers as their SMTP servers. While NotesBR1 (and other odd numbered servers) are pointing to themselves as SMTP servers.
The problem does not occur on the even numbered servers because when the Event Notifier service tries to start, it connects to the odd numbered server’s SMTP server and the Event Notifier service proceeds to startup properly.
Happy Days!
p.s. Yes, the title is a reference to Get Smart.
And I thought my client who had coded for an old R5 server to route mail had it wrong. he had it right! Up until we decommissioned his smtp server LOL!
You had the right idea. Suggest you use the next primary server as the SMTP server so 1 would use 3 and 3 would use 1?
But you probably already thought of this.
Keith, these are geographically seperated machines. So 2 Domino servers in Brazil let’s say. I didn’t want to route the mail to the next closest server that was running SMTP in Hong Kong or New York for example just for a notification.
The secondary servers in the cluster actually do not have SMTP running. They serve was webmail/DWA servers and sit out in the DMZ, but clients can fail over to them if the primary cluster mate is down.