Rolling back from a host provider
We made the decision to roll back half of our hosted Domino severs from our newly outsourced hosted servers at our outsource facility.
Most of our servers are clustered, and we use clustering for fail over, not load balancing, so our odd number servers are primary, and even number servers are the failover cluster mates.
Since at our hosted facility, the production environment was down, and the disaster recovery site wasn’t being backed up, we wanted to fall back the even number servers to our original facility in-house. This way, at least we would have a backup, or fall back plan, if our servers went down again at the outsource facility.
First, we removed the machines from the cluster in the NAB.
What we did next was to copy names.nsf, (our secondary address book), admin4.nsf, dc.nsf, da.nsf, and events.nsf to a USB disk on our local admin machine. We probably could have gotten away with just names.nsf, admin4.nsf, and dc.nsf, but just to be sure, we copied the others. We also probably could have copied domcfg.nsf and any resource reservations database to be very very sure, but we didn’t this time.
We also connected to the remote (outsourced) odd number machines and copied the notes.ini files in case there were any changes in the 3 weeks that we had cut-over to the host provider, then we configured the machines with a different IP address other than production and had them powered down.
Then we went to the physical old servers (in house), turned them on, making sure that the IP addresses were configured as the production IP addresses, copied the administrative databases.
copied the old notes.ini files to the USB stick, and brought them back to our admin desk. We then compared the notes.ini files of the in-house, with the outsourced, noted any changes, and copied the updated notes.ini to our in-house machines.
After a couple of tricks on the networking side of things to make sure that there wasn’t any problem in physically connecting to the old, in-house machines, we powered them up just as they were.
We manually replicated from the even number machines to the odd number machines before adding them back to the cluster together…but the way we chose to do this was to issue seperate replication commands by directory. This way we could get each directory replicating at the same time with multiple replicators since these servers have 8 processors in them, we could realistically get 8 replicators going without any problem (and probably more).
example:
> load replica serverodd/domain mail
> load replica serverodd/domain apps
> load replica serverodd/domain acct
> load replica serverodd/domain marketing
This way replication does not step on itself either. If you just type “load replica” eight times, the replicators will eventually get stuck all trying to replicate the same database and actually slow things down a bit.
Since the network was still connected, and we had a fast pipe to the outsource facility, and the subnets that both server rooms sit on…the replication started to catch up pretty quickly.
We had to catch replication up from 3 weeks of changes. The old servers that remained in-house were turned off after the cut-over. So any changes and updates that were made were not included on the even number servers which are in the in-house facility, but of course they existed on the odd servers that had been live and in production at the outsource facility.
Overall, it took us 2 to 3 hours (not including some quick planning to try and think through any “gotchas”) to undo a year of project planning to move our servers to an outsource facility.
This 2 to 3 hours meant rolling back 4 busy Domino servers back home.
We performe this operation from 3-7AM, so by 9AM when the business day started, the replication had caught up.
We did have a few minor problems, such as new users that had been created, archives that had been created that existed now on the odd, but not the even. The only small down side to all of this is that when we create archives, we don’t always put them on both cluster mates to save space, so there were a few archives that were created on the even number cluster mate at the outsourcing facility, but not the odd number cluster mate, so we couldn’t replicate them manually from the odd number server to the even number server (inhouse).
The solution to this problem was to do a database backup restore from the even number server backup image to the odd number server at the outsource facility, then we could create a replica on the even number server in house.
A week later and the outsource facility still does not know why all our our Domino servers went down at the same time except for the fact that it was SAN related and are looking into it.
Through all this, I’ve created this ideajam.net idea:
Create a way to synch the existence of databases between cluster members
well done, guy