So we recently had a client who wanted to perform a DR test of their Exchange 2010 DAG. The DAG consisted of a single, all in one server in production, and a single all in one server in DR. The procedure for this test was to disconnect all network connectivity between prod and DR, shutdown the exchange server and the domain controller, snapshot them, and then start them back up.
Now, we can all agree that snapshots and domain controllers are inherently dangerous, so its up to you to ensure that you have your ducks in a row to ensure that this doesn’t replicate back to production. That discussion is outside this article.
Now, initially they had trouble bringing up the databases in DR, as well as many components of the DAG. This article will walk through an example, and try to make sense of what’s causing these issues.
So, here is our setup, we have a two node DAG cluster, stretched across two sites.
PHDC-SOAEXC01 – Prod all in one Exchange Server
PROD-DC01 – Prod domain controller
PHDC-SOADC01 – Primary witness server
SFDC-SOAEXC01 – DR all in one Exchange Server
DR-DC01 – DR domain controller
SFDC-SOADC01 – Alternate witness server
The DAG name is SOA-DAG-01 and the Active Directory Sites are:
Prod = PH
DR = SF
So in our scenario, we shutdown both PHDC-SOAEXC01 and PHDC-SOADC01. This will cause the databases in DR to dismount because quorum has been lost by the DR server.
Now, in a DR “test”, we would shutdown the DR exchange server, and the DR domain controller to snapshot them. I just want to warn you, DO NOT EVER roll a domain controller back to a snapshot in a production environment. This is a purely hypothetical setup. Rant over.
Now, in our case, we have snapshotted and rebooted DR-DC01 and SFDC-SOAEXC01. When we open the Exchange Management Console, we see that the DR servers databases is in a failed state:
Now, lets start running through the DR activation steps. Here is what the process should normally be:
- Stop the mailbox servers in the prod site
- Stop the cluster service on all mailbox servers in the DR site
- Restore the mailbox servers in the DR site, evicting the prod servers from the cluster
After step 3, the database’s should mount, but as you will see, they wont, and I’ll try to explain why.
So, step 1, lets mark the prod servers as down:
You should expect to see some errors, this is completed expected because the prod site is unable, hence the –configurationonly option:
Now, step 2, we will stop the clustering service on SFDC-SOAEXC01 with the powershell command:
Now, step 3, we will restore the dag with just the servers in DR:
You may get an error stating
Server ‘PHDC-SOAEXC01′ in database availability group ‘SOA-DAG-01′ is marked to be stopped, but couldn’t be removed fro
m the cluster. Error: A server-side database availability group administrative operation failed. Error: The operation f
ailed. CreateCluster errors may result from incorrectly configured static addresses. Error: An error occurred while attempting a cluster operation. Error: Cluster API ‘"EvictClusterNodeEx(‘PHDC-SOAEXC01.SOA.corp’) failed with 0x46.
Simply re-run the command again and it should complete:
So now, we should have the databases mounted, and we should be able to see the prod servers as stopped by running the following command:
Get-DatabaseAvailabilityGroup -Status | FL
But, behold, we get an error stating GetDagNetworkConfig failed on the server. Error: the NetworkManager has not yet been initialized
So, here is the first road block, what happened is that since the DR server is one node, it uses the boot time of the alternate file share witness to determine if it is allowed to form quorum. This is due to a one node cluster, always having cluster, and it trying to prevent split brain. Tim McMichael does a great job of explaining it Tim McMichael Blog Post. Essentially the boot time is stored in the registry of the Exchange Server under:
The Exchange Server checks if it was rebooted more recently than the AFSW, it will not form quorum. So how do we fix? We can start by rebooting the AFSW to see what behavior changes.
After we do so, we can re-run:
Get-DatabaseAvailabilityGroup -Status | FL
Now, we get the network and stopped servers info, but there are some entries that are in a broken state, and we get the message that the DAG witness is in a failed state:
Note the WitnessServerinUse field reports InvalidConfiguration
We have to re-run our Restore-DatabaseAvailabilityGroup command to resolve this:
Restore-DatabaseAvailabilityGroup SOA-DAG-01 -ActiveDirectorySite SF
Now if we re-run Get-DatabaseAvailabilityGroup –Status | FL we get an expected output:
Now, we see that the WitnessShareInUse is set to the alternate.
So, are the databases mounted!? If we check, they are no longer failed, but are “Disconnected and Resyncing”
We need to force the server in DR to start because of the single node quorum issue. This can be done with the following command:
Start-DatabaseAvailabilityGroup SOA-DAG-01 -ActiveDirectorySite SF
Now the database is mounted:
So, you can see, the testing can affect what occurs with the DR test, but also the setup with the single node cluster can cause this issue. The boot time of the alternate file share witness is also extremely important to what the node can do when it restarts.
Hopefully you find the info useful! Happy Holidays to all!