Witness Server Boot Time, GetDagNetworkConfig and the pain of Exchange 2010 DR Tests

Exchange 2010, High Availability

 

So we recently had a client who wanted to perform a DR test of their Exchange 2010 DAG.  The DAG consisted of a single, all in one server in production, and a single all in one server in DR.  The procedure for this test was to disconnect all network connectivity between prod and DR, shutdown the exchange server and the domain controller, snapshot them, and then start them back up.

Now, we can all agree that snapshots and domain controllers are inherently dangerous, so its up to you to ensure that you have your ducks in a row to ensure that this doesn’t replicate back to production.  That discussion is outside this article.

Now, initially they had trouble bringing up the databases in DR, as well as many components of the DAG.  This article will walk through an example, and try to make sense of what’s causing these issues.

So, here is our setup, we have a two node DAG cluster, stretched across two sites. 

Production

PHDC-SOAEXC01 – Prod all in one Exchange Server

PROD-DC01 – Prod domain controller

PHDC-SOADC01 – Primary witness server

DR

SFDC-SOAEXC01 – DR all in one Exchange Server

DR-DC01 – DR domain controller

SFDC-SOADC01 – Alternate witness server

The DAG name is SOA-DAG-01 and the Active Directory Sites are:

Prod = PH

DR = SF

So in our scenario, we shutdown both PHDC-SOAEXC01 and PHDC-SOADC01.  This will cause the databases in DR to dismount because quorum has been lost by the DR server.

Now, in a DR “test”, we would shutdown the DR exchange server, and the DR domain controller to snapshot them.  I just want to warn you, DO NOT EVER roll a domain controller back to a snapshot in a production environment.  This is a purely hypothetical setup.  Rant over.

Now, in our case, we have snapshotted and rebooted DR-DC01 and SFDC-SOAEXC01.  When we open the Exchange Management Console, we see that the DR servers databases is in a failed state:

image

Now, lets start running through the DR activation steps.  Here is what the process should normally be:

  1. Stop the mailbox servers in the prod site
  2. Stop the cluster service on all mailbox servers in the DR site
  3. Restore the mailbox servers in the DR site, evicting the prod servers from the cluster

After step 3, the database’s should mount, but as you will see, they wont, and I’ll try to explain why.

So, step 1, lets mark the prod servers as down:

   1: Stop-DatabaseAvailabilityGroup SOA-DAG-01 -ActiveDirectorySite PH -ConfigurationOnly

You should expect to see some errors, this is completed expected because the prod site is unable, hence the –configurationonly option:

image

Now, step 2, we will stop the clustering service on SFDC-SOAEXC01 with the powershell command:

   1: Stop-Service ClusSvc

Now, step 3, we will restore the dag with just the servers in DR:

   1: Restore-DatabaseAvailabilityGroup SOA-DAG-01 -ActiveDirectorySite SF

You may get an error stating

Server ‘PHDC-SOAEXC01’ in database availability group ‘SOA-DAG-01’ is marked to be stopped, but couldn’t be removed fro

m the cluster. Error: A server-side database availability group administrative operation failed. Error: The operation f

ailed. CreateCluster errors may result from incorrectly configured static addresses. Error: An error occurred while attempting a cluster operation. Error: Cluster API ‘"EvictClusterNodeEx(‘PHDC-SOAEXC01.SOA.corp’) failed with 0x46.

Simply re-run the command again and it should complete:

image

So now, we should have the databases mounted, and we should be able to see the prod servers as stopped by running the following command:

Get-DatabaseAvailabilityGroup -Status | FL

But, behold, we get an error stating GetDagNetworkConfig failed on the server.  Error: the NetworkManager has not yet been initialized

image

So, here is the first road block, what happened is that since the DR server is one node, it uses the boot time of the alternate file share witness to determine if it is allowed to form quorum.  This is due to a one node cluster, always having cluster, and it trying to prevent split brain.  Tim McMichael does a great job of explaining it Tim McMichael Blog Post.  Essentially the boot time is stored in the registry of the Exchange Server under:

HKEY_LOCAL_MACHINESoftwareMicrosoftExchangeServerv14ReplayParameters

The Exchange Server checks if it was rebooted more recently than the AFSW, it will not form quorum.  So how do we fix?  We can start by rebooting the AFSW to see what behavior changes.

After we do so, we can re-run:

Get-DatabaseAvailabilityGroup -Status | FL

Now, we get the network and stopped servers info, but there are some entries that are in a broken state, and we get the message that the DAG witness is in a failed state:

image

Note the WitnessServerinUse field reports InvalidConfiguration

We have to re-run our Restore-DatabaseAvailabilityGroup command to resolve this:

Restore-DatabaseAvailabilityGroup SOA-DAG-01 -ActiveDirectorySite SF

Now if we re-run Get-DatabaseAvailabilityGroup –Status | FL we get an expected output:

image

Now, we see that the WitnessShareInUse is set to the alternate.

So, are the databases mounted!? If we check, they are no longer failed, but are “Disconnected and Resyncing”

image

We need to force the server in DR to start because of the single node quorum issue.  This can be done with the following command:

Start-DatabaseAvailabilityGroup SOA-DAG-01 -ActiveDirectorySite SF

Now the database is mounted:

image

So, you can see, the testing can affect what occurs with the DR test, but also the setup with the single node cluster can cause this issue.  The boot time of the alternate file share witness is also extremely important to what the node can do when it restarts.

Hopefully you find the info useful!  Happy Holidays to all!

Comments;

  1. KRISHNA

    I am getting bellow error as going through above article.
    [PS] C:\Windows\system32>Restore-DatabaseAvailabilityGroup DAG -ActiveDirectorySite Site2

    Confirm
    Are you sure you want to perform this action?
    Restoring Mailbox servers for Active Directory site “Site2” in database availability group “DAG”.
    [Y] Yes [A] Yes to All [N] No [L] No to All [?] Help (default is “Y”): a
    WARNING: The operation wasn’t successful because an error was encountered. You may find more details in log file
    “C:\ExchangeSetupLogs\DagTasks\dagtask_2013-05-10_07-59-29.367_restore-databaseavailabilitygroup.log”.
    A database availability group administrative operation failed. Error: Unable to form quorum for database availability g
    roup ‘DAG’. Please try the operation again, or run the Restore-DatabaseAvailabilityGroup cmdlet and specify the site wi
    th servers known to be running.

    Reply
  2. KRISHNA

    Still to restart DR site after failover it was working but after Restarting DR site all mail databases are unable to mount, unable to form a quorum, An Active Manager operation failed in DR Site and Cluster service in Disable State cause this DAG network not initialise and showing in EMC.
    Please help for the same.

    Reply
  3. Kman

    Is above article accurate for exchange 2013 as well? Especially when it comes to last remaining DAG member not forming the cluster if AFSW wasn’t rebooted last?

    Thanks

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *