Category Archives: High Availability

Configure an Exchange 2013 DAG on Windows Server 2012 R2 With No Administrative Access Point

DAG, Exchange 2013, High Availability

Exchange 2013 SP1 introduced support for Windows Server 2012 R2, and also introduced support for a new feature in Windows Server 2012 R2, Failover Clusters without an Administrative Access Point.  You can now create a DAG, that does not need separate IP’s on each subnet for the DAG itself.  It also no longer creates the CNO which is seen as the computer account in Active Directory.  The benefit of this feature is that you reduce complexity, no longer need to manage the computer account for the DAG, and no longer need to assign IP addresses for each subnet on which the cluster operates.  There are some downsides, but it shouldn’t affect Exchange admins much.  Mainly, since there is no ip address and no CNO, you cannot leverage Windows Failover Cluster admin tools to connect to it.  You need to leverage local PowerShell against a cluster node directly.  With Exchange, this shouldn’t be too much of a problem as almost all of the management of the cluster is handled with Exchange tools through management of the DAG itself.

In our example, we have two servers in separate AD sites that we are going to configure in our DAG:

PHDC-SOAE13MBX2

SFDC-SOAE13MBX2

We will create a DAG named SOA-DAG-2013.  Now, previously this would be the name of the CNO that Exchange would create underneath.  This is changed to essentially be a label that is stamped on all the nodes for management, but will no longer create the CNO.

If we login to EAC and navigate to Servers->Database Availability Groups, we can create the DAG by click on the plus sign:

image

Enter in the information for the DAG, and remember to specify your Witness Server.  It should be another Exchange 2013 Server in your primary datacenter location that is not also a member of the DAG.  We will specify one IP address of 255.255.255.255:

image

If we are doing this in PowerShell, the syntax is different:

New-DatabaseAvailabilityGroup –Name SOA-DAG-2013 –DatabaseAvailabilityGroupIPAddresses ([System.Net.IPAddress]::None) –WitnessServer NYDC-SOAE13CAS1.soa.corp –WitnessDirectory c:\WitnessDirectory\SOA-DAG-2013

 

image

 

Now, from here, building the DAG should have the same steps.  Lets add the mailbox servers to the DAG.  If you don’t already have Windows Failover Clustering installed, these steps will install it for you.

From the EAC:, under Database Availability Groups select the DAG name, and click on the Server with the gearbox icon:

image

 

 

 

 

 

 

 

 

 

 

Add your servers to the DAG and click Save:

 

image

From the Exchange Management Shell:

Add-DatabaseAvailabilityGroupServer -Identity SOA-DAG-2013 -MailboxServer SFDC-SOAE13MBX2

image

And your all set.  The DAG has been configured with no Administrative Access. 

If we check the properties of the DAG in the EAC we can see the IP address is listed as 255.255.255.255:

 

image

And even though we had that string in the PowerShell command, if we check the IP address in PowerShell, we only have 255.255.255.255 listed as an IP address:

 

image

Setting Up a Database Availability Group in Exchange 2013

DAG, Exchange 2013, High Availability

We’ll walk through the steps of setting up a Database Availability Group or DAG for short in Exchange 2013.  Our setup is we have two mailbox servers:

PHDC-SOAE13MBX1 – IP Address of 10.220.10.2

SFDC-SOAE13MBX1 – IP Address of 10.10.10.60

There are a couple of requirements that we need to ensure are met before we can set up the DAG, and that things run well once we do. 

Requirement #1 – Windows Failover Clustering

For starters, Exchange DAG’s use windows failover clustering as the means to setup the foundation of the DAG.  This means that you will need to install Windows Failover Clustering on each of the mailbox servers that will be a member of the DAG.  The Exchange 2013 DAG setup will perform the install for you, so there is no need to do it ahead of time.  What you do need to know though, is that the underlying Windows operating system needs to be able to install it.  Meaning your Windows OS must be one of the following:

  • Windows Server 2008 R2 Enterprise
  • Windows Server 2012 Standard
  • Windows Server 2012 Enterprise

Requirement #2 – Same Operating System

Since Windows Failover Clustering has this requirement, so does Exchange 2013 DAG’s.  All members of the DAG need to run the same OS level.  Meaning you cannot have one DAG member running 2008 R2 Enterprise and another running 2012 Standard.

Requirement #3 – One DAG per Server

Each mailbox server can only be a member of one DAG at a time. 

Requirement #4 – You Need an IP address for the DAG in each subnet there is a mailbox server

Since the DAG is a cluster, that cluster needs an IP address in each subnet that there is a mailbox server.  This is separate from the mailbox server’s IP address.  In our example, we have two subnets we are spread across:

10.220.10.0/24

10.10.10.0/24

This means we need an extra IP address for each subnet to assign to the DAG.  In our case we will use:

10.220.10.6

10.10.10.6

Requirement #5 – Name of DAG

The name of the DAG needs to fit the NETBIOS requirements, meaning 15 characters or less.

In our example, we will use SOA-DAG-13.

Requirement #6 – Witness Server

We need a Witness Server that will be used in the event that we have an even number of members in the DAG, and there needs to be a tie breaking vote.  Best practice is to use an Exchange 2013 CAS server.  Realistically ANY windows server will do, but you need to add the Exchange Trusted Subsystem as an administrator to that local PC before you can use it.

In our example we will use PHDC-SOAE13CAS1 and use the directory of C:\Witness\SOA-DAG-E13.soa.corp

Optional Requirement – Replication Networks

While it is not required, it is certainly best practice to create a replication network.  With that, you would have an extra NIC on each DAG member that would be dedicated for replication traffic only.  In bigger installations this is certainly recommended as seeding and replication can easily use a significant portion of the bandwidth.

So let’s get started. 

Pre-Stage the DAG CNO

The first thing we need to do is pre-stage the computer name for the DAG. Open up AD Users and Computers.  Navigate to an OU that contains both your Mailbox servers in it.  Right click and create a new computer object.  Fill in the details as necessary:

image

Next, right click the account you just created and select Disable Account

image

Now, we need to assign permissions to this account so that the mailbox servers are allowed to manipulate the object, as well as the group Exchange Trusted Subsystem.

In AD Users and Computers, go to View->Advanced Features

image

Right click the account you created and go to Properties->Security tab.

Add the following objects to the computer account with Full Control

  • Exchange Trusted Subsystem
  • Each mailbox server

(You only really need to the add the first mailbox server that you are using to create the DAG, but just to make things uniform you can add both)

image

image

image

Ensure that AD replication has finished before moving on.

Creating the DAG

First, lets log into the Exchange Control Panel on either of the servers.  You can get there by going to https://servername/ecp .

Navigate to Servers->Database Availability Groups.  In our example, you can see my pre-existing Exchange 2010 DAG:

image

Click on the + symbol at top to create a new DAG, and enter in the information:

image

Not that we have added the Name of the DAG, the Witness Server, and the Witness Directory.  Then we add the IP’s we have assigned to the DAG itself underneath.  Click Save to create the DAG.

Now, if we double click and open the DAG, we will note there is nothing in it.  We need to add mailbox servers to it:

image

Back on the DAG screen, select your DAG object, then select the little server with the gear on it.  This will allow us to manage the membership of the DAG.

image

Select your mailbox servers:

image

 

And click save to begin adding them to the DAG:

image

If you didn’t install Windows Failover Clustering before hand, this will install it on each node for you.  It can take about five minutes for each server for the entire process.

Eventually the DAG will be complete:

image

Setting DAC Mode on the DAG

Once your DAG is done, there is one last item you should follow.  On an Exchange 2013 server, open the exchange management shell. run the following command to enable DAC mode:

Set-DatabaseAvailabilityGroup –Identity SOA-DAG-13 –DatacenterActivationMode DAGOnly

image

The purpose of DAC mode is to help prevent split brain, and also allow you to use to Stop-DatabaseAvailabilityGroup and Start-DatabaseAvailabilityGroup commands for failover.

Now your all set, the last thing to do is to add copies of the database on each DAG member. 

Enjoy!

How to Use Managed Availability in Exchange 2013 with your Load Balancer

Exchange 2013, High Availability, Managed Availability, Netscaler

One of the major changes in Exchange 2013 is the concept of Managed Availability.  I wont go too deep into it, but it is the ability of Exchange 2013 to monitor itself, detect problems and attempt to resolve them.  One of the added bonuses of this, is that Managed Availability then knows when a particular application is working and able to serve data.  One of these specific instances where we can use it with third party tools is with hardware based load balancers.

One of the jobs of the hardware load balancer is to detect the health of the server that it is load balancing, something that Managed Availability is already doing itself!  Hardware load balancers can detect the health through a variety of different ways.  The basic is ping, which just checks if the host is responding to ping.  The obvious problem here is that the host could be up, but none of the services!  The next would be to check if a port is accessible.  Here you configure the load balancer to check if say port 443 is alive.  This is better than ping, but doesn’t check if the application is actually working behind the scenes, just that it can telnet to 443.  We can use Managed Availability with our hardware load balancer to check if the application itself is actually healthy.

How do we do that?  Say we want a HLB to check if OWA is healthy on a server.  The normal http path would be https://servername/owa right?  Well, if you navigate to https://servername/owa/healthcheck.htm, you will get a page generated on the stop indicating if OWA is working on that server.

For example, say we have two Exchange 2013 servers:

PHDC-SOAE13CAS1 – 10.220.10.3

PHDC-SOAE13CAS2 – 10.220.10.4

And we want to publish OWA through a HLB to email.company.com at ip address 10.220.10.1

If we navigate to https://PHDC-SOAE13CAS1/owa/healthcheck.htm on this server with working OWA, we get the following page:

image

Not a lot too it, but essentially its returning a 200 OK message indicating the service is working. If the service was not working, this page would not generate.  So we can have our HLB check to see if it gets a 200 OK response from a particular server.

We want to configure these in our load balancer for services such as OWA, Activesync, Outlook Anywhere etc.  So, we will configure Exchange 2013 using Citrix’s Netscaler in this example.  The configuration will be similar for other HLB, but we’ll go through the steps here.

On the Netscaler go to Load Balancing->Monitors and click add to create a new monitor.  Here, we will create a custom monitor for the Netscaler, so that it can poll that web page.  Name the monitor MONITOR-EXCHANGE2013_OWA and set the type to HTTP-ECV.  Ensure to select the Secure check box at the bottom as this will be over SSL.  Leave the other options default:

image

Next, click on the Special Parameters tab.  In the Send String box enter in GET /owa/healthcheck.htm.  In the Receive String box, enter in 200 OK:

image

Click create to save, and navigate over to Load Balancing –>Service Groups

Add a new Service Group, and name it Exchange2013-OWA and set the protocol to SSL.  Enter in the IP addresses of your CAS servers, and set their ports to 443

image

Next, click on the Monitors tab.  Find the Monitor_Exchange2013_OWA monitor we created above and add it to the configured Monitors selection

image

Click on the SSL Settings tab and select the SSL certificate that you will use to publish the Netscaler service to the internet.  I have a preloaded one named Lab-2013 that I will be using:

image

Click Create to save the Service Group.

If we check, our service group should be reporting up, that’s good, it means our monitor is working!

image

Next lets go to Load Balancing->Virtual Servers

Create a new Virtual Server and name it Exchange2013_OWA, Set the Protocol to SSL, and assign it an IP, in our case 10.220.10.1. Leave the SSL Port at 443.

image

Select the Service Groups tab and select the service group Exchange2013-OWA we created earlier:

image

Then click on the SSL Settings tab and select the same certificate as you did on the service group, in our case Lab-2013:

image

Click Create to save the virtual server.

Next, make sure your DNS address is pointing to the IP address of the virtual server and lets try to login:

image

There we go!  There is are OWA page!  But, the question is how do we test that our monitor is working.  That’s easy.  On PHDC-SOACAS2, lets go into IIS Manager and stop the MsExchangeOWAAppPool:

image

If we check our Netscaler, we see that one of the servers is now being reported as down:

image

If we try to telnet to that server on port 443:

image

We can see it works fine:

image

I know it doesn’t show much, but it shows that the server is still listening on port 443.  This also proves that using Managed Availability for your HLB is much better.  Here, the standard checks would have said the server was working fine, sent user requests to it, but in fact OWA isn’t working.  But since we are using Managed Availability, we are passing that knowledge on an application layer to our HLB.

If we try to go back to OWA:

image

The HLB sees that one server is down, and runs everything to the server that’s still up.  But, if both servers have their application pools stopped:

We get a HTTP Error, The Service is unavailable:

image

This works for all of the Exchange web services.  So that means you can create separate monitors just be appending healthcheck.htm at the end of the URL.  So for ActiveSync its https://servername/Microsoft-Server-ActiveSync/healthcheck.htm.  The only one that has a stipulation is OWA, which requires Forms Based Authentication to be enabled for it to provide a HealthCheck.htm page.

I hoped you have found this helpful, and hopefully it will save you some configuration steps (and some uptime!) on your hardware load balancer

Witness Server Boot Time, GetDagNetworkConfig and the pain of Exchange 2010 DR Tests

Exchange 2010, High Availability

 

So we recently had a client who wanted to perform a DR test of their Exchange 2010 DAG.  The DAG consisted of a single, all in one server in production, and a single all in one server in DR.  The procedure for this test was to disconnect all network connectivity between prod and DR, shutdown the exchange server and the domain controller, snapshot them, and then start them back up.

Now, we can all agree that snapshots and domain controllers are inherently dangerous, so its up to you to ensure that you have your ducks in a row to ensure that this doesn’t replicate back to production.  That discussion is outside this article.

Now, initially they had trouble bringing up the databases in DR, as well as many components of the DAG.  This article will walk through an example, and try to make sense of what’s causing these issues.

So, here is our setup, we have a two node DAG cluster, stretched across two sites. 

Production

PHDC-SOAEXC01 – Prod all in one Exchange Server

PROD-DC01 – Prod domain controller

PHDC-SOADC01 – Primary witness server

DR

SFDC-SOAEXC01 – DR all in one Exchange Server

DR-DC01 – DR domain controller

SFDC-SOADC01 – Alternate witness server

The DAG name is SOA-DAG-01 and the Active Directory Sites are:

Prod = PH

DR = SF

So in our scenario, we shutdown both PHDC-SOAEXC01 and PHDC-SOADC01.  This will cause the databases in DR to dismount because quorum has been lost by the DR server.

Now, in a DR “test”, we would shutdown the DR exchange server, and the DR domain controller to snapshot them.  I just want to warn you, DO NOT EVER roll a domain controller back to a snapshot in a production environment.  This is a purely hypothetical setup.  Rant over.

Now, in our case, we have snapshotted and rebooted DR-DC01 and SFDC-SOAEXC01.  When we open the Exchange Management Console, we see that the DR servers databases is in a failed state:

image

Now, lets start running through the DR activation steps.  Here is what the process should normally be:

  1. Stop the mailbox servers in the prod site
  2. Stop the cluster service on all mailbox servers in the DR site
  3. Restore the mailbox servers in the DR site, evicting the prod servers from the cluster

After step 3, the database’s should mount, but as you will see, they wont, and I’ll try to explain why.

So, step 1, lets mark the prod servers as down:

   1: Stop-DatabaseAvailabilityGroup SOA-DAG-01 -ActiveDirectorySite PH -ConfigurationOnly

You should expect to see some errors, this is completed expected because the prod site is unable, hence the –configurationonly option:

image

Now, step 2, we will stop the clustering service on SFDC-SOAEXC01 with the powershell command:

   1: Stop-Service ClusSvc

Now, step 3, we will restore the dag with just the servers in DR:

   1: Restore-DatabaseAvailabilityGroup SOA-DAG-01 -ActiveDirectorySite SF

You may get an error stating

Server ‘PHDC-SOAEXC01’ in database availability group ‘SOA-DAG-01’ is marked to be stopped, but couldn’t be removed fro

m the cluster. Error: A server-side database availability group administrative operation failed. Error: The operation f

ailed. CreateCluster errors may result from incorrectly configured static addresses. Error: An error occurred while attempting a cluster operation. Error: Cluster API ‘"EvictClusterNodeEx(‘PHDC-SOAEXC01.SOA.corp’) failed with 0x46.

Simply re-run the command again and it should complete:

image

So now, we should have the databases mounted, and we should be able to see the prod servers as stopped by running the following command:

Get-DatabaseAvailabilityGroup -Status | FL

But, behold, we get an error stating GetDagNetworkConfig failed on the server.  Error: the NetworkManager has not yet been initialized

image

So, here is the first road block, what happened is that since the DR server is one node, it uses the boot time of the alternate file share witness to determine if it is allowed to form quorum.  This is due to a one node cluster, always having cluster, and it trying to prevent split brain.  Tim McMichael does a great job of explaining it Tim McMichael Blog Post.  Essentially the boot time is stored in the registry of the Exchange Server under:

HKEY_LOCAL_MACHINESoftwareMicrosoftExchangeServerv14ReplayParameters

The Exchange Server checks if it was rebooted more recently than the AFSW, it will not form quorum.  So how do we fix?  We can start by rebooting the AFSW to see what behavior changes.

After we do so, we can re-run:

Get-DatabaseAvailabilityGroup -Status | FL

Now, we get the network and stopped servers info, but there are some entries that are in a broken state, and we get the message that the DAG witness is in a failed state:

image

Note the WitnessServerinUse field reports InvalidConfiguration

We have to re-run our Restore-DatabaseAvailabilityGroup command to resolve this:

Restore-DatabaseAvailabilityGroup SOA-DAG-01 -ActiveDirectorySite SF

Now if we re-run Get-DatabaseAvailabilityGroup –Status | FL we get an expected output:

image

Now, we see that the WitnessShareInUse is set to the alternate.

So, are the databases mounted!? If we check, they are no longer failed, but are “Disconnected and Resyncing”

image

We need to force the server in DR to start because of the single node quorum issue.  This can be done with the following command:

Start-DatabaseAvailabilityGroup SOA-DAG-01 -ActiveDirectorySite SF

Now the database is mounted:

image

So, you can see, the testing can affect what occurs with the DR test, but also the setup with the single node cluster can cause this issue.  The boot time of the alternate file share witness is also extremely important to what the node can do when it restarts.

Hopefully you find the info useful!  Happy Holidays to all!

How to View Disconnected Mailbox’s and Purge Disconnected Mailboxes from Exchange 2010

Exchange 2010, High Availability

 

To view disconnected mailbox’s, essentially mailboxes that have been deleted from their user accounts, you need to first ensure that Exchange has gone through and cleaned the database.  This is done to ensure that it marks that mailbox as deleted.  If your database is MDB36, run the following command:

Clean-MailboxDatabase MDB36

image

Exchange gives no result from the command.  But now you can view Disconnected mailboxes through the “Disconnected Mailbox” view in the EMC:

image

1

You can also view it in the shell by running the following command:

Get-MailboxStatistics –Database MDB36 | where {$_.disconnectdate –ne $null}

image

And you will receive the following output:

image

By default, Exchange 2010 keeps disconnected mailbox’s in the DB for 14 days.  But say you want to remove this mailbox now and return it’s white space to use in the DB.  You need to remove the mailbox from the shell. 

You can do this by getting the GUID for the mailbox by running the command:

Get-MailboxStatistics –Database MDB36 | where {$_.disconnectdate –ne $null} | select displayname,MailboxGUID

image

And you will receive the following output:

image

Now run the following command to remove the mailbox:

Remove-Mailbox –Database MDB50 –StoreMailboxIdentity 7b40b106-5941-4de0-9fce-27ede21c474e

image

You’ll receive a confirmation prompt, just accept it, and your all set:

image

Enjoy!

Exchange 2010 Database’s Fail to Replicate or Seed

Exchange 2010, High Availability

 

Recently had a colleague that ran into an issue with an Exchange 2010 migration.  He could fail over the mailbox databases with no issue to DR, but that’s where the trouble started.

The production database would start to report that there was a high copy queue length that would increase as more activity occurred on the DB.  The production database pure and simple was not receiving the transaction logs from the newly activated database in DR. 

The setup was simple, two nodes, one in production, one in DR with a FSW.  The nodes were all-in-one 2010 boxes, with one NIC for MAPI and one NIC for replication.

My colleague also informed me that he had some trouble initially seeding the database.  All roads pointed to an issue with replication.  We quickly checked his replication network settings and found the following setup:

image

His DAG network had both replication networks for the separate sites under one object.  Once we moved them to their own separate networks:

image

Everything went back to normal!

Till the next time!

Users Receive a Login Prompt After a Database Failover in Exchange 2010

Client Access, Exchange 2010, High Availability, Outlook Anywhere

 

When performing a database *over in Exchange 2010, especially a planned one, it is suppose to be as seamless as possible to the end user.  An Outlook user for example, will receive a small pop up informing them that the connection to Exchange has been lost, and their Outlook may hang for a couple of seconds before reconnecting and resuming normal behavior.

I recently ran into an issue where this was not the case.  Users would call the helpdesk as soon as a database failover was initiated with complaints that their outlook was prompting them for a login:

Jan. 2601 09.54

Well, needless to say this is not expected behavior.  After a little troubleshooting that involved a packet capture, it seemed that the Outlook clients were making an HTTPS call to the CAS servers at that moment.  Turns out it was an attempt to connect to them over Outlook Anywhere, and this was the reason of the login prompt.  When I checked the Outlook client, they in fact had the Outlook Anywhere Settings enabled.  This was due to Autodiscovery.  To check the settings in Outlook 2007 navigate to Tools->Account Settings->Change->More Settings->Connection  If yours is enabled, it will look like the following:

Jan. 2603 09.55

Under Exchange Proxy Settings, you’ll find the settings enabled:

Jan. 2604 09.55

Since these were internal clients, and had no need to use Outlook Anywhere, simple unselect the Connect to Microsoft Exchange using HTTP:

Jan. 2606 09.58

This will disable the Outlook Client from connecting.  The only issue is, if you use automatic profile generation through Group Policy, this leverages autodiscovery, so it will continue to put the setting back.  You can do one of two things.  The first is to delete the Outlook Anywhere provider using the Remove-OutlookProvider command, which is NOT recommended.  This will stop Autodiscovery from publishing Outlook Anywhere GLOBALLY. 

The second is to use Group Policy.  Create a blank GPO named something like Disable Outlook Anywhere Settings.  Download the Outlook Anywhere ADM template from here, and import it into the template under User Settings.  You’ll now have the Outlook Anywhere (RPC/HTTP) options available in Group Policy:

Jan. 2608 11.23

The only value you need to edit here is the RPC/HTTP Connection Flags setting:

Jan. 2609 11.24

Edit the setting, set it to Enabled and No Flags

Jan. 2610 11.25

This will disable the Connect to Microsoft Exchange Using HTTP in the outlook clients after its been applied, notice how its greyed out:

Jan. 2611 11.25

Once this GPO has applied to all your users, you should now be able to failover databases without the users receiving a log in prompt. 

How to Update DAG Members to SP1 in Exchange 2010

Exchange 2010, High Availability

 

Exchange 2010 SP1 has been released, and comes with a slew of new and exciting features.  Since we are all clamoring to get this installed in our environments, we should discuss how exactly we upgrade the members of our DAG so as to provide zero downtime to our users, and get our systems patched correctly.

From a high level view, remember we should be patching servers in the following order:

  1. Client Access Servers
  2. Hub Transport Servers
  3. Edge Transport Servers
  4. Mailbox Servers

The process we discuss in this article can and should be applied to ALL updates to members of a DAG, not just major updates like service packs.

In our environment we have three total nodes:

  1. NYDAGNODE1
  2. NYDAGNODE2
  3. BOSDAGNODE1

Let’s start with BOSDAGNODE1.  Step one would be to move all active databases on BOSDAGNODE1 to another server.  You want to ensure that it does not have any active databases on it.  We can accomplish this in two ways.

In the EMC navigate to Server Config –>Mailbox.  Select the server in question and then select Switchover Server:

Sep. 2301 10.43

Then you can either automatically choose a target server, or specify one yourself:

image

Or, through the shell, you can run the command:

Move-ActiveMailboxDatabase –Server DKPBOSDAGNODE1

Sep. 2302 10.45

This will do the same thing as the console, and will automatically choose a target server.

Next, we want to block DKPBOSDAGNODE1 from activating it’s databases.  This will prevent other servers from failing over to it.

Set-MailboxServer DKPBOSDAGNODE1 –DatabaseCopyAutoActivationPolicy:blocked

Sep. 2303 10.47

Now you can upgrade the server to Exchange 2010 SP1!

After its finished, run the following command to re-enable activation of the node:

Set-MailboxServer DKPBOSDAGNODE1 –DatabaseCopyAutoActivationPolicy:unrestricted

Sep. 2304 10.50

Now, you can proceed on the other nodes until your finished!

One caveat you need to be aware of.  Exchange 2010 RTM servers can failover TO a dag node member running Exchange 2010 SP1, but a server running Exchange 2010 SP1 cannot failover to a dag node member NOT running Exchange 2010 SP1.  So make sure your entire DAG is upgraded in a timely fashion!

As for your schedule, note that I chose to do Boston, which is our DR site first, because NY would still be able to fail over to it in case of a DR situation.  Next, I can do one node at a time in NY.  This allows me to keep my mailboxes local to my production site, while ensuring that I am covered should a DR situation arise. 

How to Manage a Datacenter Failure or Disaster Recovery Scenario in Exchange 2010 – Part 3

Exchange 2010, High Availability

In Part 3 of this series, we are going to discuss how to failback to your primary site, after whatever condition that caused it to go offline in the first place occurred.

So when we last left off, we have activated all of our databases in our disaster recovery site. I showed you how had to stop the mailbox servers in the primary active directory site, which then allowed us to activate our disaster recovery site.

So now, let’s say the flood that occurred in the New York site has been fixed, all the water’s been removed, and thankfully, it did not damage our equipment. We are ready to move everything back to NYC. We begin by powering on the servers in NY. Since our DAG is in DAC mode, the NY servers do not try to mount their copies of the databases. Instead, the begin copying over any log files so as to bring their copies of the databases up to date with the DR site. Wait until all the database’s are reporting a status of healthy:

Now, remember, we told the DAG that the NY servers were down with the Stop-DatabaseAvailabilityGroup command from Part 2:

Notice how NYMB01 and NYMB02 are both listed as Stopped Mailbox Servers. You will get this error in the console if you try to activate the database on one of those two servers:

What we have to do is start these servers in the DAG, making them viable copies for activation again. We do this with the Start-DatabaseAvailabilityGroup command. Again, we could use with the –MailboxServer command and specify each server, separating them with a comma, or we can specify the whole site with the –ActiveDirectorySite option as below:

Start-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite NYC

Now if we run the Get-DatabaseAvailabilityGroup command, all servers are listed as started

Now, Microsoft recommends dismounting the databases that are going to be moved back to the primary datacenter site. This will mean that you will need to get a maintenance window to perform this action, as the databases will be offline. Keep in mind that you do not NEED to dismount the databases; it is just recommended by Microsoft.

Now, we can activate the Database. You can do this either by using the shell with the Move-ActiveMailboxDatabase MDB01 –ActivateOnServer NYMB02 command:

Or through the EMC with the Activate a Database Copy Wizard:

Now, the database should be activated. One issue that I have seen pop up is that the Catalog Index is corrupted. This is the index for the Full Text Index Search. If this is the case, you may have to reseed the Catalog Index with the following command:

Update-MailboxDatabaseCopy MDB02NYMB02 –SourceServer DRMB01 –CatalogOnly

This command will reseed the content index from the server DRMB01. You should now be able to activate the database copy.

Continue by moving all active database copies to their respective servers. If you dismounted the databases as stated above, now mount these databases.

Now, our clients are still connected, but they are still connecting through DRHT01.nygiants.com because it is set as the RpcClientAccessServer:

17-Dec09 15.08

We simply need to run the command Get-MailboxDatabase | Set-MailboxDatabase –RpcClientAccessServer NYHT01.nygiants.com to change this setting over:

If we check the properties of the databases, we see that NYHT01.nygiants.com is again set as the Rpc Client Access server:

Now, our Outlook clients will be connecting through NYHT01 for their access:

Well, that’s it!

In this three part series, I should you how to activate the backup datacenter, in the event that the primary datacenter became unavailable. We discussed the theory behind the process, mainly Datacenter Activation Coordinator, the actual work needed to do it, as well as how to failback when the primary site came back online.

If you have any questions, feel free to email me at ponzekap2 at gmail dot com.

How to Manage a Datacenter Failure or Disaster Recovery Scenario in Exchange 2010 – Part 2

Exchange 2010, High Availability

 

In the first article of this series, we went over some of the premises of Exchange 2010, Database Availability Groups or DAG’s, and Database Activation Coordinator.  We discussed our test environment, as well as how, in theory, an Exchange 2010 DAG handles the failure of a datacenter.

In this article, I’ll show you how to actual activate your disaster recovery site, should your primary site go down.

One of the first thing’s we need to take into consideration is how the clients, most importantly Outlook, connect to the DAG.  We spoke in the first article how there is a new service in Exchange 2010 Client Access Servers, called the Microsoft Exchange RPC Client Access Service.  Outlook clients now connect to the Client Access Server, and the Client Access Server connects to the Mailbox Server.  This means Outlook clients don’t connect directly to the Mailbox Server’s anymore.  This becomes significant in a disaster recovery situation.  

Lets look at the output of the command:

Get-MailboxDatabase | Select Name,*rpc*

09-Dec02 18.20 

The RpcClientAccessServer value on a particular Mailbox Database indicates that connections to this database, are passed through to the Client Access Server listed.  So, for all these databases above, all Outlook connections have to go through NYHT01.nygiants.com.  (If you have more than one Client Access Server per site, which you should for redundancy and load balancing, you can create a cluster using either Windows Network Load Balancing, or a hardware load balancer, and change this value to point to the cluster host name).  Lets look at our Outlook Client, and we notice that all connections are being passed to NYHT01.nygiants.com:

09-Dec03 18.23

Alright, now, we are ready to start failing over servers!  My test user, Paul Ponzeka, is located on a mailbox database named MDB02 which is running on server NYMB02:

09-Dec04 18.27

So, to simulate a datacenter failure, we’ll just pull the power on ALL NY servers:

NYDC01

NYHT01

NYMB01

NYMB02

NY-XP1 (client machine)

So now, all of our NY machines are off, lets check the status of the Databases on our DR servers located in Boston:

09-Dec05 18.40

Well, that’s not good huh?  Notice how the DB’s for the two copies in NY are listed as ServiceDown under copy status?  Also note that DRMB01, the Mailbox Server in Boston’s copy of the database is Disconnected and Healthy.  The reason the DB is not mounted, is because of the DAC mode enabled on the DAG, which we discussed in Part 1 of this series. DRMB01 dismounts ALL its Mailbox Databases because there are not a majority number of DAG members available, this it can’t make quorum. It dismounts to prevent a possible split brain scenario, but in this case, we REALLY need to get this activated.  How do we do this?  What we need to do is remove the NY server members as being active participating members of the DAG.  We do this through the Exchange Shell.

If we note the current status of the DAG with the command:

Get-DatabaseAvailabilityGroup | select Name,*server*

09-Dec06 18.47

Notice that the Servers value lists DRMB01, NYMB01 and NYMB02 has servers in the DAG, and lists them again as StartedMailboxServers?  Well, we need to tell the DAG, that NYMB01 and NYMB02 are no longer “started” or operational.  We need to use the following command for that:

Stop-DatabaseAvailabilityGroup.

Our command can specify each server that’s down, one by one:

Stop-DatabaseAvailabilityGroup –Identity DAG1 –MailboxServer NYMB01,NYMB02

Or, since we lost the entire NY site, we can specify that the entire NY site:

Stop-DatabaseAvailabilityGroup –Identity DAG1 – ActiveDirectorySite NYC

Now, since our Mailbox Servers in NY are actually unreachable, we want to specify the –ConfigurationOnly option at the end of this command.  Otherwise, the command attempts to actually stop the mailbox services on every mailbox server in the NYC site, and causes the command to take an extremely long time to complete:

09-Dec07 18.55

Now, if we re-run the command:

Get-DatabaseAvailabilityGroup | select Name,*server*

09-Dec08 18.56

We notice that the NY mailbox servers, NYMB01 and NYMB02 are both now listed as Stopped Mailbox Servers.

Next, ensure that the clustering service is stopped on DRMB01:

09-Dec09 18.57

Now, we want to tell the DR site, it should restart with the new settings (both the NY servers missing):

Restore-DatabaseAvailabilityGroup –Identity DAG1 –ActiveDirectorySite DR

09-Dec10 18.59

You will see a progress bar, indicating its adjusting Quorum and the cluster for the new settings.  Don’t be alarmed if you see an error regarding the command not being able to contact the downed mailbox servers.

If we return to the Exchange Management Console, all our databases in the DAG have been mounted on DRMB01!

09-Dec11 19.01

Great right?  But our clients are still having trouble connecting.  What’s the problem?

09-Dec12 19.02

The reason is that NYHT01.nygiants.com is still listed as the RPC Client Access Server for this database:

09-Dec13 19.05

We have two choices.  First, is to change the DNS record for NYHT01.nygiants.com to be a CNAME for DRHT01.nygiants.com.  Or the second, and faster method, is changing the RPC Client Access Server to be DRHT01.nygiants.com with the following command:

Set-MailboxDatabase MDB02 –RpcClientAccessServer DRHT01.nygiants.com

09-Dec14 19.07

To do it for every DB that you failed over is simple:

10-Dec01 08.54

Now, back to the Outlook client and WHOILA!

09-Dec17 19.11

Now all your messaging services are back and running in your Disaster Recovery site, with limited downtime for your end users.

In this article we discussed how to fail over a datacenter to your backup or disaster recovery datacenter, should your primary go offline.

In the next and final part of this series, I’ll show you have to fail back to the primary datacenter site, which some admins think is even more terrifying than failing over!

Stay Tuned!