Tag Archives: High Availability

How to Manage a Datacenter Failure or Disaster Recovery Scenario in Exchange 2010 – Part 2

Exchange 2010, High Availability

 

In the first article of this series, we went over some of the premises of Exchange 2010, Database Availability Groups or DAG’s, and Database Activation Coordinator.  We discussed our test environment, as well as how, in theory, an Exchange 2010 DAG handles the failure of a datacenter.

In this article, I’ll show you how to actual activate your disaster recovery site, should your primary site go down.

One of the first thing’s we need to take into consideration is how the clients, most importantly Outlook, connect to the DAG.  We spoke in the first article how there is a new service in Exchange 2010 Client Access Servers, called the Microsoft Exchange RPC Client Access Service.  Outlook clients now connect to the Client Access Server, and the Client Access Server connects to the Mailbox Server.  This means Outlook clients don’t connect directly to the Mailbox Server’s anymore.  This becomes significant in a disaster recovery situation.  

Lets look at the output of the command:

Get-MailboxDatabase | Select Name,*rpc*

09-Dec02 18.20 

The RpcClientAccessServer value on a particular Mailbox Database indicates that connections to this database, are passed through to the Client Access Server listed.  So, for all these databases above, all Outlook connections have to go through NYHT01.nygiants.com.  (If you have more than one Client Access Server per site, which you should for redundancy and load balancing, you can create a cluster using either Windows Network Load Balancing, or a hardware load balancer, and change this value to point to the cluster host name).  Lets look at our Outlook Client, and we notice that all connections are being passed to NYHT01.nygiants.com:

09-Dec03 18.23

Alright, now, we are ready to start failing over servers!  My test user, Paul Ponzeka, is located on a mailbox database named MDB02 which is running on server NYMB02:

09-Dec04 18.27

So, to simulate a datacenter failure, we’ll just pull the power on ALL NY servers:

NYDC01

NYHT01

NYMB01

NYMB02

NY-XP1 (client machine)

So now, all of our NY machines are off, lets check the status of the Databases on our DR servers located in Boston:

09-Dec05 18.40

Well, that’s not good huh?  Notice how the DB’s for the two copies in NY are listed as ServiceDown under copy status?  Also note that DRMB01, the Mailbox Server in Boston’s copy of the database is Disconnected and Healthy.  The reason the DB is not mounted, is because of the DAC mode enabled on the DAG, which we discussed in Part 1 of this series. DRMB01 dismounts ALL its Mailbox Databases because there are not a majority number of DAG members available, this it can’t make quorum. It dismounts to prevent a possible split brain scenario, but in this case, we REALLY need to get this activated.  How do we do this?  What we need to do is remove the NY server members as being active participating members of the DAG.  We do this through the Exchange Shell.

If we note the current status of the DAG with the command:

Get-DatabaseAvailabilityGroup | select Name,*server*

09-Dec06 18.47

Notice that the Servers value lists DRMB01, NYMB01 and NYMB02 has servers in the DAG, and lists them again as StartedMailboxServers?  Well, we need to tell the DAG, that NYMB01 and NYMB02 are no longer “started” or operational.  We need to use the following command for that:

Stop-DatabaseAvailabilityGroup.

Our command can specify each server that’s down, one by one:

Stop-DatabaseAvailabilityGroup –Identity DAG1 –MailboxServer NYMB01,NYMB02

Or, since we lost the entire NY site, we can specify that the entire NY site:

Stop-DatabaseAvailabilityGroup –Identity DAG1 – ActiveDirectorySite NYC

Now, since our Mailbox Servers in NY are actually unreachable, we want to specify the –ConfigurationOnly option at the end of this command.  Otherwise, the command attempts to actually stop the mailbox services on every mailbox server in the NYC site, and causes the command to take an extremely long time to complete:

09-Dec07 18.55

Now, if we re-run the command:

Get-DatabaseAvailabilityGroup | select Name,*server*

09-Dec08 18.56

We notice that the NY mailbox servers, NYMB01 and NYMB02 are both now listed as Stopped Mailbox Servers.

Next, ensure that the clustering service is stopped on DRMB01:

09-Dec09 18.57

Now, we want to tell the DR site, it should restart with the new settings (both the NY servers missing):

Restore-DatabaseAvailabilityGroup –Identity DAG1 –ActiveDirectorySite DR

09-Dec10 18.59

You will see a progress bar, indicating its adjusting Quorum and the cluster for the new settings.  Don’t be alarmed if you see an error regarding the command not being able to contact the downed mailbox servers.

If we return to the Exchange Management Console, all our databases in the DAG have been mounted on DRMB01!

09-Dec11 19.01

Great right?  But our clients are still having trouble connecting.  What’s the problem?

09-Dec12 19.02

The reason is that NYHT01.nygiants.com is still listed as the RPC Client Access Server for this database:

09-Dec13 19.05

We have two choices.  First, is to change the DNS record for NYHT01.nygiants.com to be a CNAME for DRHT01.nygiants.com.  Or the second, and faster method, is changing the RPC Client Access Server to be DRHT01.nygiants.com with the following command:

Set-MailboxDatabase MDB02 –RpcClientAccessServer DRHT01.nygiants.com

09-Dec14 19.07

To do it for every DB that you failed over is simple:

10-Dec01 08.54

Now, back to the Outlook client and WHOILA!

09-Dec17 19.11

Now all your messaging services are back and running in your Disaster Recovery site, with limited downtime for your end users.

In this article we discussed how to fail over a datacenter to your backup or disaster recovery datacenter, should your primary go offline.

In the next and final part of this series, I’ll show you have to fail back to the primary datacenter site, which some admins think is even more terrifying than failing over!

Stay Tuned!

Creating a Database Availability Group in Exchange 2010 – Part 1

Exchange 2010, High Availability

 

As you may or may not have heard, Microsoft has announced the next version of their messaging suite, Microsoft Exchange 2010, will be available later this year!  The new version of Exchange hosts many improvements and feature additions on top of Exchange 2007.  One of the most exciting ones announced, was a feature called Database Availability Groups, or DAG’s.  In this four part series, we’ll go over the concepts of DAG’s, and how to get one working and test it out. 

So, what is a DAG?  A DAG is the evolution of the CCR and SCR functionality that was introduced in Exchange 2007.  CCR allowed you to keep two copies of your database’s in a cluster, protecting against both server failure, and a corruption of the database.  SCR allowed you to add site resiliency to your Exchange design, by replicating data to your disaster recovery site, and activating it if needed.  CCR and SCR have now been rolled into the DAG feature.  The best part about it, is most of the legwork, as well as the activation of the data, is automatic!  It still uses the concept of log shipping for the replication, although its been much improved.  Let’s get into a little bit of how it works.

The first thing you need to know, is in Exchange 2010, the storage architecture is different than that of Exchange 2007.  There are no more storage groups to start.  Transaction logs, checkpoint files, are all based off of the mailbox database now.  Microsoft was moving away from the storage groups, especially when you consider the requirement for any type of replication in Exchange 2007 required a maximum of one database per storage group.  The next big change is that database’s are no longer objects of a server, but objects of the Exchange organization itself.  What exactly does that mean, well, take a look of this screen shot in the Exchange 2010 management console:

 

01 Jun. 29 20.38

If you notice, I am under the Organization Configuration node, not the server configuration node.  The picture shows two database’s, each hosted on different servers.  If you notice on the bottom half of the screen, the console lists the Database Copies that make up this particular database.  A DAG consists of multiple copies of a set of database’s, that can be activated as the active copy at any time.  You can have up to 16 servers in a DAG, meaning you can have up to 16 separate copies of one database!  For example, you could have two servers in your main datacenter, each with a copy of one database for high availability in your main site, and then a third copy of the database in your disaster recovery site, in case you lost your main datacenter.  Members of a DAG do not have to be members of the same AD site, like stretched 2008 CCR clusters did.  Each of these can be activated at any time, automatically if you have a failure, or manually by running some commands.

The last major point, is that no client connects directly to a mailbox server anymore, including outlook.  Outlook clients connect to a Client Access Server, just like a POP or IMAP client does to connect to it’s mailbox.  This allows for incredible quick failovers (30 seconds or less), of the Outlook client to a new copy of the database. 

So now that we have an idea of the high level concept, let’s take a look at actually setting up one.  Here is my lab environment.  I have two separate AD sites.  New York has two subnets, 10.1.1.0/24 and 172.16.1.0/24, and London has two subnets, 192.168.1.0/24 and 172.17.1.0/24.  In NY, production traffic will occur over the 10.1.1.0/24 network, and replication and heartbeat over the 172.16.1.0/24.  In London, production is 192.168.1.0/24, and replication and heartbeat 172.17.1.0/24.  Now, since these are two separate sites, both replication networks need to be able to contact each other.  This means both networks need to be routable to each other, which in our case they are.  You can use a stretched VLAN, but is a much more complicated scenario, for no true benefit.  In each site, I have a single Domain Controller, that is also a Client Access Server and Hub Transport server, as well as one machine with just the Mailbox Role installed.  It should be noted, one of the coolest feature of the DAG, is that the mailbox role does not have to be installed by itself for it to be part of a DAG.  You can have any combination of roles installed, and it will still work EXACTLY the same.  Below is a Visio of the setup:

DAG_Diagram

All of the Exchange 2010 is installed, as you do NO customization during the install, all is done after.  This means you do not have to re-install Exchange if you decide down the rode to make it part of a DAG. Let’s take a look at the network configuration.  First, the NY server. 

 02 Jun. 29 20.58

I have two NIC’s, one labeled “Client” and one labeled “Replication”.  The client NIC, is configured as normal, with an IP, Subnet Mask, Gateway, all the regular stuff.  The replication NIC should only be configured with an IP and subnet, NO DEFAULT GATEWAY:

03 Jun. 29 21.00

Now, select the advanced button, and select the DNS tab.  At the bottom, un-select the box to “Register this connection’s address in DNS”:

04 Jun. 29 21.01

Next select the WINS tab and select the radio button to disable NetBIOS over TCP/IP:

05 Jun. 29 21.01

After this, select OK to save your settings and return to the Network Connections screen.  Select Advanced->Adapters and Bindings.  Make sure your production or “Client” NIC is listed above “Replication”:

06 Jun. 29 21.03

Now, you may be wondering about the default gateway missing on the heartbeat network.  If you add a default gateway on two different NIC’s, windows provides you with a warning:

07 Jun. 29 21.04

Hmm, seems like this most certainly pertains to us.  Also, DAG’s still use the Windows Failover Clustering feature of Windows Server 2008.  Have a configuration with a default gateway on the replication or heartbeat NIC is not supported, as very odd behavior can be exhibited.  So, then the question is asked, well the network’s are routed, how do we tell the replication NIC on one node, how to get to the replication networks of the other nodes?  For this, we add static routes to the individual server’s routing tables.  Tim McMichael had a great article about this, and you can read it here

So, on the NY node, we want it to contact the LN node’s replication network of 172.17.1.0/24 on its replication network of 172.16.1.0/24.  The gateway on the NY side is 172.16.1.254, so we run the following command:

route add 172.17.1.0 MASK 255.255.255.0 172.16.1.254 –p

08 Jun. 29 21.11

The –p makes it consistent across reboots.  We can check if it was successful with the route print command:

09 Jun. 29 21.13

So now, all replication and heartbeat traffic should pass through the specific replication NIC, over the replication network, to the replication NIC of the London node.  Repeat this step for ALL your replication networks, on all nodes.  For the London node, with a gateway on the London replication network of 172.17.1.254, the command back to NY would be:

Route Print 172.16.1.0 MASK 255.255.255.0 172.17.1.254 –p

Okay, that does it for part 1 of this series.  We went over the basic concepts of the DAG, and how to set up the networking for it.  In the next section, we’ll go over how to create the DAG, and add nodes to it.  Stay tuned.