Category Archives: High Availability

How to Manage a Datacenter Failure or Disaster Recovery Scenario in Exchange 2010 – Part 1

Exchange 2010, High Availability

 

Exchange 2010 introduced several high availability and disaster recovery features, the one that receives the most publicity is the Database Availability Group (or DAG for short) feature.  In short a DAG allows replication of a mailbox, to other servers in the DAG, that can be activated automatically within 30 seconds, restoring user access to their mailbox’s.  For more information see my article series on DAG’s here.

The automatic failover is great for High Availability within a datacenter, or even across a datacenter.  For instance, consider the following diagram:

DB

Here, the green copies are the Active Copy, they are the one’s users are actually accessing for their mailbox’s.  The yellow and the red are copies that can be activated, should the Active Copy go offline.  Consider the possibility that MDB01 on server NYMB01 goes offline, the copy on NYMB02 would be activated within 30 seconds automatically.  Next, the drive holding the database MDB01 on server NYMB02 fails, causing THIS copy to go online.  In this case, the copy of MDB01 on DRMB01 in Boston would be activated with 30 seconds, and users would be able to access their mailbox’s, across the WAN link to Boston!  This is all part of the design of the DAG, and is great from a High Availability standpoint. 

But, as we know, High Availability and Disaster Recovery are COMPLETELY separate.  High Availability means to provide your users with high uptime, or access to the application.  Disaster Recovery is the ability of the application to function when a catastrophic event happens, such as destruction of a datacenter or worse, the building holding the datacenter.  This last part, is what we will cover in these articles.  To do so, there is a feature of DAG’s that we need to talk about, and that is the Datacenter Activation Coordinator, or DAC.

DAC is a setting on a DAG that has three or more member mailbox servers, that are extended to multiple sites.  So, a higher level view of our Exchange environment is below:

How to Manage a Datacenter Failure 2010

NYMB01, NYMB02 and DRMB01 are all part of the same DAG, lets call it DAG1, and all these servers are located in the NYGIANTS.COM domain.

Now, our DAG fits the criteria for DAC mode, which is three or more member servers, spread across multiple Active Directory Sites.  So, now what is DAC mode?

DAC mode is, quite simply, a mechanism to prevent the possibility of a split brain in your exchange environment.  Consider the following scenario.  As per the first diagram, you have MDB01 and MDB02, and both active copies run in NY on NYMB01 and NYMB02 respectively.  NYHT01 is running the file share witness.  A file share witness is a server that only participates if the DAG has an even number of servers, it’s used to “break” any tie in voting regarding if a server is down or not.  The NY site is connected via WAN connection to Boston, where DRMB01 hosts replica’s of MDB01 and MDB02.  Say there is a cut to the WAN connection, and for whatever reason, NY and Boston can no longer communicate, but neither side is truly offline.  The Boston side, since it can no longer connect to the NY server’s, assumes they are down, and mounts the database copies it has of MDB01 and MDB02, and marks them as active.  Since NY is still operational, it STILL has its copies of MDB01 and MDB02 mounted and active.  This is a split brain scenario, both sites believe that they are the rightful owner of the database, and have thus mounted their respective DB’s.  This would cause a divergence in data.  For example, if outside user, sends an email to a user at nygiants.com, and its received in NY, it would get delivered to his mailbox in NY.  If another user sends the same user at nygiants.com an email, and it gets received by Boston, it would get delivered to that’s users mailbox in Boston.  Each mailbox is different, which is a huge problem, this is the issue with a split brain scenario, and is what DAC was built to protect against. 

DAC does this by preventing the DR servers from mounting their databases.  DAC requires that a majority set of the DAG members be available for the DAG to be able to make an operational decision, in this case the DR servers mounting their database.  A DAG that has the majority of its member servers is said to have Quorum.  So, in our previous example where the line was severed, DR would NOT mount its database’s.  Why not?  Because the DAG consisted of 3 total members, NYMB01, NYMB02 and DRMB01.  What this means is that according to DRMB01, its the only surviving server, which is 1 out of 3, and is not a majority, hence it cannot mount its database. Now, if you look at the first diagram, you will notice that MDB03, is green on DRMB01, meaning that the active copy of MDB03 is running on DRMB01.  Well, what happens in this scenario, where the WAN connection was cut?  Wont one of the NY servers mount MDB03?  Since DRMB01 has MDB03 already mounted, wont this cause the EXACT split brain scenario we are trying to avoid?  No.  Why not?  Remember how I said that the DAG needs to be able to make Quorum?  Well, in this case, since DRMB01 cannot make Quorum, it is forced to dismount any database that it has running.  In the event log, you’ll see the following message:

02-Dec10 19.43

So, DRMB01 dismounts MDB03, which is mounted and activated in NY.  This is how the split brain scenario is avoided. 

So what does this mean if there really is a need for a datacenter failover?  At one site I work at, there was a broken pipe in the tenant above them, causing a flood that threatened to destroy their datacenter.  If the datacenter had been destroyed, how do we activate DR?  We’ll go over that in Part 2 of this series.

For this article, we discussed mainly the theory and thought process behind DAG’s, Datacenter Activation Coordinator, and the concept of Quorum with regards to the cluster. In the next article, we’ll jump in and do an actual datacenter failover. 

Creating a Database Availability Group in Exchange 2010 – Part 3

Exchange 2010, High Availability

 

In Part 2 of this series, we created the Database Availability Group, and added both NYDAGNODE1 and LNDAGNODE1 to it.

In Part 3 of this series, we are going to configure the network’s properly, and create some database’s for the DAG.

As we noted last time, by default, all networks for every node in a DAG is configured for replication.  We only want replication to occur over certain networks, mainly 172.16.1.0 for NY and 172.17.1.0 for London.  If we navigate to Organization Configuration –> Mailbox and select the Database Availability Group tab and select DAG01, we see all the networks listed.

23 Jun. 30 20.55

Just a side note, if you right click the label DagNetwork01 for example, you can rename it to something more descriptive.

24 Jun. 30 20.57

Now, for the two production network’s, when your in the property’s page, un-check the “Replication Enabled” check box:

25 Jun. 30 20.59

Now, it states for both NY and LN Production, that replication is disabled.  This will ensure that all replication occurs over a dedicated network.

26 Jun. 30 21.00

Now, it’s time to add some database’s to the DAG!  Move on to the Database Management tab.  You will note two database’s here, both of which are the default one for each server. 

27 Jun. 30 21.01

We can add these existing database’s to the DAG, by right clicking on each of them and selecting “Add Mailbox Database Copy”.  You will select any free server to add a copy, in our case:

28 Jun. 30 21.03

Select Add, and it will add NYDAGNODE1, as a replica for the Database.  Note the preferred list sequence number.  This indicates that NYDAGNODE1’s copy of this database, should be the second database activated, should something happen to Preferred List Sequence Number 1, which is the original copy on LNDAGNODE1.

The powershell command we could have run is listed:

29 Jun. 30 21.04

Now note, we have one database that’s listed as Copy Status Mounted, and one who’s Copy Status is Healthy.  The Healthy means it’s not in production and is a replica. 

30 Jun. 30 21.06

Note how it lists the servers that are hosting the database, as well as the Copy Queue Length, Replay Queue Length, as well as the Preferred List Sequence Number.  The copy queue length is how many transaction logs are waiting to be copied to the node, the replay is how many are waiting to be played into the database on that node, and the list sequence is what is the preferred next copy of the database Exchange should activate, if the currently mounted one becomes unavailable.

Adding a new mailbox database is very similar.  You create a new mailbox database, and select a node to host the first copy:

31 Jun. 30 21.09

And then just add extra copies as we did above.  All servers that are in the same DAG should have the same drive letter or mount point configuration.  This is because all copies will have the same path to the EDB File, as well as the Transaction Log and System Files path’s.  Also, since Mailbox Database’s are objects of the organization now, you need to ensure that their names are unique throughout the entire Exchange Organization.

So, that’s it for the third part of this series.  In this part, we configured the networks for replication, and we added copies of existing database’s, as well as created new database’s and copies into our DAG.

In the next part, I’ll show you how to fail over to different copies of the Mailbox Database’s, and it’s impact on the end user.

Creating a Database Availability Group in Exchange 2010 – Part 2

Exchange 2010, High Availability

 

In Part 1 of this series, we went over the basic concepts of the Database Availability Group, or DAG, and then went into how to set up the Networking for the DAG.  In this next section, we’ll cover how to create the DAG, and then add servers to that DAG.

The first thing we need to do, is actually create the DAG.  In the Exchange Management Console, under the Organization Configuration-> Mailbox, navigate to the Database Availability Group tab:

10 Jun. 30 19.35

Click on the “New Database Availability Group” Action.  You’ll be presented with the following screen:

11 Jun. 30 19.36 

  1. Database Availability Group Name – This is just the name of the DAG.
  2. File Share Witness Share – This is the UNC patch of a file share witness, most likely on an HT server.  This is used if there an even number of Servers in the DAG for a majority vote
  3. File Share Witness Directory – This is where the share is located on the server who is hosting it.  It will be created for you automatically.
  4. Network Encryption and Network Compression we’ll leave at the default.

With a Hub Transport server installed on NYDC01, and wanting the share to be from a folder called “DAG01” on the C drive of that server, our screen will look like this:

12 Jun. 30 19.40

After hitting next, you’ll see the powershell command that could have been run:

New-DatabaseAvailabilityGroup -Name ‘DAG01’ -FileShareWitnessShare ‘\NYDC01DAG01’ -FileShareWitnessDirectory ‘C:DAG01’

Now, you’ll have a DAG created, but with no member servers in it:

13 Jun. 30 19.43

Now, lets add NYDAGNODE01 to it.  A couple things that should be noted.  First, DAG’s utilize the Windows Server Failover Cluster feature to be installed.  If when you go to add a node to the DAG, if this isn’t installed, the command will run it for you, it will just take a little bit longer.  The second issue is that we are using the Beta release of Exchange 2010.  There seems to be an issue with the Exchange Console, being able to remotely initiate the installation.  To get around this when using the Beta, just make sure to install the Windows Failover Clustering Feature from Server Manager yourself on all the nodes.  This will also help to speed things up.

14 Jun. 30 19.47

Okay, so on to adding the first Node to the DAG.  When you add the first node to a DAG, the DAG get’s assigned an IP address.  If you do this through the Exchange Management Console, the DAG will retrieve an address through DHCP.  I’m not a huge fan of this, so I like to use the Exchange Management Shell, because you can statically assign an IP address to the DAG.  I’ll show you both way’s though.  For the Exchange Management Console, Navigate to Organization Configuration –> Mailbox and select the Database Availability Group tab.  Here, you will see DAG01 listed, the DAG we created before.  Right click it, and select “Manage Database Availability Group Membership”, you’ll be presented with this screen:

15 Jun. 30 19.59

Now, select the green Add button, and then select NYDAGNODE1, and select OK.

16 Jun. 30 19.59

You could now select manage.  This would ensure the server had Failover Clustering installed, if it didn’t it would install it, and then add it to the DAG.  It would also retrieve an IP address from a DHCP server.  We won’t finish this, we’ll do it in the shell. 

The command is really simple. 

Add-DatabaseAvailabilityGroupServer -Identity DAG01 -MailboxServer NYDAGNODE1 -DatabaseAvailabilityGroupIpAddresses 10.1.1.3

18 Jun. 30 20.04

This will add the server NYDAGNODE1, to the DAG, DAG01 and assign the DAG IP address 10.1.1.3.

We let the command run, and it can take some time, you’ll see a command similar to the below as it creates the cluster and adds the server to the DAG:

19 Jun. 30 20.05

Once the command finishes, you’ll see NYDAGNODE1 listed as a member server:

20 Jun. 30 20.07

If we now ping that IP, we see that we are getting a successful return:

21 Jun. 30 20.08

Now, add the second node, LNDAGNODE1.  It works the same way as above for the Console, or the shell.  If you use the shell, you can now omit the –DatabaseAvailabilityGroupIPAddress command.  (Remember to log on locally to LNDAGNODE1, as the Beta fails when trying to do it remotely.  Also, it seems that you need to use the Exchange Management Shell (Local) icon to add the second node successfully) The end result should look like this:

22 Jun. 30 20.33

If you note, there are now two Member Servers in the DAG, and in the bottom half of the screen, it notes the networks, and their status.  Note, by default, ALL of the networks are configured for replication.  We’ll configure this differently in the next part.

In this part, we created a DAG, and added two members to this DAG.  In the third part of this series, we’ll configure the replication networks, and create some database’s and set them up for replication!

Creating a Database Availability Group in Exchange 2010 – Part 1

Exchange 2010, High Availability

 

As you may or may not have heard, Microsoft has announced the next version of their messaging suite, Microsoft Exchange 2010, will be available later this year!  The new version of Exchange hosts many improvements and feature additions on top of Exchange 2007.  One of the most exciting ones announced, was a feature called Database Availability Groups, or DAG’s.  In this four part series, we’ll go over the concepts of DAG’s, and how to get one working and test it out. 

So, what is a DAG?  A DAG is the evolution of the CCR and SCR functionality that was introduced in Exchange 2007.  CCR allowed you to keep two copies of your database’s in a cluster, protecting against both server failure, and a corruption of the database.  SCR allowed you to add site resiliency to your Exchange design, by replicating data to your disaster recovery site, and activating it if needed.  CCR and SCR have now been rolled into the DAG feature.  The best part about it, is most of the legwork, as well as the activation of the data, is automatic!  It still uses the concept of log shipping for the replication, although its been much improved.  Let’s get into a little bit of how it works.

The first thing you need to know, is in Exchange 2010, the storage architecture is different than that of Exchange 2007.  There are no more storage groups to start.  Transaction logs, checkpoint files, are all based off of the mailbox database now.  Microsoft was moving away from the storage groups, especially when you consider the requirement for any type of replication in Exchange 2007 required a maximum of one database per storage group.  The next big change is that database’s are no longer objects of a server, but objects of the Exchange organization itself.  What exactly does that mean, well, take a look of this screen shot in the Exchange 2010 management console:

 

01 Jun. 29 20.38

If you notice, I am under the Organization Configuration node, not the server configuration node.  The picture shows two database’s, each hosted on different servers.  If you notice on the bottom half of the screen, the console lists the Database Copies that make up this particular database.  A DAG consists of multiple copies of a set of database’s, that can be activated as the active copy at any time.  You can have up to 16 servers in a DAG, meaning you can have up to 16 separate copies of one database!  For example, you could have two servers in your main datacenter, each with a copy of one database for high availability in your main site, and then a third copy of the database in your disaster recovery site, in case you lost your main datacenter.  Members of a DAG do not have to be members of the same AD site, like stretched 2008 CCR clusters did.  Each of these can be activated at any time, automatically if you have a failure, or manually by running some commands.

The last major point, is that no client connects directly to a mailbox server anymore, including outlook.  Outlook clients connect to a Client Access Server, just like a POP or IMAP client does to connect to it’s mailbox.  This allows for incredible quick failovers (30 seconds or less), of the Outlook client to a new copy of the database. 

So now that we have an idea of the high level concept, let’s take a look at actually setting up one.  Here is my lab environment.  I have two separate AD sites.  New York has two subnets, 10.1.1.0/24 and 172.16.1.0/24, and London has two subnets, 192.168.1.0/24 and 172.17.1.0/24.  In NY, production traffic will occur over the 10.1.1.0/24 network, and replication and heartbeat over the 172.16.1.0/24.  In London, production is 192.168.1.0/24, and replication and heartbeat 172.17.1.0/24.  Now, since these are two separate sites, both replication networks need to be able to contact each other.  This means both networks need to be routable to each other, which in our case they are.  You can use a stretched VLAN, but is a much more complicated scenario, for no true benefit.  In each site, I have a single Domain Controller, that is also a Client Access Server and Hub Transport server, as well as one machine with just the Mailbox Role installed.  It should be noted, one of the coolest feature of the DAG, is that the mailbox role does not have to be installed by itself for it to be part of a DAG.  You can have any combination of roles installed, and it will still work EXACTLY the same.  Below is a Visio of the setup:

DAG_Diagram

All of the Exchange 2010 is installed, as you do NO customization during the install, all is done after.  This means you do not have to re-install Exchange if you decide down the rode to make it part of a DAG. Let’s take a look at the network configuration.  First, the NY server. 

 02 Jun. 29 20.58

I have two NIC’s, one labeled “Client” and one labeled “Replication”.  The client NIC, is configured as normal, with an IP, Subnet Mask, Gateway, all the regular stuff.  The replication NIC should only be configured with an IP and subnet, NO DEFAULT GATEWAY:

03 Jun. 29 21.00

Now, select the advanced button, and select the DNS tab.  At the bottom, un-select the box to “Register this connection’s address in DNS”:

04 Jun. 29 21.01

Next select the WINS tab and select the radio button to disable NetBIOS over TCP/IP:

05 Jun. 29 21.01

After this, select OK to save your settings and return to the Network Connections screen.  Select Advanced->Adapters and Bindings.  Make sure your production or “Client” NIC is listed above “Replication”:

06 Jun. 29 21.03

Now, you may be wondering about the default gateway missing on the heartbeat network.  If you add a default gateway on two different NIC’s, windows provides you with a warning:

07 Jun. 29 21.04

Hmm, seems like this most certainly pertains to us.  Also, DAG’s still use the Windows Failover Clustering feature of Windows Server 2008.  Have a configuration with a default gateway on the replication or heartbeat NIC is not supported, as very odd behavior can be exhibited.  So, then the question is asked, well the network’s are routed, how do we tell the replication NIC on one node, how to get to the replication networks of the other nodes?  For this, we add static routes to the individual server’s routing tables.  Tim McMichael had a great article about this, and you can read it here

So, on the NY node, we want it to contact the LN node’s replication network of 172.17.1.0/24 on its replication network of 172.16.1.0/24.  The gateway on the NY side is 172.16.1.254, so we run the following command:

route add 172.17.1.0 MASK 255.255.255.0 172.16.1.254 –p

08 Jun. 29 21.11

The –p makes it consistent across reboots.  We can check if it was successful with the route print command:

09 Jun. 29 21.13

So now, all replication and heartbeat traffic should pass through the specific replication NIC, over the replication network, to the replication NIC of the London node.  Repeat this step for ALL your replication networks, on all nodes.  For the London node, with a gateway on the London replication network of 172.17.1.254, the command back to NY would be:

Route Print 172.16.1.0 MASK 255.255.255.0 172.17.1.254 –p

Okay, that does it for part 1 of this series.  We went over the basic concepts of the DAG, and how to set up the networking for it.  In the next section, we’ll go over how to create the DAG, and add nodes to it.  Stay tuned.