With SQL Server Release 2012, Microsoft has introduced a new high-availability technology named “AlwaysOn”. It comes integrated with the database software and extends Microsoft’s high-availability portfolio. AlwaysOn is intended to eventually replace Database Mirroring entirely (refer to High Availability Solutions – MS TechNet).
In this post, we would like to familiarize you with the AlwaysOn technology, but also investigate a constellation of problems that can arise from its use.
What is AlwaysOn?
The SQL Server AlwaysOn (AO) technology has been extensively researched and documented online. We would like to point you to the FAQ on Microsoft’s TechNet website (Microsoft TechNet – AlwaysOn FAQ) and this insightful blog post by Juergen Thomas: SQL Server 2012 AlwaysOn – What is it?.
To put it simply:
AlwaysOn is an alternative high-availability solution for SQL Server databases that is available alongside Database Mirroring (DBM) and SQL Server Clustering. Its objective is to combine the advantages of these two HA solutions, while eliminating all of their disadvantages.
AO mitigates the complexity of a clustered SQL Server instance by means of locally installed SQL Server instances with locally attached storage. As with DBM, the database exists locally on each server.
Other than with DBM, however, a database secured with AO can be addressed using a unique DNS name / IP address. The application that uses the database is not aware of a potential failover and does not have to address another database server at runtime.
Today, a wide variety of business processes is mapped in IT systems and their execution depends on the availability and proper functioning of the systems they are based on. Particularly when it comes to systems that drive business-critical processes, downtime can result in significant financial losses. Users and administrators need to know that the databases that are protected by high-availability technologies will be available at any time even in case of error (hardware failures and the like). But does AlwaysOn function reliably and as expected for every conceivable error?
How does SQL Server / the database protected by AlwaysOn respond if the primary connection is lost (failure of the hard disks or SSDs)?
To answer this question, we set up the following test environment:
- Operating system platform Windows Server 2012 R2
- SQL Server 2012 SP1 CU2 (the exact patch level is irrelevant, the following applies to all SQL Server 2012 versions starting with RTM)
- iSCSI-connected (internet Small Computer System Interface) LUNs for the test database protected by AO
- Setup of Windows Server Failover Clustering
- Provision of a test database on the iSCSI LUNs
Setup of an SQL Server AO Availability Group
Figure 1: AlwaysOn test setup
As illustrated in figure 2, the databases on both AlwaysOn test servers are currently synchronized.
Figure 2: Synchronous databases in AO
Now we will try to simulate the failure of the primary disk connection by setting the hard disk on which the data files of the primary database reside to offline in the Disk Manager.
One would expect SQL Server 2012 to detect the failure and respond accordingly by triggering a failover.
Figure 3: Simulated hard disk failure
Upon switching to the AO dashboard and refreshing the AlwaysOn status, one will quickly see that despite the “failure” of the hard disk it still considers the databases to be synchronous. Even the SQL Server error log of the primary database instance does not indicate any problems seconds after the “failure” of the hard disk on which the data files of the database reside.
But as soon as one tries to access any database object in this database (in this case using SQL Server Management Studio), the following error message appears:
Figure 4: Message: DB cannot be accessed anymore, DB check is recommended
Based on this error message, an administrator would assume that the hard disk on which the data files of the primary database reside is defective and counteract by triggering a failover with AlwaysOn in order to open the database on the secondary side, which is assumed to be synchronous. It is fairly likely, however, that in this state SQL Server will no longer be able to fail over to the up-to-then synchronous second database! The SQL Server error log of the secondary database instance displays the following error message:
AlwaysOn: The local replica of availability group ‘AOGroupSLM’ is preparing to transition to the primary role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is required.
The availability replica for availability group ‘AOGroupSLM’ on this instance of SQL Server cannot become the primary replica. One or more databases are not synchronized or have not joined the availability group, or the WSFC cluster was started in Force Quorum mode. If the cluster was started in Force Quorum mode or the availability replica uses the asynchronous-commit mode, consider performing a forced manual failover (with possible data loss). Otherwise, once all local secondary databases are joined and synchronized, you can perform a planned manual failover to this secondary replica (without data loss). For more information, see SQL Server Books Online.
The behavior shown by AlwaysOn in such cases is, at first, unexpected:
No failover is triggered if the status of a database protected by AO changes. Why is this the case?
Because AlwaysOn does not perform an availability check on the database level. Other than Database Mirroring, AO just like SQL clustering allows for the logical grouping of multiple databases that can be failed over together if necessary. This must be considered when looking at its behavior. With AlwaysOn, availability is not checked and assessed on the level of individual database, but rather on the level of the SQL Server instance and the AO Availability Group. This is what sets it apart from the Database Mirroring behavior, which checks the availability of the individually protected databases and triggers a failover as soon as the availability of the database on the primary server is affected.
The following information can be found in SAP Note 1772688 (login required):
“The latter point also forced a change in the granularity of health checks or indications which would trigger a failover. In DBM the close of a database as result of e.g. a disk failure will trigger a failover (if automatic failover is configured). Whereas in AlwaysOn the health checks are mainly around the Availability Group and the SQL Server instance running the Availability Group as primary. But health detection is not checking the health of the individual databases which are contained in an Availability Group. As a result a database within an Availability
Group can be closed, triggered e.g. by a disk failure without the AG moving into an unhealthy state or triggering a failover (if automatic failover got configured). This is a drastic change in behavior and coverage area compared to DBM. Especially in the case of disk failures which as a secondary effect cause affected SQL Server databases to close, DBM triggered a failover, whereas AlwaysOn in its default settings won’t trigger a failover.”
- Following a storage defect on the primary side, the databases are assumed to be out of sync, making it necessary to open the secondary database, which is actually synchronous, as if it were an asynchronously replicated database (‚with allow_data_loss’).
The actual status of the secondary AO database, however, is identical to that of the secondary DBM database for a similar database error. The databases are equally synchronous or asynchronous with both technologies, but AlwaysOn is more conservative in its assessment than DBM and changes the status of the secondary database to out-of-sync (which is always is case an asynchronously mirrored databases or asynchronous AO).
The behavior when log or data files are lost is the same for both AlwaysOn and Database Mirroring until the protected database is closed. Here, the difference between the two HA solutions lies in their level of automation. AlwaysOn will only provide high availability for as long as the integrity of the database files of the current primary database is maintained. In cases, however, where the primary database is lost due to problems such as storage/SSD failure or iSCSI connection loss, the database, depending on its runtime status, is marked as “out of sync” and requires manual intervention. Database Mirroring on the other hand will, depending on its configuration, automatically fail over to the mirror when data or log files are lost.
From a data consistency perspective, there is no reason not to use the SQL Server AlwaysOn high availability solution, if SLAs are continued to be met after manual intervention and starting the mirror.