Sql-server – Failover Cluster Client Access Point Only Responds to Pings on the Owner Node

azureclusterfailoversql serversql-server-2012

Background

We're running two VMs (Windows Server 2012 R2) on Azure with SQL Server installed, set up as an availability group. We also, of course, have another VM as a dedicated DC. These are all connected through a single Virtual network. This setup was working well for us, and I was able to connect to SQL from my local, physical machine without any issue, but then the spending limit was reached on the account and it de-provisioned everything. We removed the limit and I allocated all the servers using the same VHDs again, with all settings (presumably) restored, but I am no longer able to access the SQL Server.

Name Definitions

In the interest of best explaining this, we'll call the two nodes SQL1 and SQL2, the availability group SQL-AG, the availability group listener SQL-Listener, and the Cloud Service that this is all running through (with the appropriate Endpoints set up) SQL-CloudService. SQL1 is the owner of the failover cluster role (and, in accord, has the replica role of primary) and SQL2 is the secondary.

Scenario

I am able to RDP to both servers, and use SSMS from SQL1 and connect to SQL-Listener, and view the SQL-AG dashboard, which reports everything as healthy and synchronized.

On SQL2, I am not able to connect to SQL-Listener. I am also not able to connect to SQL-CloudService from my local machine, which worked before as well. Both systems return the error,

Cannot connect to SQL-Listener.

A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 – Could not open a connection to SQL Server) (Microsoft SQL Server, Error: 53)

The network path was not found

When I go on SQL1 and connect through SSMS, I can tell SQL-AG to fail over to SQL2. It does this successfully. However, after doing that, I am no longer able to connect to SQL-Listener from SQL1, but I am from SQL2.

Long story short, I am able to connect with SSMS to the availability group listener only from the system that is marked with the replica role of primary.

The Real Problem

I don't really need to be able to do all of this, but I do need to be able to get at the SQL Server from my local machine over the internet, and I am assuming that these problems are caused by the same underlying issue since they give the same error message.

Things I've Found Along the Way

Not surprisingly given the error message and the situation, but I am not able to ping SQL-Listener unless it is running on the machine from which I initiate the ping. When SQL1 is marked as Primary, I can ping it without issue from SQL1, but when I try to from SQL2, it successfully looks up the IP with DNS, but comes back with "Reply from [SQL2's IP]: Destination host unreachable." When I failover SQL-AG, the same problem occurs in the other direction. I am, however, always able to ping SQL1 from SQL2 and vice-versa. Because of this, I am inclined to believe it a Failover Cluster problem, not a SQL one. Hence the title of this question.

I've also found that the Firewall seems untouched. This is consistent, I'd say, with the ping problem, but monitoring on the Firewall shows no attempts at SQL Server from any remote machines (my local one or the non-owning VM).

It's deducible from what I've said already, but it seems noteworthy to point out that even through the Cloud Service, I am not able to touch the firewall with port 1433. I'm not entirely sure why this would be, since the direct-to-server route, should, I assume, be pushing it, well, directly to the server. Thus I would expect an item in the log representing this, but there are tons of items and none of them are that.

Not surprisingly given the ping issues, I am also able to get to the report server URL (which resembles http://sql-listener/ReportServer) locally on the owner node, but not remotely from the other one.

I am able to connect to either SQL Server from the other if I specify the computer's name (SQL1 or SQL2, compared to SQL-Listener). This makes it all the stranger, to me anyway, that I can't seem to get through the Cloud Service. I would think that this means it's listening wherever it should be, and given that I never had to tell Azure to point to SQL-Listener, I wouldn't expect that to make any difference. So maybe I am just reading this whole situation wrong.

Troubleshooting Steps I've Taken

  • Reboot all machines involved
  • Ensure all IPs are static and what we expect them to be
  • Make sure the firewall is set up properly
  • Shut down each SQL server (on Azure, this de-allocates the VM, so it's much more serious than just rebooting) and boot them up again.
  • Delete and recreate the Failover Cluster role's client access point (and with it, the Availability Group Listener)
  • Recreated the Cloud Service endpoints (although this no longer seems like it could help anything, since that was before I knew there was a problem between servers)
  • Attempted connecting to the server with the IP address explicitly stated ("tcp:[SQL-Listener's IP]"). This comes back with a network-related/instance specific error saying "A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond."

Thoughts I've had

  • Could it be something to do with subnets? They sure seem to be on the same ones, but I can imagine that causing some weird issues like this.
  • Does anyone know of anything in particular that Azure does when it shuts down servers for running over a spending limit? Is there just one setting somewhere that gets changed that I haven't noticed?

Best Answer

So this turned out to be, as expected, a pretty silly mistake. I had forgotten all the steps required to set up Availability Groups to work with Azure, as outlined here.

Because the de-allocation of the cloud service changed its IP, the SQL-Listener was listening on the wrong IP address. I had thought about that and addressed it by deleting and recreating the listener, but I had totally overlooked all the steps that I, embarrassingly, had personally performed to set up the listener in the first place. So after an hour on the phone with Microsoft Support, we finally got everything set up again. It's all back up and working now.

Related Topic