This incident occurred on SQL Server 2019 hosted on Azure VM as Always-On Cluster. It is a four-node cluster with two nodes being synchronous and the other two being asynchronous. In one of the async nodes, one database changed the status to “Not Synchronizing / Pending Recovery”. As per the AG dashboard, a few databases had a large redo queue. Except for the database mentioned above, all others were healthy.
You usually run the command below to get the database ONLINE in this situation.
ALTER DATABASE [Database Name] SET HADR RESUME
But immediately, the above command got blocked with another session with the wait type as HADR_WORKITEM_COMPLETED. It is not a familiar wait type.
After digging in a bit more on the blocker session, it revealed a background task. However, sp_who2 gave little info too. “Unknown Token” appears as the command. We even tried to kill the session, but SQL Server did not allow us to KILL it as it was an internal session.
I tracked down the task to the OS worker level by using the DMVs below:
select * from sys.dm_os_tasks where task_address=0x0000036B2021BC28 select * from sys.dm_os_waiting_tasks where waiting_task_address=0x0000036B2021BC28 select * from sys.dm_os_workers where worker_address=0x000003477819E160
The particular task’s return_code appeared as “258 = TIMEOUT” at the task level. It looks like this background process is trying to start, but it permanently moved to
suspended immediately because of some time-out happening internally.
So, we have no option other than rebooting the node at this point. Once we did that system came up clean state, but that database was In Recovery again. Then we ran the RESUME command mentioned above, which worked with no issues.
All the databases are now in a healthy state.
About the cover picture: I took this picture during a visit to Sri Lanka Aug 2022. Peakview of Kantale Dam.