Merge pull request #36705 from mhyon/20260224-geo-replication-redo

Carolyn135 · web-flow · commit 016f291672f7 · 2026-03-03T08:28:26.000Z
20260224 geo replication redo
diff --git a/azure-sql/database/active-geo-replication-overview.md b/azure-sql/database/active-geo-replication-overview.md
@@ -106,6 +106,9 @@ Both the primary and geo-secondary are required to have the same service tier. I
 
 Another consequence of an imbalanced geo-secondary configuration is that after failover, application performance can suffer due to insufficient compute capacity of the new primary. In that case, it's necessary to scale up the database to have sufficient resources, which might take significant time, and requires a [high availability](high-availability-sla-local-zone-redundancy.md) failover at the end of the scale up process, which can interrupt application workloads.
 
+> [!TIP]
+> For detailed troubleshooting guidance on lag with geo-replication, see [Troubleshoot geo-replication redo lag](troubleshoot-geo-replication-redo.md).
+
 If you decide to create the geo-secondary with a different configuration, you should monitor log I/O rate on the primary over time. This lets you estimate the minimal compute size of the geo-secondary required to sustain the replication load. For example, if your primary database is P6 (1000 DTU) and its log I/O is sustained at 50%, the geo-secondary needs to be at least P4 (500 DTU). To retrieve historical log I/O data, use the [sys.resource_stats](/sql/relational-databases/system-catalog-views/sys-resource-stats-azure-sql-database) view. To retrieve recent log I/O data with higher granularity that better reflects short-term spikes, use the [sys.dm_db_resource_stats](/sql/relational-databases/system-dynamic-management-views/sys-dm-db-resource-stats-azure-sql-database) view.
 
 > [!TIP]  
@@ -243,6 +246,10 @@ Active geo-replication can also be managed programmatically using T-SQL, Azure P
 
 ---
 
+## Troubleshooting
+
+For more information on troubleshooting geo-replica lag, see [Troubleshoot geo-replication lag](troubleshoot-geo-replication-redo.md).
+
 ## Related content
 
 Configure active geo-replication: 
diff --git a/azure-sql/database/failover-group-sql-db.md b/azure-sql/database/failover-group-sql-db.md
@@ -160,7 +160,7 @@ A typical Azure application uses multiple Azure services and consists of multipl
 If an outage occurs in the primary region, recent transactions might not have been replicated to the geo-secondary and there might be data loss if a forced failover is performed.
 
 > [!IMPORTANT]
-> Elastic pools with 800 or fewer DTUs or 8 or fewer vCores, and more than 250 databases can encounter issues including longer planned geo-failovers and degraded performance. These issues are more likely to occur for write intensive workloads when geo-replicas are widely separated by geography, or when multiple secondary geo-replicas are used for each database. A symptom of these issues is an increase in geo-replication lag over time, potentially leading to a more extensive data loss in an outage. This lag can be monitored using [sys.dm_geo_replication_link_status](/sql/relational-databases/system-dynamic-management-views/sys-dm-geo-replication-link-status-azure-sql-database). If these issues occur, then mitigation includes scaling up the pool to have more DTUs or vCores, or reducing the number of geo-replicated databases in the pool.
+> Elastic pools with 800 or fewer DTUs or 8 or fewer vCores, and more than 250 databases can encounter issues including longer planned geo-failovers and degraded performance. These issues are more likely to occur for write intensive workloads when geo-replicas are widely separated by geography, or when multiple secondary geo-replicas are used for each database. A symptom of these issues is an increase in geo-replication lag over time, potentially leading to a more extensive data loss in an outage. This lag can be monitored using [sys.dm_geo_replication_link_status](/sql/relational-databases/system-dynamic-management-views/sys-dm-geo-replication-link-status-azure-sql-database). If these issues occur, then mitigation includes scaling up the pool to have more DTUs or vCores, or reducing the number of geo-replicated databases in the pool. For detailed troubleshooting guidance on redo lag issues, see [Troubleshoot geo-replication redo lag](troubleshoot-geo-replication-redo.md).
 
 
 <a id="failback"></a>
@@ -221,3 +221,4 @@ In a scenario where high availability is enabled on the primary database, and th
 - To learn about Azure SQL Database automated backups, see [SQL Database automated backups](automated-backups-overview.md).
 - To learn about using automated backups for recovery, see [Restore a database from the service-initiated backups](recovery-using-backups.md).
 - To learn about authentication requirements for a new primary server and database, see [SQL Database security after disaster recovery](active-geo-replication-security-configure.md).
+- For troubleshooting geo-replication issues, see [Troubleshoot geo-replication redo lag](troubleshoot-geo-replication-redo.md).
diff --git a/azure-sql/database/troubleshoot-geo-replication-redo.md b/azure-sql/database/troubleshoot-geo-replication-redo.md
@@ -0,0 +1,76 @@
+---
+title: Troubleshoot Geo-Replication and Redo Lag
+titleSuffix: Azure SQL Database
+description: Learn how to understand and troubleshoot geo-replication and redo lag in Azure SQL Database.
+author: WilliamDAssafMSFT
+ms.author: wiassaf
+ms.reviewer: mahyon, randolphwest
+ms.date: 03/02/2026
+ms.service: azure-sql-database
+ms.subservice: high-availability
+ms.topic: troubleshooting
+ms.custom:
+  - azure-sql-split
+monikerRange: "=azuresql || =azuresql-db"
+---
+
+# Troubleshoot geo-replication and redo lag
+
+[!INCLUDE [appliesto-sqldb](../includes/appliesto-sqldb.md)]
+
+In active geo-replication, the geo-secondary replica continuously receives and applies transaction log records from the primary. When the secondary replica can't apply logs as fast as the primary generates them, a backlog builds (redo queue) and the time gap increases (redo lag). This situation can affect read-only freshness on the secondary and increase failover time.
+
+- **Redo queue**: The volume of transaction log records that geo-replication ships to the secondary but doesn't apply yet.
+- **Redo lag**: The elapsed time between transaction commit on the primary and completion of replay on the secondary.
+
+Geo-replication is asynchronous. Redo lag on the secondary replica does not cause waits on the primary, but redo lag can cause data on the secondary to be behind.
+
+## Symptoms
+
+- Stale data on the secondary for read-only workloads (reporting, analytics, or offloaded reads).
+- Longer failover time, which increases Recovery Time Objective (RTO).
+- Sustained resource pressure on the secondary, reducing its ability to catch up.
+- Confirm redo lag in the DMV [sys.dm_database_replica_states](/sql/relational-databases/system-dynamic-management-views/sys-dm-database-replica-states-azure-sql-database?view=azuresqldb-current&preserve-view=true), if `redo_queue_size > 0` and growing and `secondary_lag_seconds` is increasing.
+
+## Why redo backlog grows
+
+Although the secondary database is read-only, it still maintains a transaction log for internal operations, including replaying log records from the primary. When the redo queue grows, the secondary must retain more transaction log data. 
+
+This situation can lead to:
+
+- Transaction log growth on the secondary.
+- Higher storage consumption, which can affect cost and performance.
+- Potential throttling scenarios when thresholds are exceeded.
+
+## Impact of replica size mismatch
+
+You should configure the primary and geo-secondary replica with the same service level objective (SLO), backup storage redundancy, [compute tier](service-tiers-sql-database-vcore.md#compute) (provisioned or serverless), and compute size (DTUs or vCores). 
+
+If you configure a secondary database with a lower compute size than the primary database, you might experience:
+
+- Resource contention on the secondary (CPU, I/O), which slows down redo operations.
+- Inability to keep up with the transaction log generation rate of the primary.
+- Increased redo queue size, which worsens lag and reduces replication effectiveness.
+
+## Recommendations
+
+To reduce redo lag and maintain replication health and efficient log usage on the secondary:
+
+- Align SLO and compute sizes. Ensure the secondary database has the same performance tier as the primary.
+  - Configure geo-secondary: [Active geo-replication](active-geo-replication-overview.md#configure-geo-secondary)
+  - Scale a single database: [Scale single database resources in Azure SQL Database](single-database-scale.md)
+  - Scale an elastic pool: [Scale elastic pool resources in Azure SQL Database](elastic-pool-scale.md)
+  - Cost considerations: [Plan and manage costs for Azure SQL Database](cost-management.md)
+
+- Monitor regularly. Use dynamic management views (DMVs) such as [sys.dm_database_replica_states](/sql/relational-databases/system-dynamic-management-views/sys-dm-database-replica-states-azure-sql-database?view=azuresqldb-current&preserve-view=true) to track redo lag and queue size. Redo lag is confirmed when `redo_queue_size > 0` and growing, and `secondary_lag_seconds` is increasing.
+
+- Optimize workloads:
+
+  - Reduce long-running transactions on the secondary and high log generation spikes on the primary.
+    - Avoid large index rebuilds during peak times. Rebuilds can acquire schema modification (SCH-M) locks, which might block the redo thread on the secondary and contribute to redo queue build-up.
+
+## Related content
+
+- [Active geo-replication](active-geo-replication-overview.md)
+- [Configure active geo-replication and failover](active-geo-replication-configure-portal.md)
+- [Monitor geo-replication lag](active-geo-replication-overview.md#monitor-geo-replication-lag)
diff --git a/azure-sql/database/troubleshoot-memory-errors-issues.md b/azure-sql/database/troubleshoot-memory-errors-issues.md
@@ -190,6 +190,7 @@ If out of memory errors persist in Azure SQL Database, file an Azure support req
 - [Performance Center for SQL Server Database Engine and Azure SQL Database](/sql/relational-databases/performance/performance-center-for-sql-server-database-engine-and-azure-sql-database)
 - [Troubleshooting connectivity issues and other errors with Azure SQL Database and Azure SQL Managed Instance](troubleshoot-common-errors-issues.md)
 - [Troubleshoot transient connection errors in SQL Database and SQL Managed Instance](troubleshoot-common-connectivity-issues.md)
+- [Troubleshoot transaction log errors](troubleshoot-transaction-log-errors-issues.md)
 - [Demonstrating Intelligent Query Processing](https://github.com/Microsoft/sql-server-samples/tree/master/samples/features/intelligent-query-processing)
 - [Resource management in Azure SQL Database](resource-limits-logical-server.md#memory)
 - [Blog: A new way to troubleshoot out-of-memory errors in the database engine](https://techcommunity.microsoft.com/t5/azure-sql-blog/a-new-way-to-troubleshoot-out-of-memory-errors-in-the-database/ba-p/3271926)
diff --git a/azure-sql/database/troubleshoot-transaction-log-errors-issues.md b/azure-sql/database/troubleshoot-transaction-log-errors-issues.md
@@ -151,4 +151,6 @@ To resolve this issue, try the following methods:
 - [Understand and resolve Azure SQL Database blocking problems](understand-resolve-blocking.md?view=azuresql-db&preserve-view=true#gather-blocking-information)
 - [Troubleshooting connectivity issues and other errors with Azure SQL Database and Azure SQL Managed Instance](troubleshoot-common-errors-issues.md?view=azuresql-db&preserve-view=true)
 - [Troubleshoot transient connection errors in Azure SQL Database and SQL Managed Instance](troubleshoot-common-connectivity-issues.md?view=azuresql-db&preserve-view=true)
+- [Troubleshoot geo-replication redo lag](troubleshoot-geo-replication-redo.md?view=azuresql-db&preserve-view=true)
+- [Troubleshoot out of memory errors](troubleshoot-memory-errors-issues.md?view=azuresql-db&preserve-view=true)
 - [Video: Data Loading Best Practices on Azure SQL Database](/shows/data-exposed/data-loading-best-practices-on-azure-sql-database?WT.mc_id=dataexposed-c9-niner)
diff --git a/azure-sql/toc.yml b/azure-sql/toc.yml
@@ -2036,7 +2036,9 @@
       href: database/troubleshoot-common-connectivity-issues.md
     - name: Troubleshoot out of memory errors
       href: database/troubleshoot-memory-errors-issues.md
-    - name: Import/Export service hangs
+    - name: Troubleshoot geo-replication lag
+      href: database/troubleshoot-geo-replication-redo.md
+    - name: Troubleshoot Import/Export service
       href: database/database-import-export-hang.md
     - name: Transaction log errors in Azure SQL Database
       href: database/troubleshoot-transaction-log-errors-issues.md
diff --git a/docs/relational-databases/system-dynamic-management-views/sys-dm-database-replica-states-azure-sql-database.md b/docs/relational-databases/system-dynamic-management-views/sys-dm-database-replica-states-azure-sql-database.md
@@ -76,7 +76,12 @@ Returns state information for each database that participates in primary and sec
 
 Requires `VIEW DATABASE STATE` permission on the database.
 
+## Remarks
+
+For more information on troubleshooting geo-replication redo lag in Azure SQL Database, see [Troubleshoot geo-replication redo lag](/azure/azure-sql/database/troubleshoot-geo-replication-redo?view=azuresql-db&preserve-view=true).
+
 ## Related content
 
 - [What is an Always On availability group?](../../database-engine/availability-groups/windows/overview-of-always-on-availability-groups-sql-server.md)
 - [Monitor Availability Groups (Transact-SQL)](../../database-engine/availability-groups/windows/monitor-availability-groups-transact-sql.md)
+- [sys.dm_geo_replication_link_status (Azure SQL Database and Azure SQL Managed Instance)](sys-dm-geo-replication-link-status-azure-sql-database.md)
diff --git a/docs/relational-databases/system-dynamic-management-views/sys-dm-geo-replication-link-status-azure-sql-database.md b/docs/relational-databases/system-dynamic-management-views/sys-dm-geo-replication-link-status-azure-sql-database.md
@@ -5,7 +5,7 @@ description: Contains a row for each replication link between primary and second
 author: rwestMSFT
 ms.author: randolphwest
 ms.reviewer: wiassaf
-ms.date: 06/13/2025
+ms.date: 02/26/2026
 ms.service: azure-sql-database
 ms.topic: reference
 f1_keywords:
@@ -24,7 +24,7 @@ monikerRange: "=azuresqldb-current || =azuresqldb-mi-current"
 
 [!INCLUDE[Azure SQL Database Azure SQL Managed Instance](../../includes/applies-to-version/asdb-asdbmi.md)]
 
-Contains a row for each replication link between primary and secondary databases in a geo-replication partnership. This includes both primary and secondary databases. If more than one continuous replication link exists for a given primary database, this table contains a row for each of the relationships. The view is created in all databases, including the `master` database. However, querying this view in the `master` database returns an empty set.
+Contains a row for each replication link between primary and secondary databases in a geo-replication partnership. This includes both primary and secondary databases. If more than one continuous replication link exists for a given primary database, this table contains a row for each of the relationships. 
 
 |Column name|Data type|Description|  
 |-----------------|---------------|-----------------|  
@@ -41,13 +41,18 @@ Contains a row for each replication link between primary and secondary databases
 | `secondary_allow_connections_desc` |**nvarchar(256)**|No<br /><br /> All|  
 | `last_commit` |**datetimeoffset**|The time of last transaction committed to the database. If retrieved on the primary database, it indicates the last commit time on the primary database. If retrieved on the secondary database, it indicates the last commit time on the secondary database. If retrieved on the secondary database when the primary of the replication link is down, it indicates until what point the secondary has caught up.|
 
-> [!NOTE]  
->  If the replication relationship is terminated by removing the secondary database, the row for that database in the `sys.dm_geo_replication_link_status` view disappears.  
-
 ## Permissions
 
 Requires the `VIEW DATABASE STATE` permission in the database.  
 
+## Remarks
+
+If the replication relationship is terminated by removing the secondary database, the row for that database in the `sys.dm_geo_replication_link_status` view disappears.
+
+The view is created in all databases, including the `master` database. However, querying this view in the `master` database returns an empty set.
+
+For more information on troubleshooting geo-replication redo lag in Azure SQL Database, see [Troubleshoot geo-replication redo lag](/azure/azure-sql/database/troubleshoot-geo-replication-redo?view=azuresql-db&preserve-view=true).
+
 ## Examples
 
 This Transact-SQL query shows replication lags and last replication time of secondary databases.  
@@ -63,7 +68,7 @@ FROM sys.dm_geo_replication_link_status;
 
 ## Related content
 
-- [ALTER DATABASE (Transact-SQL)](../../t-sql/statements/alter-database-transact-sql.md)
+- [sys.dm_database_replica_states (Azure SQL Database)](sys-dm-database-replica-states-azure-sql-database.md)
 - [sys.geo_replication_links (Azure SQL Database)](sys-geo-replication-links-azure-sql-database.md)
 - [sys.dm_operation_status (Azure SQL Database)](sys-dm-operation-status-azure-sql-database.md)
 - [sp_wait_for_database_copy_sync](../system-stored-procedures/sp-wait-for-database-copy-sync-transact-sql.md)