Online Indexes Failures in Azure for VLDBs

Recently we had a requirement that was caused by a very unique situation. A table with a column of datatype nvarchar(max) was inadvertently populated with some text in excess of 20MB of data/row. This resulted in the table blowing up from around 3GB to over 250GB and later 350GB. Naturally the sudden surge in size of the database was immediately noticed and we needed to rectify the issue. The environment was an Azure database and as you can see the growth in DB size resulted into issues due to reaching the service tier (max DB size allowed is only 200GB). So we needed to fix the problem.

Here is how we went about it

Since the application had reached its service tier limit we needed to bring the size of the database down ASAP. Mainly we needed to ensure there were free data pages available for new data. Since the database was already in its limit we could not do a batch by batch update to NULL the column as this would make the t-Log grow. So we ended up dropping the column and recreating it. This immediately made the database available for new instructions but the size of the database still remained approx. 300GB because the pages that were assigned have not been released back to OS, in other words Shrunk.

So we now had the table accessible but the size of the database needed to come down. Traditionally we are offered two solutions to achieve this within SQL Server:-

DBCC CLEANTABLE

Reclaims space from dropped variable-length columns in tables or indexed views

As the above description makes clear DBCC CLEAN table can fix data pages for cases where there has been a change to variable length columns in this case Dropped. Which is exactly what we did. However we encountered a different problem now. The database was constantly in use and as a result it was taking far too long to acquire the required locks on the underlying table. The second problem we faced with this approach was the fact that the service tier assigns only a limited amount of resources from the elastic pool so we could not supercharge this command by allocating more resources and thus hoping it would complete sooner. After waiting for over 12 hours we realized this was a no go. So we moved on to the next step.

REBUILD INDEX.

Now in order to reclaim the space back we would need to create a whole new set of data pages which are rebuilt when we launch the clustered index rebuild option. Since the clustered index rebuild will allocated and reassign data into new datapages it would automatically take care of the unwanted space allocated to the nvarchar (max) column. We hoped because of the options available in REBUILD INDEX we would see better results than with DBCC CLEAN TABLE. Initially we issued a straight forward command to rebuild the index such as the one below:-

In its most simple form you can rebuild an index with the below command

ALTER INDEX [PK_On_Time] ON [dbo].[On_Time] REBUILD

But this would not work for us since we have other columns of database ntext etc which can only be rebuilt OFFLINE. If you try to rebuild a clustered index on a table with these LOB datatype you would get the below error:-

Msg 2725, Level 16, State 2, Line 3

An online operation cannot be performed for index ‘PK_On_Time’ because the index contains column ‘OriginState’ of data type text, ntext, image or FILESTREAM. For a non-clustered index, the column could be an include column of the index. For a clustered index, the column could be any column of the table. If DROP_EXISTING is used, the column could be part of a new or old index. The operation must be performed offline.

So we had only one choice really and that was to rebuild the index OFFLINE

USE [madworks] GO

ALTER INDEX [PK_On_Time] ON [dbo].[On_Time] REBUILD PARTITION = ALL
    WITH (
            PAD_INDEX = OFF
            ,/** OFF by default therefore fillfactor of 100% is used for intermediate page in the b tree , If ON then fillfactor from sys.indexes is used not an issue in most cases**/
            STATISTICS_NORECOMPUTE = OFF
            ,/** OFF by default means that DBA needs to recompute stats later , its good to leave off by default since it adds to the time taken to rebuild. However query plans maybe affected as a result of plans not being up to date so recompute stats as close the index operation as possible.**/
            SORT_IN_TEMPDB = OFF
            ,/** default is OFF if table is small usually sorting is done in memory and this setting is ignored. if table is large setting this option to On can speed up rebuild operation by using the multiple files allocted to tempdb as part of tempdb optimization. In Azure tempDb files are on special optimized disk so usually a good idea to set to ON in Azure. **/
            ONLINE = OFF
            ,/** DEFAULT IS OFF , exclusively locks the table until the rebuild operation is complete. This makes the table unavailable causing other queries to fail. set to On for large tables/ frequently accessed table since index rebuild can go on for hours ad system will become unresponsive. in our case we are fored to set it OFF because of the ntext columns etc. SETTING OFF make the rebuild finish faster but table is unsable till then else vice versa**/
            ALLOW_ROW_LOCKS = OFF
            ,/** tells SQL Server to acquire only row level locks when rebuilding the IX , can make lock management heavy due to large number of locks acquired.**/
            ALLOW_PAGE_LOCKS = OFF
            ,/** tells SQL Server to acquire only page level locks when rebuilding the IX , can make lock management lighter due to large number of locks acquired.but other queried might be waiting for locks longer. SETTING BOTH ROW AND PAGE LOCKS TO OFF WILL RESULT IN TABLE ACTING LIKE AN OFFLINE REBUILD i.e not available to other queries for DML operation. **/
            RESUMABLE = ON
            ) GO

So armed with the above script we went ahead trying to rebuild the Clustered index. At this point performing an OFFLINE rebuild naturally means we cannot access the table in the meantime so we tried to do this during off peak hours but the operation would not complete even after 8 hours of running mostly due to Azure service tier limitations. We constantly kept running into excessive waits for WRITELOG and other associated wait types. So clearly INDEX rebuild also wasn’t going to work because it never completes within the SLA and the rollback was killing us every day.

Finally left with no other choice we went with the below approach.

  • Copy the data from the table into a staging table.
  • Drop FKs and indexes and constraints on the original table and any FKs so that it acted as a standalone table.
  • Truncated the stand alone table.
  • Repopulated it with data from the Staging table
  • Rebuilt the index (which completed in about 20 min) – since the table size is now 3 GB and we aren’t trying to rebuild a 300 GB table.
  • Recreated all FK, indexes etc
  • Reseed the identity column
  • SHRINK DATABASE

The whole thing took about 2 hours to complete and test.