NULL Management, SPARSE Columns, Vertical partitioning and a query

When discussing DW design I often take a lot of time to emphasize the impact Nulls have on query performance, aggregations, design, storage and much more. While this is not new to most seasoned DW developers it is often overlooked during the design phase and added at a later point in time mostly after testing etc. In the interest of putting it out there I figured I will post the topic here as well as a script I created few years back to identify suitable columns in the table.

http://enabledbusinesssolutions.com/blog/which-is-better-count-or-count-id/

Here is a post on how nulls affect COUNT behavior and performance, the default measure added for every measure group is count so I guess this is important.

Now NULLS Occupy space and as a result a wide table that has a large number of columns which are mostly NULL should ideally be carved out into its own table with sparse columns which is what we call vertical partitioning. In fact I have often seen implementations where the wide table is converted into key value pairs so as to avoid having as many nulls in the table.

ID Name AGE PhNumber SSN

1 Jayanth 4 123-456-7890 NULL

2 James 5 987-064-1234 123456

ID FName FValue

1 Name Jayanth

1 PhNumber 123-456-7890

1 Age 4

2 Name James

2 Age 5

2 PhNumber 987-064-1234

2 SSN 123456

While the second approach does in fact eliminate nulls, it also significantly adds to the number of rows while at the same time requiring complex pivoting of data to bring it back to a record set format that users can understand. I.e. the second approach works fine for OLTP databases where all records of an individual might need to be fetched but doesn’t lend itself to aggregation which is what the DW would require.

The query below identifies nullable columns within a table and then check if there are actual null values in those nullable columns. It then either generates a view or actually creates a view with only those columns which are having values for all rows.

USE [madworks]
GO
/****** Object: StoredProcedure [dbo].[CreateNotNullViews] Script Date: 6/23/2015 1:35:32 PM ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO

CREATE PROCEDURE [dbo].[CreateNotNullViews]
(@Schema_name NVARCHAR(256),
@table_name NVARCHAR(256),
@Defonly bit
)
AS
BEGIN
/***********************************************************
Author name :- Jayanth Kurup
Created date :- 201204012
Purpose:- Creates view definition or view with all columns in the table
that have not null values .
@defonyl paramter accepts values of 1 or 0
1 = schema definition (it wont create the actual view)
0 = it will create the view.

***********************************************************/

SET NOCOUNT ON
;
WITH CTE AS (
SELECT * FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = @TABLE_NAME
AND TABLE_SCHEMA =@SCHEMA_NAME
)

SELECT ROW_NUMBER() OVER( ORDER BY c.COLUMN_NAME) AS Colids , c.COLUMN_NAME , d.texts , null AS 'NullColumn'
INTO #tempcolstore
FROM CTE c
CROSS APPLY (SELECT ' IF EXISTS (SELECT 1 FROM '+@schema_name +'.'+@table_name +' WHERE '+c.COLUMN_NAME+' IS NULL)
BEGIN
SELECT '+CHAR(39)+c.COLUMN_NAME+char(39)+' AS NULLCOLUMN
END' as texts)d
WHERE IS_NULLABLE = 'YES'

---- THE ABOVE CODE TELLS THE LOGIC TO IGNORE NULLABLE COLUMNS

DECLARE @colid INT
SET @colid = 1
DECLARE @NullLIST TABLE (Columnname nvarchar(256))
WHILE @colid < = (select MAX(Colids) from #tempcolstore) BEGIN DECLARE @sql NVARCHAR(MAX) SET @SQL = ( SELECT TEXTS FROM #TEMPCOLSTORE WHERE COLIDS =@COLID) --PRINT @SQL INSERT INTO @NULLLIST EXEC SP_EXECUTESQL @SQL SET @colid = @colid +1 END ---- THE ABOVE CODE PERFORMS THE NULL EXISTANCE CHECK SET @SQL = (select 'CREATE VIEW '+@Schema_name+'.vw_'+@table_name +' AS SELECT ')+ (select distinct Replace((select distinct u2.COLUMN_NAME + ',' as 'data()' from INFORMATION_SCHEMA.COLUMNS u2 WHERE TABLE_NAME = @TABLE_NAME AND TABLE_SCHEMA = @SCHEMA_NAME AND COLUMN_NAME NOT IN (SELECT COLUMNNAME FROM @NULLLIST ) FOR XML PATH('')) + '$', ',$', '') as Roles from INFORMATION_SCHEMA.COLUMNS u1 WHERE TABLE_NAME = @TABLE_NAME AND TABLE_SCHEMA = @SCHEMA_NAME AND COLUMN_NAME NOT IN (SELECT COLUMNNAME FROM @NULLLIST ))+ ' FROM '+@schema_name+'.'+@table_name --- THE ABOVE CODE CREATES THE VIEW DEFINITION IF @DEFONLY =1 BEGIN Select @SQL as ViewOutput END ELSE BEGIN EXEC SP_EXECUTESQL @SQL PRINT 'SUCCESS :- View created' END --- PERFORMS ACTION AS PER USER INPUT. DROP TABLE #TEMPCOLSTORE ---- CLEAN UP END

If the column is nullable but doesn't have any null values then the script will include that column as part of the view definition say for example the Phnumber column in the above table, if the column is not nullable it will always appear in the view definition say for example the Name column in the same table and if the column is nullable and has at least 1 null value in the column it will be ignored from the view definition like the SSN column.

Therefore the query will create the table structure for the parent table in vertical partitioning. Why bother with all this, because it's better to implement proper design in retrospect than to not do it at all; a case in point is horizontal partitioning which almost never gets implement proactively.