Full Text Search on Varbinary (max) Columns and using the Right filters

I love Full text search in MS SQL Server. Although I haven’t seen it being used a lot in client locations, I do use it a lot within my own company as well as for some of my clients who have very specific requirements. One of the more common requirements we find is document search. Typically this would be like searching a resume or a product catalog for products that meet a specific filter criteria. While the solution is a mix of multiple features like FILESTREAM or Filetable and Full Text Search; in this example I am focusing mostly on the Full Text Search aspect of it.

The Important thing to keep in mind when working with FULLTEXT search on varbinary columns is that it is not the same as FTS on a varchar column. Mainly because the data is stored in VARBINARY format FTS requires a file type column to be defined as well. This column helps full text search identify the type of file the content comes from.


CREATE TABLE [dbo].[filecontent](
[fileid] [int] NULL,
[Filecontent] [varbinary](max) NULL,
[filevid] [int] IDENTITY(1,1) NOT NULL,
[doctype] [varchar](10) NULL,
PRIMARY KEY CLUSTERED
(
[filevid] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

GO


For example in the above create table script the doctype column of varchar (10) helps differentiate the contents as coming from a pdf file or a doc file or docx file. Without this column you wouldn’t be able to proceed with creating the full text index on varbinary columns.

Once you have a filetype or doc type column the next thing to keep in mind is that by default FTS doesn’t understand all doc types. To know which doc types FTS understands you can run the below query.


select * from sys.fulltext_document_types


If you do not find the filetype you are looking for, e.g. in this example docx, you can download the filter from the MS website.

http://www.microsoft.com/en-us/download/details.aspx?id=17062

Once the appropriate filter has been installed you need; run the below commands in SSMS


EXEC sp_fulltext_service 'update_languages';
EXEC sp_fulltext_service 'load_os_resources', 1;
EXEC sp_fulltext_service 'restart_all_fdhosts';


And then restart the Full text Service within the service manager console. If you already have a FT index on the table then you need to drop and recreate it.

In order to create the full text index you can simply right click the table and select Full Text Index > Define Full text Index

Make sure you have PK on the table before pressing Next

Select the Unique Index you want to use and then Press Next

In the Type Column select the name of the Column which stores the file type, and select the language to use.

If you want FTS to automatically track changes to the file content select the same and press Next, choose Manual if the file contents are updated infrequently.

Select the FTS catalog you have created else check “Create a new Catalog” and give it a name. Press Next

If you would like the FTS to be populated at a predefined schedule the configure it in the below screen, it’s very similar to scheduling a job. Press Next

You will see a summary

Press Finish

Once the FTS index is created, you can force the index to be populated by – clicking Start Full Population below.

Once the index is populated you can query the FTS catalog as shown below

Some common example where you can use Full text search is to search for digital documents in an online library as I am doing above, I am searching my training content for posts related to SQL Server Analysis Services. Another way this was implemented was for a client of mine who used expensive parsing software to extract resume content and store the data in a varchar column to perform searches, once we migrated to FileTable and full text search we were able to improve search relevance as well as overall ETL performance by a factor of 10.

I recently use a combination of Full text search + SSIS Fuzzy grouping + term extraction to generate a list of Tags for all the blog posts I have uploaded so far. Naturally there are other ways you can use FTS as well a simple example would the search bar on most websites where the html content is saved in a varbinary column so that relevant web pages can be displayed in the search results much like how google indexes web pages.