Wednesday, January 23, 2008

Cost of MAX in SQL Server 2005

Microsoft SQL Server 2005 introduced keyword MAX for defining variable-length data types such as VARCHAR (and its Unicode counterpart NVARCHAR), and VARBINARY. It has given database architects and developers a more robust choice when designing data models and writing T-SQL codes. There are many discussions on the internals, the advantages and disadvantages of using MAX. Below I summarize the main points.

  • The storage engine treats a MAX type as a hybrid of the normal variable-length type and LOB type. That is, if a column with MAX type has length <= 8,000 bytes, then it will be treated as the normal-length type - stored on an in-row data page or a row-overflow data page. Otherwise, the column will be treated as a LOB type - stored on a LOB data page.
  • Because of the above internal flexibility, data types text, ntext and image can be virtually replaced by the appropriate MAX type if working exclusively in SQL 2005 and later environments.
  • A VARCHAR(MAX) variable is very handy to store pre-validated XML data because of the potentially explosive size of XML data. Use TRY … CATCH block to validate text data for XML well-formedness or XML schema.
  • NVARCHAR(MAX) parameters are convenient to use for defining CLR functions/triggers/stored procedures in T-SQL. A lot of people, including me, misunderstood that .Net type SqlString would translate into NVARCHAR(MAX). Actually, it translates into NVARCHAR(4000) and SqlChars translates into NVARCHAR(MAX).
  • Although I have not verified it myself, some people claim statistics can not be automatically created on MAX type columns, therefore limiting the Optimizer's ability. But, statistics can created manually for a column of any type. Also, we could create an index on it if necessary (be very careful here - the unlimited column size could make the index take a huge amount of storage).
  • Similarly, the Optimizer uses the column size as one of the metrics to determine the query plan. MAX certainly gives the Optimizer no clue in that regard.
    The unknown column size can cause difficulty/impossibility in calculating the number of rows per page and hence planning for the storage requirements.
  • As common to any variable-length columns, updates with increased data length have the potential of causing page splits, adversely impacting performance.

In this post, I'd like to show an additional observation on using MAX variables. I have found that depending on the application, there can be a huge cost associated with using a MAX variable over its normal variable-length counterpart. As a disclaimer, I personally have not come across any such findings documented anywhere, but if that is due to my ignorance, please forgive me.
Here is some background.
We have two scalar CLR Regex functions in T-SQL, one for identifying a match (util.uf_RegexIsMatch), and the other for finding all matches first and then doing a replacement (util.uf_RegexReplace). A colleague of mine recently brought to my attention a counter-intuitive test result: when applied on a table of a half million normal VARCHAR records, util.uf_RegexIsMatch ran 3 times slower than util.uf_RegexReplace and consistently. That was very surprising. Without knowing exactly how the two Regex functions are implemented in .Net, intuitively IsMatch is doing less work than Replace, and so we were expecting the other way around and if not 3 times fasters, at least as fast, which was proven to be the case when doing comparison tests in C#.
It did not make sense. So I started a little investigation. I made sure the query plans were all right: simple table scan and after repeated runs, all the data was in the cache (SET STATISTICS IO ON showed no physical reads and only logical reads and they were the same for both). That meant both had the same IO cost and the only difference must have been in the functions themselves. Then, I looked at the CREATE FUNCTION syntax and noticed that the input parameter for util.uf_RegexIsMatch was of type NVARCHAR(MAX) and that for util.uf_RegexReplace NVARCHAR(4000). I knew the type had to be a Unicode type because that is required by all CLR objects. But could the MAX have been the cause? It was hard to believe but I decided to give it a try. I recreated the function util.uf_RegexIsMatch by using NVARCHAR(4000) for the input parameter, and re-ran the tests. Whala! The comparison went to the right direction. Also as a curiosity, making both input parameters of the same type of NVARCHAR(MAX) had the same expected and obviously slower comparison result.
In our case, we do not really need MAX, or even 4000 because the column in the table to feed the input parameters of the functions is only VARCHAR(50). I played with other numbers smaller than 4000 for the variable-length and found that made no performance difference as compared to 4000, so we standardized on NVARCHAR(4000) for the two functions. And they will be our default functions used in production. If necessary, we can create their counterparts with NVARCHAR(MAX) to handle special cases.
Is the above just a special and isolated case or does it have general validity?
Further investigation has shown that MAX in general has negative performance impact not only on CLR functions, but also T-SQL user defined functions. I suspect it is also true for stored procedures. In addition, it is true even for ad hoc query batches, which can be simulated by comparing the following two simple batches. Be sure to check "Discard results after executive" in your SSMS options so that no display will contaminate the test.
Note that a number after GO tells the query processor to execute the batch that many times.
---- Batch 1: normal variable-length variable
GO 50000
---- Batch 2: MAX
GO 50000

On my notebook running XP and SQL Server 2008 (Katmai) CTE, Batch 1 completes in 12 secs and Batch 2 completes in 16 secs, consistently. That is a difference of over 30%!

In conclusion, the flexibility that comes with MAX is not free. The difference in performance from using the normal variable-length counterpart can be unexpectedly huge. We need to evaluate each situation to really justify its use and even in that case, we may be better off creating two versions of the objects (functions, stored procedures), one for handling the normal production scenarios and the other for the special cases. I personally would like to study and know more about how SQL OS allocates and de-allocate memory for MAX variables to understand the range of differences in performances.

No comments: