Sunday, January 27, 2008

SET Settings for Stored Procedure

A stored procedure is unique in that it is able to control its execution context via the "environmental" SET settings. (Since a trigger is a special kind of stored procedure, can a trigger change its SET settings?). For example, we can set TRANSACTION LEVEL at the beginning of the stored procedure to control all statements in it and to provide the default for its inner stored procedures if any.
According to BOL,

  • The scope of the settings SET in the stored procedure is the stored procedure itself and those settings do not affect the calling (execution) environment.
  • Settings not explicitly set in the stored procedure default to those set at the execution time except for two special settings: ANSI_NULLS and QUOTED_IDENTIFIER.
  • Those two settings at the creation time of the stored procedure (implicitly) exclusively control the respective behavior, ignoring the corresponding settings at the execution time or the settings SET within the stored procedure. That is, those two settings are constant for the stored procedure after its creation, and to change them, it has to be altered or dropped and created with different settings.

We can find out the two settings at the time of stored procedure creation by querying:

Why are those two settings special?
QUOTED_IDENTIFIER is actually a parse-time setting and the parser needs to determine if a statement like SELECT * FROM "table" is valid or not based on the setting. It makes sense that for consistency, the same behavior applied during the parsing is adopted at the time of execution without regard to any different setting. (Other parse-time settings such at FIPS_FLAGGER, OFFSETS, PARSEONLY, do not affect the execution results).
ANSI_NULLS setting can directly affect query results. It is understandable that this settting deserves the consistency treatment. The question is why other settings do not have the same special treatment. Specifically, CONCAT_NULL_YIELDS_NULL also affect query results. The optimizer needs to know these other settings when it is looking for the "best" query plan. So, SQL's resolution is that each cached plan
for a stored procedure has these setting included as part of the plan. Any change in them in the execution context will cause the optimizer to recompile and generate another plan.

Another question that has been in my mind for quite some time and still has no clear answer: what execution context is an activation stored procedure in? Since there is no explictly created session to run such a procedure, will the settings defaults to the db level settings (except for the two special settings (QUOTED_IDENTIFIER and ANSI_NULLS ), of course - they were set at the creation time) . But the default db settings are all off. Is there a risk here?
The same applies to other client apps (besides the native sqlclient used by SSMS) such as ADO.Net. BOL seems to imply that ODBC and OLEDB have automatic settings.

Wednesday, January 23, 2008

Cost of MAX in SQL Server 2005

Microsoft SQL Server 2005 introduced keyword MAX for defining variable-length data types such as VARCHAR (and its Unicode counterpart NVARCHAR), and VARBINARY. It has given database architects and developers a more robust choice when designing data models and writing T-SQL codes. There are many discussions on the internals, the advantages and disadvantages of using MAX. Below I summarize the main points.

  • The storage engine treats a MAX type as a hybrid of the normal variable-length type and LOB type. That is, if a column with MAX type has length <= 8,000 bytes, then it will be treated as the normal-length type - stored on an in-row data page or a row-overflow data page. Otherwise, the column will be treated as a LOB type - stored on a LOB data page.
  • Because of the above internal flexibility, data types text, ntext and image can be virtually replaced by the appropriate MAX type if working exclusively in SQL 2005 and later environments.
  • A VARCHAR(MAX) variable is very handy to store pre-validated XML data because of the potentially explosive size of XML data. Use TRY … CATCH block to validate text data for XML well-formedness or XML schema.
  • NVARCHAR(MAX) parameters are convenient to use for defining CLR functions/triggers/stored procedures in T-SQL. A lot of people, including me, misunderstood that .Net type SqlString would translate into NVARCHAR(MAX). Actually, it translates into NVARCHAR(4000) and SqlChars translates into NVARCHAR(MAX).
  • Although I have not verified it myself, some people claim statistics can not be automatically created on MAX type columns, therefore limiting the Optimizer's ability. But, statistics can created manually for a column of any type. Also, we could create an index on it if necessary (be very careful here - the unlimited column size could make the index take a huge amount of storage).
  • Similarly, the Optimizer uses the column size as one of the metrics to determine the query plan. MAX certainly gives the Optimizer no clue in that regard.
    The unknown column size can cause difficulty/impossibility in calculating the number of rows per page and hence planning for the storage requirements.
  • As common to any variable-length columns, updates with increased data length have the potential of causing page splits, adversely impacting performance.

In this post, I'd like to show an additional observation on using MAX variables. I have found that depending on the application, there can be a huge cost associated with using a MAX variable over its normal variable-length counterpart. As a disclaimer, I personally have not come across any such findings documented anywhere, but if that is due to my ignorance, please forgive me.
Here is some background.
We have two scalar CLR Regex functions in T-SQL, one for identifying a match (util.uf_RegexIsMatch), and the other for finding all matches first and then doing a replacement (util.uf_RegexReplace). A colleague of mine recently brought to my attention a counter-intuitive test result: when applied on a table of a half million normal VARCHAR records, util.uf_RegexIsMatch ran 3 times slower than util.uf_RegexReplace and consistently. That was very surprising. Without knowing exactly how the two Regex functions are implemented in .Net, intuitively IsMatch is doing less work than Replace, and so we were expecting the other way around and if not 3 times fasters, at least as fast, which was proven to be the case when doing comparison tests in C#.
It did not make sense. So I started a little investigation. I made sure the query plans were all right: simple table scan and after repeated runs, all the data was in the cache (SET STATISTICS IO ON showed no physical reads and only logical reads and they were the same for both). That meant both had the same IO cost and the only difference must have been in the functions themselves. Then, I looked at the CREATE FUNCTION syntax and noticed that the input parameter for util.uf_RegexIsMatch was of type NVARCHAR(MAX) and that for util.uf_RegexReplace NVARCHAR(4000). I knew the type had to be a Unicode type because that is required by all CLR objects. But could the MAX have been the cause? It was hard to believe but I decided to give it a try. I recreated the function util.uf_RegexIsMatch by using NVARCHAR(4000) for the input parameter, and re-ran the tests. Whala! The comparison went to the right direction. Also as a curiosity, making both input parameters of the same type of NVARCHAR(MAX) had the same expected and obviously slower comparison result.
In our case, we do not really need MAX, or even 4000 because the column in the table to feed the input parameters of the functions is only VARCHAR(50). I played with other numbers smaller than 4000 for the variable-length and found that made no performance difference as compared to 4000, so we standardized on NVARCHAR(4000) for the two functions. And they will be our default functions used in production. If necessary, we can create their counterparts with NVARCHAR(MAX) to handle special cases.
Is the above just a special and isolated case or does it have general validity?
Further investigation has shown that MAX in general has negative performance impact not only on CLR functions, but also T-SQL user defined functions. I suspect it is also true for stored procedures. In addition, it is true even for ad hoc query batches, which can be simulated by comparing the following two simple batches. Be sure to check "Discard results after executive" in your SSMS options so that no display will contaminate the test.
Note that a number after GO tells the query processor to execute the batch that many times.
---- Batch 1: normal variable-length variable
GO 50000
---- Batch 2: MAX
GO 50000

On my notebook running XP and SQL Server 2008 (Katmai) CTE, Batch 1 completes in 12 secs and Batch 2 completes in 16 secs, consistently. That is a difference of over 30%!

In conclusion, the flexibility that comes with MAX is not free. The difference in performance from using the normal variable-length counterpart can be unexpectedly huge. We need to evaluate each situation to really justify its use and even in that case, we may be better off creating two versions of the objects (functions, stored procedures), one for handling the normal production scenarios and the other for the special cases. I personally would like to study and know more about how SQL OS allocates and de-allocate memory for MAX variables to understand the range of differences in performances.