Thursday, June 19, 2008

Entity Framework - Stored Procedure (5)

In Entity Framework - Stored Procedure (3), and Entity Framework - Stored Procedure (4),
we have created stored procedures on the SQL Server and customized the entity model so that EF will use those stored procedures in place of dynamic queries.

In this blog, I discuss INSERT in detail. I want to understand how EF handles the following:
IDENTITY Key - How does EF retrieve the server-generated IDENTITY value?
GUID Key - How does EF retrieve the server-generated UNIQUEIDENTIFIER value?
TIMESTAMP Field - It is not part of the primary key, but is server-generated. I use such fields in the data model for the purposes of concurrency control.
Multi-table INSERT - With tables having referential relations, how does EF sequence the INSERTs and sets the generated master key value at the child level?

IDENTITY Keyef.Policy.policyId is an IDENTITY key. In order for EF to retrieve the key value for the just inserted record to update the entity object, we must do the following:
In the stored procedure ef.qp_InsertPolicy, we have a result setSELECT SCOPE_IDENTITY() AS policyId, ...;
In SSDL, under , for we have attributeStoreGeneratedPattern="Identity"
In MSL, under

, we have element
Now if we run this Console program using (WORKModel.WORKEntities we = new WORKModel.WORKEntities())
{
try
{
WORKModel.Policy p = new WORKModel.Policy();
p.expDt = DateTime.Parse("2008/09/01");
p.riskState = "AZ";
p.description = "An AZ Policy";
we.AddToPolicies(p);
we.SaveChanges();
Console.WriteLine("Newly Generated policyId: {0}", p.policyId);
Console.ReadLine();
}
catch (Exception e)
{
Console.WriteLine("Exception Occured: {0}", e.Message);
Console.ReadLine();
}
}We get Newly Generated policyId: 43And on the SQL Server, this is executed:exec [ef].[qp_InsertPolicy] @dtExpDt='2008-09-01 00:00:00',@vcRiskState='AZ',@vcDescription='An AZ Policy';Cool! Notice that we did not explicitly set the policyId property of p, but after SaveChanges() p.policyId was updated with the SQL Server's generated IDENTITY key for the newly inserted record. Also note that our stored procedure [ef].[qp_InsertPolicy] was executed instead of some dynamic query.
GUID Keyef.Driver.driverId is the primary key field of type UNIQUEIDENTIFIER and a default constraint settting it to NEWSEQUENTIALID() if NULL. It would be great if EF could treat GUID in the same way as IDENTITY. Unfortunately, since SQL Server does not have the facility to get the newly generated NEWSEQUENTIALID() the same way SCOPE_IDENTITY() works on IDENTITY, MS decides that EF will not support any other server-generated value types than IDENTITY as long as that field is part of the primary key. Seehttp://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=2808788&SiteID=1.So the workaround solution is to use the parameterless (default) constructor to generate GUID to set the property (why parameterless constructor? Because it is the most common way to create a new instance. Of course, if the entity class has other (non-default) constructors, we should do the same.) Since Driver (DriverName as well) class is partial, add a new class file WORK.Custom.cs (as opposed to the auto-generated WORK.Designer.cs) which includes the following
public partial class Driver : global::System.Data.Objects.DataClasses.EntityObject
{
public Driver()
{
this.driverId = Guid.NewGuid();
}
}
public partial class DriverName : global::System.Data.Objects.DataClasses.EntityObject
{
public DriverName()
{
this.nameId = Guid.NewGuid();
}
}Here we will only need the constructor for Driver, and add that for DriverName for use in c below. With that, we can run this Console program:
using (WORKModel.WORKEntities we = new WORKModel.WORKEntities())
{
try
{
// Add a driver to an existing policy
WORKModel.Policy p = we.Policies.Include("Drivers")
.Where(w => w.policyId == 43)
.FirstOrDefault();

WORKModel.Driver d = new WORKModel.Driver();
d.age = 40;
d.Policy = p;
we.AddObject(@"Drivers", d); // To use AddObject, we must first set d.Policy as above.
//p.Drivers.Add(d);
we.SaveChanges();
Console.WriteLine("Newly Generated driverId: {0}", d.driverId);
Console.ReadLine();
}
catch (Exception e)
{
Console.WriteLine("Exception Occured: {0}", e.Message);
Console.ReadLine();
}
}We getNewly Generated driverId: edbf5506-d6f6-4946-aa14-0a2e0a71689dAnd this is executed on the SQL Server:exec [ef].[qp_InsertDriver] @uidDriverId='EDBF5506-D6F6-4946-AA14-0A2E0A71689D',@iPolicyId=43,@iAge=40;Ok, our stored procedure is used by EF instead of some dynamic query. The issue with this approach is that the .NET Guid.NewGuid() function does not gurantee the order of the guid generated. In other words, we can not be sure that a guid generated later is larger than that earlier. The implication is that since driverId is a clustered index, inserts with random driverId values will cause framentation. Our SQL NEWSEQUENTIALID() would solve the problem if utilized.
TIMESTAMP FieldEach of tables have a TIMESTAMP field, ts. Whenever a record is inserted/updated, that record's ts value will increase. Its only purpose is for consistency control. We will look at it more when we discuss UPDATE in the next blog. Here for INSERT, after a record is added in the database table, we want to retrieve the ts value to update the entity object. To do that, we must do the following:
In each INSERT stored procedure, we add a column in the result setSELECT SCOPE_IDENTITY() AS policyId, ts FROM ef.TableName WHERE pk = @pk; for each table.
In SSDL, under , for we have attributeStoreGeneratedPattern="Computed", for each table.
In MSL, under

, we have element for each table.
We will see in d. below that the entity object's ts property is updated after the inserts.

Multi-table INSERTRun this Console program:
using (WORKModel.WORKEntities we = new WORKModel.WORKEntities())
{
try
{
// Multi-table inserts
WORKModel.Policy p = new WORKModel.Policy();
p.expDt = DateTime.Parse("2008/09/01");
p.riskState = "AZ";
p.description = "An AZ Policy";
we.AddToPolicies(p);

WORKModel.Driver d = new WORKModel.Driver();
//Use the public constructor instead: WORKModel.Driver.CreateDriver(); // it does not generate PK
d.age = 40;
d.Policy = p;
we.AddObject(@"Drivers", d); // To use AddObject, we must first set d.Policy as above.

//p.Drivers.Add(d);
WORKModel.DriverName dn = new WORKModel.DriverName { Driver = d, effDt = DateTime.Parse("2008/5/1"), lastName = "Wang", firstName = "Ting" };
we.AddObject(@"DriverNames", dn);
we.SaveChanges();
Console.WriteLine("Policy.policyId: {0}; Driver.policyId: {1}; Driver.driverId: {2}; DriverName.driverId: {3}.", p.policyId, d.Policy.policyId, d.driverId, dn.Driver.driverId);
Console.WriteLine("\n");
Console.WriteLine("Policy.ts: {0}; Driver.ts: {1}; DriverName.ts: {2}.", p.ts.ToString1(), d.ts.ToString1(), dn.ts.ToString1());
Console.ReadLine();

}
catch (Exception e)
{
Console.WriteLine("Exception Occured: {0}", e.Message);
Console.ReadLine();
}
}We getPolicy.policyId: 56; Driver.policyId: 56; Driver.driverId: 3eac17c0-2c9c-4e60-b0
f4-98bfc679d01b; DriverName.driverId: 3eac17c0-2c9c-4e60-b0f4-98bfc679d01b.

Policy.ts: 00000089; Driver.ts: 000000810; DriverName.ts: 000000811.
And the following was excutued in order:
exec [ef].[qp_InsertPolicy] @dtExpDt='2008-09-01 00:00:00',@vcRiskState='AZ',@vcDescription='An AZ Policy';
exec [ef].[qp_InsertDriver] @uidDriverId='3EAC17C0-2C9C-4E60-B0F4-98BFC679D01B',@iPolicyId=56,@iAge=40;
exec [ef].[qp_InsertDriverName] @uidNameId='875FC71C-533C-47CB-B84D-BF161B8DCD82',@uidDriverId='3EAC17C0-2C9C-4E60-B0F4-98BFC679D01B',@vcLastName='Wang',@vcFirstName='Ting',@vcMiddleName=NULL,@dtEffDt='2008-05-01 00:00:00';
Nice. policyId generated on the server is retrieved to set p.policyId which in turn is used as a parameter of the driver inserting stored procedure. The ts properties on all three entity objects are updated after insert. We also note that all of our INSERT stored procedures are used.

Entity Framework - Stored Procedure (4)

In Entity Framework - Stored Procedure (3),
we have created stored procedures on the SQL Server, we now move on to customize the EDM so that EF will use those stored procedures in place of dynamic queries. As mentioned earlier, the current version of EF either uses all stored procedures for all three data-changing operations (INSERT, UPDATE, DELETE) or uses none on the per-entity basis. As a matter of fact, we will get a compiler error if say, for an entity, only the INSERT stored procedure were implemented. Unlike the customization for SELECT stored procedures, we do not need to make any changes in CSDL.
So here is the customization necessary:

In SSDL, under the element, add

























































In MSL, under the


element, add






















Under the

element, add



























Under the

element, add
































During the above customization, I have these general observations:
In MSL, the generated , where SomeName is Policy, or Driver, or DriverName, depending on the context. After the element is added, the compiler complains about more than one elements under . After removing IsTypeOf() with only , I get the compiler happy.
In MSL, the format of FunctionName attribute can be very confusion. We have to prefix it wity WORKModel.Store. to pass the compiler.
In MSL, for the stored procedure parameter mapping, if the column is a foreign key, we need to use the element to enclose element. Take policyId in the Driver entity for example. policyId is not a Property of Driver, but is from the associated Policy object.
For the DELETE stored procedures on tables with foreign keys - Driver and DriverName are such tables, we need to have the foreign key column as one of the input parameters even though the stored procedure itself does not use it. That is because EF needs to infer the order of deletions. I will discuss that further when I blog about the DELETE stored procedures.

Entity Framework - Stored Procedure (3)

In Entity Framework - Stored Procedure (2),
we explore ways to use the result set of a stored procedure in the ObectContext to materialize a collection of an entity type. We observe that EF treats such a stored procedure as a strongly-typed function returning a collection of entity objects. We also note that EF does not treat SELECT stored procedures as "first-class citizens", by which I mean that such stored procedures are not used to in the dynamic query auto-generated in association with a LINQ to Entities query.

In this and the subsequent few blogs, we review how to use stored procedures in EF to change data in the database. We will see that change-processing stored procedures are indeed treated as "first-class citizens", i.e., they can replace the dynamic queries. I use
Using Stored Procedures for Change Processing in the ADO.NET Entity Framework
as reference.

Due to the limitation of the current EF, for a given table, we must either use stored procedures for INSERT, UPDATE and DELETE or not use stored procedures at all. So we will implement all three types of stored procedures (INSERT, UPDATE and DELETE ) for the three tables (ef.Policy, ef.Driver, ef.DriverName) in our data model now and discuss only the INSERT part.
In this blog, we will introduce the stored procedures as defined on the SQL Server. The next blog will describe the necessary cutomization of the entity model so that EF wiil use those stored procedures to change the store data instead of dynamic queries. Subsequent blogs will describe each type (INSERT, UPDATE, DELETE) of stored procedures in detail.
The following lists the stored procedures created on the SQL Server:

---- Stored Procedures
-- ef.qp_InsertPolicy
IF EXISTS (SELECT * FROM sys.objects WHERE [object_id] = OBJECT_ID(N'[ef].[qp_InsertPolicy]') AND OBJECTPROPERTY([object_id], N'IsProcedure') = 1)
DROP PROCEDURE [ef].qp_InsertPolicy;
GO
CREATE PROCEDURE ef.qp_InsertPolicy (
@dtExpDt DATE,
@vcRiskState VARCHAR(25),
@vcDescription VARCHAR(100)
)
AS
BEGIN
INSERT ef.Policy (expDt, riskState, description) VALUES
(@dtExpDt, @vcRiskState, @vcDescription);
SELECT SCOPE_IDENTITY() AS policyId, P.ts AS ts
FROM ef.Policy P
WHERE policyId = SCOPE_IDENTITY();
END;
GO
GRANT EXEC ON [ef].qp_InsertPolicy TO Public;
GO

-- ef.qp_UpdatePolicy
IF EXISTS (SELECT * FROM sys.objects WHERE [object_id] = OBJECT_ID(N'[ef].[qp_UpdatePolicy]') AND OBJECTPROPERTY([object_id], N'IsProcedure') = 1)
DROP PROCEDURE [ef].qp_UpdatePolicy;
GO
CREATE PROCEDURE ef.qp_UpdatePolicy (
@iPolicyId INT,
@ts_orig TIMESTAMP,
@dtExpDt_curr DATE,
@vcRiskState_curr VARCHAR(25),
@vcDescription_curr VARCHAR(100)
)
AS
BEGIN
UPDATE ef.Policy SET
expDt = @dtExpDt_curr,
riskState = @vcRiskState_curr,
description = @vcDescription_curr
WHERE policyId = @iPolicyId AND ts = @ts_orig;
SELECT P.ts
FROM ef.Policy P
WHERE policyId = @iPolicyId;
END;
GO
GRANT EXEC ON [ef].qp_UpdatePolicy TO Public;
GO

-- ef.qp_DeletePolicy
IF EXISTS (SELECT * FROM sys.objects WHERE [object_id] = OBJECT_ID(N'[ef].[qp_DeletePolicy]') AND OBJECTPROPERTY([object_id], N'IsProcedure') = 1)
DROP PROCEDURE [ef].qp_DeletePolicy;
GO
CREATE PROCEDURE ef.qp_DeletePolicy (
@iPolicyId INT,
@ts_orig TIMESTAMP
)
AS
BEGIN
DELETE ef.Policy
WHERE policyId = @iPolicyId AND ts = @ts_orig;
END;
GO
GRANT EXEC ON [ef].qp_DeletePolicy TO Public;
GO

-- ef.qp_InsertDriver
IF EXISTS (SELECT * FROM sys.objects WHERE [object_id] = OBJECT_ID(N'[ef].[qp_InsertDriver]') AND OBJECTPROPERTY([object_id], N'IsProcedure') = 1)
DROP PROCEDURE [ef].qp_InsertDriver;
GO
CREATE PROCEDURE ef.qp_InsertDriver (
@uidDriverId UNIQUEIDENTIFIER,
@iPolicyId INT,
@iAge TINYINT
)
AS
/*
Ideally, we should let the default to generate NEWSEQUENTIALID() for driverId and
somewhere retrieve that value (similarly to SCOPE_IDENTITY()) and return that as
a result set and in MSL use tag to send it back to EDM.
But, for Primary Key, that technique only works for IDENTITY field.
This link
http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=2808788&SiteID=1
proposes a solution via explicit transaction. We are currently not implementing it.
*/
BEGIN
--SELECT @uidDriverId = COALESCE(@uidDriverId, NEWID());
INSERT ef.Driver (driverId, policyId, age) VALUES
(@uidDriverId, @iPolicyId, @iAge);
SELECT ts
FROM ef.Driver
WHERE driverId = @uidDriverId;
END;
GO
GRANT EXEC ON [ef].qp_InsertDriver TO Public;
GO

-- ef.qp_UpdateDriver
IF EXISTS (SELECT * FROM sys.objects WHERE [object_id] = OBJECT_ID(N'[ef].[qp_UpdateDriver]') AND OBJECTPROPERTY([object_id], N'IsProcedure') = 1)
DROP PROCEDURE [ef].qp_UpdateDriver;
GO
CREATE PROCEDURE ef.qp_UpdateDriver (
@uidDriverId UNIQUEIDENTIFIER,
@ts_orig TIMESTAMP,
@iPolicyId_curr INT,
@iAge_curr TINYINT
)
AS
BEGIN
UPDATE ef.Driver SET
policyId = @iPolicyId_curr,
age = @iAge_curr
WHERE driverId = @uidDriverId AND ts = @ts_orig;
SELECT ts
FROM ef.Driver
WHERE driverId = @uidDriverId;
END;
GO
GRANT EXEC ON [ef].qp_UpdateDriver TO Public;

-- ef.qp_DeleteDriver
IF EXISTS (SELECT * FROM sys.objects WHERE [object_id] = OBJECT_ID(N'[ef].[qp_DeleteDriver]') AND OBJECTPROPERTY([object_id], N'IsProcedure') = 1)
DROP PROCEDURE [ef].qp_DeleteDriver;
GO
CREATE PROCEDURE ef.qp_DeleteDriver (
@uidDriverId UNIQUEIDENTIFIER,
@iPolicyId_orig INT,
@ts_orig TIMESTAMP
)
AS
BEGIN
DELETE ef.Driver
WHERE driverId = @uidDriverId AND ts = @ts_orig;
END;
GO
GRANT EXEC ON [ef].qp_DeleteDriver TO Public;

-- ef.qp_InsertDriverName
IF EXISTS (SELECT * FROM sys.objects WHERE [object_id] = OBJECT_ID(N'[ef].[qp_InsertDriverName]') AND OBJECTPROPERTY([object_id], N'IsProcedure') = 1)
DROP PROCEDURE [ef].qp_InsertDriverName;
GO
CREATE PROCEDURE ef.qp_InsertDriverName (
@uidNameId UNIQUEIDENTIFIER,
@uidDriverId UNIQUEIDENTIFIER,
@vcLastName VARCHAR(100),
@vcFirstName VARCHAR(25),
@vcMiddleName VARCHAR(25),
@dtEffDt DATE
)
AS
BEGIN
INSERT ef.DriverName (nameId, driverId, lastName, firstName, middleName, effDt) VALUES
(@uidNameId, @uidDriverId, @vcLastName, @vcFirstName, @vcMiddleName, @dtEffDt);
SELECT ts
FROM ef.DriverName
WHERE nameId = @uidNameId;
END;
GO
GRANT EXEC ON [ef].qp_InsertDriverName TO Public;
GO

-- ef.qp_UpdateDriverName
IF EXISTS (SELECT * FROM sys.objects WHERE [object_id] = OBJECT_ID(N'[ef].[qp_UpdateDriverName]') AND OBJECTPROPERTY([object_id], N'IsProcedure') = 1)
DROP PROCEDURE [ef].qp_UpdateDriverName;
GO
CREATE PROCEDURE ef.qp_UpdateDriverName (
@uidNameId UNIQUEIDENTIFIER,
@ts_orig TIMESTAMP,
@uidDriverId_curr UNIQUEIDENTIFIER,
@vcLastName_curr VARCHAR(100),
@vcFirstName_curr VARCHAR(25),
@vcMiddleName_curr VARCHAR(25),
@dtEffDt_curr DATE
)
AS
BEGIN
UPDATE ef.DriverName SET
driverid = @uidDriverId_curr,
lastName = @vcLastName_curr,
firstName = @vcFirstName_curr,
middleName = @vcMiddleName_curr,
effDt = @dtEffDt_curr
WHERE nameId = @uidNameId AND ts = @ts_orig;
SELECT ts
FROM ef.DriverName
WHERE nameId = @uidNameId;
END;
GO
GRANT EXEC ON [ef].qp_UpdateDriverName TO Public;
GO

-- ef.qp_DeleteDriverName
IF EXISTS (SELECT * FROM sys.objects WHERE [object_id] = OBJECT_ID(N'[ef].[qp_DeleteDriverName]') AND OBJECTPROPERTY([object_id], N'IsProcedure') = 1)
DROP PROCEDURE [ef].qp_DeleteDriverName;
GO
CREATE PROCEDURE ef.qp_DeleteDriverName (
@uidNameId UNIQUEIDENTIFIER,
@uidDriverId_orig UNIQUEIDENTIFIER,
@ts_orig TIMESTAMP
)
AS
BEGIN
DELETE ef.DriverName
WHERE nameId = @uidNameId AND ts = @ts_orig;
END;
GO
GRANT EXEC ON [ef].qp_DeleteDriverName TO Public;

Monday, June 16, 2008

Entity Framework - Stored Procedure (2)

In Entity Framework - Stored Procedure (1), I defined the store data model and the object data model.
In this blog, I will explore ways to use a stored procedure to materialize objects based on our defined data model.
Initially, I thought the framework would be able to use a stored procedure in its generated T-SQL instead of doing a SELECT explicitly. To see if that was the case, I created a stored procedure to get all drivers for a given policyId as follows:

IF EXISTS (SELECT * FROM sys.objects WHERE [object_id] = OBJECT_ID(N'[ef].[qp_GetDriversPerPolicy]') AND OBJECTPROPERTY([object_id], N'IsProcedure') = 1)
DROP PROCEDURE [ef].[qp_GetDriversPerPolicy];
GO
CREATE PROCEDURE ef.qp_GetDriversPerPolicy (
@iPolicyId INT
)
AS
BEGIN
SELECT D.driverId, D.policyId, D.age, D.ts
FROM ef.Driver D
WHERE D.policyId = @iPolicyId;
END;
GO
GRANT EXEC ON [ef].[qp_GetDriversPerPolicy] TO Public;
GO
-- Example:
EXEC ef.[qp_GetDriversPerPolicy] @iPolicyId = 2;
GO

Then I followed the instructions from
How to: Define a Model with a Stored Procedure
to modify WORK .edmx file so that
In SSDL, in the element, I added




In CSDL, in the element, I added




And in MSL, under , I added


I saved the XML file WORK.edmx and in VS, I clicked "Run Custom Tool" on item WORK.edmx to update the generated class file.

Then, I creatd a Console project whose Main program basically ran the following code:

using (WORKModel.WORKEntities ctx = new WORKModel.WORKEntities())
{
try
{

ObjectQuery pl =
ctx.Policies//.Include("Drivers")
.Where("it.policyId = @id", new ObjectParameter("id", 2));
var p = pl.Select(s => s).FirstOrDefault();
p.Drivers.Load();

Console.WriteLine("policyId\t expDt\t riskState\t description");
Console.WriteLine("{0}\t {1}\t {2}\t {3}", p.policyId, p.expDt, p.riskState, p.description);

Console.WriteLine("age");
foreach (WORKModel.Driver dr in p.Drivers)
{
Console.WriteLine("{0}", dr.age);
}

Console.ReadLine();
}
catch (Exception e)
{
Console.WriteLine("Exception Occurred: {0}", e.Message);
Console.ReadLine();
}
}

I was expecting "p.Drivers.Load()" to use our stored procedure, but instead, the following T-SQL was excuted on the SQL server:

exec sp_executesql N'SELECT
1 AS [C1],
[Extent1].[driverId] AS [driverId],
[Extent1].[age] AS [age],
[Extent1].[ts] AS [ts],
[Extent1].[policyId] AS [policyId]
FROM [ef].[Driver] AS [Extent1]
WHERE [Extent1].[policyId] = @EntityKeyValue1',N'@EntityKeyValue1 int',@EntityKeyValue1=2

So the conclusion is that EF currently ignores stored procedures when it loads data into the ObjectContext object instances. It would be nice in future versions to have the choice of using stored procedures in that context if available.

Then, is it possible to use our stored procedure at all in the ObjectContext?
MSDN article How to: Execute a Query Using a Stored Procedure (Entity Framework)
shows that it is possible. To test that, create a Console project with its Main function as follows:
using (WORKModel.WORKEntities ctx = new WORKModel.WORKEntities())
{
try
{
int policyId = 2;
Console.WriteLine("driverId\t age");
foreach (WORKModel.Driver d in ctx.GetDriversPerPolicy(policyId))
{
Console.WriteLine("{0}\t {1}", d.driverId, d.age);
}
Console.ReadLine();
}
catch (Exception e)
{
Console.WriteLine("Exception Occurred: {0}", e.Message);
Console.ReadLine();
}
}
SQL Profiler indeed shows the following has been excuted:
exec [ef].[qp_GetDriversPerPolicy] @iPolicyId=2.

EF makes our stored procedure a strongly-type function and makes it very convenient to execute a stored procedure explicitly in user codes. It uses the ObjectContext's native ADO.NET connection to execute the stored procedure and we also notice that in between two consecutive calls to the stored procedure, SQL Profiler shows a connection reset by "exec sp_reset_connection". That is because EF wants to make sure the execution state for each stored proc call is initialized. See
What does sp_reset_connection do?

for more information about the details of connection reset.

Sunday, June 15, 2008

Entity Framework with Stored Procedure (1)

Entity Framework, a.k.a. ADO.NET Data Model, defaults to dynamic SQL queries when it accesses the storage layer for retrieving and manipulating the data. While that gives the framework tremendously flexibility in its query generator which is doing a decent job. For example, most of the dynamic queries generated are parameterized, which helps ease the worry about SQL injection. Regarding the SQL Server performance, since parameterized queries, like stored procedures, are mostly plan-cacheable, it should not be an issue. The main concern I have is with respect to security - dynamic queries require the user to have full data access permissions on the underlying tables. That is a no-no in most shops. That leads one to stored procedures. How does EF support stored procedures? I will have a series of blogs exploring various aspects of using stored procedures in EF. I will use a simple data model throughout the series. So let's start with defining the data model.

---- Create a database named WORK - I am using the Feb 2008 CTP version of SQL Server 2008.
-- There is nothing special in the work that needs features of SQL Server 2008. SQL Server 2005 will work too.
USE master;
-- Create Database
IF EXISTS (SELECT * FROM sys.databases WHERE [name] = 'WORK') BEGIN
DROP DATABASE WORK;
END;

CREATE DATABASE WORK
ON
( NAME = WORK_DAT,
-- TODO: make sure the file name is correct
FILENAME = 'C:\SQL Data - 2008\WORK_Data.MDF',
SIZE = 10MB,
MAXSIZE = UNLIMITED,
FILEGROWTH = 10% )
LOG ON
( NAME = WORK_LOG,
-- TODO: make sure the file name is correct
FILENAME = 'C:\SQL Data - 2008\WORK_Log.LDF',
SIZE = 5MB,
MAXSIZE = UNLIMITED,
FILEGROWTH = 10% );
GO

---- Create a schem: ef to hold our SQL objects
USE WORK;
CREATE SCHEMA ef AUTHORIZATION dbo;

---- Tables
-- ef.Policy: master table. It has an IDENTITY field. It will be interesting to see how EF handles it.
CREATE TABLE ef.Policy (
policyId INT NOT NULL IDENTITY(1, 1),
expDt DATE NOT NULL,
riskState VARCHAR(25) NOT NULL,
description VARCHAR(100) NULL,
CONSTRAINT PK_ef_Policy PRIMARY KEY CLUSTERED (policyId)
);
-- ef.Driver: a detail tabel. It references ef.Policy and has a GUID primary key with default.
CREATE TABLE ef.Driver (
driverId UNIQUEIDENTIFIER NOT NULL,
policyId INT NOT NULL,
age TINYINT NULL,
CONSTRAINT PK_ef_Driver PRIMARY KEY CLUSTERED (driverId),
--CONSTRAINT DF_ef_Driver_driverId DEFAULT (NEWSEQUENTIALID()) FOR driverId,
CONSTRAINT FK_ef_Driver_policyId FOREIGN KEY (policyId) REFERENCES ef.Policy (policyId)
);
ALTER TABLE ef.Driver ADD CONSTRAINT DF_ef_Driver_driverId DEFAULT (NEWSEQUENTIALID()) FOR driverId;
-- ef.DriverName: a sub detail table. It references ef.Driver and has a GUID primary key with default.
DROP TABLE ef.DriverName;
CREATE TABLE ef.DriverName (
nameId UNIQUEIDENTIFIER NOT NULL
CONSTRAINT PK_ef_DriverName PRIMARY KEY CLUSTERED,
driverId UNIQUEIDENTIFIER NOT NULL
CONSTRAINT FK_ef_DriverName_driverId FOREIGN KEY REFERENCES ef.Driver (driverId),
lastName VARCHAR(100) NOT NULL,
firstName VARCHAR(25) NULL,
middleName VARCHAR(25) NULL,
effDt DATE NOT NULL
);
ALTER TABLE ef.DriverName ADD CONSTRAINT DF_ef_DriverName_nameId DEFAULT (NEWSEQUENTIALID()) FOR nameId;

After the database objects are created on the SQL server, I then open Visual Studio 2008 SP1 Beta1 which has beta3 of the Entity Framework (ADO.NET Data Model). I create a Console application. Then add an ADO.NET Data Model item - the wizard leads to the connection of the above WORK database, and I rename the item to be WORK.edmx and the other names were given by default.

At this point, I do a little customization:
The auto-generated code name both the EntityType and EntitySet the same. For, example, for the Policy entity, both its type and collection are named Policy. That can be confusion in future coding. So I decide to rename the EntitySet to be Policyes in this case. The same for Driver and DriverName. To do that, we need to make some changes in the XML file WORK.edmx as follows:
In CSDL: under , change under , change .If necessary, change
In MSL: change
Save WORK.edmx
In VS, right-click WORK.edmx and Run Custom Tool.
Since the TIMESTAMP field ts is server-generated and is intended for consistency control (discussed in details when we implement the Update and Insert stored procedures) and is not for client code manipulation, I make its Setter private just to be sure.
So now we have a database data model and its related entity object model. In future blogs of this series, we will introduce stored procedures and see how they can be used within EF.

Thursday, March 20, 2008

.NET Assembly and DB Backup/Restore

We recently had to transfer a utility database from our loal development server to a remote server hosted at our data center on the East Coast. We backed up the local database and restored it on the remote server. All went smoothly and DBCC CHECKDB('…') showed no errors. But when we tried to use the CLR functions inside the restored DB, some worked while other did not. A quick analysis showed that those that worked were the simple ones whose assemblies only referenced the .NET assemblies "approved " for SQL, and those that did not work had their assemblies reference .NET assemblies not in the "approved" list, such as System.EnterpriseServices. To give some background, we need those "unapproved" assemblies for our .NET remoting client applications inside SQL; basically, we have made SQL a .NET remoting client which consumes data services with endpoints at our application server. Examples of such data services: SSN decoder, VIN decoder, Address parser, Geo decoder, White Page search engine. All those data services have been implemented as .NET remoting objects hosted in Windows services and are accessed via TCP channels.

Back to our topic. In this particular case, the local server was 32-bit while the remote server 64-bit. It turns out that SQLCLR automatically loads the (most recent) "approved" assemblies in the server's GAC. So, the 64-bit SQL server always has the newest 64-bit .NET assemblies in the "approved" list ready, and since our user assemblies were built to target "Any" platforms, it is no wonder that the CLR functions whose assemblies reference only the "apprioved" assemblies worked immediately after the restore.
The case with "unapproved" assemblies is another story. Their loading is outside SQLCLR's domain, and we have to manually load them. SQLCLR checks the signature of a loaded assembly against that in GAC when the loaded assembly is accessed and reports error if the signatures do not match, for example, when there is an update. Obviously, the 32-bit assembly loaded in the backup could not match its 64-bit counterpart. To fix the problem, ALTER ASSEMBLY [System.EnterpriseServices] FROM … did not work because it was not a compatible upgrade. We ended up dropping all custom assemblies referencing [System.EnterpriseServices] and [System.EnterpriseServices] itself, and recreating all of them in the reverse order.

Friday, March 7, 2008

Output Parameter for sp_executesql

sp_executesql is the preferred way for running dynamic queries because it encourages parametization which in turn increases the chances of plan reuse.

Here I will discuss a gotcha when using sp_executesql with output parameters and in a loop. Instead of initializing the output parameters in each iteration of the loop, the value from the previous iteration is kept. Let's look at this simple example:

---- Script 1: Output parameter is not initialization
DECLARE @iCount INT, @vcSQL NVARCHAR(MAX), @iOut INT;
SELECT @iCount = 1;
WHILE (@iCount <= 2)
BEGIN
SELECT @vcSQL =
'
SELECT @i = COALESCE(@i, @j);
';
EXEC sp_executesql @vcSQL,
N'@i INT OUT, @j INT',
@iOut OUT, @iCount;
SELECT @iCount, @iOut;
SELECT @iCount = @iCount + 1;
END;

In Script 1, we have a WHILE loop for 2 iterations. The simple dynamic query as defined in @vcSQL has two parameters @i and @j as indicated in the execution of sp_executesql , with @i the OUTPUT parameter and @j the normal input parameter.
In the first iteration, @iCount = 1 which is passed to parameter @j. @i starts as NULL and therefore is assigned @j = 1. So, SELECT @iCount, @iOut; produces (1, 1).
In the second iteration, @iCount = 2 which again is assigned to @j, if @i were initialized, then just as in the first iteration, @i would be assigned @j=1, and SELECT @iCount, @iOut; would produce (2, 2). But actually, the result is (2, 1). That shows that at the start of iteration 2, @i was not initialized to NULL, but kept its previous value of 1 from iteration 1.

To make sure the output parameter is initialized in each iteration, we need to explicitly assign it a NULL value at the beginning inside the dynamic query as shown in Script 2. Running it produces the desired (1, 1) and (2, 2) from SELECT @iCount, @iOut; after the two iterations.

---- Script 2:
DECLARE @iCount INT, @vcSQL NVARCHAR(MAX), @iOut INT;
SELECT @iCount = 1;
WHILE (@iCount <= 2)
BEGIN
SELECT @vcSQL =
'
SELECT @i = NULL;
SELECT @i = COALESCE(@i, @j);
';
EXEC sp_executesql @vcSQL,
N'@i INT OUT, @j INT',
@iOut OUT, @iCount;
SELECT @iCount, @iOut;
SELECT @iCount = @iCount + 1;
END;

The above is a very simple demonstration. In the real world, @i might participate complex calculations. For example, it might be an output parameter of a stored procedure such as @expected_rowcount of sp_table_validation which behaves differently depending on whether or not @expected_rowcount is NULL. Knowing the difference with and without the explicit initialization will help us achieve the desired results.

Output Parameter for sp_executesql

sp_executesql is the preferred way for running dynamic queries because it encourages parametization which in turn increases the chances of plan reuse.

Here I will discuss a gotcha when using sp_executesql with output parameters and in a loop. Instead of initializing the output parameters in each iteration of the loop, the value from the previous iteration is kept. Let's look at this simple example:

---- Script 1: Output parameter is not initialization
DECLARE @iCount INT, @vcSQL NVARCHAR(MAX), @iOut INT;
SELECT @iCount = 1;
WHILE (@iCount <= 2)
BEGIN
SELECT @vcSQL =
'
SELECT @i = COALESCE(@i, @j);
';
EXEC sp_executesql @vcSQL,
N'@i INT OUT, @j INT',
@iOut OUT, @iCount;
SELECT @iCount, @iOut;
SELECT @iCount = @iCount + 1;
END;

In Script 1, we have a WHILE loop for 2 iterations. The simple dynamic query as defined in @vcSQL has two parameters @i and @j as indicated in the execution of sp_executesql , with @i the OUTPUT parameter and @j the normal input parameter.
In the first iteration, @iCount = 1 which is passed to parameter @j. @i starts as NULL and therefore is assigned @j = 1. So, SELECT @iCount, @iOut; produces (1, 1).
In the second iteration, @iCount = 2 which again is assigned to @j, if @i were initialized, then just as in the first iteration, @i would be assigned @j=1, and SELECT @iCount, @iOut; would produce (2, 2). But actually, the result is (2, 1). That shows that at the start of iteration 2, @i was not initialized to NULL, but kept its previous value of 1 from iteration 1.

To make sure the output parameter is initialized in each iteration, we need to explicitly assign it a NULL value at the beginning inside the dynamic query as shown in Script 2. Running it produces the desired (1, 1) and (2, 2) from SELECT @iCount, @iOut; after the two iterations.

---- Script 2:
DECLARE @iCount INT, @vcSQL NVARCHAR(MAX), @iOut INT;
SELECT @iCount = 1;
WHILE (@iCount <= 2)
BEGIN
SELECT @vcSQL =
'
SELECT @i = NULL;
SELECT @i = COALESCE(@i, @j);
';
EXEC sp_executesql @vcSQL,
N'@i INT OUT, @j INT',
@iOut OUT, @iCount;
SELECT @iCount, @iOut;
SELECT @iCount = @iCount + 1;
END;

The above is a very simple demonstration. In the real world, @i might participate complex calculations. For example, it might be an output parameter of a stored procedure such as @expected_rowcount of sp_table_validation which behaves differently depending on whether or not @expected_rowcount is NULL. Knowing the difference with and without the explicit initialization will help us achieve the desired results.

Thursday, February 28, 2008

Puzzle - Counterfeit Coins

SQL Server Magazine (February 2008) has the following puzzle:

  • This puzzle is from Clifford Jensen. Suppose you have 10 stacks of coins, with 10 coins in each stack. One stack consists of 10 counterfeit coins and the other 9 stacks each consist of 10 legitimate coins. Each legitimate coin weighs exactly 1 gram. Each counterfeit coin weighs exactly 0.9 grams. You have a digital scale that’s graduated in tenths of grams. Using the scale to take only one reading, determine which stack has the 10 counterfeit coins. You can weigh any number of coins from any number of stacks, but must you weigh them all together (i.e., you can take only one reading from the scale).

It is obvious that the solution has to be one in which coins of all 10 stacks are used in some way. In order to distinguish one stack from another, we need to take different number of coins from different stacks. So one solution is

  1. Label the 10 stacks (arbitrarily) from 1 to 10
  2. Take n coins from stack n, where n = 1, 2, …, 10. Put them on the scale and take a reading.
  3. Focus only on the single decimal digit. If it is 9, then the counterfeit stack is stack number 1; 8: stack 2; 7: stack 3; 6: stack 4; 5: stack 5; 4: stack 6; 3: stack 7; 2: stack 8; 1: stack 9. So, if we let j be the decimal digit from the reading, then the stack with all counterfeit coins is stack number (10 - j).

Wednesday, February 20, 2008

Understanding SQL plan_generation_num

SQL Server 2005 provides two important dynamic views sys.dm_exec_cached_plans (P) and sys.dm_exec_query_stats (S) to help understand cached plans and their statistics. Joining on [plan_handle], P and S have a 1-to-0 relationship. P has a record for every compiled plan in the plan cache, and S can have a record for every query statement in the corresponding cached plan. S can be empty for a cached plan of cache object types "Parse Tree" or "CLR Compiled Proc" or for objects such as system stored procedures, ad hoc queries, prepared statements, etc. Since I am mainly interested in user stored procedures and their compilation/recompilation frequency, for our purposes, P and S have a 1-to-many relationship - a cached plan for a stored procedure in P has one or more corresponding recoreds in S, one for each query statement in the stored procedure.

There are a lot of interesting columns in P and S, especially in S, and here I will only discuss what I have learned about plan_generation_num in S. SQL Server 2005 treats the compiled plan for a stored procedure as an array of subplans, one for each query statement. If an individual subplan needs recompilation, it does so without causing the whole plan to recompile. In doing so, SQL Server increments the plan_generation_num on the subplan record to be 1 + MAX(plan_generation_num for all subplans). The general distribution of plan_generation_num among all subplans for a given plan is such that it has multiple of 1's and distinct numbers > 1. That is because all subplans start with 1 as their plan_generation_num. Appendix A is the query for learning plan_generation_num.

Based on the above discussion, the higher the maximum(plan_generation_num) for a cached plan, the more frequent the plan has been recompiled. Be aware that it is not the recompilation of the whole plan but rather of the individual subplans that is building up maximum(plan_generation_num).

Appendix A.
Following is a query to list all plan_generation_num for a stored procedure
---- Analze recompilation via.plan_generation_num
DECLARE @vcName VARCHAR(50);
SELECT @vcName = 'qp_Temp';
SELECT
P.plan_handle,
S.statement_start_offset, S.statement_end_offset,
S.plan_generation_num,
S.execution_count,
P.usecounts,
S.creation_time,
S.last_execution_time,
sqlSmt = SUBSTRING(T2.text, (S.statement_start_offset/2) + 1, ((CASE WHEN S.statement_end_offset > 0 THEN S.statement_end_offset ELSE DATALENGTH(T2.text) END - S.statement_start_offset)/2) + 1),
L.query_plan,
TL.query_plan,
obj = OBJECT_NAME(T.objectid),
T.objectid,
P.cacheobjtype,
P.objtype,
T.text
FROM sys.dm_exec_cached_plans P
INNER JOIN sys.dm_exec_query_stats S
ON P.plan_handle = S.plan_handle
OUTER APPLY sys.dm_exec_sql_text(P.plan_handle) T
CROSS APPLY sys.dm_exec_query_plan(P.plan_handle) L
OUTER APPLY sys.dm_exec_sql_text(S.sql_handle) T2
CROSS APPLY sys.dm_exec_text_query_plan (P.plan_handle, DEFAULT, DEFAULT) TL
WHERE P.cacheobjtype = 'Compiled Plan' AND
T.text NOT LIKE '%dm_exec_cached_plans%' AND
P.objtype = 'Proc' AND T.dbid = DB_ID()
AND OBJECT_NAME(T.objectid) = 'qp_Temp'
ORDER BY P.plan_handle, S.statement_start_offset, S.statement_end_offset

Monday, February 11, 2008

Website Down

A client called about its website being down. Even a simple page requested would show a blank screen, but without any browser error. The website was ping-able with decent responses. The IP returned from the DNS was correct - I had to check the IP because the client just renewed the domain registration and I was afraid the renewal somehow caused the IP associated with the website to be changed. That was not the case. Just to be sure. I also got the confirmation by "View Source" in the browser which showed the correct site info.

I then reviewed the System event log and noticed an recent installation of Windows Security updates KB941644 and KB943485. I tried to call Microsoft Product Support Services at 1-866-PCSAFETY for free security updates support to find out if Microsoft had any knowledge, but got no answer. I searched the web and found some complaints related to the updates, and one particular about the proxy being messed up and the user having to uninstall the updates to fix the problem. My client did not use a proxy server but that message told me that the updates might have changed something on the web server to interfere with Internet browsing or even page rendering. I checked the web server's log and noticed 404 errors associated with the GET from external Ips - and interestingly, from Intranet Ips, browsing the web site had not problem. Another important piece of info was that from the web server, I could not browse www.msn.com, www.microsoft.com, etc - the display showed the same blank screen just as an Internet user would get trying to browse my client's website, even though I could browse other web sites such as www.google.com. With the above facts, I first uninstalled the updates, but that did not fix the problem. So, something else was changed. I then went to Internet Options and restored the default settings. Whala! Everything started to work again.

That was a very unfortunate experience. My client's production web site was down for quite some time just because Microsoft sent out the updates without any warning about the potential effects. In this case, my client did nothing wrong, and it just followed Microsoft's recommendation. I hope Microsoft will be more thorough in QA and provide better guidance about its future updates.

Monday, February 4, 2008

Session Settings for SSB Activation Procedure

According to BOL, the session settings for internal activation are the default databases options:
"Service Broker executes internally activated service programs on a background session distinct from the connection that created the message. The options set for this session are the default options for the database."
Pasted from <http://msdn2.microsoft.com/en-us/library/ms171585.aspx>

The default options for out database are "OFF" for all the relevant settings. But the activation procedure when activated reports that the followings settings are turned "ON":
ANSI_NULL_DFLT_ON
ANSI_NULLS
ANSI_PADDING
ANSI_WARNINGS
CONCAT_NULL_YIELDS_NULL
QUOTED_IDENTIFIER

Where did these settings come from? ANSI_NULLS and QUOTED_IDENTIFIER are understandable because they are creation time settings. The others are not set as the database defaults.

First, I thought those settings are from the set_options attribute of the cached plan. But the following query
SELECT P.usecounts, A.set_options FROM sys.dm_exec_cached_plans P OUTER APPLY sys.dm_exec_plan_attributes(P.plan_handle) A
CROSS APPLY sys.dm_exec_sql_text(P.plan_handle) T
WHERE A.attribute = 'set_options'
AND P.cacheobjtype = 'Compiled Plan' AND P.objtype = 'Proc'
AND T.text LIKE '%procName%'; -- replace procName with the name of the activation procedure
reports that this set of sestings are compiled with the cached plan:
ANSI_NULLS
ANSI_PADDING
ANSI_WARNINGS
ARITHABORT
ARITHIGNORE
DISABLE_DEF_CNST_CHK
IMPLICIT_TRANSACTIONS

So that is not it. What is most strange about the set_options of the cached plan is IMPLICIT_TRANSACTIONS which should not be set at all in normal situations because of its unexpected effects on @@TRANCOUNT.

The settings reported from within the activation procedure are the desired behavior settings with the exception of missing ARITHABORT but it is implicitly implied by ANSI_WARNINGS. Also, these 6 settings are exactly the same as those for an ADO.Net connection. These 6 except for CONCAT_NULL_YIELDS_NULL belong to the group of settings controled by ANSI_DEFAULTS. ANSI_DEFAULTS include two other settings: CURSOR_CLOSE_ON_COMMIT and IMPLICIT_TRANSACTIONS. The following is quoted from BOL:
"The SQL Native Client ODBC driver and SQL Native Client OLE DB Provider for SQL Server automatically set ANSI_DEFAULTS to ON when connecting. The driver and Provider then set CURSOR_CLOSE_ON_COMMIT and IMPLICIT_TRANSACTIONS to OFF. The OFF settings for SET CURSOR_CLOSE_ON_COMMIT and SET IMPLICIT_TRANSACTIONS can be configured in ODBC data sources, in ODBC connection attributes, or in OLE DB connection properties that are set in the application before connecting to SQL Server. The default for SET ANSI_DEFAULTS is OFF for connections from DB-Library applications."

So it seems that instead of using database default options, SSB activation procedure is executed in a session environment similar to the connection of an SQL Native Client OLE DB Provider (such as SSMS client).

Sunday, January 27, 2008

SET Settings for Stored Procedure

A stored procedure is unique in that it is able to control its execution context via the "environmental" SET settings. (Since a trigger is a special kind of stored procedure, can a trigger change its SET settings?). For example, we can set TRANSACTION LEVEL at the beginning of the stored procedure to control all statements in it and to provide the default for its inner stored procedures if any.
According to BOL,

  • The scope of the settings SET in the stored procedure is the stored procedure itself and those settings do not affect the calling (execution) environment.
  • Settings not explicitly set in the stored procedure default to those set at the execution time except for two special settings: ANSI_NULLS and QUOTED_IDENTIFIER.
  • Those two settings at the creation time of the stored procedure (implicitly) exclusively control the respective behavior, ignoring the corresponding settings at the execution time or the settings SET within the stored procedure. That is, those two settings are constant for the stored procedure after its creation, and to change them, it has to be altered or dropped and created with different settings.


We can find out the two settings at the time of stored procedure creation by querying:
SELECT OBJECTPROPERTY(OBJECT_ID('spName'), 'ExecIsQuotedIdentOn');
SELECT OBJECTPROPERTY(OBJECT_ID('spName'), 'ExecIsAnsiNullsOn');

Why are those two settings special?
QUOTED_IDENTIFIER is actually a parse-time setting and the parser needs to determine if a statement like SELECT * FROM "table" is valid or not based on the setting. It makes sense that for consistency, the same behavior applied during the parsing is adopted at the time of execution without regard to any different setting. (Other parse-time settings such at FIPS_FLAGGER, OFFSETS, PARSEONLY, do not affect the execution results).
ANSI_NULLS setting can directly affect query results. It is understandable that this settting deserves the consistency treatment. The question is why other settings do not have the same special treatment. Specifically, CONCAT_NULL_YIELDS_NULL also affect query results. The optimizer needs to know these other settings when it is looking for the "best" query plan. So, SQL's resolution is that each cached plan
for a stored procedure has these setting included as part of the plan. Any change in them in the execution context will cause the optimizer to recompile and generate another plan.

Another question that has been in my mind for quite some time and still has no clear answer: what execution context is an activation stored procedure in? Since there is no explictly created session to run such a procedure, will the settings defaults to the db level settings (except for the two special settings (QUOTED_IDENTIFIER and ANSI_NULLS ), of course - they were set at the creation time) . But the default db settings are all off. Is there a risk here?
The same applies to other client apps (besides the native sqlclient used by SSMS) such as ADO.Net. BOL seems to imply that ODBC and OLEDB have automatic settings.

Wednesday, January 23, 2008

Cost of MAX in SQL Server 2005

Microsoft SQL Server 2005 introduced keyword MAX for defining variable-length data types such as VARCHAR (and its Unicode counterpart NVARCHAR), and VARBINARY. It has given database architects and developers a more robust choice when designing data models and writing T-SQL codes. There are many discussions on the internals, the advantages and disadvantages of using MAX. Below I summarize the main points.

  • The storage engine treats a MAX type as a hybrid of the normal variable-length type and LOB type. That is, if a column with MAX type has length <= 8,000 bytes, then it will be treated as the normal-length type - stored on an in-row data page or a row-overflow data page. Otherwise, the column will be treated as a LOB type - stored on a LOB data page.
  • Because of the above internal flexibility, data types text, ntext and image can be virtually replaced by the appropriate MAX type if working exclusively in SQL 2005 and later environments.
  • A VARCHAR(MAX) variable is very handy to store pre-validated XML data because of the potentially explosive size of XML data. Use TRY … CATCH block to validate text data for XML well-formedness or XML schema.
  • NVARCHAR(MAX) parameters are convenient to use for defining CLR functions/triggers/stored procedures in T-SQL. A lot of people, including me, misunderstood that .Net type SqlString would translate into NVARCHAR(MAX). Actually, it translates into NVARCHAR(4000) and SqlChars translates into NVARCHAR(MAX).
  • Although I have not verified it myself, some people claim statistics can not be automatically created on MAX type columns, therefore limiting the Optimizer's ability. But, statistics can created manually for a column of any type. Also, we could create an index on it if necessary (be very careful here - the unlimited column size could make the index take a huge amount of storage).
  • Similarly, the Optimizer uses the column size as one of the metrics to determine the query plan. MAX certainly gives the Optimizer no clue in that regard.
    The unknown column size can cause difficulty/impossibility in calculating the number of rows per page and hence planning for the storage requirements.
  • As common to any variable-length columns, updates with increased data length have the potential of causing page splits, adversely impacting performance.


In this post, I'd like to show an additional observation on using MAX variables. I have found that depending on the application, there can be a huge cost associated with using a MAX variable over its normal variable-length counterpart. As a disclaimer, I personally have not come across any such findings documented anywhere, but if that is due to my ignorance, please forgive me.
Here is some background.
We have two scalar CLR Regex functions in T-SQL, one for identifying a match (util.uf_RegexIsMatch), and the other for finding all matches first and then doing a replacement (util.uf_RegexReplace). A colleague of mine recently brought to my attention a counter-intuitive test result: when applied on a table of a half million normal VARCHAR records, util.uf_RegexIsMatch ran 3 times slower than util.uf_RegexReplace and consistently. That was very surprising. Without knowing exactly how the two Regex functions are implemented in .Net, intuitively IsMatch is doing less work than Replace, and so we were expecting the other way around and if not 3 times fasters, at least as fast, which was proven to be the case when doing comparison tests in C#.
It did not make sense. So I started a little investigation. I made sure the query plans were all right: simple table scan and after repeated runs, all the data was in the cache (SET STATISTICS IO ON showed no physical reads and only logical reads and they were the same for both). That meant both had the same IO cost and the only difference must have been in the functions themselves. Then, I looked at the CREATE FUNCTION syntax and noticed that the input parameter for util.uf_RegexIsMatch was of type NVARCHAR(MAX) and that for util.uf_RegexReplace NVARCHAR(4000). I knew the type had to be a Unicode type because that is required by all CLR objects. But could the MAX have been the cause? It was hard to believe but I decided to give it a try. I recreated the function util.uf_RegexIsMatch by using NVARCHAR(4000) for the input parameter, and re-ran the tests. Whala! The comparison went to the right direction. Also as a curiosity, making both input parameters of the same type of NVARCHAR(MAX) had the same expected and obviously slower comparison result.
In our case, we do not really need MAX, or even 4000 because the column in the table to feed the input parameters of the functions is only VARCHAR(50). I played with other numbers smaller than 4000 for the variable-length and found that made no performance difference as compared to 4000, so we standardized on NVARCHAR(4000) for the two functions. And they will be our default functions used in production. If necessary, we can create their counterparts with NVARCHAR(MAX) to handle special cases.
Is the above just a special and isolated case or does it have general validity?
Further investigation has shown that MAX in general has negative performance impact not only on CLR functions, but also T-SQL user defined functions. I suspect it is also true for stored procedures. In addition, it is true even for ad hoc query batches, which can be simulated by comparing the following two simple batches. Be sure to check "Discard results after executive" in your SSMS options so that no display will contaminate the test.
Note that a number after GO tells the query processor to execute the batch that many times.
---- Batch 1: normal variable-length variable
DECLARE @vc VARCHAR(8000);
SELECT @vc = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ';
GO 50000
---- Batch 2: MAX
DECLARE @vc VARCHAR(MAX);
SELECT @vc = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ';
GO 50000

On my notebook running XP and SQL Server 2008 (Katmai) CTE, Batch 1 completes in 12 secs and Batch 2 completes in 16 secs, consistently. That is a difference of over 30%!


In conclusion, the flexibility that comes with MAX is not free. The difference in performance from using the normal variable-length counterpart can be unexpectedly huge. We need to evaluate each situation to really justify its use and even in that case, we may be better off creating two versions of the objects (functions, stored procedures), one for handling the normal production scenarios and the other for the special cases. I personally would like to study and know more about how SQL OS allocates and de-allocate memory for MAX variables to understand the range of differences in performances.