SQL Server Query Optimization – Sargable Functions

When a function is used as part of a SQL join or predicate, if the query optimizer is able to make efficient  use of indexes to speed up the query, then the function is sargable (Search ARGument ABLE). Not all Transact-SQL functions are sargable. So care must be taken to avoid non-sargable functions when creating queries.

In this blog I’m going to demonstrate some workarounds for non-sargable pitfalls.

Setup

Let’s create a table, load it with data (10,000 rows), and create some non-clustered indexes to target when optimizing our queries.

CREATE TABLE Product (
	ProductId int IDENTITY(1,1) NOT NULL,
	ProductName nvarchar(100) NOT NULL,
	ProductLine nchar(2) NULL,
	CreatedDT datetime,
	CONSTRAINT PK_Product PRIMARY KEY CLUSTERED
	(
		ProductId ASC
	)
) ON [PRIMARY];
GO

INSERT INTO Product (ProductName, ProductLine, CreatedDT)
VALUES		('Product 1', 'L1', DATEADD(year, -1, GETDATE())), (' Product 1 ', 'L1', DATEADD(year, -2, GETDATE())),
			('Product 2', 'L1', DATEADD(year, -3, GETDATE())), (' Product 2 ', 'L1', DATEADD(year, -4, GETDATE())),
			('Product 3', 'L2', DATEADD(year, -5, GETDATE())), (' Product 3 ', 'L2', DATEADD(year, -6, GETDATE())),
			('Product 4', 'L2', DATEADD(year, -7, GETDATE())), (' Product 4 ', 'L3', DATEADD(year, -8, GETDATE())),
			('Product 5', 'L4', DATEADD(year, -9, GETDATE())), (' Product 5 ', NULL, DATEADD(year, -10, GETDATE()));
GO 1000

CREATE NONCLUSTERED INDEX IX_Product_ProductName ON dbo.Product
(
	ProductName ASC
) INCLUDE (ProductId) ON [PRIMARY];

CREATE NONCLUSTERED INDEX IX_Product_CreatedDT ON dbo.Product
(
	CreatedDT ASC
) INCLUDE (ProductId, ProductName) ON [PRIMARY];

CREATE NONCLUSTERED INDEX IX_Product_ProductLine ON dbo.Product
(
	ProductLine ASC
) INCLUDE (ProductId, ProductName) ON [PRIMARY];

LTRIM and RTRIM

First off, we  want to optimize a query that returns Product records with ProductName equal to “Product 1”. But note that a subset of the products are prefixed and suffixed with spaces, and we also want to return these.

Here’s how NOT to do it –

SELECT	ProductId, ProductName
FROM	Product
WHERE	LTRIM(RTRIM(ProductName)) = 'Product 1';

We are targeting the IX_Product_ProductName index, and the query plan shows that the optimizer has chosen to use this index.

query1

However, it is scanning all 10,000 rows of the index in order to return the expected 2000. The LTRIM and RTRIM functions are not sargable.

If ProductName was only prefixed with spaces, and not suffixed, then we could optimize the query by using the LIKE operator.

SELECT	ProductId, ProductName
FROM	Product
WHERE	ProductName LIKE 'Product 1%';

But as we need to account for the suffixed spaces, this will not do for us.

To optimize the query, we add a computed column which uses the LTRIM and RTRIM functions to strip the spaces, and then create a non-clustered index on this column –

ALTER TABLE Product ADD ProductNameTrimmed AS LTRIM(RTRIM(ProductName));

CREATE NONCLUSTERED INDEX IX_Product_ProductNameTrimmed ON dbo.Product
(
	ProductNameTrimmed ASC
) INCLUDE (ProductId, ProductName) ON [PRIMARY];

If we now modify the query to use the computed column in the predicate –

SELECT	ProductId, ProductName
FROM	Product
WHERE	ProductNameTrimmed = 'Product 1';

The query optimizer is now able to perform a seek on the non-clustered index –

query2

The number of logical reads has reduced from 50 to 16, and the execution time from 193ms to 67ms. And if the query returned a smaller subset of records, then the performance improvement would be even greater.

DateTime Functions

If we now query on a specific CreatedDT datetime –

SELECT	ProductID, ProductName, CreatedDT
FROM	Product
WHERE	CreatedDT = '2014-11-21 14:08:42.593';

The optimizer performs an efficient seek on the IX_Product_CreatedDT index.

However, if we want to return rows for a date, irrespective of time, we may try the following widely used query –

SELECT	ProductID, ProductName, CreatedDT
FROM	Product
WHERE	DATEADD(dd, 0, DATEDIFF(dd, 0, CreatedDT)) = '2014-11-21';

But the DATEADD and DATEDIFF functions are not sargable so the query optimizer is unable to perform a seek on the IX_Product_CreatedDT non-clustered index. Instead it scans the clustered index, reading all 10,000 rows.

Instead let’s try the following –

SELECT	ProductID, ProductName, CreatedDT
FROM	Product
WHERE	CONVERT(date, CreatedDT) = '2014-11-21';

The query optimizer is now able to perform the seek on the index.

query3

The logical reads have Now lestbeen reduced from 54 to 8. The CONVERT function is this case is sargable.

If we look at the details of the index seek, we see that the query optimizer is using the following seek predicate –

Seek Keys[1]: 
Start: [Sandbox].[dbo].[Product].CreatedDT > Scalar Operator([Expr1005]), 
End: [Sandbox].[dbo].[Product].CreatedDT < Scalar Operator([Expr1006])

So it is translating the CONVERT predicate into a greater than and less than query. This gives us a clue as to another method for enabling the optimizer to efficiently utilize the index –

SELECT	ProductID, ProductName, CreatedDT
FROM	Product
WHERE	CreatedDT > '2014-11-21' AND CreatedDT < '2014-11-22';

Now let’s try and return all rows created in year 2014 –

SELECT	ProductID, ProductName, CreatedDT
FROM	Product
WHERE	YEAR(CreatedDT) = '2014';

Not surprisingly, the query optimizer scans all rows as the YEAR function is not sargable.

To enable the optimizer to use the IX_Product_CreatedDT index we can re-write the query as follows –

SELECT	ProductID, ProductName, CreatedDT
FROM	Product
WHERE	CreatedDT > '2014-1-1' AND CreatedDT < '2015-1-1';

query4

The number of logical reads is reduced from 54 to 8.

ISNULL and COALESCE

The following query returns all Product records with ProductLine equal to “L4”.

SELECT	ProductID, ProductName, ProductLine
FROM	Product
WHERE	ProductLine = 'L4';

The query optimizer performs an efficient seek on the IX_Product_ProductLine index.

query5

The ProductLine column is nullable, and if we want to return these as well we could use a ISNULL or COALESCE statement as follows –

SELECT	ProductID, ProductName, ProductLine
FROM	Product
WHERE	COALESCE(ProductLine, 'L4') = 'L4';

However, the optimizer now chooses a non-optimal scan on the index.

query6

The COALESE function, and also the ISNULL function, is not sargable. The fix is simply to change the predicate to use an OR .. IS NULL as shown below –

SELECT	ProductID, ProductName, ProductLine
FROM	Product
WHERE	ProductLine = 'L4' OR ProductLine IS NULL;

The optimizer now chooses an optimal seek on the index.

query7

Rule of Thumb

Sargable functions in SQL Server are few and far between. As a rule of thumb, I’d recommend that SQL Server developers, without a great deal of experience in optimizing queries, err on the side of caution by not applying functions, or operators, to the columns used in joins or predicates. If there is a imperative to do so, then consider whether there are alternative such as computed columns.

Inheritance Modelling Patterns with Entity Framework 6 Code First

Introduction

In this post I’m going to demonstrate how to model inheritance using Entity Framework 6 Code First. In particular, I’m going to demonstrate the following alternative database mapping strategies –

  • Table Per Hierarchy (TPH)
  • Table Per Concrete Type (TPC)
  • Table Per Type (TPT)

My model is as follows –

contract

Contract, the base class, has a start and end date, and a unique identifier. CreditContract and PrepaymentContract derive from Contract and hence inherit it’s properties. In addition, CreditContract has credit limit and currency properties, whilst PrepaymentContract has prepayment amount and period properties..

Code First Classes

First step is to create the model classes.

Contract

Contract is the base class for the hierarchy, hence the abstract modifier.

using System;
using System.ComponentModel.DataAnnotations;
using System.ComponentModel.DataAnnotations.Schema;

namespace CodeFirstInheritance.Models
{
    public abstract class Contract
    {
        [Key]
        [DatabaseGeneratedAttribute(DatabaseGeneratedOption.Identity)]
        public int ContractID { get; set; }

        [DataType(DataType.Date)]
        [Required]
        public DateTime StartDate { get; set; }

        [DataType(DataType.Date)]
        [Required]
        public DateTime EndDate { get; set; }
    }
}

CreditContract

CreditContract is derived from Contract.

For simplicity I have modelled the CreditLimitCurrency property as a string with data annotations mapping it to a char(3) column. In practice it would be better to model this as a Enum.

using System.ComponentModel.DataAnnotations;
using System.ComponentModel.DataAnnotations.Schema;

namespace CodeFirstInheritance.Models
{
    public class CreditContract : Contract
    {
        [Required]
        public float CreditLimit { get; set; }

        [Column(TypeName = "char")]
        [StringLength(3)]
        [Required]
        public string CreditLimitCurrency { get; set; }
    }
}

PrepaymentContract

PrepaymentContract also derives from Contract.

using System.ComponentModel.DataAnnotations;
using System.ComponentModel.DataAnnotations.Schema;

namespace CodeFirstInheritance.Models
{
    public class PrepaymentContract : Contract
    {
        [Required]
        public float PrepaymentAmount { get; set; }

        [Column(TypeName = "char")]
        [StringLength(1)]
        [Required]
        public string PrepaymentPeriod { get; set; }
    }
}

Table Per Hierarchy (TPH)

With the TPH mapping strategy, all classes in the inheritance tree are mapped to a single database table. A database field is used as a discriminator in determining which concrete type each database record should be mapped to.

To instruct Entity Framework to use this strategy, all that is required is to derive a class from the DbContext class, and add a DBSet property for the Contract type, whilst NOT adding DBSet properties for CreditContract nor PrepaymentContract.

I have also added an override of the OnModelCreating method. I’ll discuss this in a moment.

CodeFirstInheritanceDbContext

using System.Data.Entity;

namespace CodeFirstInheritance.Models
{
    public class CodeFirstInheritanceDbContext : DbContext
    {
        public DbSet<Contract> Contracts { get; set; }

        protected override void OnModelCreating(DbModelBuilder modelBuilder)
        {
            modelBuilder.Entity<Contract>()
                .Map<CreditContract>(m => m.Requires("Discriminator").HasValue("C"))
                .Map<PrepaymentContract>(m => m.Requires("Discriminator").HasValue("P"));

            base.OnModelCreating(modelBuilder);
        }
    }
}

If we then enable migrations and add an initial migration, the following DbMigration class is generated –

namespace CodeFirstInheritance.Migrations
{
    using System;
    using System.Data.Entity.Migrations;

    public partial class Initial : DbMigration
    {
        public override void Up()
        {
            CreateTable(
                "dbo.Contracts",
                c => new
                    {
                        ContractID = c.Int(nullable: false, identity: true),
                        StartDate = c.DateTime(nullable: false),
                        EndDate = c.DateTime(nullable: false),
                        CreditLimit = c.Single(),
                        CreditLimitCurrency = c.String(maxLength: 3, fixedLength: true, unicode: false),
                        PrepaymentAmount = c.Single(),
                        PrepaymentPeriod = c.String(maxLength: 1, fixedLength: true, unicode: false),
                        Discriminator = c.String(nullable: false, maxLength: 128),
                    })
                .PrimaryKey(t => t.ContractID);

        }

        public override void Down()
        {
            DropTable("dbo.Contracts");
        }
    }
}

A single table, Contract, is created with fields from all types. Of note is the Discriminator field, which we did not explicitly specify in our hierarchy. EF has generated this field for us and will use it to discriminate between credit and prepayment contract records.

If we update the database and then execute the following code –

using (var db = new CodeFirstInheritanceDbContext())
{
    CreditContract creditContract = new CreditContract();
    creditContract.StartDate = new DateTime(2015, 1, 1);
    creditContract.EndDate = new DateTime(2015, 12, 31);
    creditContract.CreditLimit = 100000;
    creditContract.CreditLimitCurrency = "USD";
    db.Contracts.Add(creditContract);

    PrepaymentContract prepaymentContract = new PrepaymentContract();
    prepaymentContract.StartDate = new DateTime(2015, 1, 1);
    prepaymentContract.EndDate = new DateTime(2015, 12, 31);
    prepaymentContract.PrepaymentAmount = 10000;
    prepaymentContract.PrepaymentPeriod = "M";
    db.Contracts.Add(prepaymentContract);

    db.SaveChanges();
}

The following records are added to the Contract table –

tphresults

By default EF will use the full names of the types as the discriminator values. This may significantly increase the size of each record. EF allows us to override these values using the fluent API. This is what I did in the OnModelCreating method of the CodeFirstInheritanceDbContext class, in order to shorten the discriminator to a single character.

We can then query the model across contract types and for specific contract types as follows –

using (var db = new CodeFirstInheritanceDbContext())
{
    var countOfAllContract = db.Contracts.Count();
    var countOfCreditContracts = db.Contracts.OfType<CreditContract>().Count();
}

TPH is the simplest mapping strategy to implement. And as all data is stored in a single table there are no joins required when querying data. It is therefore the most performant strategy.

However, there are a few issues –

  • If you look back at the CreditContract and PrepaymentContract classes you’ll note that all the properties are decorated with the [Required] attribute. However, to fit records of both types into a single database table EF has had to generate the fields as nullable. Data integrity has potentially been compromised.
  • It is not clear from the database table which fields relate to which type. This may make life difficult when populating the table from other sources.
  • The table breaks 3rd normal form as the values contained in the sub-type fields depend on the discriminator. 3rd normal form states that fields should only depend on the table key.

Table Per Concrete Type (TPC)

With the TPC mapping strategy, each concrete class is mapped to it’s own database table. So we get 2 tables, CreditContract and PrepaymentContract, with each having all fields in the respective branch of the hierarchy. We do not get a table for the abstract base class.

Before we instruct EF to use this strategy, we need to make a small modification to the Contract class – change the type of the ContractID property from int to Guid. I’ll discuss why in a moment.

using System;
using System.ComponentModel.DataAnnotations;
using System.ComponentModel.DataAnnotations.Schema;

namespace CodeFirstInheritance.Models
{
    public abstract class Contract
    {
        [Key]
        [DatabaseGeneratedAttribute(DatabaseGeneratedOption.Identity)]
        public Guid ContractID { get; set; }

        [DataType(DataType.Date)]
        [Required]
        public DateTime StartDate { get; set; }

        [DataType(DataType.Date)]
        [Required]
        public DateTime EndDate { get; set; }
    }
}

As per the TPH strategy, we derive a class from DbContext, and add a DBSet property for the Contract type, whilst NOT adding DBSet properties for CreditContract nor PrepaymentContract.

We then override the OnModelCreating method and add fluent API code to instruct EF to map the CreditContract entity to a specific table. And to also map the inherited properties to this table. The table name does not have to be the same as the entity name.

We do the same for PrepaymentContract.

using System.Data.Entity;

namespace CodeFirstInheritance.Models
{
    public class CodeFirstInheritanceDbContext : DbContext
    {
        public DbSet<Contract> Contracts { get; set; }

        protected override void OnModelCreating(DbModelBuilder modelBuilder)
        {
            modelBuilder.Entity<CreditContract>().Map(m =>
            {
                m.MapInheritedProperties();
                m.ToTable("CreditContract");
            });

            modelBuilder.Entity<PrepaymentContract>().Map(m =>
            {
                m.MapInheritedProperties();
                m.ToTable("PrepaymentContract");
            });   
           
            base.OnModelCreating(modelBuilder);
        }
    }
}

If we then enable migrations and add an initial migration, we can see from the migration script that 2 separate tables are created on the database.

namespace CodeFirstInheritance.Migrations
{
    using System;
    using System.Data.Entity.Migrations;
    
    public partial class Initial : DbMigration
    {
        public override void Up()
        {
            CreateTable(
                "dbo.CreditContract",
                c => new
                    {
                        ContractID = c.Guid(nullable: false, identity: true),
                        StartDate = c.DateTime(nullable: false),
                        EndDate = c.DateTime(nullable: false),
                        CreditLimit = c.Single(nullable: false),
                        CreditLimitCurrency = c.String(nullable: false, maxLength: 3, fixedLength: true, unicode: false),
                    })
                .PrimaryKey(t => t.ContractID);
            
            CreateTable(
                "dbo.PrepaymentContract",
                c => new
                    {
                        ContractID = c.Guid(nullable: false, identity: true),
                        StartDate = c.DateTime(nullable: false),
                        EndDate = c.DateTime(nullable: false),
                        PrepaymentAmount = c.Single(nullable: false),
                        PrepaymentPeriod = c.String(nullable: false, maxLength: 1, fixedLength: true, unicode: false),
                    })
                .PrimaryKey(t => t.ContractID);
            
        }
        
        public override void Down()
        {
            DropTable("dbo.PrepaymentContract");
            DropTable("dbo.CreditContract");
        }
    }
}

Update the database and then execute the following code –

using (var db = new CodeFirstInheritanceDbContext())
{
    CreditContract creditContract = new CreditContract();
    creditContract.StartDate = new DateTime(2015, 1, 1);
    creditContract.EndDate = new DateTime(2015, 12, 31);
    creditContract.CreditLimit = 100000;
    creditContract.CreditLimitCurrency = "USD";
    db.Contracts.Add(creditContract);

    PrepaymentContract prepaymentContract = new PrepaymentContract();
    prepaymentContract.StartDate = new DateTime(2015, 1, 1);
    prepaymentContract.EndDate = new DateTime(2015, 12, 31);
    prepaymentContract.PrepaymentAmount = 10000;
    prepaymentContract.PrepaymentPeriod = "M";
    db.Contracts.Add(prepaymentContract);

    db.SaveChanges();
}

The CreditContract entity has been persisted to the CreditContract table –

tpc1

Whilst the PrepaymentContract entity has been persisted to the PrepaymentContract table –

tpc2

We can then query the entities exactly as we did when the TPH strategy was adopted. The consumer of the model does not need to be aware of the mapping strategy.

We changed the ContractID property from int to Guid. The reason is that we are asking the database server to generate unique identifiers for our records. The concrete entity types are mapped to separate tables, and the database server does not and should not know this. Integer identity fields are only unique within the context of a single table, and so if the type was int we would soon experience errors similar to the following –

The changes to the database were committed successfully, but an error occurred while updating the object context. The ObjectContext might be in an inconsistent state. Inner exception message: AcceptChanges cannot continue because the object's key values conflict with another object in the ObjectStateManager. Make sure that the key values are unique before calling AcceptChanges.

However, by changing the type to Guid we are, to all intents and purposes, generating unique fields.

The downside of doing this is that SQL Server GUID fields are 16 bytes in length, compared to 4 bytes for an int.

Compared to TPH, TPC is less performant when querying across types as the queries require a UNION between the tables. However, it will be equally performant when querying a single type.

As TPC stores the data in separate tables, it allows for non-nullable columns, hence ensuring data integrity. And it is clear which properties are required by which type.

Table Per Type (TPT)

With the Table By Type (TPT) strategy, both abstract and concrete types are mapped to their own tables in the database with the inheritance relationships represented by foreign key relationships.

To instruct EF to use this strategy we have 2 options.

The first option is to override the OnModelCreating method and add Fluent API code to instruct EF to map the CreditContract entity to a specific table, and the PrepaymentContract entity to a different specific table.

using System.Data.Entity;

namespace CodeFirstInheritance.Models
{
    public class CodeFirstInheritanceDbContext : DbContext
    {
        public DbSet<Contract> Contracts { get; set; }

        protected override void OnModelCreating(DbModelBuilder modelBuilder)
        {
            modelBuilder.Entity<CreditContract>().Map(m =>
            {
                m.ToTable("CreditContract");
            });

            modelBuilder.Entity<PrepaymentContract>().Map(m =>
            {
                m.ToTable("PrepaymentContract");
            });   
           
            base.OnModelCreating(modelBuilder);
        }
    }
}

The second option is to annotate the concrete types with the [Table()] attribute.

using System;
using System.ComponentModel.DataAnnotations;
using System.ComponentModel.DataAnnotations.Schema;

namespace CodeFirstInheritance.Models
{
    public abstract class Contract
    {
        [Key]
        [DatabaseGeneratedAttribute(DatabaseGeneratedOption.Identity)]
        public int ContractID { get; set; }

        [DataType(DataType.Date)]
        [Required]
        public DateTime StartDate { get; set; }

        [DataType(DataType.Date)]
        [Required]
        public DateTime EndDate { get; set; }
    }

    [Table("CreditContract")]
    public class CreditContract : Contract
    {
        [Required]
        public float CreditLimit { get; set; }

        [Column(TypeName = "char")]
        [StringLength(3)]
        [Required]
        public string CreditLimitCurrency { get; set; }
    }

    [Table("PrepaymentContract")]
    public class PrepaymentContract : Contract
    {
        [Required]
        public float PrepaymentAmount { get; set; }

        [Column(TypeName = "char")]
        [StringLength(1)]
        [Required]
        public string PrepaymentPeriod { get; set; }
    }
}

Enabling migrations and adding an initial migration gives us the following –

namespace CodeFirstInheritance.Migrations
{
    using System;
    using System.Data.Entity.Migrations;

    public partial class Initial : DbMigration
    {
        public override void Up()
        {
            CreateTable(
                "dbo.Contracts",
                c => new
                    {
                        ContractID = c.Int(nullable: false, identity: true),
                        StartDate = c.DateTime(nullable: false),
                        EndDate = c.DateTime(nullable: false),
                    })
                .PrimaryKey(t => t.ContractID);

            CreateTable(
                "dbo.CreditContract",
                c => new
                    {
                        ContractID = c.Int(nullable: false),
                        CreditLimit = c.Single(nullable: false),
                        CreditLimitCurrency = c.String(nullable: false, maxLength: 3, fixedLength: true, unicode: false),
                    })
                .PrimaryKey(t => t.ContractID)
                .ForeignKey("dbo.Contracts", t => t.ContractID)
                .Index(t => t.ContractID);

            CreateTable(
                "dbo.PrepaymentContract",
                c => new
                    {
                        ContractID = c.Int(nullable: false),
                        PrepaymentAmount = c.Single(nullable: false),
                        PrepaymentPeriod = c.String(nullable: false, maxLength: 1, fixedLength: true, unicode: false),
                    })
                .PrimaryKey(t => t.ContractID)
                .ForeignKey("dbo.Contracts", t => t.ContractID)
                .Index(t => t.ContractID);

        }

        public override void Down()
        {
            DropForeignKey("dbo.PrepaymentContract", "ContractID", "dbo.Contracts");
            DropForeignKey("dbo.CreditContract", "ContractID", "dbo.Contracts");
            DropIndex("dbo.PrepaymentContract", new[] { "ContractID" });
            DropIndex("dbo.CreditContract", new[] { "ContractID" });
            DropTable("dbo.PrepaymentContract");
            DropTable("dbo.CreditContract");
            DropTable("dbo.Contracts");
        }
    }
}

3 tables are created, one per type, as shown in the following database schema –

dbdiagram

Update the database and then execute the following code –

using (var db = new CodeFirstInheritanceDbContext())
{
    CreditContract creditContract = new CreditContract();
    creditContract.StartDate = new DateTime(2015, 1, 1);
    creditContract.EndDate = new DateTime(2015, 12, 31);
    creditContract.CreditLimit = 100000;
    creditContract.CreditLimitCurrency = "USD";
    db.Contracts.Add(creditContract);

    PrepaymentContract prepaymentContract = new PrepaymentContract();
    prepaymentContract.StartDate = new DateTime(2015, 1, 1);
    prepaymentContract.EndDate = new DateTime(2015, 12, 31);
    prepaymentContract.PrepaymentAmount = 10000;
    prepaymentContract.PrepaymentPeriod = "M";
    db.Contracts.Add(prepaymentContract);

    db.SaveChanges();
}

And this is what we get in our Contract table –

tpt1

And in the CreditContract table –

tpt2

And finally the PrepaymentContract table –

tpt3

TPT is the least performant of the strategies, but enables the highest degree of database integrity and normalization.

 

 

 

SSIS Data Flows – ADO.NET vs. OLE DB vs. ODBC

​There has been much debate in the developer community on which provider to choose when connecting to SQL Server database.

Performance is often cited as a discriminator, with ADO.NET being less perfor​​mant than ODBC and OLE DB, as it is a managed​ facade. But performance is glossed over as most applications deal only with CRUD operations, and therefore the differences are negligible.

However, as BI developers we typically work with large datasets. We therefore cannot gloss over the performance differences, particularly when implementing SSIS data flow tasks.

A quick search online reveals very little in the way of empirical data on which is faster. Maybe I’ve been searching on the wrong terms, but the only comprehensive test data I’ve come up with is detailed by  Gilbert Quevauvilliers in the following article –

https://gqbi.wordpress.com/2013/06/19/ssis-fastest-data-flow-task-item-for-transferring-data-over-the-network/​

He compares various combinations of ADO.NET and OLE DB source and destination components for a relatively large data set – 10,000,000 rows. He found a combination of OLE DB source and destination components to be over 8x faster than ADO.NET for his testing protocol –

​​Given the lack of evidence, I decided to carry out some tests myself.

Create Source and Destination Tables

I used AdventureWorks2014 as my source database.

I created a {Sales].[SalesOrderDetailBIG] table with the same schema as [Sales].[SalesOrderDetail]. And then populated it with 10,000,000 records, totalling just under 700MB. This is the maximum number of records that I will be testing with. It should be more than enough records to cover most incremental loading scenarios.​

USE AdvntureWorks2014
GO
CREATE TABLE [Sales].[SalesOrderDetailBIG](
        [SalesOrderID] [int] NOT NULL,
   ​     [SalesOrderDetailID] [int] IDENTITY(1,1) NOT NU​LL,
        [CarrierTrackingNumber] [nvarchar](25) NULL,
        [OrderQty] [smallint] NOT NULL,
        [ProductID] [int] NOT NULL,
        [SpecialOfferID] [int] NOT NULL,
        [UnitPrice] [money] NOT NULL,
        [UnitPriceDiscount] [money] NOT NULL,
        [LineTotal] numeric(38,6) NOT NULL,
        [rowguid] [uniqueidentifier] NOT NULL,
        [ModifiedDate] [datetime] NOT NULL
) ON [PRIMARY]​​​

My destination is a SQL Server 2014 database on the same server – my underpowered 8GB i5 quad-core development laptop – so there will be no network latency between the source and destination databases. This is not representative of most scenarios, but I’ve only got access to one server at the moment so needs must.

The destination table is as follows –

USE RWDW_Staging 
GO
CREATE TABLE [dbo].[SalesOrderDetail_Staging](
        [SalesOrderID] [int] NOT NULL,
        [SalesOrderDetailID] [int] NOT NULL,
        [CarrierTrackingNumber] [nvarchar](25) NOT NULL,
        [OrderQty] [smallint] NOT NULL,
        [ProductID] [int] NOT NULL,
        [SpecialOfferID] [int] NOT NULL,
        [UnitPrice] [money] NOT NULL,
        [UnitPriceDiscount] [money] NOT NULL,
        [ModifiedDate] [datetime] NOT NULL,
 CONSTRAINT [PK_SalesOrderDetail_Staging] PRIMARY KEY CLUSTERED (
        [SalesOrderID] ASC,
        [SalesOrderDetailID] ASC 
) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

CREATE NONCLUSTERED INDEX [IX_SalesOrderDetail_Staging] ON [dbo].[SalesOrderDetail_Staging]
(
        [ModifiedDate] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO​

Please note that the destination table does not include the [LineTotal] and [rowguid] columns. And it has a composite PK on [SalesOrderID] and [SalesOrderDetailID] and a non-clustered index on [ModifiedDate]. More about these features later.

Create Connection Managers

I created 6 connection managers, one each of ADO.NET, OLE DB, and ODBC for the source and destination databases –​

Connection Managers

For ADO.NET and OLE DB I used the “.Net Providers\SQLClient Data Provider” and “Native OLE DB\SQL Server Native Client 11.0” data providers.

For ODBC I configured System DSNs for the source and destination, using the “ODBC Driver 11 for SQL Server” driver. According to the SQL Server 2014 developer guide, this is the successor to the “SQL Server Native Client 11.0” driver, and should be used in it’s place –

http://msdn.microsoft.com/en-us/library/cc280510.aspx

I kept the default settings for all the connection managers.

​​​Comparison of ADO.NET, OLE DB, and ODBC Source Components

I wanted to initially compare the source data flow components in isolation from the destination components, so I added 3 data flow tasks to my package, and configured them as follows –

ADO.NET Source data flow task​

ADONET Source

By adding a RowCount transformation, I do not need to add a destination. Hence I have isolated the source from the destination. The RowCount transformation adds very little overhead and should not skew the results.

ADONET Source Config

I have selected the LocalHost AdventureWorks2014 ADO NET connection manager, and used a SQL command to select only the columns that I need. It is obviously important to select only the columns that are required rather than all columns. It can make a significant improvement in performance. But is something that can be easily overlooked.

OLE DB Source Data Flow Task

The OLE DB Source test data flow follows the same pattern –

OLEDB Source

The OLE DB Source component allows you to choose the subset of columns to utilise – in this case I have deselected [LineTotal] and [rowguid] –

OLEDB Source Config

So you would think that rather than having to create a SQL Command for the source you can combine the “Table or View” data access mode with the selection of only the columns that you need.

But all is not as it seems. A SQL Server Profiler trace reveals the following –

Profiler

SSIS is still emitting a “select *” statement and so is retrieving all columns.

The answer, as per the ADO.NET source, is to use a SQL Command to select only the columns that are needed. I have done this for the test.

ODBC Source Data Flow Task

ODBC Source

The ODBC Source component has a BatchSize property. The Profiler trace reveals the following –Profiler 5

An sp_cursorpreexec system stored procedures is executed to prepare a curssor and populate it with the full data set. The sp_cursorfetch stored procedure is then repeatedly excuted to fetch the data 1000 rows at a time. The final parameter is the batch size specified in the BatchSize property.

I tried a range of BatchSize values, and found negligible difference in the performance. I did however hit a problem when attempting to retrieve the data in a single batch by setting the BatchSize to 10,000,000. The performance was significantly degredated as the SSIS internal buffer size was smaller than the batchsize, and as such there was a big overhead incurred in allocating more memory to the buffer –

​Information: 0x4004800C at ODBC Source, SSIS.Pipeline: The buffer manager detected that the system was low on virtual memory, but was unable to swap out any buffers. 0 buffers were considered and 0 were locked. Either not enough memory is available to the pipeline because not enough is installed, other processes are using it, or too many buffers are locked.

Information: 0x4004800F at ODBC Source: Buffer manager allocated 2 megabyte(s) in 1 physical buffer(s).

Information: 0x40048010 at ODBC Source: Component "ODBC Source" (2) owns 2 megabyte(s) physical buffer.​

​Source Data Flow Component Results

I wanted to see if the number of records effected the results. I therefore modified the SQL commands using a “top n” statement, and ran the tests for 10,000,000, 1,000,000, 10,000, and 1,000 records respectively for each data source. I also ran each test 3 times over in order to account for chatter on the server. The only other processes running during the testing were background OS and SQL Server processes, but sometimes these can skew the results.

Results are as follows –

Source Records​ Data (MB) Elapsed Time (s) MB/s
ADO.NET 10,000,000 639.727 37.737 16.952
ADO.NET 1,000,000 63.973 4.025 15.894
ADO.NET 100,000 6.397 0.483
ADO.NET 10,000 0.640 0.109
OLE DB 10,000,000 639.727 15.693 40.765
OLE DB 1,000,000 63.973 1.716 37.280
OLE DB 100,000 6.397 0.297
OLE DB 10,000 0.640 0.078
​ODBC 10,000,000​ 639.727​ 1:13:461​ 8.708​
​ODBC 1,000,000​ 63.973​ 6.614​ 9.672​
​ODBC 100,000​ 6.397​ 0.687​
​ODBC ​10,000 ​0.640 ​0.234

Gilbert found some benefit from optimizing the source connection manager by changing the packet size property to the maximum possible value 32,767. And also by adding the following SQL hints “WITH (NOLOCK) OPTION (FAST 10000)”. Neither of these were applicable for my testing protocol as both my databases are on the same server and so the data is not being sent over a network. Also I am not competing with other processes for access to the data, and do not have to wait for other processes to release locks on the source table.

But as with all SQL hints, you are overriding the optimizer, and so there may or may not be an improvement. You should definitely consider all of the above suggestions when attempting to optimize your processes. Gilbert, to be fair, does say it “sometimes” improves performance.

There’s a good article by Rob Farley on why you should consider using the FAST 10000 hint –

http://sqlblog.com/blogs/rob_farley/archive/2011/02/17/the-ssis-tuning-tip-that-everyone-misses.aspx

Under my testing protocol, the OLE DB source in isolation is approximately 2.5 x quicker than the ADO.NET source, and a whopping 3.8x quicker than the ODBC source, for relatively large data sets. For small data sets the difference is much less, but I suspect only because the overhead of initializing the classes and establishing the connection is obscuring the real throughput performance. This is why, in the table, I have left the MB/s field blank for the 100,000 and 10,000 row tests.

Comparison of ADO.NET, OLE DB, and ODBC Destination Components

I will now test the ADO.NET, OLE DB, and ODBC destination components. But to keep thing simple will only be using the OLE DB source to do this.

ADO.NET Destination Data Flow Task

ADONET Destination

ADONET Destination Config

On the ADO.NET Destination component, I have checked the  “Use Bulk Insert when possible” option. We obviously prefer bulk inserts rather than single row inserts. This option was introduced with 2008 R2. If you are using 2008 or earlier then the ADO.NET destination will use single row inserts and so will be dreadfully slow. Apparently the “Use Bulk Insert when possible” option was added to enable the transfer of data to SQL Azure, given that connecting to SQL Azure using OLE DB is not supported by Microsoft (although it does work).

I ran the data flow task 3 timed for 10,000,000. The results were – 2:55.220; 2:54.690; and 2:54.784.

Looking at the Profiler trace reveals that the component is performing a bulk insert, but is inserting in batches of 9557 rows –

Profiler 2

SSIS buffers the data during a data flow. The data flow task has a couple of properties for controlling the buffer size – DefaultBufferMaxRow and DefaultBufferSize. DefaultBufferMax rows applies an upper limit to the maximum number of rows in the buffer. I set this to 10,000,000 as this is the number of rows that I am transferring. The DefaultBufferSize is the default size in bytes of the buffer. I set this to the maximum possible, which is 100 MB.

Task Props

In addition, I increased the command timeout property of the ADO.NET destination component from 30 to 300 seconds to avoid time outs. And I kept the BatchSize property equal to 0, which instructs SSIS to use the same size as the internal buffer.

ADONET Destination Props

The net effect is that the bulk inserts were now being performed in batches of 728,177 rows; and this is constrained by the 100 MB DefaultBufferSize –

Profiler 3

But contrary to expectation, increasing the buffer and batch size actually had a detrimental effect on performance, with the task now taking over 4 minutes to complete. I tried a range of DefaultMaxBufferRow values, and found 20,000 to produce the best results as follows – 2:47.483; 2:51.055; and 2:53.297.

However, this is only marginally faster than the starting point of 10,000. I would expect the optimal value to very much depend on context, so worth trying out different values. But do not expect it to greatly improve the performance with the ADO.NET destination.

As mentioned above, the staging table has a PK constraint, and hence is a B-tree rather than a heap. And it has a non-clustered index which will of course utilize resources to maintain.

Deleting the non-clustered index results in the following – 2:11.950; 2:18.155; and 2:19.824. So an average improvement of 33 seconds. But this must be offset slightly by the 12 seconds that it takes to recreate the index once all records have been inserted.

Additionally dropping the PK constraint results in the following – 1:51.344; 1:51.853; and 1:59.183. But it took over 2 minutes to recreate the constraint, so on balance dropping the PK was detrimental.

OLE DB Destination Data Flow Component

OLEDB Destination

I configured the OLE DB Source component to restrict the columns to the required subset using a SQL command –

OLEDB Destination Config

And I configured the OLE DB Destination component to use the “Table or view – fast load” data access mode, and hence enable bulk rather than row-by-row inserts –

OLEDB Destination Config 2

The staging table had both the PK constraint and the non-clustered index enabled.

My initial results with 10,000,000 rows were as follows – 2:09.981; 2:05.238; and 2:00.589. This is not far off the best results achieved with the ADO.NET Destination. But we’ve only just started.

The source database is configured to use the Bulk-logged recovery model, which aims to minimally log bulk operations. This should be significantly more performant than the Full recovery model, assuming the bulk insert meets the criteria required to be minimally logged. The criteria for the target table are as follows –

  • The target table is not being replicated.
  • Table locking is specified (using TABLOCK).
  • If the table has no indexes, data pages are minimally logged.
  • If the table does not have a clustered index but has one or more non-clustered indexes, data pages are always minimally logged. How index pages are logged, however, depends on whether the table is empty –
    • If the table is empty, index pages are minimally logged.
    • If table is non-empty, index pages are fully logged.
  • If the table has a clustered index and is empty, both data and index pages are minimally logged. In contrast, if a table has a clustered index and is non-empty, data pages and index pages are both fully logged regardless of the recovery model.

My destination table is not being replicated, and the Profiler trace reveals that the TABLOCK hint is being applied with the “insert bulk” statement -insert bulk [dbo].

[SalesOrderDetail_Staging]([SalesOrderID] int,[SalesOrderDetailID] int,[CarrierTrackingNumber] nvarchar(25) collate Latin1_General_CI_AS,[OrderQty] smallint,[ProductID] int,[SpecialOfferID] int,[UnitPrice] money,[UnitPriceDiscount] money,[ModifiedDate] datetime)with(TABLOCK,CHECK_CONSTRAINTS)

My destination table has a clustered index (PK) and also a non-clustered index. However, I am truncating the table before executing the data flow task. But, there is a caveat – the data is being inserted in batches. The table is only empty for the 1st batch. As such only the 1st batch is being minimally logged.

As with the ADO.NET Destination test, I dropped the non-clustered index, but I have kept the PK constraint. I also increased the DefaultBufferMaxRow and DefaultBufferSize settings –

Task Props 2

And then set the OLE DB Destination component’s “Maximum insert commit size” property to 0, which tells SSIS to commit the rows as a single batch at the end of the data-flow –

OLEDB Destination Config 3

If I was to run the data flow now, the performance would still be relatively slow as the rows will be written to the TempDB, then sorted by the PK fields, before being written to the destination table. To ensure that the rows are written directly to the destination table, I added a ORDER BY statement to the OLE DB Source SQL Command –

OLEDB Destination Config 4

And finally, I added an ORDER hint to the OLE DB Destination component’s FastLoadOptions property, specifying the PK fields –

OLEDB Destination Config 5

My final results for the OLE DB Destination are as follows – 0:43.197; 0:43.493; and 0:43.493

ODBC Destination Component

ADONET Destination

I configured the Data Flow task and OLE DB Source component as per the ADO.NET and OLE DB tests. I then configured the ODBC Destination component to use the “Table Name – Batch” data access mode as follows –

ODBC Destination Config

In the SQL Server 2014 documentation Microsoft describes the batch load option as follows –

Batch: In this mode the ODBC destination attempts to use the most efficient insertion method based on the perceived ODBC provider capabilities. For most modern ODBC providers, this would mean preparing an INSERT statement with parameters and then using a row-wise array parameter binding (where the array size is controlled by the BatchSize property). If you select Batch and the provider does not support this method, the ODBC destination automatically switches to the Row-by-row mode.​

http://msdn.microsoft.com/en-us/library/hh758691.aspx​

I executed the task and looked at the Profiler trace, and pretty soon realised that the data was being inserted row-by-row –

Profiler 6

I stopped the task as it was taking too long to complete.

I cannot see any obvious reason for the data being inserted row-by-row rather than in a batch. I need to investigate this further at a later date and will return and completing the tests when I have a resolution. Unfortunately, for now I am not including the ODBC destination component in the results.

SQL Server Destination Component​

​​Another option for connection to a destination SQL Server database is to use the SQL Server Destination component. I have excluded this from the test, as it is limited to local databases only, and so you are restricting yourself to only being able to run the package on the destination server. This may fit with your current deployment plan, but potentially limits extensibility and scallability. There would have to be good performance benefits for me to consider accepting this restriction, but most commentators say that the performance is very similar to the latest incarnations of the OLE DB Destination. To back this up, the description of the component in the BIDS toolbox includes the following note –

Note: to optimize performance, we recommend using the OLE DB destination instead.

Destination Data Flow Component Results

As per the Source data flow component tests, I modified the SQL commands using a “TOP n” statement, and ran the tests for 10,000,000, 1,000,000, 10,000, and 1,000 records respectively for each destination.  And I ran each test 3 times over in order to account for chatter on the server.

Source Destination
Records
Data (MB) Elapsed Time (s)​​ MB/s
OLE DB ADO.NET 10,000,000 639.727 2:16.543 4.685
OLE DB ADO.NET 1,000,000 63.973 13.004 4.919
OLE DB ADO.NET 100,000 6.397​ 1.347
OLE DB ADO.NET 10,000 0.640 0.265
OLE DB OLE DB 10,000,000 639.727 43.394 14.742
OLE DB OLE DB 1,000,000 63.973 4.649 13.760
OLE DB OLE DB 100,000 6.397 0.577
OLE DB OLE DB 10,000 0.640 0.156

Why is the OLE DB Destination so much faster than the ADO.NET Destination?

Although not statistically significant, my results infer that the OLE DB Destination is approximately 3 times faster than the ADO.NET Destination when transferring reasonably large data sets.

I can think of 2 reasons that may explain this –

  • The OLE DB Destination component gives us more control over the bulk load. In particular, it allows us to apply hints using the FastLoadOptions to ensure that the data is inserted directly into the destination table, rather than being inserted into the TempDB table prior to sorting in order to accommodate the PK constraint. It also allows us to commit the whole data set as a single batch.
  • The OLE DB destination inserts rows using the following statement –
insert bulk [dbo].[SalesOrderDetail_Staging]([SalesOrderID] int,[SalesOrderDetailID] int,[CarrierTrackingNumber] nvarchar(25) collate Latin1_General_CI_AS,[OrderQty] smallint,[ProductID] int,[SpecialOfferID] int,[UnitPrice] money,[UnitPriceDiscount] money,[ModifiedDate] datetime)with(TABLOCK,CHECK_CONSTRAINTS)

C​​ompare this to the ADO.NET destination statement –

insert bulk "dbo"."SalesOrderDetail_Staging" ([SalesOrderID] Int, [SalesOrderDetailID] Int, [CarrierTrackingNumber] NVarChar(25) COLLATE Latin1_General_CI_AS, [OrderQty] SmallInt, [ProductID] Int, [SpecialOfferID] Int, [UnitPrice] Money, [UnitPriceDiscount] Money, [ModifiedDate] DateTime)

There are no table hints. And in particular, there is no TABLOCK hint. This is necessary for minimal logging, so even if we meet all the other criteria for minimal logging, the ADO.NET destination does not request it.

In Conclusion

  • ​​I found the OLE DB Source to be approximately 2.5x faster than the ADO.NET source, and approximately 3.8x faster than the ODBC Source, when reading a large number of rows from a SQL Server Database hosted on the same server as SSIS.
  • When combined with an OLE DB Source, I found the OLE DB Destination to be ​3x faster than the ADO.NET Destination when transfering a lerge number of rows between SQL Server databases hosted on the same server.
  • Although I selected the “Table Load – Batch” data access mode, the ODBC Destination performed row-by-row inserts, which is not acceptible for a large data set. I need to investigate this further to determine a resolution.
  • My source and destination databases where hosted on the same server as SSIS. As such my testing protocol does not account for network latencies or constraints. Arguably my testing protocol is flawed because of this. I will endevour to repeat the tests on a more realistic infrastructure when one becomes available to me.
  • My results aren’t statistically significant – 3 observations per population – and my testing protocol is flawed. But I think my results add to the anecdotal evidence which suggests that OLE DB should be the first port of call when building SSIS data flows.​

Caveat Emptor

​OLE DB would appear to be the most performant option for SSIS data flow SQL Server source and destination components.

However, in August 2011 Microsoft announced that they will be deprecating the SQL Server OLE DB provider, with SQL Server 2012 being the last release to support it. True to their word, the provider has not been updated for SQL Server 2014, which is why I have used the 2012 provider “Native OLE DB\SQL Server Native Client 11.0” for my OLE DB connections.
As discussed in the article below, they are aligning their strategy with ODBC, and they recommend using ODBC for new projects, whilst planning for the upgrade of existing projects. I will not be doing either until it is clear that the performance of ODBC has been improved to match that of OLE DB. I’m sure ​Microsoft are working on it.​