Unit of Work, Repository, Entity Framework, and Persistence Ignorance

Let me lay my cards on the table. I’m not an expert on design patterns. But I’ve read some books, some blog posts, and followed many a debate online. I’ve developed a good few applications and have generally gained a good appreciation for the principles, what we are trying to achieve, and what works in practice.

I’m going to present my go-to implementation when starting a new solution. I will discuss the basics of the patterns and principles. Just enough to answer the Why questions.

I am not presenting anything new or unique. These are tried and tested patterns which are industry best practice. If the code is similar, or even the same, as code published by other sources then I apologize for not referencing those sources. I have used this implementation over a number of years and so would not be able to tell you where the ideas came from, or which snippets of code have been copied or created from scratch.

My choice of weapons at the moment are ASP.NET MVC, C#, Entity Framework (Code First), Unity, and Moq. However, the data persistence layer and service layer that I will present is mostly technology agnostic, so it should be easy enough to implement the classes using other technologies.

I will be presenting this information in a series of posts –

1) Unit of Work, Repository, Entity Framework, and Persistence Ignorance

2) Service Layer, MVC Controllers, and Dependency Injection

3) Unit Testing using Moq

Unit Of Work

According to Fowler Unit of Work –

“Maintains a list of objects affected by a business transaction and coordinates the writing out of changes and the resolution of concurrency problems.

When you’re pulling data in and out of a database, it’s important to keep track of what you’ve changed; otherwise, that data won’t be written back into the database. Similarly you have to insert new objects you create and remove any objects you delete.”

The Entity Framework’s DbContext class is essentially an implementation of Unit of Work. It manages the connection to the database. It populates entity objects from the database as and when needed. It tracks changes to the entity objects. And it persists these changes to the database, managing transactions and concurrency when doing so.

So most of the work is done for us.

The recommended way of working with DbContext is to derive a class from DbContext and expose the required entities as DBSet properties. So for my BikeStore application, I could create the following class –

using BikeStore.Model;
using System.Data.Entity;

namespace BikeStore.Data
{
    public class BikeStoreContext : DbContext
    {
        public virtual DbSet<Basket> Baskets { get; set; }
        public virtual DbSet<Product> Products { get; set; }
        public virtual DbSet<Customer> Customers{ get; set; }
    }
}

I could then create an instance of this class, and then query, add, modify, and delete entities using this instance.

using(var context = new BikeStoreContext())
{
    Product product = context.Products.Where(p => p.ProductId == 1).First();
    product.Name = "CAAD 10";
    context.SaveChanges();
}

If I’m developing using ASP.NET MVC I may choose to include this code in a controller action method, or maybe refactor it a little and include it in a helper class or even in a business logic or service layer class.

The problem with the above is that the DbContext and DbSet classes are specific to the System.Data.Entity library. We would need to reference this library in the controller, and as such the controller would now be aware of and dependent on Entity Framework. If we had to switch to another data persistence technology, such as nHibernate, then we would need to re-write and re-test all controller methods that are now dependent on Entity Framework. Admittedly, I have never yet had to do this. But never say never.

A bigger concern is that by creating a dependency between the controllers and Entity Framework, we are making our lives very difficult when it comes to unit testing. We will find it very difficult to isolate the controllers from the database.

So how do we decouple the controllers from the persistence layer, and hence make the controllers persistence ignorant?

IUnitOfWork

The first thing to do is create an IUnitOfWork interface. It’s a very simple interface, with a single method – Save().

using System;

namespace Framework.Data
{
    public interface IUnitOfWork
    {
        void Save();
    }
}

Later on we will create a concrete Entity Framework implementation of this interface, wrapping up the DbContext. DbContext would essentially be hidden from the controllers, and as such the dependency will have been removed.

IRepository

But we still need to remove the dependency on DBSet whilst allowing the querying, adding, modifying, and deleting of entities.

We will use the repository pattern to achieve this. According to Fowler, the Repository pattern –

“Mediates between the domain and data mapping layers using a collection-like interface for accessing domain objects.”

Rather than DBSet being the interface for accessing the entities, we will use IEnumerable.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Linq.Expressions;

namespace Framework.Data
{
    public interface IRepository<TEntity> where TEntity : class
    {
        IEnumerable<TEntity> Get(
            Expression<func<TEntity, bool>> filter = null,
            Func<iqueryable<TEntity>, IOrderedQueryable> orderBy = null,
            string includeProperties = "");

        IEnumerable<TEntity> GetAll(
            Func<iqueryable<TEntity>, IOrderedQueryable> orderBy = null,
            string includeProperties = "");

        IEnumerable<TEntity> GetTopN(
            int? N,
            Expression<Func<TEntity, bool>> filter = null,
            Func<iqueryable, IOrderedQueryable> orderBy = null,
            string includeProperties = "");

        void Add(TEntity entity);

        void Alter(TEntity entity);

        void Remove(TEntity entity);
    }
}

The first method in the interface, Get, enables the retrieving of a collection of entities. It has 3 optional parameters –

  • filter – which accepts a lambda expression defining which set of entities to retrieve
  • orderBy – which defines how the set should be ordered
  • includeProperties – which accepts a comma-delimited string of related property names to retrieve with each entity.

As its name suggests, the second method, GetAll, enables the retrieving of all entities. Concrete implementations of this method may simple call the Get method. It’s not necessary to include this method in you implementation. It’s there only for convenience.

GetTopN accepts the same parameters as Get, and additionally has a parameter which defines the number of entities to return.

Add, Alter, and Remove all accept an entity, and as their names suggest enable the modification of entities.

EFRepository

We have defined a generic IRespository interface, and could now use this for creating a concrete Entity Framework implementation of Repository for each entity in our system. But we can simplify things by creating a generic Entity Framework implementation. So a single generic class rather than one class per entity.

The generic Entity Framework implementation shown below implements a constructor that take an instance if DbContext, and it used this instance to access the required DbSet.

We will create a  an instance of EFRepository for each entity in our model, and pass the same DbContext into each. We can then save changes to different entity types in a single transaction.

using System.Linq.Expressions;

namespace Framework.Data
{
    public class EFRepository<TEntity> : IRepository where TEntity : class
    {
        private DbContext _db;
        private DbSet<TEntity> _dbSet;

        public EFRepository(DbContext context)
        {
            this._db = context;
            this._dbSet = context.Set<TEntity>();
        }

        public virtual IEnumerable<TEntity> Get(
            Expression<Func<TEntity, bool>> filter = null,
            Func<iqueryable, IOrderedQueryable> orderBy = null,
            string includeProperties = "")
        {
            IQueryable<TEntity> query = _dbSet;

            if (filter != null)
                query = query.Where(filter);

            if (includeProperties != null)
                foreach (var includeProperty in includeProperties.Split(new char[] { ',' }, StringSplitOptions.RemoveEmptyEntries))
                    query = query.Include(includeProperty);

            if (orderBy != null)
                orderBy(query);

            return query;
        }

        public virtual IEnumerable GetAll(
            Func<iqueryable, IOrderedQueryable> orderBy = null,
            string includeProperties = "")
        {
            return this.Get(null, orderBy, includeProperties);
        }

        public virtual IEnumerable GetTopN(
            int? N,
            Func<iqueryable, IOrderedQueryable> orderBy = null,
            string includeProperties = "")
        {
            if (N == null)
                return this.Get(null, orderBy, includeProperties);
            else
                return this.Get(null, orderBy, includeProperties).Take((int)N);
        }

        public virtual void Add(TEntity entity)
        {
            _dbSet.Add(entity);
        }

        public virtual void Remove(TEntity entityToDelete)
        {
            if (_db.Entry(entityToDelete).State == EntityState.Detached)
                _dbSet.Attach(entityToDelete);
            _dbSet.Remove(entityToDelete);
        }

        public virtual void Alter(TEntity entityToUpdate)
        {
            _dbSet.Attach(entityToUpdate);
            _db.Entry(entityToUpdate).State = EntityState.Modified;
        }
    }
}

IBikeStoreUnitOfWork

If you look at the namespaces for the classes shown so far, you’ll see that I have included all the classes in the Framework.Data namespace. This is because there is nothing that is application specific.

I started this article by presenting the BikeStoreContext class, and pointing out that by consuming it in our controllers we are tying the controllers to Entity Framework.

I will now replicate the BikeStoreContext class but in a way that hides the implementation of the persistence technology.

Firstly, let’s create an IBikeStoreUnitOfWork interface, which exposes an IRepository class for each entity in our BikeStore model (I have only shown a few of the entities to keep the examples short) –

using BikeStore.Model;
using Framework.Data;

namespace BikeStore.Data
{
    public interface IBikeStoreUnitOfWork : IUnitOfWork
    {
        IRepository ProductRepository { get; }
        IRepository BasketRepository { get; }
        IRepository CustomerRepository { get; }
    }
}

BikeStoreUnitOfWork

We now want to create our concrete Entity Framework implementation of this interface.

using BikeStore.Model;
using Framework.Data;
using System;
using System.Data.Entity.Core.Objects;

namespace BikeStore.Data
{
    public class BikeStoreUnitOfWork : IBikeStoreUnitOfWork, IDisposible
    {
        private BikeStoreContext _db = new BikeStoreContext();

        private EFRepository<Product> _productRepository;
        private EFRepository<Basket> _basketRepository;
        private EFRepository<Customer> _customerRepository;

        public IRepository ProductRepository
        {
            get
            {
                if (_productRepository == null)
                    _productRepository = new EFRepository<Product>(_db);
                return _productRepository;
            }
        }

        public IRepository BasketRepository
        {
            get
            {
                if (_basketRepository == null)
                    _basketRepository = new EFRepository<Basket>(_db);
                return _basketRepository;
            }
        }

        public IRepository CustomerRepository
        {
            get
            {
                if (_customerRepository == null)
                    _customerRepository = new EFRepository<Customer>(_db);
                return _customerRepository;
            }
        }

        public void Save()
        {
            _db.SaveChanges();
        }

        private bool _disposed = false;

        protected virtual void Dispose(bool disposing)
        {
            if (!this._disposed)
            {
                if (disposing)
                {
                    _db.Dispose();
                }
            }
            this._disposed = true;
        }

        public void Dispose()
        {
            Dispose(true);
            GC.SuppressFinalize(this);
        }
    }
}

BikeStoreUnitOfWork contains a single instance of BikeStoreContext, and passes it into the repositories when they are utilized. By sharing this instance across repositories we can include changes to multiple entity types within a single transaction.

BikeStoreUnitOfWork implements the IDisposible interface, and ensures that the context is correctly disposed of.

We can now replace the code shown at the start of the article with the following –

using (var unitOfWork = new BikeStoreUnitOfWork())
{
    Product product = unitOfWork.ProductRepository.Get(p => p.ProductId == 1).First();
    product.Name = "CAAD 10";
    unitOfWork.Save();
}

This code can be included in a controller without the need for the controller to reference the System.Data.Entity assembly. The controller is persistence ignorant. It is no longer aware that Entity Framework is being used in the persistence layer. And we could change the implementation of BikeStoreUnit of work to use nHibernate or any other technology, without the need to change any code or references in the controller.

In the next post in this series I will present the Service Layer pattern, and demonstrate how we can use this to separate business logic from request processing/presentation logic. And how we can use dependency injection to inject concrete implementations of unit of work into the service classes, and concrete implementations of the service classes into the controllers. Hence fully decoupling the controllers from the service layer, and the service layer from the persistence layer. In the third and final post I will talk about how this helps us to isolate the controllers and service layers during unit testing.

SSIS Data Flows – ADO.NET vs. OLE DB vs. ODBC

​There has been much debate in the developer community on which provider to choose when connecting to SQL Server database.

Performance is often cited as a discriminator, with ADO.NET being less perfor​​mant than ODBC and OLE DB, as it is a managed​ facade. But performance is glossed over as most applications deal only with CRUD operations, and therefore the differences are negligible.

However, as BI developers we typically work with large datasets. We therefore cannot gloss over the performance differences, particularly when implementing SSIS data flow tasks.

A quick search online reveals very little in the way of empirical data on which is faster. Maybe I’ve been searching on the wrong terms, but the only comprehensive test data I’ve come up with is detailed by  Gilbert Quevauvilliers in the following article –

https://gqbi.wordpress.com/2013/06/19/ssis-fastest-data-flow-task-item-for-transferring-data-over-the-network/​

He compares various combinations of ADO.NET and OLE DB source and destination components for a relatively large data set – 10,000,000 rows. He found a combination of OLE DB source and destination components to be over 8x faster than ADO.NET for his testing protocol –

​​Given the lack of evidence, I decided to carry out some tests myself.

Create Source and Destination Tables

I used AdventureWorks2014 as my source database.

I created a {Sales].[SalesOrderDetailBIG] table with the same schema as [Sales].[SalesOrderDetail]. And then populated it with 10,000,000 records, totalling just under 700MB. This is the maximum number of records that I will be testing with. It should be more than enough records to cover most incremental loading scenarios.​

USE AdvntureWorks2014
GO
CREATE TABLE [Sales].[SalesOrderDetailBIG](
        [SalesOrderID] [int] NOT NULL,
   ​     [SalesOrderDetailID] [int] IDENTITY(1,1) NOT NU​LL,
        [CarrierTrackingNumber] [nvarchar](25) NULL,
        [OrderQty] [smallint] NOT NULL,
        [ProductID] [int] NOT NULL,
        [SpecialOfferID] [int] NOT NULL,
        [UnitPrice] [money] NOT NULL,
        [UnitPriceDiscount] [money] NOT NULL,
        [LineTotal] numeric(38,6) NOT NULL,
        [rowguid] [uniqueidentifier] NOT NULL,
        [ModifiedDate] [datetime] NOT NULL
) ON [PRIMARY]​​​

My destination is a SQL Server 2014 database on the same server – my underpowered 8GB i5 quad-core development laptop – so there will be no network latency between the source and destination databases. This is not representative of most scenarios, but I’ve only got access to one server at the moment so needs must.

The destination table is as follows –

USE RWDW_Staging 
GO
CREATE TABLE [dbo].[SalesOrderDetail_Staging](
        [SalesOrderID] [int] NOT NULL,
        [SalesOrderDetailID] [int] NOT NULL,
        [CarrierTrackingNumber] [nvarchar](25) NOT NULL,
        [OrderQty] [smallint] NOT NULL,
        [ProductID] [int] NOT NULL,
        [SpecialOfferID] [int] NOT NULL,
        [UnitPrice] [money] NOT NULL,
        [UnitPriceDiscount] [money] NOT NULL,
        [ModifiedDate] [datetime] NOT NULL,
 CONSTRAINT [PK_SalesOrderDetail_Staging] PRIMARY KEY CLUSTERED (
        [SalesOrderID] ASC,
        [SalesOrderDetailID] ASC 
) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

CREATE NONCLUSTERED INDEX [IX_SalesOrderDetail_Staging] ON [dbo].[SalesOrderDetail_Staging]
(
        [ModifiedDate] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO​

Please note that the destination table does not include the [LineTotal] and [rowguid] columns. And it has a composite PK on [SalesOrderID] and [SalesOrderDetailID] and a non-clustered index on [ModifiedDate]. More about these features later.

Create Connection Managers

I created 6 connection managers, one each of ADO.NET, OLE DB, and ODBC for the source and destination databases –​

Connection Managers

For ADO.NET and OLE DB I used the “.Net Providers\SQLClient Data Provider” and “Native OLE DB\SQL Server Native Client 11.0” data providers.

For ODBC I configured System DSNs for the source and destination, using the “ODBC Driver 11 for SQL Server” driver. According to the SQL Server 2014 developer guide, this is the successor to the “SQL Server Native Client 11.0” driver, and should be used in it’s place –

http://msdn.microsoft.com/en-us/library/cc280510.aspx

I kept the default settings for all the connection managers.

​​​Comparison of ADO.NET, OLE DB, and ODBC Source Components

I wanted to initially compare the source data flow components in isolation from the destination components, so I added 3 data flow tasks to my package, and configured them as follows –

ADO.NET Source data flow task​

ADONET Source

By adding a RowCount transformation, I do not need to add a destination. Hence I have isolated the source from the destination. The RowCount transformation adds very little overhead and should not skew the results.

ADONET Source Config

I have selected the LocalHost AdventureWorks2014 ADO NET connection manager, and used a SQL command to select only the columns that I need. It is obviously important to select only the columns that are required rather than all columns. It can make a significant improvement in performance. But is something that can be easily overlooked.

OLE DB Source Data Flow Task

The OLE DB Source test data flow follows the same pattern –

OLEDB Source

The OLE DB Source component allows you to choose the subset of columns to utilise – in this case I have deselected [LineTotal] and [rowguid] –

OLEDB Source Config

So you would think that rather than having to create a SQL Command for the source you can combine the “Table or View” data access mode with the selection of only the columns that you need.

But all is not as it seems. A SQL Server Profiler trace reveals the following –

Profiler

SSIS is still emitting a “select *” statement and so is retrieving all columns.

The answer, as per the ADO.NET source, is to use a SQL Command to select only the columns that are needed. I have done this for the test.

ODBC Source Data Flow Task

ODBC Source

The ODBC Source component has a BatchSize property. The Profiler trace reveals the following –Profiler 5

An sp_cursorpreexec system stored procedures is executed to prepare a curssor and populate it with the full data set. The sp_cursorfetch stored procedure is then repeatedly excuted to fetch the data 1000 rows at a time. The final parameter is the batch size specified in the BatchSize property.

I tried a range of BatchSize values, and found negligible difference in the performance. I did however hit a problem when attempting to retrieve the data in a single batch by setting the BatchSize to 10,000,000. The performance was significantly degredated as the SSIS internal buffer size was smaller than the batchsize, and as such there was a big overhead incurred in allocating more memory to the buffer –

​Information: 0x4004800C at ODBC Source, SSIS.Pipeline: The buffer manager detected that the system was low on virtual memory, but was unable to swap out any buffers. 0 buffers were considered and 0 were locked. Either not enough memory is available to the pipeline because not enough is installed, other processes are using it, or too many buffers are locked.

Information: 0x4004800F at ODBC Source: Buffer manager allocated 2 megabyte(s) in 1 physical buffer(s).

Information: 0x40048010 at ODBC Source: Component "ODBC Source" (2) owns 2 megabyte(s) physical buffer.​

​Source Data Flow Component Results

I wanted to see if the number of records effected the results. I therefore modified the SQL commands using a “top n” statement, and ran the tests for 10,000,000, 1,000,000, 10,000, and 1,000 records respectively for each data source. I also ran each test 3 times over in order to account for chatter on the server. The only other processes running during the testing were background OS and SQL Server processes, but sometimes these can skew the results.

Results are as follows –

Source Records​ Data (MB) Elapsed Time (s) MB/s
ADO.NET 10,000,000 639.727 37.737 16.952
ADO.NET 1,000,000 63.973 4.025 15.894
ADO.NET 100,000 6.397 0.483
ADO.NET 10,000 0.640 0.109
OLE DB 10,000,000 639.727 15.693 40.765
OLE DB 1,000,000 63.973 1.716 37.280
OLE DB 100,000 6.397 0.297
OLE DB 10,000 0.640 0.078
​ODBC 10,000,000​ 639.727​ 1:13:461​ 8.708​
​ODBC 1,000,000​ 63.973​ 6.614​ 9.672​
​ODBC 100,000​ 6.397​ 0.687​
​ODBC ​10,000 ​0.640 ​0.234

Gilbert found some benefit from optimizing the source connection manager by changing the packet size property to the maximum possible value 32,767. And also by adding the following SQL hints “WITH (NOLOCK) OPTION (FAST 10000)”. Neither of these were applicable for my testing protocol as both my databases are on the same server and so the data is not being sent over a network. Also I am not competing with other processes for access to the data, and do not have to wait for other processes to release locks on the source table.

But as with all SQL hints, you are overriding the optimizer, and so there may or may not be an improvement. You should definitely consider all of the above suggestions when attempting to optimize your processes. Gilbert, to be fair, does say it “sometimes” improves performance.

There’s a good article by Rob Farley on why you should consider using the FAST 10000 hint –

http://sqlblog.com/blogs/rob_farley/archive/2011/02/17/the-ssis-tuning-tip-that-everyone-misses.aspx

Under my testing protocol, the OLE DB source in isolation is approximately 2.5 x quicker than the ADO.NET source, and a whopping 3.8x quicker than the ODBC source, for relatively large data sets. For small data sets the difference is much less, but I suspect only because the overhead of initializing the classes and establishing the connection is obscuring the real throughput performance. This is why, in the table, I have left the MB/s field blank for the 100,000 and 10,000 row tests.

Comparison of ADO.NET, OLE DB, and ODBC Destination Components

I will now test the ADO.NET, OLE DB, and ODBC destination components. But to keep thing simple will only be using the OLE DB source to do this.

ADO.NET Destination Data Flow Task

ADONET Destination

ADONET Destination Config

On the ADO.NET Destination component, I have checked the  “Use Bulk Insert when possible” option. We obviously prefer bulk inserts rather than single row inserts. This option was introduced with 2008 R2. If you are using 2008 or earlier then the ADO.NET destination will use single row inserts and so will be dreadfully slow. Apparently the “Use Bulk Insert when possible” option was added to enable the transfer of data to SQL Azure, given that connecting to SQL Azure using OLE DB is not supported by Microsoft (although it does work).

I ran the data flow task 3 timed for 10,000,000. The results were – 2:55.220; 2:54.690; and 2:54.784.

Looking at the Profiler trace reveals that the component is performing a bulk insert, but is inserting in batches of 9557 rows –

Profiler 2

SSIS buffers the data during a data flow. The data flow task has a couple of properties for controlling the buffer size – DefaultBufferMaxRow and DefaultBufferSize. DefaultBufferMax rows applies an upper limit to the maximum number of rows in the buffer. I set this to 10,000,000 as this is the number of rows that I am transferring. The DefaultBufferSize is the default size in bytes of the buffer. I set this to the maximum possible, which is 100 MB.

Task Props

In addition, I increased the command timeout property of the ADO.NET destination component from 30 to 300 seconds to avoid time outs. And I kept the BatchSize property equal to 0, which instructs SSIS to use the same size as the internal buffer.

ADONET Destination Props

The net effect is that the bulk inserts were now being performed in batches of 728,177 rows; and this is constrained by the 100 MB DefaultBufferSize –

Profiler 3

But contrary to expectation, increasing the buffer and batch size actually had a detrimental effect on performance, with the task now taking over 4 minutes to complete. I tried a range of DefaultMaxBufferRow values, and found 20,000 to produce the best results as follows – 2:47.483; 2:51.055; and 2:53.297.

However, this is only marginally faster than the starting point of 10,000. I would expect the optimal value to very much depend on context, so worth trying out different values. But do not expect it to greatly improve the performance with the ADO.NET destination.

As mentioned above, the staging table has a PK constraint, and hence is a B-tree rather than a heap. And it has a non-clustered index which will of course utilize resources to maintain.

Deleting the non-clustered index results in the following – 2:11.950; 2:18.155; and 2:19.824. So an average improvement of 33 seconds. But this must be offset slightly by the 12 seconds that it takes to recreate the index once all records have been inserted.

Additionally dropping the PK constraint results in the following – 1:51.344; 1:51.853; and 1:59.183. But it took over 2 minutes to recreate the constraint, so on balance dropping the PK was detrimental.

OLE DB Destination Data Flow Component

OLEDB Destination

I configured the OLE DB Source component to restrict the columns to the required subset using a SQL command –

OLEDB Destination Config

And I configured the OLE DB Destination component to use the “Table or view – fast load” data access mode, and hence enable bulk rather than row-by-row inserts –

OLEDB Destination Config 2

The staging table had both the PK constraint and the non-clustered index enabled.

My initial results with 10,000,000 rows were as follows – 2:09.981; 2:05.238; and 2:00.589. This is not far off the best results achieved with the ADO.NET Destination. But we’ve only just started.

The source database is configured to use the Bulk-logged recovery model, which aims to minimally log bulk operations. This should be significantly more performant than the Full recovery model, assuming the bulk insert meets the criteria required to be minimally logged. The criteria for the target table are as follows –

  • The target table is not being replicated.
  • Table locking is specified (using TABLOCK).
  • If the table has no indexes, data pages are minimally logged.
  • If the table does not have a clustered index but has one or more non-clustered indexes, data pages are always minimally logged. How index pages are logged, however, depends on whether the table is empty –
    • If the table is empty, index pages are minimally logged.
    • If table is non-empty, index pages are fully logged.
  • If the table has a clustered index and is empty, both data and index pages are minimally logged. In contrast, if a table has a clustered index and is non-empty, data pages and index pages are both fully logged regardless of the recovery model.

My destination table is not being replicated, and the Profiler trace reveals that the TABLOCK hint is being applied with the “insert bulk” statement -insert bulk [dbo].

[SalesOrderDetail_Staging]([SalesOrderID] int,[SalesOrderDetailID] int,[CarrierTrackingNumber] nvarchar(25) collate Latin1_General_CI_AS,[OrderQty] smallint,[ProductID] int,[SpecialOfferID] int,[UnitPrice] money,[UnitPriceDiscount] money,[ModifiedDate] datetime)with(TABLOCK,CHECK_CONSTRAINTS)

My destination table has a clustered index (PK) and also a non-clustered index. However, I am truncating the table before executing the data flow task. But, there is a caveat – the data is being inserted in batches. The table is only empty for the 1st batch. As such only the 1st batch is being minimally logged.

As with the ADO.NET Destination test, I dropped the non-clustered index, but I have kept the PK constraint. I also increased the DefaultBufferMaxRow and DefaultBufferSize settings –

Task Props 2

And then set the OLE DB Destination component’s “Maximum insert commit size” property to 0, which tells SSIS to commit the rows as a single batch at the end of the data-flow –

OLEDB Destination Config 3

If I was to run the data flow now, the performance would still be relatively slow as the rows will be written to the TempDB, then sorted by the PK fields, before being written to the destination table. To ensure that the rows are written directly to the destination table, I added a ORDER BY statement to the OLE DB Source SQL Command –

OLEDB Destination Config 4

And finally, I added an ORDER hint to the OLE DB Destination component’s FastLoadOptions property, specifying the PK fields –

OLEDB Destination Config 5

My final results for the OLE DB Destination are as follows – 0:43.197; 0:43.493; and 0:43.493

ODBC Destination Component

ADONET Destination

I configured the Data Flow task and OLE DB Source component as per the ADO.NET and OLE DB tests. I then configured the ODBC Destination component to use the “Table Name – Batch” data access mode as follows –

ODBC Destination Config

In the SQL Server 2014 documentation Microsoft describes the batch load option as follows –

Batch: In this mode the ODBC destination attempts to use the most efficient insertion method based on the perceived ODBC provider capabilities. For most modern ODBC providers, this would mean preparing an INSERT statement with parameters and then using a row-wise array parameter binding (where the array size is controlled by the BatchSize property). If you select Batch and the provider does not support this method, the ODBC destination automatically switches to the Row-by-row mode.​

http://msdn.microsoft.com/en-us/library/hh758691.aspx​

I executed the task and looked at the Profiler trace, and pretty soon realised that the data was being inserted row-by-row –

Profiler 6

I stopped the task as it was taking too long to complete.

I cannot see any obvious reason for the data being inserted row-by-row rather than in a batch. I need to investigate this further at a later date and will return and completing the tests when I have a resolution. Unfortunately, for now I am not including the ODBC destination component in the results.

SQL Server Destination Component​

​​Another option for connection to a destination SQL Server database is to use the SQL Server Destination component. I have excluded this from the test, as it is limited to local databases only, and so you are restricting yourself to only being able to run the package on the destination server. This may fit with your current deployment plan, but potentially limits extensibility and scallability. There would have to be good performance benefits for me to consider accepting this restriction, but most commentators say that the performance is very similar to the latest incarnations of the OLE DB Destination. To back this up, the description of the component in the BIDS toolbox includes the following note –

Note: to optimize performance, we recommend using the OLE DB destination instead.

Destination Data Flow Component Results

As per the Source data flow component tests, I modified the SQL commands using a “TOP n” statement, and ran the tests for 10,000,000, 1,000,000, 10,000, and 1,000 records respectively for each destination.  And I ran each test 3 times over in order to account for chatter on the server.

Source Destination
Records
Data (MB) Elapsed Time (s)​​ MB/s
OLE DB ADO.NET 10,000,000 639.727 2:16.543 4.685
OLE DB ADO.NET 1,000,000 63.973 13.004 4.919
OLE DB ADO.NET 100,000 6.397​ 1.347
OLE DB ADO.NET 10,000 0.640 0.265
OLE DB OLE DB 10,000,000 639.727 43.394 14.742
OLE DB OLE DB 1,000,000 63.973 4.649 13.760
OLE DB OLE DB 100,000 6.397 0.577
OLE DB OLE DB 10,000 0.640 0.156

Why is the OLE DB Destination so much faster than the ADO.NET Destination?

Although not statistically significant, my results infer that the OLE DB Destination is approximately 3 times faster than the ADO.NET Destination when transferring reasonably large data sets.

I can think of 2 reasons that may explain this –

  • The OLE DB Destination component gives us more control over the bulk load. In particular, it allows us to apply hints using the FastLoadOptions to ensure that the data is inserted directly into the destination table, rather than being inserted into the TempDB table prior to sorting in order to accommodate the PK constraint. It also allows us to commit the whole data set as a single batch.
  • The OLE DB destination inserts rows using the following statement –
insert bulk [dbo].[SalesOrderDetail_Staging]([SalesOrderID] int,[SalesOrderDetailID] int,[CarrierTrackingNumber] nvarchar(25) collate Latin1_General_CI_AS,[OrderQty] smallint,[ProductID] int,[SpecialOfferID] int,[UnitPrice] money,[UnitPriceDiscount] money,[ModifiedDate] datetime)with(TABLOCK,CHECK_CONSTRAINTS)

C​​ompare this to the ADO.NET destination statement –

insert bulk "dbo"."SalesOrderDetail_Staging" ([SalesOrderID] Int, [SalesOrderDetailID] Int, [CarrierTrackingNumber] NVarChar(25) COLLATE Latin1_General_CI_AS, [OrderQty] SmallInt, [ProductID] Int, [SpecialOfferID] Int, [UnitPrice] Money, [UnitPriceDiscount] Money, [ModifiedDate] DateTime)

There are no table hints. And in particular, there is no TABLOCK hint. This is necessary for minimal logging, so even if we meet all the other criteria for minimal logging, the ADO.NET destination does not request it.

In Conclusion

  • ​​I found the OLE DB Source to be approximately 2.5x faster than the ADO.NET source, and approximately 3.8x faster than the ODBC Source, when reading a large number of rows from a SQL Server Database hosted on the same server as SSIS.
  • When combined with an OLE DB Source, I found the OLE DB Destination to be ​3x faster than the ADO.NET Destination when transfering a lerge number of rows between SQL Server databases hosted on the same server.
  • Although I selected the “Table Load – Batch” data access mode, the ODBC Destination performed row-by-row inserts, which is not acceptible for a large data set. I need to investigate this further to determine a resolution.
  • My source and destination databases where hosted on the same server as SSIS. As such my testing protocol does not account for network latencies or constraints. Arguably my testing protocol is flawed because of this. I will endevour to repeat the tests on a more realistic infrastructure when one becomes available to me.
  • My results aren’t statistically significant – 3 observations per population – and my testing protocol is flawed. But I think my results add to the anecdotal evidence which suggests that OLE DB should be the first port of call when building SSIS data flows.​

Caveat Emptor

​OLE DB would appear to be the most performant option for SSIS data flow SQL Server source and destination components.

However, in August 2011 Microsoft announced that they will be deprecating the SQL Server OLE DB provider, with SQL Server 2012 being the last release to support it. True to their word, the provider has not been updated for SQL Server 2014, which is why I have used the 2012 provider “Native OLE DB\SQL Server Native Client 11.0” for my OLE DB connections.
As discussed in the article below, they are aligning their strategy with ODBC, and they recommend using ODBC for new projects, whilst planning for the upgrade of existing projects. I will not be doing either until it is clear that the performance of ODBC has been improved to match that of OLE DB. I’m sure ​Microsoft are working on it.​