In my previous blog post I talked about BIML, and how it might revolutionise my approach to creating ETL processes. It’s pretty cool, and very powerful, but there is a bit of a learning curve, so I decided to look for a different way to achieve the same thing, but that required less upskill-time, and preferably less development time too.
So, the ideal solution will:
- be quick to build initially, and easy to maintain in the long run.
- allow for parallel data loads to make the best use of the available resources.
- allow for ad-hoc changes to the load or schema without having to open, make changes to, and re-deploy the SSIS package.
I briefly tested several other methods (most of which involved generating large amounts of dynamic SQL and executing that against your source and/or destination). I instead decided to try out an SSIS add-on package called “Data Flow Task Plus”, which I’d never heard of before.
What is it?
A company called CozyRoc has developed a set of new components, and extensions to existing components within SSIS, making them a whole lot more powerful than what you get out of the box. This is nothing new, in fact you can develop your own components relatively easily if you so choose (in fact even I’ve dabbled with this many moons ago, trying to read CSV files with annoying formatting “features”).
Data Flow Plus lets you configure dynamic data flows. You can control various options via package or project parameters, which means less time spent opening packages to edit them when your source or destination schema changes. Basically this means you can create “schema-less” ETL packages which will just transfer data from a source table to a destination table, even if you add or remove (or change) columns! Too good to be true, right?
As they say, the proof is in the pudding, so here’s some pudding… figuratively speaking. Nothing like some green ticks in SSIS to make your afternoon!
That’s the end result of my proof-of-concept, but don’t worry, I’ll step you through it.
First-things-first, you’ll need to go to the CozyRoc website and download the package, either 32 or 64-bit depending on your requirements.
Once that’s done and you open Visual Studio, you’ll notice a bunch of new components in your SSIS Toolbox. The only one I’m covering here though is the new Data Flow Task Plus (highlighted), although I may cover more in future as there are a couple that sound interesting (like parallel foreach loops!).
So my plan is to have table metadata stored in a table on the destination (Azure Data Warehouse) database, which is queried by the package and stored in package variables. I’ll then iterate over the list of tables, do my ETL (depending on what kind of load I’m doing), and finally load the data from the source system. Sounds simple enough (… and it is), so let’s get started.
And yeees I know this isn’t really much of an “ETL” process… but “ELT” doesn’t roll off the tongue as easily. :-p
Here’s a SQL script to set up for this proof-of-concept if you want to follow along. It creates 2 databases (a source and a destination), as well as a table to store metadata about the tables I want loaded from one to the other.
CREATE DATABASE DWSource; GO CREATE DATABASE DWDestination; GO USE DWDestination; -- DROP TABLE LoadConfiguration CREATE TABLE dbo.LoadConfiguration ( LoadStream TINYINT NOT NULL, TableName NVARCHAR(100) NOT NULL, SqlCreateStmt NVARCHAR(MAX) NOT NULL, IndexColumnName NVARCHAR(100) NOT NULL, LoadType NVARCHAR(20) NOT NULL, ColumnListToLoad NVARCHAR(MAX) NOT NULL ) -- These are very simplified versions of a few tables in our (Timely’s) database. You'll need to create them in the source database if you want to test this yourself. INSERT LoadConfiguration VALUES (1, 'Booking', REPLACE('CREATE TABLE [dbo].[Booking]( [BookingId] [int] NOT NULL, [CustomerId] [int] NOT NULL, [StartDate] [datetime] NOT NULL, [EndDate] [datetime] NOT NULL, [Price] [money] NULL, [BusinessId] [int] NOT NULL )','NOT NULL','NULL'), 'BookingId', 'Full', 'BookingId, CustomerId, StartDate, EndDate, Price, BusinessId') INSERT LoadConfiguration VALUES (1, 'Business', REPLACE('CREATE TABLE [dbo].[Business]( [BusinessId] [int] NOT NULL, [Name] [nvarchar](100) NOT NULL, [DateCreated] [datetime] NOT NULL, [Description] [nvarchar](max) NULL )','NOT NULL','NULL'), 'BusinessId', 'Full', 'BusinessId, Name, DateCreated') INSERT LoadConfiguration VALUES (1, 'Customer', REPLACE('CREATE TABLE [dbo].[Customer]( [CustomerId] [int] NOT NULL, [BusinessId] [int] NOT NULL, [FirstName] [nvarchar](50) NULL, [LastName] [nvarchar](50) NULL, [DateCreated] [datetime] NOT NULL )','NOT NULL','NULL'), 'CustomerId', 'Full', 'CustomerId, BusinessId, FirstName, LastName, DateCreated')
With this proof-of-concept I want to test that I can create tables, prepare them, and then load only the columns that I want loaded.
Variables & Expressions
A small but important part of creating a package like this is making sure you get your variable expressions right – i.e. make the various SQL statements and values you use as dynamic as possible. As an example here are my variables for this little package. Note the expression column and how values are stitched together when it comes to building SQL commands used by the various components.
From top-to-bottom, we’ve got:
- ColumnListToLoad – this is the list of columns from the source table that I want loaded into the destination table.
- IndexColumnName – the name of the “ID” column that I can use to tell where to load from if doing an incremental load. In the real world I’ll probably make the package handle either Id’s or DateTime columns, because with some tables it will make more sense to load based on a load-date.
- IndexColumnValue – if doing an incremental load, then this variable will be populated with the max IndexColumnId already loaded into the data warehouse.
- LoadSettings – the System.Object variable which will hold the full result set of the initial SQL query, and feed it into the ForEach loop container. Nom nom nom…
- LoadType – whether we’re doing a Full or Incremental load. Could cater for other load types here too.
- SQL_DeleteStatement – a SQL delete statement based on an expression. If doing an incremental load then this will delete any data that may exist after the current max IndexColumnValue, which should help prevent duplicates.
- SQL_DropStatement – a SQL table drop statement. Probably didn’t need to be a fully dynamic expression, but for some reeeeaally important or large tables, you may want to disable accidental drops by putting something harmless in this variable for those specific tables.
- SQL_LoadStatement – a SQL select statement which will pull the data from the source table. This select statement will make use of the ColumnListToLoad variable, as well as the SQL_WhereClause variable if performing an incremental load.
- SQL_MaxIdValueStatement – SQL statement to get the max Id value and populate the IndexColumnValue variable.
- SQL_WhereClause – snippet of SQL depending on whether we’re performing an incremental load, and the value of the IndexColumnValue variable.
- SqlCreateStatement – The SQL create table statement for the destination table. In this example it’s just an exact copy of the source table. I tend to pull production data across into tables matching the source schema, even if my “ColumnListToLoad” variable means that I’m only loading a subset of columns. This means that if I need to add columns to the load later, I don’t need to change the create scripts.
- TableName – the name of the source (and in this case, destination) table.
Here’s the steps in my package (and a chance for you to admire my l33t Windows Snipping tool handwriting skillz!). Note that I’m not going to go into a whole lot of detail here, because the purpose of this post isn’t to cover all things SSIS. Instead I’ll link to other sites which explain each step or series of steps more clearly.
1. Select from the [LoadConfiguration] table, and stick the result-set into an object variable.
2. Use a ForEach container to loop through each ‘row’ in the above object variable, assigning the individual values to variables scoped to the container.
3. There are separate sequence containers for Full and Incremental loads. Their disabled states are set via an Expression which is based on the value from the [LoadType] column grabbed from the [LoadConfiguration] table above. So, if we’re doing a full load, the Incremental load container will be disabled, and vice versa. Another (possibly better) way of doing this would be to use precedence constraints with expressions to control the path of execution.
4. As above, but for the ‘Incremental’ [LoadType] value…
5. Load data using the new data load plus component. The best way to figure out how to do this is to watch the (rather dry) video from CozyRoc on this page. But basically it involves setting up the component just like you would the normal data flow task, but then removing all columns from the outputs and inputs (using the advanced editor), and leaving only a single “placeholder/dummy” column. This placeholder column is brilliantly named “THUNK_COLUMN”.
Here’s another good blog post on a more complex setup using this component and Sharepoint.
Dunno… haven’t finished implementing the real thing yet. But the proof of concept is working well, and it went together pretty quickly, so I’m positive this will work, I think…
I’ll update this post with my thoughts once I’ve got it all working. As usual please let me know if I’ve made any glaring mistakes, or if you’ve got some awesome ideas on how to improve this process further.