Tag: datawarehouse

  • SSIS Data Flow Plus!

    SSIS Data Flow Plus!

    In my previous blog post I talked about BIML, and how it might revolutionise my approach to creating ETL processes.  It’s pretty cool, and very powerful, but there is a bit of a learning curve, so I decided to look for a different way to achieve the same thing, but that required less upskill-time, and preferably less development time too.

    So, the ideal solution will:

    • be quick to build initially, and easy to maintain in the long run.
    • allow for parallel data loads to make the best use of the available resources.
    • allow for ad-hoc changes to the load or schema without having to open, make changes to, and re-deploy the SSIS package.

    I briefly tested several other methods (most of which involved generating large amounts of dynamic SQL and executing that against your source and/or destination). I instead decided to try out an SSIS add-on package called “Data Flow Task Plus”, which I’d never heard of before.

    What is it?

    A company called CozyRoc has developed a set of new components, and extensions to existing components within SSIS, making them a whole lot more powerful than what you get out of the box. This is nothing new, in fact you can develop your own components relatively easily if you so choose (in fact even I’ve dabbled with this many moons ago, trying to read CSV files with annoying formatting “features”).

    Data Flow Plus lets you configure dynamic data flows. You can control various options via package or project parameters, which means less time spent opening packages to edit them when your source or destination schema changes. Basically this means you can create “schema-less” ETL packages which will just transfer data from a source table to a destination table, even if you add or remove (or change) columns!  Too good to be true, right?

    The Pudding

    As they say, the proof is in the pudding, so here’s some pudding… figuratively speaking. Nothing like some green ticks in SSIS to make your afternoon!

    That’s the end result of my proof-of-concept, but don’t worry, I’ll step you through it.

    imageFirst-things-first, you’ll need to go to the CozyRoc website and download the package, either 32 or 64-bit depending on your requirements.

    Once that’s done and you open Visual Studio, you’ll notice a bunch of new components in your SSIS Toolbox. The only one I’m covering here though is the new Data Flow Task Plus (highlighted), although I may cover more in future as there are a couple that sound interesting (like parallel foreach loops!).

    New Plan

    So my plan is to have table metadata stored in a table on the destination (Azure Data Warehouse) database, which is queried by the package and stored in package variables. I’ll then iterate over the list of tables, do my ETL (depending on what kind of load I’m doing), and finally load the data from the source system. Sounds simple enough (… and it is), so let’s get started.

    And yeees I know this isn’t really much of an “ETL” process… but “ELT” doesn’t roll off the tongue as easily. :-p 

    Here’s a SQL script to set up for this proof-of-concept if you want to follow along. It creates 2 databases (a source and a destination), as well as a table to store metadata about the tables I want loaded from one to the other.

    CREATE DATABASE DWSource; 
    GO 
    CREATE DATABASE DWDestination; 
    GO
    USE DWDestination;
    -- DROP TABLE LoadConfiguration 
    CREATE TABLE dbo.LoadConfiguration ( 
        LoadStream TINYINT NOT NULL, 
        TableName NVARCHAR(100) NOT NULL, 
        SqlCreateStmt NVARCHAR(MAX) NOT NULL, 
        IndexColumnName NVARCHAR(100) NOT NULL, 
        LoadType NVARCHAR(20) NOT NULL, 
        ColumnListToLoad NVARCHAR(MAX) NOT NULL 
        )
    -- These are very simplified versions of a few tables in our (Timely’s) database. You'll need to create them in the source database if you want to test this yourself.
    INSERT LoadConfiguration VALUES (1, 'Booking', REPLACE('CREATE TABLE [dbo].[Booking]( 
        [BookingId] [int] NOT NULL, 
        [CustomerId] [int] NOT NULL, 
        [StartDate] [datetime] NOT NULL, 
        [EndDate] [datetime] NOT NULL, 
        [Price] [money] NULL, 
        [BusinessId] [int] NOT NULL 
    )','NOT NULL','NULL'), 'BookingId', 'Full', 'BookingId, CustomerId, StartDate, EndDate, Price, BusinessId')
    INSERT LoadConfiguration VALUES (1, 'Business', REPLACE('CREATE TABLE [dbo].[Business]( 
        [BusinessId] [int] NOT NULL, 
        [Name] [nvarchar](100) NOT NULL, 
        [DateCreated] [datetime] NOT NULL, 
        [Description] [nvarchar](max) NULL 
    )','NOT NULL','NULL'), 'BusinessId', 'Full', 'BusinessId, Name, DateCreated')
    INSERT LoadConfiguration VALUES (1, 'Customer', REPLACE('CREATE TABLE [dbo].[Customer]( 
        [CustomerId] [int] NOT NULL, 
        [BusinessId] [int] NOT NULL, 
        [FirstName] [nvarchar](50) NULL, 
        [LastName] [nvarchar](50) NULL, 
        [DateCreated] [datetime] NOT NULL 
    )','NOT NULL','NULL'), 'CustomerId', 'Full', 'CustomerId, BusinessId, FirstName, LastName, DateCreated')

    With this proof-of-concept I want to test that I can create tables, prepare them, and then load only the columns that I want loaded.

    Variables & Expressions

    A small but important part of creating a package like this is making sure you get your variable expressions right – i.e. make the various SQL statements and values you use as dynamic as possible.  As an example here are my variables for this little package. Note the expression column and how values are stitched together when it comes to building SQL commands used by the various components.

    From top-to-bottom, we’ve got:

    • ColumnListToLoad – this is the list of columns from the source table that I want loaded into the destination table.
    • IndexColumnName – the name of the “ID” column that I can use to tell where to load from if doing an incremental load. In the real world I’ll probably make the package handle either Id’s or DateTime columns, because with some tables it will make more sense to load based on a load-date.
    • IndexColumnValue – if doing an incremental load, then this variable will be populated with the max IndexColumnId already loaded into the data warehouse.
    • LoadSettings – the System.Object variable which will hold the full result set of the initial SQL query, and feed it into the ForEach loop container. Nom nom nom…
    • LoadType – whether we’re doing a Full or Incremental load. Could cater for other load types here too.
    • SQL_DeleteStatement – a SQL delete statement based on an expression. If doing an incremental load then this will delete any data that may exist after the current max IndexColumnValue, which should help prevent duplicates.
    • SQL_DropStatement – a SQL table drop statement. Probably didn’t need to be a fully dynamic expression, but for some reeeeaally important or large tables, you may want to disable accidental drops by putting something harmless in this variable for those specific tables.
    • SQL_LoadStatement – a SQL select statement which will pull the data from the source table. This select statement will make use of the ColumnListToLoad variable, as well as the SQL_WhereClause variable if performing an incremental load.
    • SQL_MaxIdValueStatement – SQL statement to get the max Id value and populate the IndexColumnValue variable.
    • SQL_WhereClause – snippet of SQL depending on whether we’re performing an incremental load, and the value of the IndexColumnValue variable.
    • SqlCreateStatement – The SQL create table statement for the destination table. In this example it’s just an exact copy of the source table. I tend to pull production data across into tables matching the source schema, even if my “ColumnListToLoad” variable means that I’m only loading a subset of columns. This means that if I need to add columns to the load later, I don’t need to change the create scripts.
    • TableName – the name of the source (and in this case, destination) table.

    The Package

    Here’s the steps in my package (and a chance for you to admire my l33t Windows Snipping tool handwriting skillz!). Note that I’m not going to go into a whole lot of detail here, because the purpose of this post isn’t to cover all things SSIS. Instead I’ll link to other sites which explain each step or series of steps more clearly.

    1. Select from the [LoadConfiguration] table, and stick the result-set into an object variable.

    2. Use a ForEach container to loop through each ‘row’ in the above object variable, assigning the individual values to variables scoped to the container.

    3. There are separate sequence containers for Full and Incremental loads. Their disabled states are set via an Expression which is based on the value from the [LoadType] column grabbed from the [LoadConfiguration] table above. So, if we’re doing a full load, the Incremental load container will be disabled, and vice versa. Another (possibly better) way of doing this would be to use precedence constraints with expressions to control the path of execution.

    4. As above, but for the ‘Incremental’ [LoadType] value…

    5. Load data using the new data load plus component. The best way to figure out how to do this is to watch the (rather dry) video from CozyRoc on this page.  But basically it involves setting up the component just like you would the normal data flow task, but then removing all columns from the outputs and inputs (using the advanced editor), and leaving only a single “placeholder/dummy” column. This placeholder column is brilliantly named  “THUNK_COLUMN”.

    Here’s another good blog post on a more complex setup using this component and Sharepoint.

    Conclusion

    Dunno… haven’t finished implementing the real thing yet. But the proof of concept is working well, and it went together pretty quickly, so I’m positive this will work, I think…

    I’ll update this post with my thoughts once I’ve got it all working. As usual please let me know if I’ve made any glaring mistakes, or if you’ve got some awesome ideas on how to improve this process further.

    Cheers,
    Dave

  • BIML, where have you been all my life?

    BIML, where have you been all my life?

    I’ve used the BIDS Helper Visual Studio add-on for years now, and I’ve seen and heard of BIML, but it’s one of those things I’ve never needed to look into any further than that.  Until I discovered that it’s something that would’ve saved me hours of tedious SSIS work!

    What is it?

    BIML (Business Intelligence Mark-up Language), or more specifically, BIMLScript, is sort of a mashup of XML and C# code nuggets, allowing you to create SSIS and SSAS packages.  This is very much the condensed “DBDave” version – check out the official site for a much more eloquent explanation of what it is.

    Basic Example

    When you open up your SSIS project in Visual Studio, if you’ve got BIDS Helper installed, then when you right-click on the project you have the option of adding a BIML file:

    Screenshot of Visual Studio Solution Explorer showing the option to add a new BIML file under the 'BIML Test' project.

    It’ll create a new file under “Miscellaneous” in your project. Go ahead and open it up and you’ll see something like this:

    Screenshot of a Visual Studio interface showing a BIML script file titled 'BimScript.biml' open on the left and the Solution Explorer on the right, highlighting the BIML Test project structure.

    You can “execute” a BIMLScript by right-clicking on it, and selecting “Generate SSIS Packages”:

    A screenshot showing the context menu options in Visual Studio with the 'Generate SSIS Packages' option highlighted, under the 'BimlScript' folder within the 'Miscellaneous' section.

    Now we can jump in the deep end and paste the following into this new BIML script:

    <Biml xmlns="http://schemas.varigence.com/biml.xsd"> 
        <Connections> 
            <Connection Name="SourceConn" ConnectionString="Data Source=.;Initial Catalog=tempdb;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" /> 
            <Connection Name="DestinationConn" ConnectionString="Data Source=.;Initial Catalog=tempdb;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" /> 
        </Connections> 
        <Projects> 
            <PackageProject Name="BIMLTest"> 
                <Parameters> 
                    <Parameter Name="DateFrom" DataType="DateTime">2016-01-01</Parameter> 
                    <Parameter Name="DestinationDatabase" DataType="String">tempdb</Parameter> 
                    <Parameter Name="DestinationServer" DataType="String">localhost</Parameter> 
                    <Parameter Name="DoTruncate" DataType="Boolean">false</Parameter> 
                    <Parameter Name="SourceDatabase" DataType="String">tempdb</Parameter> 
                    <Parameter Name="SourceServer" DataType="String">localhost</Parameter> 
                </Parameters> 
                <Packages> 
                    <Package PackageName="BIMLTestPackage" /> 
                </Packages> 
            </PackageProject> 
        </Projects> 
        <Packages> 
            <Package Name="BIMLTestPackage" ConstraintMode="Linear" ProtectionLevel="DontSaveSensitive"> 
                <Connections> 
                    <Connection ConnectionName="SourceConn"> 
                        <Expressions> 
                            <Expression ExternalProperty="InitalCatalog">@[$Project::SourceDatabase]</Expression> 
                            <Expression ExternalProperty="ServerName">@[$Project::SourceServer]</Expression> 
                        </Expressions> 
                    </Connection> 
                    <Connection ConnectionName="DestinationConn"> 
                        <Expressions> 
                            <Expression ExternalProperty="InitialCatalog">@[$Project::DestinationDatabase]</Expression> 
                            <Expression ExternalProperty="ServerName">@[$Project::DestinationServer]</Expression> 
                        </Expressions> 
                    </Connection> 
                </Connections> 
                <Tasks> 
                    <Container Name="Truncate Destination Table" ConstraintMode="Parallel"> 
                        <Expressions> 
                            <Expression ExternalProperty="Disable">!(@[$Project::DoTruncate])</Expression> 
                        </Expressions> 
                        <Tasks> 
                            <ExecuteSQL Name="Truncate Table" ConnectionName="DestinationConn"> 
                                <DirectInput> 
                                    TRUNCATE TABLE dbo.DWDestinationTableExample; 
                                </DirectInput> 
                            </ExecuteSQL> 
                        </Tasks> 
                    </Container> 
                    <Container Name="Load Table" ConstraintMode="Linear"> 
                        <Tasks> 
                            <Dataflow Name="Load dbo.DWDestinationTableExample"> 
                                <Transformations> 
                                    <OleDbSource Name="Source" ConnectionName="SourceConn"> 
                                        <DirectInput> 
                                            SELECT * FROM dbo.DWSourceTableExample WHERE KeyDate >= ?; 
                                        </DirectInput> 
                                        <Parameters> 
                                            <Parameter Name="0" VariableName="BIMLTest.DateFrom" /> 
                                        </Parameters> 
                                    </OleDbSource> 
                                    <OleDbDestination Name="Destination" ConnectionName="DestinationConn" KeepIdentity="true" UseFastLoadIfAvailable="true" MaximumInsertCommitSize="100000"> 
                                        <ExternalTableOutput Table="dbo.DWDestinationTableExample"> 
                                        </ExternalTableOutput> 
                                    </OleDbDestination> 
                                </Transformations> 
                            </Dataflow> 
                        </Tasks> 
                    </Container> 
                </Tasks> 
            </Package> 
        </Packages> 
    </Biml>

    What the… ?!?

    Yeah, okay, let’s step through this to figure out what it does.  I’ll show you what each bit of code results in too, which might help make it more tangible/understandable:

    <Connections> 
            <Connection Name="SourceConn" ConnectionString="Data Source=.;Initial Catalog=tempdb;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" /> 
            <Connection Name="DestinationConn" ConnectionString="Data Source=.;Initial Catalog=tempdb;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" /> 
        </Connections>

    First we setup the connections that will exist within the package. These are just connections to tempdb on my local SQL instance for testing. This bit results in this:

    Screenshot of Connection Managers in Visual Studio displaying DestinationConn and SourceConn.

    Next up, we specify the project and some project parameters that we’re going to use within the package:

    <Projects> 
        <PackageProject Name="BIMLTest"> 
            <Parameters> 
                <Parameter Name="DateFrom" DataType="DateTime">2016-01-01</Parameter> 
                <Parameter Name="DestinationDatabase" DataType="String">tempdb</Parameter> 
                <Parameter Name="DestinationServer" DataType="String">localhost</Parameter> 
                <Parameter Name="DoTruncate" DataType="Boolean">false</Parameter> 
                <Parameter Name="SourceDatabase" DataType="String">tempdb</Parameter> 
                <Parameter Name="SourceServer" DataType="String">localhost</Parameter> 
            </Parameters> 
            <Packages> 
                <Package PackageName="BIMLTestPackage" /> 
            </Packages> 
        </PackageProject> 
    </Projects>

    There are some gotchas regarding project parameters in BIML when using BIDS Helper to check and run your BIMLScript, so keep that in mind.  As per this example, you need to specify the project parameter definitions in here, even if they already exist within your project.

    So because of these issues, I found it simpler just to make sure the parameters already exist, like this:

    Screenshot of the project parameters window in SQL Server Data Tools, displaying parameters for a BIMLTestPackage including Name, Data Type, and Value columns.

    Now we create the package itself, and substitute in some of the package parameters, which in this case we’re using to replace parts of the connection strings for our source and destination connections.

    <Packages>
    <Package Name="BIMLTestPackage" ConstraintMode="Linear" ProtectionLevel="DontSaveSensitive">
    <Connections>
    <Connection ConnectionName="SourceConn">
    <Expressions>
    <Expression ExternalProperty="InitalCatalog">@[$Project::SourceDatabase]</Expression>
    <Expression ExternalProperty="ServerName">@[$Project::SourceServer]</Expression>
    </Expressions>
    </Connection>
    <Connection ConnectionName="DestinationConn">
    <Expressions>
    <Expression ExternalProperty="InitialCatalog">@[$Project::DestinationDatabase]</Expression>
    <Expression ExternalProperty="ServerName">@[$Project::DestinationServer]</Expression>
    </Expressions>
    </Connection>
    </Connections>

    This is the same as this part in the user interface:

    A screenshot of a database query results table showing columns for row ID, word, and operator. The table includes various SQL-related terms and their corresponding operators.

    Finally we add the meat to this SSIS sandwich; the components that perform the actual transformation and/or loading of data.

    <Tasks>
    <Container Name="Truncate Destination Table" ConstraintMode="Parallel">
    <Expressions>
    <Expression ExternalProperty="Disable">!(@[$Project::DoTruncate])</Expression>
    </Expressions>
    <Tasks>
    <ExecuteSQL Name="Truncate Table" ConnectionName="DestinationConn">
    <DirectInput>
    TRUNCATE TABLE dbo.DWDestinationTableExample;
    </DirectInput>
    </ExecuteSQL>
    </Tasks>
    </Container>
    <Container Name="Load Table" ConstraintMode="Linear">
    <Tasks>
    <Dataflow Name="Load dbo.DWDestinationTableExample">
    <Transformations>
    <OleDbSource Name="Source" ConnectionName="SourceConn">
    <DirectInput>
    SELECT * FROM dbo.DWSourceTableExample WHERE KeyDate >= ?;
    </DirectInput>
    <Parameters>
    <Parameter Name="0" VariableName="BIMLTest.DateFrom" />
    </Parameters>
    </OleDbSource>
    <OleDbDestination Name="Destination" ConnectionName="DestinationConn" KeepIdentity="true" UseFastLoadIfAvailable="true" MaximumInsertCommitSize="100000">
    <ExternalTableOutput Table="dbo.DWDestinationTableExample">
    </ExternalTableOutput>
    </OleDbDestination>
    </Transformations>
    </Dataflow>
    </Tasks>
    </Container>
    </Tasks>
    </Package>
    </Packages>
    </Biml>

    We’ve got an “Execute SQL” component running a truncate of the destination table first.  However, we only want this to run if we’ve set our project parameter “DoTruncate” to true.

    Screenshot of the SSIS package 'BIMLTestPackage' in Visual Studio, showing the 'Truncate Destination Table' task and its properties on the right.

    And lastly a Data Flow task to move data.  This is done using a SQL query with a parameter for a “KeyDate” column, as an illustration of what you might do in a real-life situation.

    Load Table transformation in SSIS package for dbo_DWDestinationTableExample
    Screenshot showing the Set Query Parameters dialog in SQL Server Integration Services (SSIS) with parameter mapping for a Data Flow task.

    Cool! Now what??

    So that’s BIML in a very small nutshell.  Even if that’s all you’re doing with it (i.e. creating pretty basic packages) I think it’s worth doing since it makes source control of your packages SOOOOOO much nicer!

    Imagine getting a pull request from a developer who’s made some SSIS changes, and simply being able to diff the BIML scripts to see exactly what they’ve changed!? Smile

    But wait, there’s more…

    In the scenario that lead to discover BIML, I wanted to create a “dynamic” SSIS package, that was driven by metadata stored in a database.  In other words, I could maintain a table with a list of table names that I wanted “ETL’d” from my production system to my data-warehouse, and my magic SSIS package would pick up changes, new tables added, etc without me needing to open and edit one monstrous package.

    This is where the power of BIMLScript and it’s C# nuggets really shines. It lets you drop in complicated logic in C# code to control and mould the output of the BIML.  So you could look up a list of tables to load, then iterate over that list, creating packages per table.  Check out this post for a lot more detail (and examples) on how to achieve this.

    That’s it for now. There’s lots of more detailed examples around if you look for them (Google is your friend), and I just wanted to highlight the possibilities which I didn’t realise were there before. Hopefully you find it as useful as I did.

    Cheers,
    Dave

  • Azure SQL to Azure Data Warehouse ETL

    Azure SQL to Azure Data Warehouse ETL

    I’ve recently needed to move data from our transactional database (an Azure SQL database), into an Azure SQL Data Warehouse. A definite case of “harder than it needed to be”…

    What’s an Azure Data Warehouse?

    I’ll assume if you’ve read this far, you know what a SQL database is. But an Azure Data Warehouse is a slightly different beast; it’s another hosted/managed service offered on Microsoft’s Azure cloud infrastructure. They market it as a distributed & elastic database that can support petabytes of data, while offering enterprise-class features.

    Diagram illustrating the architecture of SQL Data Warehouse, including connections to various data sources such as SQL databases, Azure Tables, and Azure Data Lake.

    That essentially means that behind the scenes this “database” is actually a bunch of orchestrated nodes working together (a control node, multiple compute nodes, storage, etc).  Queries against this distributed database are themselves split up and run in parallel across these nodes – i.e. “MPP”, or Massively Parallel Processing. That’s very much it in a nutshell – for a lot more detail though, read this as well.

    Why use this over other solutions?

    I originally set up an old-school SSAS instance on an Azure VM, backed by a normal SQL Server data warehouse hosted on the same VM. Not very exciting, but it worked.  The struggle was that to get data from our production database (an Azure SQL Database) into this VM required either SSIS packages pulling data across the wire, or a restore of the prod database locally (i.e. onto the VM) and then extracting the data from that using cross-database queries.

    Then I read up on these relatively new Azure Data Warehouses, and I assumed that *surely* there would be a much simpler/better way of moving data directly from one to the other natively, within the “cloud”.

    “Cloud-to-cloud” ETL FTW!

    Tweet from David Curlewis asking if there is a cloud-native way to ETL data from an Azure SQL Database to Azure Data Warehouse, tagged with #azure #azuredw #thisshouldbeeasierthanitis.

    I asked the question, and the consensus seemed to be that Data Factory is the cool new way to move your data around the cloud.  So I gave that a crack. Be warned, you’ll need to brush up on JSON (since you’ll need to be comfy writing/modifying JSON to setup the data sources, control the pipelines, etc).

    All the examples I found seem to involve Blob-to-SQL, or SQL-to-Blob data loads.  So I figured out how the bits and pieces work together, how to customise the JSON to setup the correct sources, pipelines, etc, and then kicked it off.  It didn’t work… <sadface>

    The issues I ran into were definitely solvable (data type conversion issues mostly) – but given my noob-ness with JSON and Data Factory in general, as well as the fact that it felt really clunky when trying to change schema quickly, I decided to be boring and revert back to good ol’ SSIS instead.

    I feel like there’s a huge gap here for someone to build a simpler data load tool for this!  And yes, I did also try using the Data Factory “Copy Wizard” (still in preview at this stage). While it did allow me to setup a basic table copy, I then wanted to modify the JSON pipeline slightly due to some data type issues, and amusingly the Azure Portal threw an error when I saved my changes because the default quota limits pipeline JSON objects to 200KB, and mine was *just* over that. You can request for this to be increased, but I couldn’t be bothered and basically ragequit at this point. 😛

    You see, the problem is that when you’re the sole infrastructure & database guy for a smallish start-up company, you don’t have time to spend a few days learning the ins-and-outs just to setup a basic data transfer. I need something that just works, quickly, so I can move on to solving tickets, optimising database performance, flying, checking on the test/dev environments, etc, etc, etc…

    I’ll keep an eye on the copy wizard though, as I’m sure they’ll improve it over time, and it seems to be the closest to what I’m looking for at this stage.

    It’s not all bad

    Having said all of that, I’m still sticking with SQL Data Warehouse as my BI/BA back-end, and have been impressed with the performance of loads (even just done via SSIS packages) as well as query performance.

    Screenshot of an SSIS package design interface, showing various data flow and control flow tasks for data integration.

    I made sure to split the data load aspects of my package up so as to utilise the parallel nature of SQL Data Warehouse, so I’m guessing that will be helping performance.  I’ve also built some proof-of-concept PowerBI dashboards over the top of the data warehouse, which was ridiculously easy (and quite satisfying).

    Let me know if you’ve had any similar experiences (good or bad) with loading data into SQL Data Warehouse, or moving data around within the cloud.

    Cheers,
    Dave