Tag: sql

  • SSIS Data Flow Plus!

    SSIS Data Flow Plus!

    In my previous blog post I talked about BIML, and how it might revolutionise my approach to creating ETL processes.  It’s pretty cool, and very powerful, but there is a bit of a learning curve, so I decided to look for a different way to achieve the same thing, but that required less upskill-time, and preferably less development time too.

    So, the ideal solution will:

    • be quick to build initially, and easy to maintain in the long run.
    • allow for parallel data loads to make the best use of the available resources.
    • allow for ad-hoc changes to the load or schema without having to open, make changes to, and re-deploy the SSIS package.

    I briefly tested several other methods (most of which involved generating large amounts of dynamic SQL and executing that against your source and/or destination). I instead decided to try out an SSIS add-on package called “Data Flow Task Plus”, which I’d never heard of before.

    What is it?

    A company called CozyRoc has developed a set of new components, and extensions to existing components within SSIS, making them a whole lot more powerful than what you get out of the box. This is nothing new, in fact you can develop your own components relatively easily if you so choose (in fact even I’ve dabbled with this many moons ago, trying to read CSV files with annoying formatting “features”).

    Data Flow Plus lets you configure dynamic data flows. You can control various options via package or project parameters, which means less time spent opening packages to edit them when your source or destination schema changes. Basically this means you can create “schema-less” ETL packages which will just transfer data from a source table to a destination table, even if you add or remove (or change) columns!  Too good to be true, right?

    The Pudding

    As they say, the proof is in the pudding, so here’s some pudding… figuratively speaking. Nothing like some green ticks in SSIS to make your afternoon!

    That’s the end result of my proof-of-concept, but don’t worry, I’ll step you through it.

    imageFirst-things-first, you’ll need to go to the CozyRoc website and download the package, either 32 or 64-bit depending on your requirements.

    Once that’s done and you open Visual Studio, you’ll notice a bunch of new components in your SSIS Toolbox. The only one I’m covering here though is the new Data Flow Task Plus (highlighted), although I may cover more in future as there are a couple that sound interesting (like parallel foreach loops!).

    New Plan

    So my plan is to have table metadata stored in a table on the destination (Azure Data Warehouse) database, which is queried by the package and stored in package variables. I’ll then iterate over the list of tables, do my ETL (depending on what kind of load I’m doing), and finally load the data from the source system. Sounds simple enough (… and it is), so let’s get started.

    And yeees I know this isn’t really much of an “ETL” process… but “ELT” doesn’t roll off the tongue as easily. :-p 

    Here’s a SQL script to set up for this proof-of-concept if you want to follow along. It creates 2 databases (a source and a destination), as well as a table to store metadata about the tables I want loaded from one to the other.

    CREATE DATABASE DWSource; 
    GO 
    CREATE DATABASE DWDestination; 
    GO
    USE DWDestination;
    -- DROP TABLE LoadConfiguration 
    CREATE TABLE dbo.LoadConfiguration ( 
        LoadStream TINYINT NOT NULL, 
        TableName NVARCHAR(100) NOT NULL, 
        SqlCreateStmt NVARCHAR(MAX) NOT NULL, 
        IndexColumnName NVARCHAR(100) NOT NULL, 
        LoadType NVARCHAR(20) NOT NULL, 
        ColumnListToLoad NVARCHAR(MAX) NOT NULL 
        )
    -- These are very simplified versions of a few tables in our (Timely’s) database. You'll need to create them in the source database if you want to test this yourself.
    INSERT LoadConfiguration VALUES (1, 'Booking', REPLACE('CREATE TABLE [dbo].[Booking]( 
        [BookingId] [int] NOT NULL, 
        [CustomerId] [int] NOT NULL, 
        [StartDate] [datetime] NOT NULL, 
        [EndDate] [datetime] NOT NULL, 
        [Price] [money] NULL, 
        [BusinessId] [int] NOT NULL 
    )','NOT NULL','NULL'), 'BookingId', 'Full', 'BookingId, CustomerId, StartDate, EndDate, Price, BusinessId')
    INSERT LoadConfiguration VALUES (1, 'Business', REPLACE('CREATE TABLE [dbo].[Business]( 
        [BusinessId] [int] NOT NULL, 
        [Name] [nvarchar](100) NOT NULL, 
        [DateCreated] [datetime] NOT NULL, 
        [Description] [nvarchar](max) NULL 
    )','NOT NULL','NULL'), 'BusinessId', 'Full', 'BusinessId, Name, DateCreated')
    INSERT LoadConfiguration VALUES (1, 'Customer', REPLACE('CREATE TABLE [dbo].[Customer]( 
        [CustomerId] [int] NOT NULL, 
        [BusinessId] [int] NOT NULL, 
        [FirstName] [nvarchar](50) NULL, 
        [LastName] [nvarchar](50) NULL, 
        [DateCreated] [datetime] NOT NULL 
    )','NOT NULL','NULL'), 'CustomerId', 'Full', 'CustomerId, BusinessId, FirstName, LastName, DateCreated')

    With this proof-of-concept I want to test that I can create tables, prepare them, and then load only the columns that I want loaded.

    Variables & Expressions

    A small but important part of creating a package like this is making sure you get your variable expressions right – i.e. make the various SQL statements and values you use as dynamic as possible.  As an example here are my variables for this little package. Note the expression column and how values are stitched together when it comes to building SQL commands used by the various components.

    From top-to-bottom, we’ve got:

    • ColumnListToLoad – this is the list of columns from the source table that I want loaded into the destination table.
    • IndexColumnName – the name of the “ID” column that I can use to tell where to load from if doing an incremental load. In the real world I’ll probably make the package handle either Id’s or DateTime columns, because with some tables it will make more sense to load based on a load-date.
    • IndexColumnValue – if doing an incremental load, then this variable will be populated with the max IndexColumnId already loaded into the data warehouse.
    • LoadSettings – the System.Object variable which will hold the full result set of the initial SQL query, and feed it into the ForEach loop container. Nom nom nom…
    • LoadType – whether we’re doing a Full or Incremental load. Could cater for other load types here too.
    • SQL_DeleteStatement – a SQL delete statement based on an expression. If doing an incremental load then this will delete any data that may exist after the current max IndexColumnValue, which should help prevent duplicates.
    • SQL_DropStatement – a SQL table drop statement. Probably didn’t need to be a fully dynamic expression, but for some reeeeaally important or large tables, you may want to disable accidental drops by putting something harmless in this variable for those specific tables.
    • SQL_LoadStatement – a SQL select statement which will pull the data from the source table. This select statement will make use of the ColumnListToLoad variable, as well as the SQL_WhereClause variable if performing an incremental load.
    • SQL_MaxIdValueStatement – SQL statement to get the max Id value and populate the IndexColumnValue variable.
    • SQL_WhereClause – snippet of SQL depending on whether we’re performing an incremental load, and the value of the IndexColumnValue variable.
    • SqlCreateStatement – The SQL create table statement for the destination table. In this example it’s just an exact copy of the source table. I tend to pull production data across into tables matching the source schema, even if my “ColumnListToLoad” variable means that I’m only loading a subset of columns. This means that if I need to add columns to the load later, I don’t need to change the create scripts.
    • TableName – the name of the source (and in this case, destination) table.

    The Package

    Here’s the steps in my package (and a chance for you to admire my l33t Windows Snipping tool handwriting skillz!). Note that I’m not going to go into a whole lot of detail here, because the purpose of this post isn’t to cover all things SSIS. Instead I’ll link to other sites which explain each step or series of steps more clearly.

    1. Select from the [LoadConfiguration] table, and stick the result-set into an object variable.

    2. Use a ForEach container to loop through each ‘row’ in the above object variable, assigning the individual values to variables scoped to the container.

    3. There are separate sequence containers for Full and Incremental loads. Their disabled states are set via an Expression which is based on the value from the [LoadType] column grabbed from the [LoadConfiguration] table above. So, if we’re doing a full load, the Incremental load container will be disabled, and vice versa. Another (possibly better) way of doing this would be to use precedence constraints with expressions to control the path of execution.

    4. As above, but for the ‘Incremental’ [LoadType] value…

    5. Load data using the new data load plus component. The best way to figure out how to do this is to watch the (rather dry) video from CozyRoc on this page.  But basically it involves setting up the component just like you would the normal data flow task, but then removing all columns from the outputs and inputs (using the advanced editor), and leaving only a single “placeholder/dummy” column. This placeholder column is brilliantly named  “THUNK_COLUMN”.

    Here’s another good blog post on a more complex setup using this component and Sharepoint.

    Conclusion

    Dunno… haven’t finished implementing the real thing yet. But the proof of concept is working well, and it went together pretty quickly, so I’m positive this will work, I think…

    I’ll update this post with my thoughts once I’ve got it all working. As usual please let me know if I’ve made any glaring mistakes, or if you’ve got some awesome ideas on how to improve this process further.

    Cheers,
    Dave

  • BIML, where have you been all my life?

    BIML, where have you been all my life?

    I’ve used the BIDS Helper Visual Studio add-on for years now, and I’ve seen and heard of BIML, but it’s one of those things I’ve never needed to look into any further than that.  Until I discovered that it’s something that would’ve saved me hours of tedious SSIS work!

    What is it?

    BIML (Business Intelligence Mark-up Language), or more specifically, BIMLScript, is sort of a mashup of XML and C# code nuggets, allowing you to create SSIS and SSAS packages.  This is very much the condensed “DBDave” version – check out the official site for a much more eloquent explanation of what it is.

    Basic Example

    When you open up your SSIS project in Visual Studio, if you’ve got BIDS Helper installed, then when you right-click on the project you have the option of adding a BIML file:

    Screenshot of Visual Studio Solution Explorer showing the option to add a new BIML file under the 'BIML Test' project.

    It’ll create a new file under “Miscellaneous” in your project. Go ahead and open it up and you’ll see something like this:

    Screenshot of a Visual Studio interface showing a BIML script file titled 'BimScript.biml' open on the left and the Solution Explorer on the right, highlighting the BIML Test project structure.

    You can “execute” a BIMLScript by right-clicking on it, and selecting “Generate SSIS Packages”:

    A screenshot showing the context menu options in Visual Studio with the 'Generate SSIS Packages' option highlighted, under the 'BimlScript' folder within the 'Miscellaneous' section.

    Now we can jump in the deep end and paste the following into this new BIML script:

    <Biml xmlns="http://schemas.varigence.com/biml.xsd"> 
        <Connections> 
            <Connection Name="SourceConn" ConnectionString="Data Source=.;Initial Catalog=tempdb;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" /> 
            <Connection Name="DestinationConn" ConnectionString="Data Source=.;Initial Catalog=tempdb;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" /> 
        </Connections> 
        <Projects> 
            <PackageProject Name="BIMLTest"> 
                <Parameters> 
                    <Parameter Name="DateFrom" DataType="DateTime">2016-01-01</Parameter> 
                    <Parameter Name="DestinationDatabase" DataType="String">tempdb</Parameter> 
                    <Parameter Name="DestinationServer" DataType="String">localhost</Parameter> 
                    <Parameter Name="DoTruncate" DataType="Boolean">false</Parameter> 
                    <Parameter Name="SourceDatabase" DataType="String">tempdb</Parameter> 
                    <Parameter Name="SourceServer" DataType="String">localhost</Parameter> 
                </Parameters> 
                <Packages> 
                    <Package PackageName="BIMLTestPackage" /> 
                </Packages> 
            </PackageProject> 
        </Projects> 
        <Packages> 
            <Package Name="BIMLTestPackage" ConstraintMode="Linear" ProtectionLevel="DontSaveSensitive"> 
                <Connections> 
                    <Connection ConnectionName="SourceConn"> 
                        <Expressions> 
                            <Expression ExternalProperty="InitalCatalog">@[$Project::SourceDatabase]</Expression> 
                            <Expression ExternalProperty="ServerName">@[$Project::SourceServer]</Expression> 
                        </Expressions> 
                    </Connection> 
                    <Connection ConnectionName="DestinationConn"> 
                        <Expressions> 
                            <Expression ExternalProperty="InitialCatalog">@[$Project::DestinationDatabase]</Expression> 
                            <Expression ExternalProperty="ServerName">@[$Project::DestinationServer]</Expression> 
                        </Expressions> 
                    </Connection> 
                </Connections> 
                <Tasks> 
                    <Container Name="Truncate Destination Table" ConstraintMode="Parallel"> 
                        <Expressions> 
                            <Expression ExternalProperty="Disable">!(@[$Project::DoTruncate])</Expression> 
                        </Expressions> 
                        <Tasks> 
                            <ExecuteSQL Name="Truncate Table" ConnectionName="DestinationConn"> 
                                <DirectInput> 
                                    TRUNCATE TABLE dbo.DWDestinationTableExample; 
                                </DirectInput> 
                            </ExecuteSQL> 
                        </Tasks> 
                    </Container> 
                    <Container Name="Load Table" ConstraintMode="Linear"> 
                        <Tasks> 
                            <Dataflow Name="Load dbo.DWDestinationTableExample"> 
                                <Transformations> 
                                    <OleDbSource Name="Source" ConnectionName="SourceConn"> 
                                        <DirectInput> 
                                            SELECT * FROM dbo.DWSourceTableExample WHERE KeyDate >= ?; 
                                        </DirectInput> 
                                        <Parameters> 
                                            <Parameter Name="0" VariableName="BIMLTest.DateFrom" /> 
                                        </Parameters> 
                                    </OleDbSource> 
                                    <OleDbDestination Name="Destination" ConnectionName="DestinationConn" KeepIdentity="true" UseFastLoadIfAvailable="true" MaximumInsertCommitSize="100000"> 
                                        <ExternalTableOutput Table="dbo.DWDestinationTableExample"> 
                                        </ExternalTableOutput> 
                                    </OleDbDestination> 
                                </Transformations> 
                            </Dataflow> 
                        </Tasks> 
                    </Container> 
                </Tasks> 
            </Package> 
        </Packages> 
    </Biml>

    What the… ?!?

    Yeah, okay, let’s step through this to figure out what it does.  I’ll show you what each bit of code results in too, which might help make it more tangible/understandable:

    <Connections> 
            <Connection Name="SourceConn" ConnectionString="Data Source=.;Initial Catalog=tempdb;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" /> 
            <Connection Name="DestinationConn" ConnectionString="Data Source=.;Initial Catalog=tempdb;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" /> 
        </Connections>

    First we setup the connections that will exist within the package. These are just connections to tempdb on my local SQL instance for testing. This bit results in this:

    Screenshot of Connection Managers in Visual Studio displaying DestinationConn and SourceConn.

    Next up, we specify the project and some project parameters that we’re going to use within the package:

    <Projects> 
        <PackageProject Name="BIMLTest"> 
            <Parameters> 
                <Parameter Name="DateFrom" DataType="DateTime">2016-01-01</Parameter> 
                <Parameter Name="DestinationDatabase" DataType="String">tempdb</Parameter> 
                <Parameter Name="DestinationServer" DataType="String">localhost</Parameter> 
                <Parameter Name="DoTruncate" DataType="Boolean">false</Parameter> 
                <Parameter Name="SourceDatabase" DataType="String">tempdb</Parameter> 
                <Parameter Name="SourceServer" DataType="String">localhost</Parameter> 
            </Parameters> 
            <Packages> 
                <Package PackageName="BIMLTestPackage" /> 
            </Packages> 
        </PackageProject> 
    </Projects>

    There are some gotchas regarding project parameters in BIML when using BIDS Helper to check and run your BIMLScript, so keep that in mind.  As per this example, you need to specify the project parameter definitions in here, even if they already exist within your project.

    So because of these issues, I found it simpler just to make sure the parameters already exist, like this:

    Screenshot of the project parameters window in SQL Server Data Tools, displaying parameters for a BIMLTestPackage including Name, Data Type, and Value columns.

    Now we create the package itself, and substitute in some of the package parameters, which in this case we’re using to replace parts of the connection strings for our source and destination connections.

    <Packages>
    <Package Name="BIMLTestPackage" ConstraintMode="Linear" ProtectionLevel="DontSaveSensitive">
    <Connections>
    <Connection ConnectionName="SourceConn">
    <Expressions>
    <Expression ExternalProperty="InitalCatalog">@[$Project::SourceDatabase]</Expression>
    <Expression ExternalProperty="ServerName">@[$Project::SourceServer]</Expression>
    </Expressions>
    </Connection>
    <Connection ConnectionName="DestinationConn">
    <Expressions>
    <Expression ExternalProperty="InitialCatalog">@[$Project::DestinationDatabase]</Expression>
    <Expression ExternalProperty="ServerName">@[$Project::DestinationServer]</Expression>
    </Expressions>
    </Connection>
    </Connections>

    This is the same as this part in the user interface:

    A screenshot of a database query results table showing columns for row ID, word, and operator. The table includes various SQL-related terms and their corresponding operators.

    Finally we add the meat to this SSIS sandwich; the components that perform the actual transformation and/or loading of data.

    <Tasks>
    <Container Name="Truncate Destination Table" ConstraintMode="Parallel">
    <Expressions>
    <Expression ExternalProperty="Disable">!(@[$Project::DoTruncate])</Expression>
    </Expressions>
    <Tasks>
    <ExecuteSQL Name="Truncate Table" ConnectionName="DestinationConn">
    <DirectInput>
    TRUNCATE TABLE dbo.DWDestinationTableExample;
    </DirectInput>
    </ExecuteSQL>
    </Tasks>
    </Container>
    <Container Name="Load Table" ConstraintMode="Linear">
    <Tasks>
    <Dataflow Name="Load dbo.DWDestinationTableExample">
    <Transformations>
    <OleDbSource Name="Source" ConnectionName="SourceConn">
    <DirectInput>
    SELECT * FROM dbo.DWSourceTableExample WHERE KeyDate >= ?;
    </DirectInput>
    <Parameters>
    <Parameter Name="0" VariableName="BIMLTest.DateFrom" />
    </Parameters>
    </OleDbSource>
    <OleDbDestination Name="Destination" ConnectionName="DestinationConn" KeepIdentity="true" UseFastLoadIfAvailable="true" MaximumInsertCommitSize="100000">
    <ExternalTableOutput Table="dbo.DWDestinationTableExample">
    </ExternalTableOutput>
    </OleDbDestination>
    </Transformations>
    </Dataflow>
    </Tasks>
    </Container>
    </Tasks>
    </Package>
    </Packages>
    </Biml>

    We’ve got an “Execute SQL” component running a truncate of the destination table first.  However, we only want this to run if we’ve set our project parameter “DoTruncate” to true.

    Screenshot of the SSIS package 'BIMLTestPackage' in Visual Studio, showing the 'Truncate Destination Table' task and its properties on the right.

    And lastly a Data Flow task to move data.  This is done using a SQL query with a parameter for a “KeyDate” column, as an illustration of what you might do in a real-life situation.

    Load Table transformation in SSIS package for dbo_DWDestinationTableExample
    Screenshot showing the Set Query Parameters dialog in SQL Server Integration Services (SSIS) with parameter mapping for a Data Flow task.

    Cool! Now what??

    So that’s BIML in a very small nutshell.  Even if that’s all you’re doing with it (i.e. creating pretty basic packages) I think it’s worth doing since it makes source control of your packages SOOOOOO much nicer!

    Imagine getting a pull request from a developer who’s made some SSIS changes, and simply being able to diff the BIML scripts to see exactly what they’ve changed!? Smile

    But wait, there’s more…

    In the scenario that lead to discover BIML, I wanted to create a “dynamic” SSIS package, that was driven by metadata stored in a database.  In other words, I could maintain a table with a list of table names that I wanted “ETL’d” from my production system to my data-warehouse, and my magic SSIS package would pick up changes, new tables added, etc without me needing to open and edit one monstrous package.

    This is where the power of BIMLScript and it’s C# nuggets really shines. It lets you drop in complicated logic in C# code to control and mould the output of the BIML.  So you could look up a list of tables to load, then iterate over that list, creating packages per table.  Check out this post for a lot more detail (and examples) on how to achieve this.

    That’s it for now. There’s lots of more detailed examples around if you look for them (Google is your friend), and I just wanted to highlight the possibilities which I didn’t realise were there before. Hopefully you find it as useful as I did.

    Cheers,
    Dave

  • Azure SQL to Azure Data Warehouse ETL

    Azure SQL to Azure Data Warehouse ETL

    I’ve recently needed to move data from our transactional database (an Azure SQL database), into an Azure SQL Data Warehouse. A definite case of “harder than it needed to be”…

    What’s an Azure Data Warehouse?

    I’ll assume if you’ve read this far, you know what a SQL database is. But an Azure Data Warehouse is a slightly different beast; it’s another hosted/managed service offered on Microsoft’s Azure cloud infrastructure. They market it as a distributed & elastic database that can support petabytes of data, while offering enterprise-class features.

    Diagram illustrating the architecture of SQL Data Warehouse, including connections to various data sources such as SQL databases, Azure Tables, and Azure Data Lake.

    That essentially means that behind the scenes this “database” is actually a bunch of orchestrated nodes working together (a control node, multiple compute nodes, storage, etc).  Queries against this distributed database are themselves split up and run in parallel across these nodes – i.e. “MPP”, or Massively Parallel Processing. That’s very much it in a nutshell – for a lot more detail though, read this as well.

    Why use this over other solutions?

    I originally set up an old-school SSAS instance on an Azure VM, backed by a normal SQL Server data warehouse hosted on the same VM. Not very exciting, but it worked.  The struggle was that to get data from our production database (an Azure SQL Database) into this VM required either SSIS packages pulling data across the wire, or a restore of the prod database locally (i.e. onto the VM) and then extracting the data from that using cross-database queries.

    Then I read up on these relatively new Azure Data Warehouses, and I assumed that *surely* there would be a much simpler/better way of moving data directly from one to the other natively, within the “cloud”.

    “Cloud-to-cloud” ETL FTW!

    Tweet from David Curlewis asking if there is a cloud-native way to ETL data from an Azure SQL Database to Azure Data Warehouse, tagged with #azure #azuredw #thisshouldbeeasierthanitis.

    I asked the question, and the consensus seemed to be that Data Factory is the cool new way to move your data around the cloud.  So I gave that a crack. Be warned, you’ll need to brush up on JSON (since you’ll need to be comfy writing/modifying JSON to setup the data sources, control the pipelines, etc).

    All the examples I found seem to involve Blob-to-SQL, or SQL-to-Blob data loads.  So I figured out how the bits and pieces work together, how to customise the JSON to setup the correct sources, pipelines, etc, and then kicked it off.  It didn’t work… <sadface>

    The issues I ran into were definitely solvable (data type conversion issues mostly) – but given my noob-ness with JSON and Data Factory in general, as well as the fact that it felt really clunky when trying to change schema quickly, I decided to be boring and revert back to good ol’ SSIS instead.

    I feel like there’s a huge gap here for someone to build a simpler data load tool for this!  And yes, I did also try using the Data Factory “Copy Wizard” (still in preview at this stage). While it did allow me to setup a basic table copy, I then wanted to modify the JSON pipeline slightly due to some data type issues, and amusingly the Azure Portal threw an error when I saved my changes because the default quota limits pipeline JSON objects to 200KB, and mine was *just* over that. You can request for this to be increased, but I couldn’t be bothered and basically ragequit at this point. 😛

    You see, the problem is that when you’re the sole infrastructure & database guy for a smallish start-up company, you don’t have time to spend a few days learning the ins-and-outs just to setup a basic data transfer. I need something that just works, quickly, so I can move on to solving tickets, optimising database performance, flying, checking on the test/dev environments, etc, etc, etc…

    I’ll keep an eye on the copy wizard though, as I’m sure they’ll improve it over time, and it seems to be the closest to what I’m looking for at this stage.

    It’s not all bad

    Having said all of that, I’m still sticking with SQL Data Warehouse as my BI/BA back-end, and have been impressed with the performance of loads (even just done via SSIS packages) as well as query performance.

    Screenshot of an SSIS package design interface, showing various data flow and control flow tasks for data integration.

    I made sure to split the data load aspects of my package up so as to utilise the parallel nature of SQL Data Warehouse, so I’m guessing that will be helping performance.  I’ve also built some proof-of-concept PowerBI dashboards over the top of the data warehouse, which was ridiculously easy (and quite satisfying).

    Let me know if you’ve had any similar experiences (good or bad) with loading data into SQL Data Warehouse, or moving data around within the cloud.

    Cheers,
    Dave

  • Keeping track of your SQL Server index “inventory”

    Keeping track of your SQL Server index “inventory”

    Something’s always bugged me when it comes to managing indexes in SQL Server; keeping track of when I created or dropped them.

    You can already check just about everything you could want to know about indexes, via handy DMV’s.  Need to figure out how much your indexes are being used? Take a look at sys.dm_db_index_usage_stats.  Or how about which indexes you might need to add to improve performance? Easy, just query sys.dm_db_missing_index_group_stats! You get the idea…

    But what about the create date of an index, or when it was last rebuilt, or even when you dropped an index? For this you need to roll up your sleeves and roll your own solution.

    How to skin this cat?

    There are a few ways we can do this.

    Scheduled job

    The first way I used to do this was to just have a scheduled job running fairly regularly (like every 15 – 30 minutes) which checked for any changes to indexes in the database since the last time it ran. Any new ones would be added to the table, changes would be recorded, and dropped indexes would noted as such.  In fact, I used a version of Kimberly Tripp’s “improved sp_helpindex” to gather and store the index information in a nice format (i.e. with separate columns for included columns, compression, etc).

    This is what the “guts” of the proc look like, just to give you an idea:

    DECLARE TimelyTables CURSOR FAST_FORWARD FOR
    		SELECT	DISTINCT 
    				QUOTENAME(ss.name) + '.' + QUOTENAME(st.name) AS TableName,
    				st.[object_id]
    		FROM	sys.tables AS st
    		JOIN	sys.schemas AS ss ON st.[schema_id] = ss.[schema_id]
    		WHERE	st.is_ms_shipped = 0;
    
    	OPEN TimelyTables;
    	FETCH NEXT FROM TimelyTables INTO @TableName, @ObjectId;
    	WHILE @@FETCH_STATUS = 0
    	BEGIN
    		TRUNCATE TABLE #tt_Indexes;
    
    		RAISERROR('Table: %s (%i)',10,1,@TableName,@ObjectId);
    
    		INSERT	#tt_Indexes
    		EXEC	dbo.sp_helpindex2 @TableName; 
    
    		IF @@ROWCOUNT &gt; 0 
    		BEGIN 
    			INSERT	#index_history ([object_id], table_name, index_id, is_disabled, index_name, index_description, index_keys, included_columns, filter_definition, [compression]) 
    			SELECT	@ObjectId, @TableName, t.index_id, t.is_disabled, t.index_name, t.index_description, t.index_keys, t.included_columns, t.filter_definition, t.[compression] 
    			FROM	#tt_Indexes AS t;  
    		END 
    
    		FETCH NEXT FROM TimelyTables INTO @TableName, @ObjectId;
    	END

    Outside of this loop you can then do your MERGE comparison between “current” indexes, and what was recorded in the “index history” table from the previous run.

    DDL trigger

    DDL triggers are nothing new, but they can be very useful for auditing schema and login changes, etc. So it makes sense that this is ideally suited to creating an inventory of your indexes (if not all database objects that you might be interested in).  In fact, you can even quite easily create your own poor-man’s source control system, but that’s a different kettle of fish.

    The idea behind DDL triggers is that you specify which ‘events‘ you want them to fire for at a database or server level.  In my case, working with Azure, I’m only interested in database level events.  In fact, in this case I’m only interested in recording the details of any CREATE INDEX, ALTER INDEX, or DROP INDEX statements.  Which looks like this:

    CREATE TRIGGER trg_IndexChangeLog ON DATABASE 
        FOR CREATE_INDEX, ALTER_INDEX, DROP_INDEX 
    AS 
    ...

    Now we just insert the EVENTDATA data into our logging table, like so:

    IF OBJECT_ID('dba.IndexChangeLog') IS NOT NULL 
    BEGIN
    	DECLARE @data XML; 
    	SET @data = EVENTDATA(); 
    
    	INSERT	dba.IndexChangeLog(eventtype, objectname, objecttype, sqlcommand, loginname) 
    	VALUES	( 
    			@data.value('(/EVENT_INSTANCE/EventType)[1]', 'varchar(50)'), 
    			@data.value('(/EVENT_INSTANCE/ObjectName)[1]', 'varchar(256)'), 
    			@data.value('(/EVENT_INSTANCE/ObjectType)[1]', 'varchar(25)'), 
    			@data.value('(/EVENT_INSTANCE/TSQLCommand)[1]', 'varchar(max)'), 
    			@data.value('(/EVENT_INSTANCE/LoginName)[1]', 'varchar(256)') 
    			); 
    END;

    EVENTDATA is basically an XML document that contains the important bits you might want to record for auditing purposes… like who ran what, when.  This is what’s available:

    <EVENT_INSTANCE>
        <EventType>event </EventType>
        <PostTime>date-time</PostTime>
        <SPID>spid</SPID>
        <ServerName>name </ServerName>
        <LoginName>login </LoginName>
        <UserName>name</UserName>
        <DatabaseName>name</DatabaseName>
        <SchemaName>name</SchemaName>
        <ObjectName>name</ObjectName>
        <ObjectType>type</ObjectType>
        <TSQLCommand>command</TSQLCommand>
    </EVENT_INSTANCE>

    Best of both worlds?

    The above two methods are ones that I’ve already used successfully and have experience with – and each one offer pro’s and con’s.  I like the detail and “query-ability” of the data I get from the scheduled job solution, but the DDL trigger is simpler and doesn’t rely on scheduled jobs running.  The trigger also clearly has a big advantage in that it’s going to pick up the actual event as it happens, whereas the job may miss stuff between executions. This may or may not be important to you.

    There may be a nice way of combining the two though.  Using Event Notifications, or DDL triggers to insert a payload onto a Service Broker queue, you could have a live and asynchronous system which gathers more detail than what’s available in the EVENTDATA.  I.e. you could have an activated procedure on the end of the SB queue which uses the index name to populate additional details, for example. Let me know in the comments if you give this a go, or if you can see any gotchas.

    Cheers,
    Dave

     

  • SQL Server 2016 & Azure Query Store

    SQL Server 2016 & Azure Query Store

    I hadn’t been following the news much regarding SQL Server 2016, so when I did some reading the other day I was quite pleasantly surprised by some of the new features announced (not to mention that it looks like SSMS will finally be getting some much needed love). 🙂

    We’ve been having some frustrating intermittent performance issues recently, and I was struggling to gain much insight into what the issue was (since we’re using Azure Databases, so scope for troubleshooting is a little narrower than for on-premise SQL servers).  So when I read about the “Query Store” feature available in SQL Server 2016 and Azure Database (v12) I got quite excited.

    What is it?

    I’ll keep this short and sweet since there’s already a few good posts out there about the Query Store. Basically this feature allows you to track down queries which have “regressed” (i.e. it was performing well, and then all of a sudden it turned to crap for no apparent reason).

    Not only can you track them down, you can now “pin” the old (i.e. good) execution plan.  Essentially you’re overriding the optimiser and telling it that you in fact know better.

    Sweet! How do I do it?

    You could do this before now, by forcing plans using USE PLAN query hints, etc.  But the Query Store and it’s related new shiny UI makes it soooo much easier and “trackable”.

    Dashboard displaying a summary of regressed queries in SQL Server, highlighting query execution durations and plans.

    As I said before though, I’m not going to go into details about how to use it. I used this post to figure out how it works, how to force plans, how to monitor how it’s going, etc.

    Ok, so how did it help?

    Our problem was that we were seeing intermittent DTU spikes (remember, we’re on Azure, so this means we were maxing out our premium-tier database’s resources in some way, whether CPU, Disk I/O, etc). We tracked it down to a heavy stored procedure call which was running well 99% of the time, but would get a “bad plan” every now and then.  So we would see a spike in our app response time in New Relic, I’d jump into a query window and run an sp_recompile on this proc, and usually the problem would go away (until the next time).

    Obviously this wasn’t a sustainable approach, so I needed to either rewrite the proc to make it more stable, tweak some indexes, or force a plan.  I fired up the new “Regressed Queries” report (shown above) and it quickly highlighted the problem query.  From there it was a case of selecting the “good” plan, and hitting the “Force Plan” button. Well… I don’t trust buttons so I actually ran the TSQL equivalent, sys.sp_query_store_force_plan.

    Visual representation of tracked queries in SQL Server Management Studio with execution plan and query performance metrics.

    Some interesting observations

    In the above image you can see the forced plan (circles with ticks in them). What seems to happen, which initially threw me, is that when you force a plan SQL generates a new plan which matches the forced plan, but is picked up as a different plan by the Query Store.  Which is why you see the ticks in the circles up until the point you actually pin the plan, after which point you get a new, un-ticked plan.  At first I thought this meant it wasn’t working, but it does indeed seem to stick to this forced plan.

    Other uses

    I’ve also found the reports very useful even when not resorting to forcing plans.  In several cases I’ve found queries which aren’t performing as well as they should be, and either altered the query or some underlying indexes, and then seen the (usually) positive resultant new plans; as shown in the image below, where this query was a bit all over the place until I slightly altered an existing index (added 1 included column) and it has since settled on a better, more stable plan (in purple).

    A graphical representation of SQL Server query performance over time, showing various plan IDs with distinct colors, highlighting performance fluctuations and trends.

    Cheers,
    Dave

  • Red Gate Azure backup failures

    Red Gate Azure backup failures

    We use Red Gate Cloud Services to backup our Azure database, as well as some storage accounts to Amazon EC2 storage, etc.  It all works really well 99% of the time, but after a few failures in a short time period I opened a ticket with Red Gate.

    The errors we got weren’t always the same though…

    “Microsoft Backup Service returned an error: Error encountered during the service operation. Database source is not a supported version of SQL Server”

    “Microsoft import/export service returned an error. BadRequest Error encountered during the service operation.”

    “Expected to drop temporary database ‘<database-name>’ but it didn’t exist, this shouldn’t happen. Please check for a temporary database on your SQL Azure Server”

    … as well as a few others.

    In Azure you don’t do a “backup” as such. The process is more like create a copy of your live database, export that database to a bacpac file, drop the copy of the database. It seemed like the issue here was usually with the last step where it tries to drop the database copy.

    Red Gate basically said the errors came from Microsoft’s side of the fence, so I had to open a ticket with them instead.  This made me want to curl up into the foetal position and cry. Instead I forged on and logged the ticket, and after the back-and-forth time-wasting dance they make everyone do, we started getting somewhere (although it wasn’t great news).  They’d passed the ticket to the “product group”, but they had some “internal issues with the log mechanism for the export/import service” so troubleshooting was proving difficult. *sigh*

    Anyway, eventually they suggested that one possible cause of at least some of the errors was that the Red Gate service was trying to access the database copy before it had finished being created. So Red Gate are investigating that at the moment, and I’ll update this post when I hear back from them.  This doesn’t explain the errors regarding the dropping of the database copy (which means that sometimes we end up with a random database copy sitting on the server, costing us money until I manually delete it).

    Hopefully I’ll have some further updates on this soon.

    Dave