Tag: database

BIML, where have you been all my life?

I’ve used the BIDS Helper Visual Studio add-on for years now, and I’ve seen and heard of BIML, but it’s one of those things I’ve never needed to look into any further than that. Until I discovered that it’s something that would’ve saved me hours of tedious SSIS work!

What is it?

BIML (Business Intelligence Mark-up Language), or more specifically, BIMLScript, is sort of a mashup of XML and C# code nuggets, allowing you to create SSIS and SSAS packages. This is very much the condensed “DBDave” version – check out the official site for a much more eloquent explanation of what it is.

Basic Example

When you open up your SSIS project in Visual Studio, if you’ve got BIDS Helper installed, then when you right-click on the project you have the option of adding a BIML file:

Screenshot of Visual Studio Solution Explorer showing the option to add a new BIML file under the 'BIML Test' project.

It’ll create a new file under “Miscellaneous” in your project. Go ahead and open it up and you’ll see something like this:

Screenshot of a Visual Studio interface showing a BIML script file titled 'BimScript.biml' open on the left and the Solution Explorer on the right, highlighting the BIML Test project structure.

You can “execute” a BIMLScript by right-clicking on it, and selecting “Generate SSIS Packages”:

A screenshot showing the context menu options in Visual Studio with the 'Generate SSIS Packages' option highlighted, under the 'BimlScript' folder within the 'Miscellaneous' section.

Now we can jump in the deep end and paste the following into this new BIML script:

<Biml xmlns="http://schemas.varigence.com/biml.xsd"> 
    <Connections> 
        <Connection Name="SourceConn" ConnectionString="Data Source=.;Initial Catalog=tempdb;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" /> 
        <Connection Name="DestinationConn" ConnectionString="Data Source=.;Initial Catalog=tempdb;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" /> 
    </Connections> 
    <Projects> 
        <PackageProject Name="BIMLTest"> 
            <Parameters> 
                <Parameter Name="DateFrom" DataType="DateTime">2016-01-01</Parameter> 
                <Parameter Name="DestinationDatabase" DataType="String">tempdb</Parameter> 
                <Parameter Name="DestinationServer" DataType="String">localhost</Parameter> 
                <Parameter Name="DoTruncate" DataType="Boolean">false</Parameter> 
                <Parameter Name="SourceDatabase" DataType="String">tempdb</Parameter> 
                <Parameter Name="SourceServer" DataType="String">localhost</Parameter> 
            </Parameters> 
            <Packages> 
                <Package PackageName="BIMLTestPackage" /> 
            </Packages> 
        </PackageProject> 
    </Projects> 
    <Packages> 
        <Package Name="BIMLTestPackage" ConstraintMode="Linear" ProtectionLevel="DontSaveSensitive"> 
            <Connections> 
                <Connection ConnectionName="SourceConn"> 
                    <Expressions> 
                        <Expression ExternalProperty="InitalCatalog">@[$Project::SourceDatabase]</Expression> 
                        <Expression ExternalProperty="ServerName">@[$Project::SourceServer]</Expression> 
                    </Expressions> 
                </Connection> 
                <Connection ConnectionName="DestinationConn"> 
                    <Expressions> 
                        <Expression ExternalProperty="InitialCatalog">@[$Project::DestinationDatabase]</Expression> 
                        <Expression ExternalProperty="ServerName">@[$Project::DestinationServer]</Expression> 
                    </Expressions> 
                </Connection> 
            </Connections> 
            <Tasks> 
                <Container Name="Truncate Destination Table" ConstraintMode="Parallel"> 
                    <Expressions> 
                        <Expression ExternalProperty="Disable">!(@[$Project::DoTruncate])</Expression> 
                    </Expressions> 
                    <Tasks> 
                        <ExecuteSQL Name="Truncate Table" ConnectionName="DestinationConn"> 
                            <DirectInput> 
                                TRUNCATE TABLE dbo.DWDestinationTableExample; 
                            </DirectInput> 
                        </ExecuteSQL> 
                    </Tasks> 
                </Container> 
                <Container Name="Load Table" ConstraintMode="Linear"> 
                    <Tasks> 
                        <Dataflow Name="Load dbo.DWDestinationTableExample"> 
                            <Transformations> 
                                <OleDbSource Name="Source" ConnectionName="SourceConn"> 
                                    <DirectInput> 
                                        SELECT * FROM dbo.DWSourceTableExample WHERE KeyDate >= ?; 
                                    </DirectInput> 
                                    <Parameters> 
                                        <Parameter Name="0" VariableName="BIMLTest.DateFrom" /> 
                                    </Parameters> 
                                </OleDbSource> 
                                <OleDbDestination Name="Destination" ConnectionName="DestinationConn" KeepIdentity="true" UseFastLoadIfAvailable="true" MaximumInsertCommitSize="100000"> 
                                    <ExternalTableOutput Table="dbo.DWDestinationTableExample"> 
                                    </ExternalTableOutput> 
                                </OleDbDestination> 
                            </Transformations> 
                        </Dataflow> 
                    </Tasks> 
                </Container> 
            </Tasks> 
        </Package> 
    </Packages> 
</Biml>

What the… ?!?

Yeah, okay, let’s step through this to figure out what it does. I’ll show you what each bit of code results in too, which might help make it more tangible/understandable:

<Connections> 
        <Connection Name="SourceConn" ConnectionString="Data Source=.;Initial Catalog=tempdb;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" /> 
        <Connection Name="DestinationConn" ConnectionString="Data Source=.;Initial Catalog=tempdb;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" /> 
    </Connections>

First we setup the connections that will exist within the package. These are just connections to tempdb on my local SQL instance for testing. This bit results in this:

Screenshot of Connection Managers in Visual Studio displaying DestinationConn and SourceConn.

Next up, we specify the project and some project parameters that we’re going to use within the package:

<Projects> 
    <PackageProject Name="BIMLTest"> 
        <Parameters> 
            <Parameter Name="DateFrom" DataType="DateTime">2016-01-01</Parameter> 
            <Parameter Name="DestinationDatabase" DataType="String">tempdb</Parameter> 
            <Parameter Name="DestinationServer" DataType="String">localhost</Parameter> 
            <Parameter Name="DoTruncate" DataType="Boolean">false</Parameter> 
            <Parameter Name="SourceDatabase" DataType="String">tempdb</Parameter> 
            <Parameter Name="SourceServer" DataType="String">localhost</Parameter> 
        </Parameters> 
        <Packages> 
            <Package PackageName="BIMLTestPackage" /> 
        </Packages> 
    </PackageProject> 
</Projects>

There are some gotchas regarding project parameters in BIML when using BIDS Helper to check and run your BIMLScript, so keep that in mind. As per this example, you need to specify the project parameter definitions in here, even if they already exist within your project.

So because of these issues, I found it simpler just to make sure the parameters already exist, like this:

Screenshot of the project parameters window in SQL Server Data Tools, displaying parameters for a BIMLTestPackage including Name, Data Type, and Value columns.

Now we create the package itself, and substitute in some of the package parameters, which in this case we’re using to replace parts of the connection strings for our source and destination connections.

<Packages>
<Package Name="BIMLTestPackage" ConstraintMode="Linear" ProtectionLevel="DontSaveSensitive">
<Connections>
<Connection ConnectionName="SourceConn">
<Expressions>
<Expression ExternalProperty="InitalCatalog">@[$Project::SourceDatabase]</Expression>
<Expression ExternalProperty="ServerName">@[$Project::SourceServer]</Expression>
</Expressions>
</Connection>
<Connection ConnectionName="DestinationConn">
<Expressions>
<Expression ExternalProperty="InitialCatalog">@[$Project::DestinationDatabase]</Expression>
<Expression ExternalProperty="ServerName">@[$Project::DestinationServer]</Expression>
</Expressions>
</Connection>
</Connections>

This is the same as this part in the user interface:

Finally we add the meat to this SSIS sandwich; the components that perform the actual transformation and/or loading of data.

<Tasks>
<Container Name="Truncate Destination Table" ConstraintMode="Parallel">
<Expressions>
<Expression ExternalProperty="Disable">!(@[$Project::DoTruncate])</Expression>
</Expressions>
<Tasks>
<ExecuteSQL Name="Truncate Table" ConnectionName="DestinationConn">
<DirectInput>
TRUNCATE TABLE dbo.DWDestinationTableExample;
</DirectInput>
</ExecuteSQL>
</Tasks>
</Container>
<Container Name="Load Table" ConstraintMode="Linear">
<Tasks>
<Dataflow Name="Load dbo.DWDestinationTableExample">
<Transformations>
<OleDbSource Name="Source" ConnectionName="SourceConn">
<DirectInput>
SELECT * FROM dbo.DWSourceTableExample WHERE KeyDate >= ?;
</DirectInput>
<Parameters>
<Parameter Name="0" VariableName="BIMLTest.DateFrom" />
</Parameters>
</OleDbSource>
<OleDbDestination Name="Destination" ConnectionName="DestinationConn" KeepIdentity="true" UseFastLoadIfAvailable="true" MaximumInsertCommitSize="100000">
<ExternalTableOutput Table="dbo.DWDestinationTableExample">
</ExternalTableOutput>
</OleDbDestination>
</Transformations>
</Dataflow>
</Tasks>
</Container>
</Tasks>
</Package>
</Packages>
</Biml>

We’ve got an “Execute SQL” component running a truncate of the destination table first. However, we only want this to run if we’ve set our project parameter “DoTruncate” to true.

And lastly a Data Flow task to move data. This is done using a SQL query with a parameter for a “KeyDate” column, as an illustration of what you might do in a real-life situation.

Load Table transformation in SSIS package for dbo_DWDestinationTableExample

Cool! Now what??

So that’s BIML in a very small nutshell. Even if that’s all you’re doing with it (i.e. creating pretty basic packages) I think it’s worth doing since it makes source control of your packages SOOOOOO much nicer!

Imagine getting a pull request from a developer who’s made some SSIS changes, and simply being able to diff the BIML scripts to see exactly what they’ve changed!?

But wait, there’s more…

In the scenario that lead to discover BIML, I wanted to create a “dynamic” SSIS package, that was driven by metadata stored in a database. In other words, I could maintain a table with a list of table names that I wanted “ETL’d” from my production system to my data-warehouse, and my magic SSIS package would pick up changes, new tables added, etc without me needing to open and edit one monstrous package.

This is where the power of BIMLScript and it’s C# nuggets really shines. It lets you drop in complicated logic in C# code to control and mould the output of the BIML. So you could look up a list of tables to load, then iterate over that list, creating packages per table. Check out this post for a lot more detail (and examples) on how to achieve this.

That’s it for now. There’s lots of more detailed examples around if you look for them (Google is your friend), and I just wanted to highlight the possibilities which I didn’t realise were there before. Hopefully you find it as useful as I did.

Cheers,
Dave

2016-09-16

Azure SQL to Azure Data Warehouse ETL

I’ve recently needed to move data from our transactional database (an Azure SQL database), into an Azure SQL Data Warehouse. A definite case of “harder than it needed to be”…

What’s an Azure Data Warehouse?

I’ll assume if you’ve read this far, you know what a SQL database is. But an Azure Data Warehouse is a slightly different beast; it’s another hosted/managed service offered on Microsoft’s Azure cloud infrastructure. They market it as a distributed & elastic database that can support petabytes of data, while offering enterprise-class features.

That essentially means that behind the scenes this “database” is actually a bunch of orchestrated nodes working together (a control node, multiple compute nodes, storage, etc). Queries against this distributed database are themselves split up and run in parallel across these nodes – i.e. “MPP”, or Massively Parallel Processing. That’s very much it in a nutshell – for a lot more detail though, read this as well.

Why use this over other solutions?

I originally set up an old-school SSAS instance on an Azure VM, backed by a normal SQL Server data warehouse hosted on the same VM. Not very exciting, but it worked. The struggle was that to get data from our production database (an Azure SQL Database) into this VM required either SSIS packages pulling data across the wire, or a restore of the prod database locally (i.e. onto the VM) and then extracting the data from that using cross-database queries.

Then I read up on these relatively new Azure Data Warehouses, and I assumed that *surely* there would be a much simpler/better way of moving data directly from one to the other natively, within the “cloud”.

“Cloud-to-cloud” ETL FTW!

I asked the question, and the consensus seemed to be that Data Factory is the cool new way to move your data around the cloud. So I gave that a crack. Be warned, you’ll need to brush up on JSON (since you’ll need to be comfy writing/modifying JSON to setup the data sources, control the pipelines, etc).

All the examples I found seem to involve Blob-to-SQL, or SQL-to-Blob data loads. So I figured out how the bits and pieces work together, how to customise the JSON to setup the correct sources, pipelines, etc, and then kicked it off. It didn’t work… <sadface>

The issues I ran into were definitely solvable (data type conversion issues mostly) – but given my noob-ness with JSON and Data Factory in general, as well as the fact that it felt really clunky when trying to change schema quickly, I decided to be boring and revert back to good ol’ SSIS instead.

I feel like there’s a huge gap here for someone to build a simpler data load tool for this! And yes, I did also try using the Data Factory “Copy Wizard” (still in preview at this stage). While it did allow me to setup a basic table copy, I then wanted to modify the JSON pipeline slightly due to some data type issues, and amusingly the Azure Portal threw an error when I saved my changes because the default quota limits pipeline JSON objects to 200KB, and mine was *just* over that. You can request for this to be increased, but I couldn’t be bothered and basically ragequit at this point. 😛

You see, the problem is that when you’re the sole infrastructure & database guy for a smallish start-up company, you don’t have time to spend a few days learning the ins-and-outs just to setup a basic data transfer. I need something that just works, quickly, so I can move on to solving tickets, optimising database performance, flying, checking on the test/dev environments, etc, etc, etc…

I’ll keep an eye on the copy wizard though, as I’m sure they’ll improve it over time, and it seems to be the closest to what I’m looking for at this stage.

It’s not all bad

Having said all of that, I’m still sticking with SQL Data Warehouse as my BI/BA back-end, and have been impressed with the performance of loads (even just done via SSIS packages) as well as query performance.

I made sure to split the data load aspects of my package up so as to utilise the parallel nature of SQL Data Warehouse, so I’m guessing that will be helping performance. I’ve also built some proof-of-concept PowerBI dashboards over the top of the data warehouse, which was ridiculously easy (and quite satisfying).

Let me know if you’ve had any similar experiences (good or bad) with loading data into SQL Data Warehouse, or moving data around within the cloud.

Cheers,
Dave

2016-09-01
SQL Server 2016 & Azure Query Store

I hadn’t been following the news much regarding SQL Server 2016, so when I did some reading the other day I was quite pleasantly surprised by some of the new features announced (not to mention that it looks like SSMS will finally be getting some much needed love). 🙂

We’ve been having some frustrating intermittent performance issues recently, and I was struggling to gain much insight into what the issue was (since we’re using Azure Databases, so scope for troubleshooting is a little narrower than for on-premise SQL servers). So when I read about the “Query Store” feature available in SQL Server 2016 and Azure Database (v12) I got quite excited.

What is it?

I’ll keep this short and sweet since there’s already a few good posts out there about the Query Store. Basically this feature allows you to track down queries which have “regressed” (i.e. it was performing well, and then all of a sudden it turned to crap for no apparent reason).

Not only can you track them down, you can now “pin” the old (i.e. good) execution plan. Essentially you’re overriding the optimiser and telling it that you in fact know better.

Sweet! How do I do it?

You could do this before now, by forcing plans using USE PLAN query hints, etc. But the Query Store and it’s related new shiny UI makes it soooo much easier and “trackable”.

As I said before though, I’m not going to go into details about how to use it. I used this post to figure out how it works, how to force plans, how to monitor how it’s going, etc.

Ok, so how did it help?

Our problem was that we were seeing intermittent DTU spikes (remember, we’re on Azure, so this means we were maxing out our premium-tier database’s resources in some way, whether CPU, Disk I/O, etc). We tracked it down to a heavy stored procedure call which was running well 99% of the time, but would get a “bad plan” every now and then. So we would see a spike in our app response time in New Relic, I’d jump into a query window and run an sp_recompile on this proc, and usually the problem would go away (until the next time).

Obviously this wasn’t a sustainable approach, so I needed to either rewrite the proc to make it more stable, tweak some indexes, or force a plan. I fired up the new “Regressed Queries” report (shown above) and it quickly highlighted the problem query. From there it was a case of selecting the “good” plan, and hitting the “Force Plan” button. Well… I don’t trust buttons so I actually ran the TSQL equivalent, sys.sp_query_store_force_plan.

Some interesting observations

In the above image you can see the forced plan (circles with ticks in them). What seems to happen, which initially threw me, is that when you force a plan SQL generates a new plan which matches the forced plan, but is picked up as a different plan by the Query Store. Which is why you see the ticks in the circles up until the point you actually pin the plan, after which point you get a new, un-ticked plan. At first I thought this meant it wasn’t working, but it does indeed seem to stick to this forced plan.

Other uses

I’ve also found the reports very useful even when not resorting to forcing plans. In several cases I’ve found queries which aren’t performing as well as they should be, and either altered the query or some underlying indexes, and then seen the (usually) positive resultant new plans; as shown in the image below, where this query was a bit all over the place until I slightly altered an existing index (added 1 included column) and it has since settled on a better, more stable plan (in purple).

Cheers,
Dave

2015-07-03
SQL Server version information from build numbers

Just a quick post in case anyone else finds this useful; for a report I’ve created, I wanted to have SQL Server versions displayed in a nice, readable form, rather than the usual build number format (e.g. 11.0.3128.0).

So I found the very useful SQL Server Builds page (thanks to whoever maintains that), and proceeded to write a function which accepts the build number, and spits out a friendlier “version string”, down to the service pack and cumulative update level of resolution. (I didn’t need anything lower than that, but you could easily add it in yourself).

Here’s the file, feel free to download, use, hack to pieces, etc. 😉

Cheers,
Dave

2013-04-24
Combining different database technologies

Bike and a Bunny

Well the past month or two have been “busy”; I bought an old motorbike and restored it (win!), I sold my main motorbike and bought a new one (bigger win!), I then crashed it (not badly, but enough to ruin my day – boo!), I got the repaired bike back 4 weeks later (yay!) – and that’s just the 2-wheeled events, not the several dental surgeries my 7yo daughter’s needed, or the old “should-we-buy-a-new-house” gem. Phew…

Of course alongside the above events, work has been as busy as ever. I’m trying to figure out if it’s actually possible for it to be getting busier, when a year ago I thought it was already crazy-busy. Anyway, we’ve got a number of slightly more meaty projects on the go involving major architectural changes, upgrades, migrations, etc – which are stressful but more rewarding than most normal BAU work.

Anyway – getting to the point of this post, one of the things I’ve been thinking about on the side (to some extent) has involved some NoSQL technologies. Well, one so far, in the form of Redis. Nothing concrete so far, just a VM on my laptop that I’ve been playing with.

This got me thinking (a dangerous event at the best of times); is there a method of efficiently transferring data between disparate database systems, such as SQL Server and Redis, other than writing higher level applications to handle the ETL process. I mean, that obviously works (and maybe I just don’t understand enough about how people currently do this) but it seems to me like there should be a more efficient, “lower level” method of pumping a stream of data out of SQL and into Redis (for example). Could a CLR procedure fill the gap here perhaps?

The reason I was playing with this idea in the first place was related to a few bits of functionality which are sort of related; search autosuggest, a recommendation engine, and a context sensitive spell checker. All of these require massive sets of fairly narrow data (i.e. key-value pairs) that need to be queried in a predictable manner, very very quickly. All of these are actually already implemented using SQL Server, and handle some not-to-be-sneezed-at loads as it is.

So SQL Server obviously works to an extent, and performance can be increased significantly by using solid state storage (SSDs or cards like the FusionIO range), or even further by using system memory (via some kind of ram drive software). The data itself isn’t mission critical, so redundancy isn’t as crucial, meaning volatile storage can be used.
But at some point it becomes a lot more difficult to scale SQL to handle increasing load (financially and practically), which is where I think solutions like Redis, MemCacheDB, and quite a few others fit the bill perfectly.

This is just one example where a key-value store makes sense (to me at least), although there are of course other scenarios where you might need a document store instead (like MongoDB) or the Map/Reduce batch-processing power of something like HBase. In fact, Microsoft already has a Hadoop (or is that “an Hadoop…“?) connector, so I wonder if we will eventually have connectors for other systems as well?

I guess for now I’m resigned to brushing up on my horribly dusty and inadequate .Net dev skills in order to try out my whacky grand plans of disparate databases talking to each other nicely like good little children. Any dev’s feel like giving me a hand? :-p

One random, rambling post; check. 😉

Cheers
Dave

2012-08-14