Tag: azure

Azure SQL to Azure Data Warehouse ETL

I’ve recently needed to move data from our transactional database (an Azure SQL database), into an Azure SQL Data Warehouse. A definite case of “harder than it needed to be”…

What’s an Azure Data Warehouse?

I’ll assume if you’ve read this far, you know what a SQL database is. But an Azure Data Warehouse is a slightly different beast; it’s another hosted/managed service offered on Microsoft’s Azure cloud infrastructure. They market it as a distributed & elastic database that can support petabytes of data, while offering enterprise-class features.

That essentially means that behind the scenes this “database” is actually a bunch of orchestrated nodes working together (a control node, multiple compute nodes, storage, etc). Queries against this distributed database are themselves split up and run in parallel across these nodes – i.e. “MPP”, or Massively Parallel Processing. That’s very much it in a nutshell – for a lot more detail though, read this as well.

Why use this over other solutions?

I originally set up an old-school SSAS instance on an Azure VM, backed by a normal SQL Server data warehouse hosted on the same VM. Not very exciting, but it worked. The struggle was that to get data from our production database (an Azure SQL Database) into this VM required either SSIS packages pulling data across the wire, or a restore of the prod database locally (i.e. onto the VM) and then extracting the data from that using cross-database queries.

Then I read up on these relatively new Azure Data Warehouses, and I assumed that *surely* there would be a much simpler/better way of moving data directly from one to the other natively, within the “cloud”.

“Cloud-to-cloud” ETL FTW!

I asked the question, and the consensus seemed to be that Data Factory is the cool new way to move your data around the cloud. So I gave that a crack. Be warned, you’ll need to brush up on JSON (since you’ll need to be comfy writing/modifying JSON to setup the data sources, control the pipelines, etc).

All the examples I found seem to involve Blob-to-SQL, or SQL-to-Blob data loads. So I figured out how the bits and pieces work together, how to customise the JSON to setup the correct sources, pipelines, etc, and then kicked it off. It didn’t work… <sadface>

The issues I ran into were definitely solvable (data type conversion issues mostly) – but given my noob-ness with JSON and Data Factory in general, as well as the fact that it felt really clunky when trying to change schema quickly, I decided to be boring and revert back to good ol’ SSIS instead.

I feel like there’s a huge gap here for someone to build a simpler data load tool for this! And yes, I did also try using the Data Factory “Copy Wizard” (still in preview at this stage). While it did allow me to setup a basic table copy, I then wanted to modify the JSON pipeline slightly due to some data type issues, and amusingly the Azure Portal threw an error when I saved my changes because the default quota limits pipeline JSON objects to 200KB, and mine was *just* over that. You can request for this to be increased, but I couldn’t be bothered and basically ragequit at this point. 😛

You see, the problem is that when you’re the sole infrastructure & database guy for a smallish start-up company, you don’t have time to spend a few days learning the ins-and-outs just to setup a basic data transfer. I need something that just works, quickly, so I can move on to solving tickets, optimising database performance, flying, checking on the test/dev environments, etc, etc, etc…

I’ll keep an eye on the copy wizard though, as I’m sure they’ll improve it over time, and it seems to be the closest to what I’m looking for at this stage.

It’s not all bad

Having said all of that, I’m still sticking with SQL Data Warehouse as my BI/BA back-end, and have been impressed with the performance of loads (even just done via SSIS packages) as well as query performance.

I made sure to split the data load aspects of my package up so as to utilise the parallel nature of SQL Data Warehouse, so I’m guessing that will be helping performance. I’ve also built some proof-of-concept PowerBI dashboards over the top of the data warehouse, which was ridiculously easy (and quite satisfying).

Let me know if you’ve had any similar experiences (good or bad) with loading data into SQL Data Warehouse, or moving data around within the cloud.

Cheers,
Dave

2016-09-01
Keeping track of your SQL Server index “inventory”
Something’s always bugged me when it comes to managing indexes in SQL Server; keeping track of when I created or dropped them.

You can already check just about everything you could want to know about indexes, via handy DMV’s. Need to figure out how much your indexes are being used? Take a look at sys.dm_db_index_usage_stats. Or how about which indexes you might need to add to improve performance? Easy, just query sys.dm_db_missing_index_group_stats! You get the idea…

But what about the create date of an index, or when it was last rebuilt, or even when you dropped an index? For this you need to roll up your sleeves and roll your own solution.

How to skin this cat?

There are a few ways we can do this.

Scheduled job

The first way I used to do this was to just have a scheduled job running fairly regularly (like every 15 – 30 minutes) which checked for any changes to indexes in the database since the last time it ran. Any new ones would be added to the table, changes would be recorded, and dropped indexes would noted as such. In fact, I used a version of Kimberly Tripp’s “improved sp_helpindex” to gather and store the index information in a nice format (i.e. with separate columns for included columns, compression, etc).

This is what the “guts” of the proc look like, just to give you an idea:
```
DECLARE TimelyTables CURSOR FAST_FORWARD FOR
		SELECT	DISTINCT 
				QUOTENAME(ss.name) + '.' + QUOTENAME(st.name) AS TableName,
				st.[object_id]
		FROM	sys.tables AS st
		JOIN	sys.schemas AS ss ON st.[schema_id] = ss.[schema_id]
		WHERE	st.is_ms_shipped = 0;

	OPEN TimelyTables;
	FETCH NEXT FROM TimelyTables INTO @TableName, @ObjectId;
	WHILE @@FETCH_STATUS = 0
	BEGIN
		TRUNCATE TABLE #tt_Indexes;

		RAISERROR('Table: %s (%i)',10,1,@TableName,@ObjectId);

		INSERT	#tt_Indexes
		EXEC	dbo.sp_helpindex2 @TableName; 

		IF @@ROWCOUNT &gt; 0 
		BEGIN 
			INSERT	#index_history ([object_id], table_name, index_id, is_disabled, index_name, index_description, index_keys, included_columns, filter_definition, [compression]) 
			SELECT	@ObjectId, @TableName, t.index_id, t.is_disabled, t.index_name, t.index_description, t.index_keys, t.included_columns, t.filter_definition, t.[compression] 
			FROM	#tt_Indexes AS t;  
		END 

		FETCH NEXT FROM TimelyTables INTO @TableName, @ObjectId;
	END
```
Outside of this loop you can then do your MERGE comparison between “current” indexes, and what was recorded in the “index history” table from the previous run.

DDL trigger

DDL triggers are nothing new, but they can be very useful for auditing schema and login changes, etc. So it makes sense that this is ideally suited to creating an inventory of your indexes (if not all database objects that you might be interested in). In fact, you can even quite easily create your own poor-man’s source control system, but that’s a different kettle of fish.

The idea behind DDL triggers is that you specify which ‘events‘ you want them to fire for at a database or server level. In my case, working with Azure, I’m only interested in database level events. In fact, in this case I’m only interested in recording the details of any CREATE INDEX, ALTER INDEX, or DROP INDEX statements. Which looks like this:
```
CREATE TRIGGER trg_IndexChangeLog ON DATABASE 
    FOR CREATE_INDEX, ALTER_INDEX, DROP_INDEX 
AS 
...
```
Now we just insert the EVENTDATA data into our logging table, like so:
```
IF OBJECT_ID('dba.IndexChangeLog') IS NOT NULL 
BEGIN
	DECLARE @data XML; 
	SET @data = EVENTDATA(); 

	INSERT	dba.IndexChangeLog(eventtype, objectname, objecttype, sqlcommand, loginname) 
	VALUES	( 
			@data.value('(/EVENT_INSTANCE/EventType)[1]', 'varchar(50)'), 
			@data.value('(/EVENT_INSTANCE/ObjectName)[1]', 'varchar(256)'), 
			@data.value('(/EVENT_INSTANCE/ObjectType)[1]', 'varchar(25)'), 
			@data.value('(/EVENT_INSTANCE/TSQLCommand)[1]', 'varchar(max)'), 
			@data.value('(/EVENT_INSTANCE/LoginName)[1]', 'varchar(256)') 
			); 
END;
```
EVENTDATA is basically an XML document that contains the important bits you might want to record for auditing purposes… like who ran what, when. This is what’s available:
```
<EVENT_INSTANCE>
    <EventType>event </EventType>
    <PostTime>date-time</PostTime>
    <SPID>spid</SPID>
    <ServerName>name </ServerName>
    <LoginName>login </LoginName>
    <UserName>name</UserName>
    <DatabaseName>name</DatabaseName>
    <SchemaName>name</SchemaName>
    <ObjectName>name</ObjectName>
    <ObjectType>type</ObjectType>
    <TSQLCommand>command</TSQLCommand>
</EVENT_INSTANCE>
```
Best of both worlds?

The above two methods are ones that I’ve already used successfully and have experience with – and each one offer pro’s and con’s. I like the detail and “query-ability” of the data I get from the scheduled job solution, but the DDL trigger is simpler and doesn’t rely on scheduled jobs running. The trigger also clearly has a big advantage in that it’s going to pick up the actual event as it happens, whereas the job may miss stuff between executions. This may or may not be important to you.

There may be a nice way of combining the two though. Using Event Notifications, or DDL triggers to insert a payload onto a Service Broker queue, you could have a live and asynchronous system which gathers more detail than what’s available in the EVENTDATA. I.e. you could have an activated procedure on the end of the SB queue which uses the index name to populate additional details, for example. Let me know in the comments if you give this a go, or if you can see any gotchas.

Cheers,
Dave
2016-08-05

PowerShell script to update Azure firewall rule

*Updated 2017-06-28

I’ve recently moved house, and as a result had to change my broadband plan from cable to ADSL (sad face). This also means I’ve gone from having a fixed IP address to a dynamically assigned one. Usually, this wouldn’t be a problem, except when it comes to connecting to the several Azure servers that I manage on a daily basis. Now I need to use the Azure Portal to manually change each server’s firewall settings at least once or twice a week. Painfull…

So I quickly threw together this PS script to do the job for me and thought others out there might find it useful too.

How’s it work?

The script accepts an array of Azure SQL Server names, finds your external IP address using ipinfo.io, and then loops through the list of servers. You’ll need to provide a default rule name or modify the function call to pass it in (maybe include it in the array if it’s different for each server?).

It then checks the current IP address of the specified rule and, if it’s different to your external IP address, updates the firewall rule for you. #Magic

Import-Module SQLPS -DisableNameChecking 
Import-Module Azure 

# Run Get-AzurePublishSettingsFile first to download the publish settings file for your Azure subscription
# Full instructions here: https://docs.microsoft.com/en-us/powershell/module/azure/get-azurepublishsettingsfile?view=azuresmps-4.0.0

Import-AzurePublishSettingsFile "C:\My_oresome_path\Sweet-as-publish-settings-file.publishsettings" # &lt;-- put the path to your publish settings file here

# Now just add your server names to this array... or get fancy and look them up somehow, 
# whether from a simple text file or something more exotic.
[array]$AzureServers = @('servername01','servername02','servername03','servername04'); 

# Just a little function to get your current external/public IP address
function Get-MyIpAddress
{
    $ip = Invoke-RestMethod http://ipinfo.io/json | Select -exp ip
    return $ip;
}

# This function does the work of changing the whitelist if necessary
function Update-MyAzureFirewallRule 
{
    Param (
        [Parameter(Mandatory=$True,ValueFromPipeline=$True,ValueFromPipelinebyPropertyName=$True)]
        [string]$ServerName,
        [string]$RuleName = 'Put_Your_Rule_Name_Here',
        [string]$IpAddress
        )

    # Gets the current rule (so we can see what the IP address is currently set to)
    $CurrentRule = Get-AzureSqlDatabaseServerFirewallRule -RuleName $RuleName -ServerName $ServerName; 
    $CurrentIp = $CurrentRule.StartIpAddress

    # If your current IP doesn't match what's in the whitelist, then update it
    if ($CurrentIp -ne $IpAddress)
    {
        Write-Host "Setting firewall rule '$RuleName' on server '$ServerName' to IP address '$IpAddress' (was '$CurrentIp')..."
        Set-AzureSqlDatabaseServerFirewallRule -StartIPAddress $IpAddress -EndIPAddress $IpAddress -RuleName $RuleName -ServerName $ServerName;
    }
    
}

if ($IpAddress = Get-MyIpAddress)
{
    Write-Host "Your IP address is $IpAddress"

    foreach ($s in $AzureServers) 
    {
        Update-MyAzureFirewallRule -ServerName $s -IpAddress $IpAddress;
    }
}

This post provided the inspiration, which I then tweaked it to suit my needs. Like I said; it’s quick-and-dirty, so use at your own risk. 😉 I’m no PowerShell guru either, so feel free to let me know if you improve on it.

Cheers,
Dave

2016-07-28

SQL Server 2016 & Azure Query Store

I hadn’t been following the news much regarding SQL Server 2016, so when I did some reading the other day I was quite pleasantly surprised by some of the new features announced (not to mention that it looks like SSMS will finally be getting some much needed love). 🙂

We’ve been having some frustrating intermittent performance issues recently, and I was struggling to gain much insight into what the issue was (since we’re using Azure Databases, so scope for troubleshooting is a little narrower than for on-premise SQL servers). So when I read about the “Query Store” feature available in SQL Server 2016 and Azure Database (v12) I got quite excited.

What is it?

I’ll keep this short and sweet since there’s already a few good posts out there about the Query Store. Basically this feature allows you to track down queries which have “regressed” (i.e. it was performing well, and then all of a sudden it turned to crap for no apparent reason).

Not only can you track them down, you can now “pin” the old (i.e. good) execution plan. Essentially you’re overriding the optimiser and telling it that you in fact know better.

Sweet! How do I do it?

You could do this before now, by forcing plans using USE PLAN query hints, etc. But the Query Store and it’s related new shiny UI makes it soooo much easier and “trackable”.

As I said before though, I’m not going to go into details about how to use it. I used this post to figure out how it works, how to force plans, how to monitor how it’s going, etc.

Ok, so how did it help?

Our problem was that we were seeing intermittent DTU spikes (remember, we’re on Azure, so this means we were maxing out our premium-tier database’s resources in some way, whether CPU, Disk I/O, etc). We tracked it down to a heavy stored procedure call which was running well 99% of the time, but would get a “bad plan” every now and then. So we would see a spike in our app response time in New Relic, I’d jump into a query window and run an sp_recompile on this proc, and usually the problem would go away (until the next time).

Obviously this wasn’t a sustainable approach, so I needed to either rewrite the proc to make it more stable, tweak some indexes, or force a plan. I fired up the new “Regressed Queries” report (shown above) and it quickly highlighted the problem query. From there it was a case of selecting the “good” plan, and hitting the “Force Plan” button. Well… I don’t trust buttons so I actually ran the TSQL equivalent, sys.sp_query_store_force_plan.

Some interesting observations

In the above image you can see the forced plan (circles with ticks in them). What seems to happen, which initially threw me, is that when you force a plan SQL generates a new plan which matches the forced plan, but is picked up as a different plan by the Query Store. Which is why you see the ticks in the circles up until the point you actually pin the plan, after which point you get a new, un-ticked plan. At first I thought this meant it wasn’t working, but it does indeed seem to stick to this forced plan.

Other uses

I’ve also found the reports very useful even when not resorting to forcing plans. In several cases I’ve found queries which aren’t performing as well as they should be, and either altered the query or some underlying indexes, and then seen the (usually) positive resultant new plans; as shown in the image below, where this query was a bit all over the place until I slightly altered an existing index (added 1 included column) and it has since settled on a better, more stable plan (in purple).

Cheers,
Dave

2015-07-03
Red Gate Azure backup failures

We use Red Gate Cloud Services to backup our Azure database, as well as some storage accounts to Amazon EC2 storage, etc. It all works really well 99% of the time, but after a few failures in a short time period I opened a ticket with Red Gate.

The errors we got weren’t always the same though…

“Microsoft Backup Service returned an error: Error encountered during the service operation. Database source is not a supported version of SQL Server”

“Microsoft import/export service returned an error. BadRequest Error encountered during the service operation.”

“Expected to drop temporary database ‘<database-name>’ but it didn’t exist, this shouldn’t happen. Please check for a temporary database on your SQL Azure Server”

… as well as a few others.

In Azure you don’t do a “backup” as such. The process is more like create a copy of your live database, export that database to a bacpac file, drop the copy of the database. It seemed like the issue here was usually with the last step where it tries to drop the database copy.

Red Gate basically said the errors came from Microsoft’s side of the fence, so I had to open a ticket with them instead. This made me want to curl up into the foetal position and cry. Instead I forged on and logged the ticket, and after the back-and-forth time-wasting dance they make everyone do, we started getting somewhere (although it wasn’t great news). They’d passed the ticket to the “product group”, but they had some “internal issues with the log mechanism for the export/import service” so troubleshooting was proving difficult. *sigh*

Anyway, eventually they suggested that one possible cause of at least some of the errors was that the Red Gate service was trying to access the database copy before it had finished being created. So Red Gate are investigating that at the moment, and I’ll update this post when I hear back from them. This doesn’t explain the errors regarding the dropping of the database copy (which means that sometimes we end up with a random database copy sitting on the server, costing us money until I manually delete it).

Hopefully I’ll have some further updates on this soon.

Dave

2015-06-29
Head in the clouds

Wow – 2 years and 4 days with no posts. What a slacker…

Anyway… I’ve been busy doing a few things in that time. The most relevant probably being that I moved on from Trade Me near the end of 2014, and now work for Timely. This is a small company started a couple of years ago by some guys who I used to work with at Trade Me, and we offer an online booking/scheduling system for SMB’s around the world.

The interesting change for me from a technical perspective is that Timely is entirely based in “the Cloud” – or Microsoft Azure to be specific. We have a blog post here which explains our systems at a high level (and even includes a Visio diagram put together by yours truly). 🙂

That’s already a little out of date though… that’s one of the things about IaaS and PaaS; it’s very easy to move quickly. One minute there’s nothing, an hour later you’ve spun up a new SQL Server virtual machine with a couple extra data disks, and configured Reporting Services!

This is awesome in that you can go from “Hey, I’ve got this cool idea!” to deployed into production in days (or even hours). Proof-of-concept systems can be spun up and shut down with minimal effort, and Microsoft are pretty good these days at releasing frequent updates to Azure functionality; meaning there’s always fresh temptation to try out the latest beta-features.

It can also be bad though. If it’s easy to spin up new servers and services, you’ll do it more often, and then you’ve got to manage them (oh yeah, and pay for them). So I’m finding myself getting elbow-deep in PowerShell more than I ever have before in order to script and automate things as much as possible. Might not seem necessary when you’ve got 1 or 2 servers, but before you know it you’ll have half a dozen or more. 😉

Anyway, I’ll leave it at that for now since I’ve got a few posts lined up to cover some of these topics in more detail.

Cheers,
Dave

2015-04-28