I’ve recently needed to move data from our transactional database (an Azure SQL database), into an Azure SQL Data Warehouse. A definite case of “harder than it needed to be”…
What’s an Azure Data Warehouse?
I’ll assume if you’ve read this far, you know what a SQL database is. But an Azure Data Warehouse is a slightly different beast; it’s another hosted/managed service offered on Microsoft’s Azure cloud infrastructure. They market it as a distributed & elastic database that can support petabytes of data, while offering enterprise-class features.
That essentially means that behind the scenes this “database” is actually a bunch of orchestrated nodes working together (a control node, multiple compute nodes, storage, etc). Queries against this distributed database are themselves split up and run in parallel across these nodes – i.e. “MPP”, or Massively Parallel Processing. That’s very much it in a nutshell – for a lot more detail though, read this as well.
Why use this over other solutions?
I originally set up an old-school SSAS instance on an Azure VM, backed by a normal SQL Server data warehouse hosted on the same VM. Not very exciting, but it worked. The struggle was that to get data from our production database (an Azure SQL Database) into this VM required either SSIS packages pulling data across the wire, or a restore of the prod database locally (i.e. onto the VM) and then extracting the data from that using cross-database queries.
Then I read up on these relatively new Azure Data Warehouses, and I assumed that *surely* there would be a much simpler/better way of moving data directly from one to the other natively, within the “cloud”.
“Cloud-to-cloud” ETL FTW!
I asked the question, and the consensus seemed to be that Data Factory is the cool new way to move your data around the cloud. So I gave that a crack. Be warned, you’ll need to brush up on JSON (since you’ll need to be comfy writing/modifying JSON to setup the data sources, control the pipelines, etc).
All the examples I found seem to involve Blob-to-SQL, or SQL-to-Blob data loads. So I figured out how the bits and pieces work together, how to customise the JSON to setup the correct sources, pipelines, etc, and then kicked it off. It didn’t work… <sadface>
The issues I ran into were definitely solvable (data type conversion issues mostly) – but given my noob-ness with JSON and Data Factory in general, as well as the fact that it felt really clunky when trying to change schema quickly, I decided to be boring and revert back to good ol’ SSIS instead.
I feel like there’s a huge gap here for someone to build a simpler data load tool for this! And yes, I did also try using the Data Factory “Copy Wizard” (still in preview at this stage). While it did allow me to setup a basic table copy, I then wanted to modify the JSON pipeline slightly due to some data type issues, and amusingly the Azure Portal threw an error when I saved my changes because the default quota limits pipeline JSON objects to 200KB, and mine was *just* over that. You can request for this to be increased, but I couldn’t be bothered and basically ragequit at this point.
You see, the problem is that when you’re the sole infrastructure & database guy for a smallish start-up company, you don’t have time to spend a few days learning the ins-and-outs just to setup a basic data transfer. I need something that just works, quickly, so I can move on to solving tickets, optimising database performance, flying, checking on the test/dev environments, etc, etc, etc…
I’ll keep an eye on the copy wizard though, as I’m sure they’ll improve it over time, and it seems to be the closest to what I’m looking for at this stage.
It’s not all bad
Having said all of that, I’m still sticking with SQL Data Warehouse as my BI/BA back-end, and have been impressed with the performance of loads (even just done via SSIS packages) as well as query performance.
I made sure to split the data load aspects of my package up so as to utilise the parallel nature of SQL Data Warehouse, so I’m guessing that will be helping performance. I’ve also built some proof-of-concept PowerBI dashboards over the top of the data warehouse, which was ridiculously easy (and quite satisfying).
Let me know if you’ve had any similar experiences (good or bad) with loading data into SQL Data Warehouse, or moving data around within the cloud.
Cheers,
Dave