Parallel Data Warehouse (SQL Servers Big Brother) Presentation

Faster is better. That’s the rule that applies to pretty much everything: cars, broadband, bar service, data warehouses… I went along to Microsoft last week with this in mind as I knew already that their Parallel Data Warehouse is all about speed and they were keen to demo it.
A little bit late to the party it may be but 4 years after being for sale in other Western markets, Microsoft’s Parallel Data Warehouse (PDW) has entered the Australian market. It sits at the extreme end of the Microsoft Data Warehouse product range:
MSDWOfferings
Presented by ex Teradata employee Matthew Winter, it was a truly impressive fact fest that shows what can be achieved performance-wise when you let the vendors configure everything and then lock the box so no pesky client types can tinker in it.
  • Sold as an appliance  (pre-configured hardware & software all in 1 big box)
  • Contains control nodes, compute nodes & storage nodes (several nodes per rack)
  • Compute nodes are virtual machines with their own SQL Server instances
  • Up to 9 compute nodes per rack
  • Virtual machines run in Hyper-V
  • Hyper-V limited to 64 hosts, which is only thing that limits PDW size
  • Virtual machines watched by MS system center, 99.8% same performance as non virtual
  • Uses MPP – massively parallel processing, instead of current Symmetric Multi-Processing
  • Storage: JBOD – just a bunch of disks, not raid
  • Max storage: 6 Ptbytes
  • Cheap data drives, 70 disks, 32 per comp node plus hot swappable
  • Labelled SQL 2012 but actually 2014
  • Base unit is half rack
  • Full rack : 2 Tb processing per hour
  • Shell database records  meta data on a control node
  • Compute nodes do work
  • SQL execution plans are injected with special parallel instructions
  • Cluster column storage used
  • Don’t forget batch mode for column store
  • Won’t be batch mode if you only use auto stats. Collect your own stats
  • Polybase lets you write SQL against Hadoop and PDW, combining data
  • PDW 1000 * faster than Hadoop so good to load from Hadoop first
  • Migration is simple – schema needs distribution keys, nc indexes removed.
  • Identity columns not supported.
  • Price based on 32 core SQL license
  • Microsoft currently training 5 core partners

PDWAppliance

If your pockets are deep enough and need to process data at the kind of speed that is likely to set your hair on fire then PDW is for you. It is ultra appealing to have everything pre-configured and not have to worry about it.

There is only a handful of companies using it in Australia so far but that will change as there is a clear market for it, as demonstrated by the competitors already at the party – Oracle, Teradata, IBM etc…

Hands-On: The Fastest DBMS In The World

The wild world of IT consultancy is full of swings and roundabouts but maybe the best pro from a technology point of view is getting to get to work with a broad range of interesting technologies on the clients behalf. Recently I was on a client site when they got a team from Actian to do a POC on premises of Vectorwise – an very high performance DBMS that I had never encountered in the wild. The best was after they had set up the POC, we (people on the client site) were going to be able to play with it in order to compare it against other alternatives.

First it is important to understand what Vectorwise does well – it is not a database cure-all that does everything excellently. It does analytical style querying well, very, very well. The industry standard point of reference for database performance is the mundanely titled Transaction Processing Perforormance Council (TPC). They have a handful of different comparisons of different database workloads. The most relevent within the data warehouse/BI space is the TPC-H benchmark:

This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions.

This basically gives us our data warehouse style world championship leaderboard of performance and cost. It looks like this:

TPC-H RESULT (2)

Got your attention yet? Thought so.

Cutting to the chase of our experiments on the client site, Vectorwise was running the heaviest queries the finance team relied on up to 100 x faster than the SQL Server being used already. This wasn’t entirely fair as Vectorwise sat on SSDs and had a bit more memory but those aren’t going to make SQL Server 100 x faster. Less intensive queries ran around 10 x faster than SQL Server, sometimes a little less. The Vectorwise consultants told us a little about the way the CPU cache is used exclusively for processing, which is what gives it it’s large performance advantage. Impressive stuff.

Related: Gartner Data Warehouse vendor magic quandrants

We found problems though, firstly run away queries, which the Vectorwise guys came in and promptly fixed through configuration changes.

The largest problem for the client was potentially user adoption as the variety of SQL used was a subset of ANSI SQL that lacks procedures, functions and data-types – the client site’s analysts (the people that would be writing the SQL) used T-SQL and Informix. Re-training a large team rarely seems like an inviting task.

The final problem was an operational one – the particular server we had set up for us was on a Linux server which would be great if there were any other Linux servers in the company or people skilled in Linux administration. There was a windows flavour available which would have eradicated this problem but was not forthcoming for mysterious management reasons.

It seems like a product that on the one hand provides jaw dropping query performance but only the other gives a slightly buggy, feature poor experience. If it can iron out these headaches before it’s competitors can provide the same performance then Vectorwise will be a lot more prominent in the DW space.