Summary: the new release of Vertica's columnar database can store several related data elements in a single column. Didn't we used to call that a row-oriented database? Benefits seem limited.
Analytical database vendor Vertica yesterday announced its 3.5 release. The main feature is a new architecture called "Flexstore", which can combine several data elements into a single column. This is done for columns that are commonly used together in the same query, such as the “bid” and “asked” price on a stock transaction or the dimension tables in a star schema (to use the company’s examples).
I was skeptical of this notion when Vertica briefed me two weeks ago, and still am today. Storing multiple elements together is what a row-oriented database does, so it seems fundamentally at odds with Vertica’s column-based model. More concretely, a columnar database scans all entries for each column during a query, so its speed is basically determined by the amount of data. Whether it scans two columns that are one terabyte each or one combined column of two terabytes, it’s still scanning the same two terabytes.
Vertica offered two responses to my doubts. One is that it can better compress the data when the two columns are combined, for example by using delta encoding (storing only the change from one value to the next). I’ll buy that, although I suspect the gains won’t be very large.
The other explanation was that data for each column typically ends in one partially-filled data block, leaving a small amount of empty space that must still be read. It’s something like storing 3 ½ cups of water in 1-cup containers – you need four cups, of which three are completely filled and one holds the remainder. (Vertica confirmed that it generally fills each block except the “last” one for any column.) Combining the columns therefore reduces the number of partly-empty blocks.
But the saving is just one partially-filled block per column. It's a bit more for small columns like dimension lists, several of which might fit into a single block if combined. I can’t see how a few partially-empty data blocks would have much impact on performance when a good size database fills thousands of blocks. (The typical block size, per Vertica, is 1 MB). And if you don’t have a good size database, performance won’t be an issue in the first place.
I was willing to be convinced that I was missing something, but Vertica told me they didn’t have any formal test results available. The best they could offer was that they sometimes saw up to 10% improvement when large tables are involved, mostly from compression. For a system that promises to deliver “query results 50 to 200 times faster than other databases”, a 10% change is immaterial.
The other major component of the Vertica announcement is what it calls “MapReduce integration”, which should definitely not be confused with actually implementing MapReduce within Vertica. (Indeed, the footnotes to Wikipedia’s article on MapReduce show that Vertica CTO Michael Stonebreaker has been publicly skeptical of MapReduce, although the nuances are complicated.)
What Vertica has added is a JDBC connector that makes it relatively easy to move data between separate servers running Vertica and Hadoop (the open source version of MapReduce). Since SQL databases like Vertica are good at different things than MapReduce, this generally makes sense. Still, it's worth noting that other analytical database vendors including Greenplum and Aster Data run MapReduce and SQL on the same hardware.
The 3.5 version of Vertica is scheduled for release this October.
No comments:
Post a Comment