Customer Experience Matrix: analytical database

Showing posts with label analytical database. Show all posts

Friday, September 11, 2009

A Heartwarming Story of Social Media, Family and QlikView

My son works as a sports researcher at a cable television network. His job seems mainly to be looking things up in online databases and, on broadcast days, watching several games simultaneously. It's nice work if you can get it.

In terms of technology, Microsoft Word and Excel meet most of his needs. But I did introduce him to QlikView several years back, and he learned enough to analyze statistics for his college basketball team. When QlikView introduced its free Personal Edition, he decided to use it at work to track a database of college recruiting prospects. Despite (or because of) his lack of technical background, and without any formal QlikView training, he created a very nice system to find prospects based on different characteristics and create ad hoc statistical summaries.

The centerpiece is a map that displays the number of recruits by state. Because this is QlikView, the map is automatically redrawn each time he makes a selection: so he can see recruits for a certain position, or going to a particular school, or whatever. This is the sort of thing that gets sports people excited. In fact, his colleagues were so pleased that there’s talk of using a version of the map on-air.

The only fly in this ointment was that neither he nor I could find a way to get the map to show the numbers for all the states simultaneously. We could get different sized bubbles reflecting the state counts, and we could see the actual figure for each state by hovering over it. Recognizing my own limits as a QlikView developer, I asked for help on the QlikView user forum and from friend on the QlikView consulting staff. The consultant didn't think it was possible, so we let the matter drop.

Fast-forward one month, to yesterday, when I received a notification that someone had responded to my forum query with a solution. It took a couple of tries, and some additional help from forum members, to get it to work on my son’s map. But you can imagine how pleased we were when we finally saw the map as originally envisioned.

This story illustrates quite a bit about QlikView. Building the original map was easy – my son was able to do it with little help, even though QlikView was doing some very sophisticated processing under the hood. (Specifically, on-the-fly data aggregation along user-defined calculated dimensions, without touching the underlying database). But getting the system to do exactly what he needed did take some special knowledge. (He had to use the number of students by state as his primary dimension, not the X/Y map coordinates.) The adjustment took just a few minutes, but only a QlikView expert would realize that’s how you do it.

To generalize a bit more broadly, then, QlikView really does enable non-technical users to do amazing things, and it really is as powerful as its advocates (myself included) like to claim. But users do need some training to be effective – something that advocates are sometimes reluctant to admit.

The story also illustrates the value of social media. QlikView’s forum is an amazing source of help for users of all skill levels. It works because QlikView has a community of highly engaged advocates who are both expert in the product and willing to help each other.

The forum provides several strategic benefits for QlikView: it helps users become successful (thus driving wider adoption); it lets users succeed even if they don’t receive proper training (which many will not, particularly among users of the free Personal Edition); it reduces the need for paid support staff; and it provides a window into common problems and requirements. It also reinforces the commitment of the engaged users themselves, by publicly rewarding their contributions. Although I’ve never discussed the forum with QlikView management, they obviously understand these benefits well enough to justify their continued investments in it.

This isn’t to say that social media would provide the same value to everyone. QlikView fits several specific conditions – enthusiastic expert users, problems that can be solved fairly easily, etc. – that won’t always apply. But as an example what social media can sometimes accomplish, QlikView is a great case study waiting to be written.

Wednesday, August 05, 2009

Vertica Announces 3.5 Release

Summary: the new release of Vertica's columnar database can store several related data elements in a single column. Didn't we used to call that a row-oriented database? Benefits seem limited.

Analytical database vendor Vertica yesterday announced its 3.5 release. The main feature is a new architecture called "Flexstore", which can combine several data elements into a single column. This is done for columns that are commonly used together in the same query, such as the “bid” and “asked” price on a stock transaction or the dimension tables in a star schema (to use the company’s examples).

I was skeptical of this notion when Vertica briefed me two weeks ago, and still am today. Storing multiple elements together is what a row-oriented database does, so it seems fundamentally at odds with Vertica’s column-based model. More concretely, a columnar database scans all entries for each column during a query, so its speed is basically determined by the amount of data. Whether it scans two columns that are one terabyte each or one combined column of two terabytes, it’s still scanning the same two terabytes.

Vertica offered two responses to my doubts. One is that it can better compress the data when the two columns are combined, for example by using delta encoding (storing only the change from one value to the next). I’ll buy that, although I suspect the gains won’t be very large.

The other explanation was that data for each column typically ends in one partially-filled data block, leaving a small amount of empty space that must still be read. It’s something like storing 3 ½ cups of water in 1-cup containers – you need four cups, of which three are completely filled and one holds the remainder. (Vertica confirmed that it generally fills each block except the “last” one for any column.) Combining the columns therefore reduces the number of partly-empty blocks.

But the saving is just one partially-filled block per column. It's a bit more for small columns like dimension lists, several of which might fit into a single block if combined. I can’t see how a few partially-empty data blocks would have much impact on performance when a good size database fills thousands of blocks. (The typical block size, per Vertica, is 1 MB). And if you don’t have a good size database, performance won’t be an issue in the first place.

I was willing to be convinced that I was missing something, but Vertica told me they didn’t have any formal test results available. The best they could offer was that they sometimes saw up to 10% improvement when large tables are involved, mostly from compression. For a system that promises to deliver “query results 50 to 200 times faster than other databases”, a 10% change is immaterial.

The other major component of the Vertica announcement is what it calls “MapReduce integration”, which should definitely not be confused with actually implementing MapReduce within Vertica. (Indeed, the footnotes to Wikipedia’s article on MapReduce show that Vertica CTO Michael Stonebreaker has been publicly skeptical of MapReduce, although the nuances are complicated.)

What Vertica has added is a JDBC connector that makes it relatively easy to move data between separate servers running Vertica and Hadoop (the open source version of MapReduce). Since SQL databases like Vertica are good at different things than MapReduce, this generally makes sense. Still, it's worth noting that other analytical database vendors including Greenplum and Aster Data run MapReduce and SQL on the same hardware.

The 3.5 version of Vertica is scheduled for release this October.

Thursday, July 09, 2009

ParAccel Toots Its Horn and Revs Its Database Engine

Summary: Over the past year, columnar analytical database vendor ParAccel has methodically proven its claims about speed, scalability and easy deployment. Now it's looking to grow fast.

When I first wrote about analytical database vendor ParAccel in a February 2008 post, it was one of several barely distinguishable vendors offering massively parallel, SQL-compatible columnar databases. Their main claim to fame was a record-setting performance on the TPC-H benchmark, but even the significance of that was unclear since few vendors bother with the TPC process.

Since then, ParAccel has delivered an impressive string of accomplishments, including deals with demanding customers (Merkle, PriceChopper, Autometrics, TRX) and an important alliance with EMC to create a “scalable analytic appliance”. To top it off, they recently announced their 2.0 release, a new TPC-H record, and $22 million Series C funding. (Full disclosure: they also hired me to write a white paper.)

Of all these, perhaps the most significant news is that the new TPC-H benchmark comes at the 30 terabyte level.* ParAccel’s previous TPC-H championships were at the 100 GB to 1 TB levels.

The change reflects a general growth in the scale of systems supported by MPP columnar databases. ParAccel reports its largest production installation holds 18 TB of compressed data, which probably translates to something more than 50 TB of input. Segment-leader Vertica reports several production installations larger than 100 TB. Neither had more than 10 TB in production a year ago.

These figures still don’t put the columnar systems in the same ballpark as the petabyte-scale database appliances like Netezza, Greenplum and Aster Data, but they do open up some major new possibilities. In case you’re wondering, ParAccel’s TPC-H results were seven times faster and had 16 times better price / performance than the previous record, held by Oracle.

But pure scalability isn’t the key selling point for ParAccel. More than anything, the company stresses its ability to handle complex queries without specialized data schemas or indexes. This means that existing data structures can be loaded as is and queried immediately. The net result is a much faster “time to answer” than competitive systems, which do tailor schemas and/or indexes to specific questions. It also means that new queries can be answered immediately, without waiting for schema modifications or new indexes.

The 2.0 release extends these advantages with a new query optimizer that handles very complex joins and correlated subqueries; parallel data loading (nearly 9 TB per hour in the TPC-H benchmark) and User Defined Functions; enhanced compression; and “blended scans” that avoid Storage Area Network (SAN) controller bottlenecks by loading SAN data onto compute nodes and querying them directly. It also adds some special features such as Oracle SQL support and column encryption for financial data. Another set of enhancements are designed to provide enterprise-class reliability, availability and manageability, such as back-up and failover. Several of these features are already in production, although the official 2.0 release date is August.

The new release and added funding mark a transition of ParAccel from quiet introduction to full-throated selling. Over the past year, the company has carefully limited its participation in Proof of Concept (POC) competitions, the key selection tool in this segment. This gave it time to refine its POC processes, add system features, and build initial client references. It says it can now complete a typical POC in three days, often leaving while other vendors are still getting started. The company is now ramping up its lead generation and inside sales operations, aiming to grow quickly beyond its dozen-plus existing installations. (To provide some context: Vertica reports more than 100 clients.) We'll see what comes next.

______________
* For some serious doubt-sowing about the new benchmarks, see Daniel Abadi's post (be sure to read the comments) and ParAccel's response. What really matters, as ParAccel points out, is performance in customer POCs. The company says its performance has never been beaten, although there was one tie. (For sheer entertainment, check out the related string on Curt Monash's blog.)

Thursday, January 29, 2009

SQLStream Simplifies Event Stream Processing

I spoke earlier this week with SQLStream, which offers software to execute queries against data streams such as stock market prices, Web logs and credit card transactions. These queries can include on-the-fly calculations such as moving averages, as well as scans for patterns like a sequence of failed log-in attempts. Typical applications include security monitoring, fraud detection, and general business activity monitoring. Marketers can use the queries to identify new leads and select cross-sell and upsell offers. Although the connection is a little less obvious, the system can also be used as an alternative to conventional batch data preparation methods for tasks like customer data integration.

SQLStream’s particular claim to fame is that its queries are almost identical to garden-variety SQL. Other vendors in this space apparently use more proprietary approaches. I say “apparently” because I haven’t researched the competition in any depth. A quick bit of poking around was enough to scare me off: there are many vendors in the space and it is a highly technical topic. It turns out that stream processing is one type of “complex event processing,” a field which has attracted some very smart but contentious experts. To see what I mean, check out Event Processing Thinking (Opher Etzion) and Cyberstrategics Complex Event Processing Blog (Tim Bass). This is clearly not a group to mess with.

That said, SQLStream’s more or less direct competitors seem to include: Coral8, Truviso, Progress Apama, Oracle BAM, TIBCO BusinessEvents, KX Systems, StreamBase and Aleri . For a basic introduction to data stream processing, see this presentation from Truvisio.

Back to SQLStream. As I said, it lets users write what are essentially standard SQL queries that are directed against a data stream rather than a static table. The data stream can be any JDBC-accessible data source, which includes most types of databases and file structures. The system can also accept streams of XML data over HTTP, which includes RSS feeds, Twitter posts and other Web sources. Its queries can also incorporate conventional (non-streaming) relational database tables, which is very useful when you need to compare streamed inputs against more or less static reference information. For example, you might want to check current activity against a customer’s six-month average bank balance or transaction rate.

The advantages of using SQL queries are that there are lots of SQL programmers out there and that SQL is relatively easy to write and understand. The disadvantage (in my opinion; not surprisingly, SQLStream didn’t mention this) is that SQL is really bad at certain kinds of queries, such as queries comparing subsets within the query universe and queries based on record sequence. Lack of sequencing may sound like a pretty big drawback for a stream processing system, but SQLStream compensates by letting queries specify a time “window” of records to analyze. This makes queries such as “more than three transactions in the past minute” quite simple. (The notion of “windows” is common among stream processing systems.) To handle subsets within queries, SQLStream mimics a common SQL technique of converting one complex query into a sequence of simple queries. In SQLStream terms, this means the output of one query can be a stream that is read by another query. These streams can be cascaded indefinitely in what SQLStream calls a “data flow architecture”. Queries can also call external services, such as address verification, and incorporate the results. Query results can be posted as records to a regular database table.

SQLStream does its actual processing by holding all the necessary data in memory. It automatically examines all active queries to determine how long data must be retained: thus, if three different queries need a data element for one, two and three minutes, the system will keep that data in memory for three minutes. SQLStream can run on 64-bit servers, allowing effectively unlimited memory, at least in theory. In practice, it is bound by the physical memory available: if the stream feeds more data than the server can hold, some data will be lost. The vendor is working on strategies to solve this problem, probably by retaining the overflow data and processing it later. For now, the company simply recommends that users make sure they have plenty of extra memory available.

In addition to memory, system throughput depends on processing power. SQLStream currently runs on multi-core, single-server systems and is moving towards multi-node parallel processing. Existing systems process tens of thousands of records per second. By itself, this isn't a terribly meaningful figure, since capacity also depends on record size, query complexity, and data retention windows. In any case, the vendor is aiming to support one million records per second.

SQLStream was founded in 2002 and owns some basic stream processing patents. The product itself was launched only in 2008 and currently has about a dozen customers. Since the company is still seeking to establish itself, pricing is, in their words, “very aggressive”.

If you’re still reading this, you probably have a pretty specific reason for being interested in SQLStream or stream processing in general. But just in case you’re wondering “Why the heck is he writing about this in a marketing blog?” there are actually several reasons. The most obvious is that “real time analytics” and “real time interaction management” are increasingly prominent topics among marketers. Real time analytics provides insights into customer behaviors at either a group level (e.g., trends in keyword response) or for an individual (e.g., estimated lifetime value). Real time interaction management goes beyond insight to recommend individual treatments as the interaction takes place (e.g., which offer to make during a phone call). Both require the type of quick reaction to new data that stream processing can provide.

There is also increasing interest in behavior detection, sometimes called event driven marketing. This monitors customer behaviors for opportunities to initiate an interaction. The concept is not widely adopted, even thought it has proven successful again and again. (For example, Mark Holtom of Eventricity recently shared some very solid research that found event-based contacts were twice as productive as any other type. Unfortunately the details are confidential, but if you contact Mark via Eventriicty perhaps he can elaborate.) I don’t think lack of stream processing technology is the real obstacle to event-based marketing, but perhaps greater awareness of stream processing would stir up interest in behavior detection in general.

Finally, stream processing is important because so much attention has recently been focused on analytical databases that use special storage techniques such as columnar or in-memory structures. These require processing to put the data into the proper format. Some offer incremental updates, but in general the updates run as batch processes and the systems are not tuned for real-time or near-real-time reactions. So it’s worth considering stream processing systems as a complement to that lets companies employ these other technologies without giving up quick response to new data.

I suppose there's one more reason: I think this stuff is really neat. Am I allowed to say that?

Wednesday, December 24, 2008

ADVIZOR's In-Memory Database Supports Powerful Visualization

Back when I was writing a great deal about QlikView, I proposed that its fundamental value came from empowering business analysts to do work for themselves that would otherwise require IT support. (See, for example, this post, which has the virtue of pretty graphics.) This same notion of considering which users do which work has permiated my ideas of usability measurement for demand generation systems and usability in general. But to get back specifically to business intelligence systems, I think there is a particularly large gap between the capabilities available to business analysts and those available to IT. That is, even though the business intelligence systems like Cognos and Business Objects give analysts many ways to slice and present prepared data, they do not let analysts add new data or restructure existing data to meet new needs. This still requires the IT staff to design new data cubes and loading processes.

This gap is partly filled by analytical technologies such as columnar systems and database appliances, which can give good performance without schemas tailored for each task. But those systems are purchased and managed by the IT department, so they still leave analysts largely reliant on IT’s tender mercies.

A much larger portion of the gap is filled by products like QlikView, which the analysts can largely control for themselves. These can be divided into two subcategories: database engines like QlikView and illuminate, and visualization tools like Tableau and TIBCO Spotfire. The first group lets analysts do complex data manipulation and queries without extensive data modeling, while the latter group lets them do complex data exploration and presentation without formal programming. This distinction is not absolute: the database tools offer some presentation functions, and the visualization tools support some data manipulation. Both capabilities must be available for the analysts the work independently.

This brings us to ADVIZOR from ADVIZOR Solutions. ADVIZOR features an in-memory database and some data manipulation, but its primary strength is visualization. This includes at least fifteen chart types, including some with delightfully cool names like Multiscape, Parabox and Data Constellations. Analysts can easily configure these by selecting a few menu options. The charts are also somewhat interactive, allowing users to select records by clicking on a shape or drawing a box around data points. Some settings can be changed within the chart, such as selecting a measure to report on. Others, such as specifying the dimensions, require modifying the chart setup. The distinction won’t matter much to business analysts, who will have the tools build and modify the charts. But final consumers of the analyses typically run a viewer that does not permit changes to the underlying graph configuration.

On the other hand, that in-memory database can link several charts within a dashboard so selections made on one chart are immediately reflected in all others. This is arguably the greatest strength of the system, since it lets users slice data across many dimensions without writing complex queries. Colors are also consistent from one chart to the next, so that, for example, the colors assigned to different customer groups in a bar chart determine the color of the dot assigned to each customer in a scatter plot. Selecting a single group by clicking on its bar would turn the dots of all the other customers to gray. Keeping the excluded records visible in this fashion may yield more insight than simply removing them, although the system could also do that. These adjustments appear almost instantly even where millions of records are involved.

Dashboards can easily be shared with end-users through either a zero-footprint Web client or a downloadable object. Both use the Microsoft .NET platform, so Mac and Linux users need not apply. Images of ADVIZOR dashboards can easily be exported to Office documents, and can actually be manipulated from within Powerpoint if they are connected to the underlying dashboard. It’s also easy to export results such as lists of selected records.

Circling back to that database: it employs technology developed at Bell Labs during the 1990’s to support interactive visualization. The data model itself is a fairly standard one of tables linked by keys. Users can import data from text files, relational databases, Excel, Access, text files, Business Objects or Salesforce.com. They can map the table relationships and add some transformations and calculated fields during or after the import process. Although the mapping and transformations are executed interactively, the system records the sequence so the user can later edit it or repeat it automatically.

The import is fairly quick: the vendor said that an extract of three to four gigabytes across thirty tables runs in about twenty minutes, of which about five minutes is the build itself. The stored data is highly compressed but expands substantially when loaded into RAM: in the previous example, the three to four GB are saved as a 70 MB project file, but need 1.4 GB of RAM. The current version of ADVIZOR runs on 32 bit systems which limits it to 2-4 GB of RAM, although a 64 bit version is on track for release in January 2009. This will allow much larger implementations.

Pricing of ADVIZOR starts at $499 for a desktop version limited to Excel, Access or Salesforce.com source data and without table linking. (A 30-day trial version of this costs $49.) The full version starts at around $10,000, with additional charges for different types of user seats and professional services. Few clients pay less than $20,000 and a typical purchase is $50,000 to $60,000. Most buyers are business analysts or managers with limited technical skills, so the company usually helps set up their initial data loads and applications. ADVIZOR was introduced in 2004 and has several thousand end users, with a particular concentration in fund-raising for higher education. The bulk of ADVIZOR sales come through vendors who have embedded it within their own products.

Wednesday, August 27, 2008

Looking for Differences in MPP Analytical Databases

“You gotta get a gimmick if you wanna get ahead” sing the strippers in the classic musical Gypsy. The same rule seems to apply to analytical databases: each vendor has its own little twist that makes it unique, if not necessarily better than the competition. This applies even, or maybe especially, to the non-columnar systems that use a massively parallel (“shared-nothing”) architecture to handle very large volumes.

You’ll note I didn’t refer to these systems as “appliances”. Most indeed follow the appliance path pioneered by industry leader Netezza, I’ve been contacted by Aster Data, Microsoft), Dataupia, and Kognitio. A review of my notes shows that no two are quite alike.

Let’s start with Dataupia. CEO and founder Foster Hinshaw was also a founder at Netezza, which he left in 2005. Hinshaw still considers Netezza the “killer machine” for large analytical workloads, but positions Dataupia as a more flexible product that can handle conventional reporting in addition to ad hoc analytics. “A data warehouse for the rest of us” is how he puts it.

As it happens, all the vendors in this group stress their ability to handle “mixed workloads”. It’s not clear they mean the same thing, although the phrase may indicate that data can be stored in structures other than only star/snowflake schemas. In any event, the overlap is large enough that I don’t think we can classify Dataupia as unique on that particular dimension. What does set the system apart is its ability to manage “dynamic aggregation” of inputs into the data cubes required by many business intelligence and reporting applications. Cube building is notoriously time-consuming for conventional databases, and although any MPP database can presumably maintain cubes, it appears that Dataupia is especially good at it. This would indeed support Dataupia’s position as more reporting-oriented than its competitors.

The other apparently unique feature of Dataupia is its ability to connect with applications through common relational databases such as Oracle and DB2. None of the other vendors made a similar claim, but I say this is “apparently” unique because Hinshaw said the connection is made via the federation layer built into the common databases, and I don’t know whether other systems could also connect in the same way. In any case, Hinshaw said this approach makes Dataupia look to Oracle like nothing more than some additional table space. So integration with existing applications can’t get much simpler.

One final point about Dataupia is pricing. A 2 terabyte blade costs $19,500, which includes both hardware and software. (Dataupia is a true appliance.) This is a much lower cost than any competitor.

The other true appliance in this group is DATAllegro. When we spoke in April, it was building its nodes with a combination of EMC storage, Cisco networking, Dell servers, Ingres database and the Linux operating system. Presumably the Microsoft acquisition will change those last two. DATAllegro’s contribution was the software to distribute data across and within the hardware nodes and to manage queries against that data. In my world, this falls under the heading of intelligent partitioning, which is not itself unique: in fact, three of the four vendors listed here do it. Of course, the details vary and DATAllegro’s version no doubt has some features that no one else shares. DATAllegro was also unique in requiring a large (12 terabyte) initial configuration, for close to $500,000. This will also probably change under Microsoft management.

Aster Data lets users select and assemble their own hardware rather than providing an appliance. Otherwise, it generally resembles the Dataupia and DATAllegro appliances in that it uses intelligent partitioning to distribute its data. Aster assigns separate nodes to the tasks of data loading, query management, and data storage/query execution. The vendor says this makes it easy to support different types of workloads by adding the appropriate types of nodes. But DATAllegro also has separate loader nodes, and I’m not sure about the other systems. So I’m not going to call that one unique. Aster pricing starts at $100,000 for the first terabyte.

Kognitio resembles Aster in its ability to use any type of hardware: in fact, a single network can combine dissimilar nodes. A more intriguing difference is that Kogitio is the only one of these systems that distributes incoming data in a round-robin fashion, instead of attempting to put related data on the same node. It can do this without creating excessive inter-node traffic because it loads data into memory during query execution—another unique feature among this group. (The trick is that related data is sent to the same node when it's loaded into memory. See the comments on this post for details.)

Kognitio also wins the prize for the oldest (or, as they probably prefer, most mature) technology in this group, tracing its WX2 product to the WhiteCross analytical database of the early 1980’s. (WX2…WhiteCross…get it?) It also has by far the highest list price, of $180,000 per terabyte. But this is clearly negotiable, especially in the U.S. market, which Kognitio entered just this year. (Note: after this post was originally published, Kognitio called to remind me that a. they will build an appliance for you with commodity hardware if you wish and b. they also offer a hosted solution they call Data as a Service. They also note that the price per terabyte drops when you buy more than one.)

Whew. I should probably offer a prize for anybody who can correctly infer which vendors have which features from the above. But I’ll make it easy for you (with apologies that I still haven’t figured out how to do a proper table within Blogger).

______________Dataupia___DATAllegro___Aster Data___Kognitio
Mixed Workload_____Yes________Yes________Yes________Yes
Intelligent Partition___Yes________Yes________Yes________no
Appliance__________Yes________Yes________no________no
Dynamic Aggregation__Yes________no_________no________no
Federated Access_____Yes________no_________no________no
In-Memory Execution__no________no_________no________Yes
Entry Cost per TB___$10K(1)___~$40K(2)______$100K______$180K

(1) $19.5K for 2TB
(2) under $500K for 12TB; pre-acquisition pricing

As I noted earlier, some of these differences may not really matter in general or for your application in particular. In other cases, the real impact depends on the implementation details not captured in such a simplistic list. So don’t take this list for anything more than it is: an interesting overview of the different choices made by analytical database developers.

Wednesday, August 06, 2008

More on QlikView - Curt Monash Blog

I somehow ended up posting some comments on QlikView technology on Curt Monash's DBMS2 blog. This is actually a more detailed description than I've ever posted here about how I think QlikView works. If you're interested in that sort of thing, do take a look.

Tuesday, August 05, 2008

More on Vertica

I finally had a conversation with columnar database developer Vertica last week. They have done such an excellent job explaining their system in white papers and other published materials that most of my questions had already been answered. But it’s always good to hear things straight from the source.

The briefing pretty much confirmed what I already knew and have written here and elsewhere. Specifically, the two big differentiators of Vertica are its use of sorted data and of shared-nothing (MPP) hardware. Loading the data in a sorted order allows certain queries to run quickly because the system need not scan the entire column to find the desired data. Of course, if a query involves data from more than one column, all those columns be stored in the same sequence or must be joined on a common ID. Vertica supports both approaches. Each has its cost. Shared-nothing hardware allows scalability and allows redundant data storage which simplifies recovery.

Our conversation did highlight a few limits that I hadn’t seen clearly before. It turns out that the original releases of Vertica could only support star and snowflake schema databases. I knew Vertical was star schema friendly but didn’t realize the design was required. If I understood correctly, even the new release will not fully support queries across multiple fact tables sharing a dimension table, a fairly common data warehouse design. Vertica’s position is that everybody really should use star/snowflake designs. The other approaches were compromises imposed the limits of traditional row-oriented database engines, which Vertica makes unnecessary. I suspect there are other reasons people might want to use different designs, if only to save the trouble of transforming their source data.

On a somewhat related note, Vertica also clarified that their automated database designer—a major strength of the system—works by analyzing a set of training queries. This is fine so long as workloads are stable, but not so good if they change. A version that monitors actual queries and automatically adjusts the system to new requirements is planned for later this year. Remember that database design is very important to Vertica, since performance depends in part on having the right sorted columns in place. Note also that the automated design will become trickier as the system supports more than start/snowflake schemas. I wouldn’t be surprised to see some limits on the automated designs as a result.

The other bit of hard fact that emerged from the call is that the largest current production database for Vertica is 10 terabytes. The company says new, bigger installations are added all the time, so I’m sure that number will grow. They added that they’ve tested up to 50 TB and are confident the system will scale much higher. I don’t doubt it, since scalability is one of the key benefits of the shared-nothing approach. Vertica also argues that the amount of data is not a terribly relevant measure of scalability—you have to consider response time and workload as well. True enough. I’d certainly consider Vertica for databases much larger than 10 TB. But I’d also do some serious testing at scale before making a purchase.

Saturday, July 12, 2008

Sybase IQ: A Different Kind of Columnar Database (Or Is It The Other Way Around?)

I spent a fair amount of time this past week getting ready for my part in the July 10 DM Radio Webcast on columnar databases. Much of this was spent updating my information on SybaseIQ, whose CTO Irfan Khan was a co-panelist.

Sybase was particularly eager to educate me because I apparently ruffled a few feathers when my July DM Review column described SybaseIQ as a “variation on a columnar database” and listed it separately from other columnar systems. Since IQ has been around for much longer than the other columnar systems and has a vastly larger installed base—over 1,300 customers, as they reminded me several times—the Sybase position seems to be that they should be considered the standard, and everyone else as the variation. (Not that they put it that way.) I can certainly see why it would be frustrating to be set apart from other columnar systems at exactly the moment when columnar technology is finally at the center of attention.

The irony is that I’ve long been fond of SybaseIQ, precisely because I felt its unconventional approach offered advantages that few people recognized. I also feel good about IQ because I wrote about its technology back in 1994, before Sybase purchased it from Expressway Technologies—as I reminded Sybase several times.

In truth, though, that original article was part of the problem. Expressway was an indexing system that used a very clever, and patented, variation on bitmap indexes that allowed calculations within the index itself. Although that technology is still an important feature within SybaseIQ, it now supplements a true column-based data store. Thus, while Expressway was not a columnar database, SybaseIQ is.

I was aware that Sybase had extended Expressway substantially, which is why my DM Review article did refer to them as a type of columnar database. So there was no error in what I wrote. But I’ll admit that until this week’s briefings I didn’t realize just how far SybaseIQ has moved from its bitmapped roots. It now uses seven or nine types of indexes (depending on which document you read), including traditional b-tree indexes and word indexes. Many of its other indexes do use some form of bitmaps, often in conjunction with tokenization (i.e., replacing an actual value with a key that points to a look-up table of actual values. Tokenization saves space when the same value occurs repeatedly, because the key is much smaller than the value itself. Think how much smaller a database is if it stores “MA” instead of “Massachusetts” in its addresses. )

Of course, tokenization is really a data compression technique, so I have a hard time considering a column of tokenized data to be an index. To me, an index is an access mechanism, not the data itself, regardless of how well it’s compressed. Sybase serenely glides over the distinction with the Zen-like aphorism that “the index is the column” (or maybe it was the other way around). I’m not sure I agree, but the point doesn’t seem worth debating

Yet, semantics aside, SybaseIQ’s heavy reliance on “indexes” is a major difference between it and the raft of other systems currently gaining attention as columnar databases: Vertica, ParAccel, Exasol and Calpont among them. These systems do rely heavily on compression of their data columns, but don’t describe (or, presumably, use) these as indexes. In particular, so far as I know, they don’t build different kinds of indexes on the same column, which IQ treats as a main selling point. Some of the other systems store several versions of the same column in different sort sequences, but that’s quite different.

The other very clear distinction between IQ and the other columnar systems is that IQ uses Symmetrical Multi-Processing (SMP) servers to process queries against a unified data store, while the others rely on shared nothing or Massively Multi-Processor (MMP) servers. This reflects a fundamentally different approach to scalability. Sybase scales by having different servers execute different queries simultaneously, relying on its indexes to minimize the amount of data that must be read from the disk. The MPP-based systems scale by partitioning the data so that many servers can work in parallel to scan it quickly. (Naturally, the MPP systems do more than a brute-force column scan; for example, those sorted columns can substantially reduce read volumes.)

It’s possible that understanding these differences would allow someone to judge which type of columnar system works better for a particular application. But I am not that someone. Sybase makes a plausible case that its approach is inherently better for a wider range of ad hoc queries, because it doesn’t depend on how the data is partitioned or sorted. However, I haven’t heard the other vendors’ side of that argument. In any event, actual performance will depend on how the architecture has been implemented. So even a theoretically superior approach will not necessarily deliver better results in real life. Until the industry has a great deal more experience with the MPP systems in particular, the only way to know which database is better for a particular application will be to test them.

The SMP/MPP distinction does raise a question about SybaseIQ’s uniqueness. My original DM Review article actually listed two classes of columnar systems: SMP-based and MPP-based. Other SMP-based systems include Alterian, SmartFocus, Infobright, 1010Data and open-source LucidDB. (The LucidDB site contains some good technical explanations of columnar techniques, incidentally.)

I chose not to list SybaseIQ in the SMP category because I thought its reliance on bitmap techniques makes it significantly different from the others, and in particular because I believed it made IQ substantially more scalable. I’m not so sure about the bitmap part anymore, now that realize SybaseIQ makes less use of bitmaps than I thought, and have found that some of the other vendors use them too. On the other hand, IQ’s proven scalability is still much greater than any of these other systems—Sybase cites installations over 100 TB, while none of the others (possibly excepting Infobright) has an installation over 10 TB.

So where does all this leave us? Regarding SybaseIQ, not so far from where we started: I still say it’s an excellent columnar database that is significantly different from the (MPP-based) columnar databases that are the focus of recent attention. But, to me, the really important word in the preceding sentence is “excellent”, not “columnar”. The point of the original DM Review article was that there are many kinds of analytical databases available, and you should consider them all when assessing which might fit your needs. It would be plain silly to finally look for alternatives to conventional relational databases and immediately restrict yourself to just one other approach.

Customer Experience Matrix