Thursday, May 31, 2007

Zoomix Bridges Statistical and Reference-Based Matching

Tuesday’s post on matching technology prompted a contact from Zoomix, a data quality software vendor just entering the U.S. market.

Zoomix makes the best case I’ve seen for the use of statistical methods without predefined reference data within data quality systems. Basically, they apply machine learning techniques that let users train systems for specific data quality applications. Part of this training involves inferring general rules for matching and data cleansing. That much is fairly standard. But Zoomix also captures specific information—such as the equivalence of terms in different languages—and stores those as well. In other words, Zoomix builds its own reference data by capturing user decisions, rather than relying on external reference databases. (Zoomix isn’t fanatic about its approach: users can also import reference data if they wish.)

I suppose self-generated reference data is still reference data, but the larger point is that Zoomix’s approach means it can be applied to any type of data, not just the traditional name and address information for which standard reference files exist. This gives Zoomix the flexibility traditionally associated with purely statistical solutions. Similarly, Zoomix self-generates rules to identify concepts, extract attributes, and build classification schemes. All these capabilities are traditionally associated with external knowledge such as grammars and taxonomies.

These capabilities make Zoomix sound more like text analysis software—think Autonomy, ClearForest (just bought by Reuters) and Inxight (just purchased by BusinessObjects)—than traditional data quality or matching solutions. This conveniently supports the point of Tuesday’s post, that those two worlds are closer than commonly realized (by me, at least).

Zoomix’s approach combines the flexibility of statistical solutions with the power of domain-specific reference data. I might change my mind after I think about it more deeply, but my first impression is it could mark a significant improvement in data quality techniques.

No comments: