One of the most important new trends in business intelligence is data discovery. It is a departure from traditional business intelligence in that it emphasizes interactive, visual analytics rather than static reporting. The goal of data discovery is to work with and enable people to use their intuition to find meaningful and important information in data. This process usually consists of asking questions of the data in some way, seeing results visually, and refining the questions.
Contrast this with the traditional approach, which is for information consumers to ask questions, which causes reports to be developed, which are then fed to the consumer, which may generate more questions, which will generate more reports.
Data Discovery Approaches
Progressive companies consider data to be a strategic asset and understand its importance in driving innovation, differentiation, and growth. But leveraging data and transforming it into real business value requires a holistic approach to business intelligence and analytics. This means going beyond the scope of most data visualization tools and is dramatically different from the business intelligence (BI) platforms of years past.
The continuing evolution of data discovery in the enterprise and the cloud is being driven by the trends listed below:
- Big data: On big data projects, data discovery is both more important and more challenging. Not only is the volume of data that must be efficiently processed for discovery larger, but the diversity of sources and formats presents challenges that make many traditional methods of data discovery fail. Cases where big data initiatives also involve rapid profiling of high-velocity big data make data profiling harder and less feasible using existing toolsets.
- Real-time analytics: The ongoing shift toward (nearly) real-time analytics has created a new class of use cases for data discovery. These use cases are valuable but require data discovery tools that are faster, more automated, and more adaptive.
- Agile analytics and agile business intelligence: Data scientists and business intelligence teams are adopting more agile, iterative methods of turning data into business value. They perform data discovery processes more often and in more diverse ways, such as profiling new data sets for integration, seeking answers to new questions emerging this week based on last week’s analysis, or finding alerts about emerging trends that may warrant new analysis work streams.
Different Data Discovery Techniques
Data discovery techniques vary, but they all aid the user by consolidating data within a defined context. That context enables quick evaluation and, ideally, the creation of actionable information. Three basic methods are normally used to discover, categorize, and present data:
- Metadata: This data discovery option uses automated tools to discover data element semantics within data sets. Relational databases store metadata and use it to describe column and table attributes. A search for possible credit card numbers in a database, for example, could use column attributes (e.g., column name, data type, or data size) to identify numbers that could possibly be used to represent a credit card number. The metadata method is the most common data analysis technique.
- Labels: Data elements can often be grouped based on a descriptive term. The term, or tag, can then be used for subsequent data discovery processes. An important aspect of labels is that they must be applied when the data is created. Tags can then be added over time to provide references or additional information. Labeling is less rigid than metadata and is more commonly used with flat files. This data discovery option becomes increasingly useful as more database request modules (DBRMs) move to Indexed Sequential Access Method (ISAM) or quasi-relational data storage, a cloud database service approach popular for handling rapidly growing data sets.
- Content analysis: This process analyzes data using pattern matching; hashing; and statistical, lexical, or other types of probability analysis. Content analysis is a growing trend across multiple industries, as it has proven successful in data loss prevention (DLP) and web content analysis products.
Related article – Implementation of Data Discovery