The Data Mining Context

Businesses, government agencies, and scientific research teams across many disciplines are collecting and/or generating large volumes of data. Traditional analyses of these datasets involve human experts who manually analyze the data, or specially coded applications that search for patterns of interest. As the data volume continues to grow, these approaches are no longer adequate.
Data mining, sometimes known as Knowledge Discovery in Databases (KDD), gives users tools to sift through vast data stores to learn and recognize patterns, make classifications, verify hypotheses, and detect anomalies. These findings can highlight previously undetected correlations, influence strategic decision-making, and identify new hypotheses that warrant further investigation.
The KDD process involves several steps that take the user along the path from data to information to knowledge. For a successful KDD project, the first step is to determine the objective.

Figure 1
Figure 1 illustrates the individual activities that make up the KDD process after the objective has been identified. These steps include data selection, cleaning, and transformation; model building and pattern discovery; and outcome interpretation and evaluation. Knowledge discovery is a process that can be very iterative and each of these activities may be revisited multiple times.
The results of a successful KDD activity include not only the identification of structural patterns in data (recognition), but also descriptions of the patterns (learning) - descriptions that can impart knowledge, not just yield predictions. [Witten, Frank]
The data selection, cleaning, and transformation activities shown in Figure 1 compose what is often referred to as the data preparation step. The effort for the various KDD steps is illustrated in Figure 2. Ideally, one would like to reduce the data.

Figure 2. Required Effort for each KDD Step
D2K Streamline
The D2K-SL tool will enable end users to perform Data Mining activities on their current data sets in an easy-to-use environment.
A limited number of data-mining tools are available commercially, however their cost is often prohibitively high for many users; especially for those just beginning to investigate what KDD can do for them. Some no-cost data mining tools are available, such as the WEKA package[cite], but are unable to handle data that does not fit entirely in memory. With the D2K-SL tool, we plan to provide a uniform interface that can be used to perform KDD on datasets ranging in size from kilobytes to terabytes. In addition, follow-on releases of D2K-SL will provide transparent access to computing resources from the desktop to the TeraGrid, enabling a new class of data mining algorithms to be explored and supported.
D2K-SL will address a critical need by providing strong support for the data preparation phase of the KDD process, and will use new visualization techniques to support the user in all steps of the data mining activity. Currently, visualizations are used primarily to show the outcome of the process, but are not relied upon in earlier stages. We believe that interactive visualizations will help users understand and prepare their datasets.
Finally, we anticipate support in D2K-SL for mining text and images, in addition to the more traditional mining of numeric and categorical data.




|
 |

D2K SL Beta-stage Implementation. All Rights Reserved.
D2K and D2K-SL

The Automated Learning Group currently has a data-to-knowledge system, D2K, which supports all phases of the data mining process. D2K was designed to provide data mining professionals with a flexible and extensible architecture or "sandbox" for developing and evaluating the performance, accuracy, and relevance of a range of data mining techniques on a variety of data sets.
D2K modules implement a wide variety of input/output, user interface, data preparation, data mining, and results interpretation and evaluation functionalities. Modules are connected within a visual programming environment to form a data flow itinerary that defines a given data mining application. The D2K infrastructure provides the execution environment for the itinerary, and is responsible for scheduling modules and orchestrating control and data communications between modules.
For the D2K-SL project, we are building upon the current D2K modules and infrastructure to provide set of pre-defined KDD applications that guide users through the data mining process. The D2K-SL applications will encompass three classic data modeling problems - prediction, discovery, and deviation detection. Users will access these applications through an intuitive interface that hides the module selection and connection complexities, and presents the user with a "smart" software environment that will reduce, as much as possible, the learning curve associated with the data mining process. D2K-SL will also provide feedback throughout the KDD process, guiding and assisting users as they load and mine their datasets.
Similar to the TurboTax model, the user will be presented with choices relevant only to the data mining process they have selected, and will be allowed to return to earlier steps in the process to refine or reenter significant information. They will be able to perform "experiments" on their data and to build upon previous results. It is our hope that these efforts will allow the novice user to complete the steps of the KDD process and gain knowledge from their data, without overwhelming them with too much flexibility and control.
D2K-SL can utilize a relational database to store and manipulate the data being mined. For the initial release we plan to use Oracle, and we will investigate supporting MySQL and DB2 as time allows Ð either in the initial or in future releases. The KDD applications distributed with D2K-SL will be composed from D2K modules, and many of those modules will contain SQL commands that interact with a database. The user will not have to interact with the database directly, nor will they be aware of the underlying modules composing the application.
It is our belief that D2K-SL will enable users to successfully extract knowledge that previously was undetected in their datasets. Furthermore, we believe that by taking advantage of database capabilities, the effort spent on the data preparation step will be dramatically reduced, allowing people to quickly begin mining their data with minimal frustration. With the proper tools, the users can concentrate their efforts on the interpretation of the information exposed. In addition, the D2K-SL product will allow a user to move from shared memory, to cluster, to grid computing by utilizing a database to provide parallel and distributed data access.



|