D2K - Data to Knowledge

D2K - Data to Knowledge is a flexible data mining and machine learning system that integrates analytical data mining methods for prediction, discovery, and deviation detection, with information visualization tools. It offers a visual programming environment that allows users to easily connect software modules together in a unique data flow environment. D2K supplies a set of software modules and application templates, along with a standard API (Application Programming Interface) for software module development. The software modules are reusable components that facilitate collaboration among developers. These modules and the entire D2K environment are written in Java for maximum flexibility and portability.

Advantages of utilizing D2K to build highly functional data mining applications include reduced development time through module reuse and sharing, and access to D2K distributed computing and parallel processing capabilities.

D2K Components

Figure 3 is a conceptual diagram of the many components that make up D2K. The three inner rings represent the core components, including the infrastructure. The outer ring represents the graphical user interface of the D2K Toolkit, as well as other applications that can be built using the core components.

D2K Components
Figure 3. D2K Components

D2K Infrastructure

The D2K Infrastructure is the processing engine that all other components utilize. It is responsible for controlling the scheduling and execution of knowledge discovery applications on the compute resources. The infrastructure also defines the D2K API.

D2K Modules

D2K Modules are software components that perform data preparation and transformation, implement discovery and learning algorithms, and assist in the interpretation and evaluation of results. A D2K Module is an independent unit that may take input and produce output. Figure 5 identifies the six types of D2K Modules. Each type addresses a step of the knowledge discovery process.

D2K Module Icon
Figure 4. D2K Module Icon

Figure 4 illustrates the various parts of a module. The graphic in the center of the module icon represents its type. Input and output ports are indicated by small rectangles. The color of the port specifies the type of data flowing in or out. A progress bar across the top of each module displays execution time as a percentage of the entire itinerary execution time. The bar turns green during module execution, then red when complete. If a "P" is shown, then the module has properties that can be set.

D2K Module Types

D2K modules are organized into six categories, enabling users to easily identify the functionality of any given module in an application.

D2K Module Types
Figure 5. D2K Module Types

User Input

User Input modules require interaction with the user. They generally present a dialog where the user can enter information or manipulate the data in a specific way. For example, modules used to identify how data is to be binned would be user input modules.

Input

Input modules load data from files or databases. For example, modules that read a file, crawl the web, or read from a database would be input modules.

Output

Output modules save data to files or databases or show it to the user on the screen. For example, modules that write files or print information to the console would be output modules.

Data Prep

Data Prep modules perform functions that prepare the data for analysis. Data selection, data cleaning, and data transformation algorithms would be of this type. For example, modules that perform binning or normalization of data would be data prep modules.

Compute

Compute modules typically perform the main calculations for the application. For example, algorithms used to solve data mining problems would be implemented as compute modules.

Visualization

Visualization modules provide visual feedback to the user. For example, modules that display a scatterplot graphical view of the data or a decision tree visualization of the data mining results would be visualization modules.

D2K Itineraries

D2K Itineraries are essentially applications composed of D2K Modules connected together in the D2K Toolkit. Itinerary complexity is limited only by the needs of your project. Pre-built itineraries can be loaded into the D2K Toolkit and incorporated into other itineraries as "nested itineraries." A number of itineraries for use in a variety of data mining problems and domains have been developed and are distributed with the Toolkit.

D2K Toolkit

The D2K Toolkit provides the most flexible and feature-rich interface for setting up itineraries and controlling knowledge discovery tasks. Very complex data flow graphs can be built to compare the accuracy of different data mining methods, to visualize results, and to save models for later use.

Domain-Specific D2K-Driven Applications

A Domain-Specific D2K-Driven Application presents the knowledge discovery process via an interface that has been completely tailored to a particular domain, hiding the details of module selection and itinerary composition while taking advantage of the underlying D2K Infrastructure. These applications employ D2K functionality in the background, using modules to dynamically construct applications. They present their own specialized user interfaces specific to the tasks being performed.

D2K Streamline (SL)

D2K Streamline (SL) is a new software effort that does not expose the full complexity of the D2K Toolkit, but instead presents the user with pre-configured options corresponding to the steps of the KDD process. The pre-configured options are composed of itinerary stubs, and can be tailored to a particular user community. Specific applications can be packaged and presented within the SL framework to a given user community. One of the primary goals in the development of SL is to provide a method for easy technology transfer and continued use of software modules already developed by the ALG for researchers and collaborators.

Features & Functionality

Major features that D2K provides to an application developer include:

Using the D2K Toolkit

The remainder of this document focuses on the D2K Toolkit. Toolkit features and functionality are explained in detail to prepare the reader to use the system to solve their own data mining problems.