Using the Workspace


The Workspace occupies the largest area of the D2K Toolkit window, and is the focus of user activity within the Toolkit. The functional components central to the D2K environment - modules, itineraries, and models - are activated by user actions within the Workspace. Examples of such actions include placing modules, connecting modules, setting module properties, and executing itineraries. This chapter is a guide to performing these actions.

Placing Modules

The first step in building an itinerary is to choose modules from the Resource Panel and place them into the Workspace. As described in the previous chapter, D2K follows a data flow paradigm, and each of the various D2K Module types relates to a step in the Knowledge Discovery in Databases (KDD) process. Consequently, the first modules you place will often be Input Modules or User Input Modules. To place one of these types of modules, click on the Modules tab in the Resource Panel to the left of the Workspace.

Modules are displayed in the Resource Panel by their Java package hierarchy or by D2K Module type. For the purposes of this tutorial, make sure your modules are being displayed by type. If they are not, change the Tree Type setting in the D2K Toolkit preferences. After the Modules pane opens, double click on a folder or single click on a tree node to expand the desired module type. Finally, click on a module and drag it over to the Workspace.

This same process can also be used to place entire itineraries as a single module icon into the Workspace. These are called "nested itineraries." The nested itinerary icon is shown in figure 12. To access the collection of itineraries in your toolkit, click the Itineraries tab in the Resource Panel.

Nested itinerary icon
Figure 12. Nested itinerary icon

Once modules have been placed in the Workspace, they can be moved by clicking and dragging the module icon. Be careful not to click on an input or output port as this will initiate another operation that will be discussed shortly. While placing and moving modules in the Workspace, you may find it helpful to enable the Snap to Grid option found in the Views menu. While this feature is turned on, you will be able to easily keep modules aligned in the Workspace. The grid itself can be turned on using the Show Grid option in the same menu. The grid size can be edited in the D2K Toolkit Preferences.

Selecting Modules

Modules can be selected by single clicking on them in the Workspace. When a module is selected, other selected items are deselected. If you wish to select multiple modules, hold down the control key as you make the additional selections, or click and drag in the Workspace to select a region of modules. While selected, modules can be moved about the Workspace or deleted using the delete or backspace key.

Labeling Modules

Module icons are displayed in the Workspace with an accompanying label that can be changed, as shown in figure 13. Editing the module label does not change its Java class name. However, the label must remain unique among the other module labels in the itinerary. Module labels are edited in a fashion similar to editing a filename in the Macintosh or Windows operating systems. Simply click on the module label, just below the module icon, and enter the desired text. When finished editing, click in the Workspace or on the module icon. When the module is selected in the Workspace, the label is referred to as the module's alias in the Component Info Pane.

Editing a module label
Figure 13. Editing a module label

Connecting Modules

The power and usefulness of D2K Modules are only realized once they are connected together to form a larger system known as an itinerary. Constructing an itinerary directs the flow of data through a series of software modules, each performing some operation on the data.

The ports of two modules may only be connected if their data types are compatible with one another. Port data types are represented by their color. Each color relates to a different data type. All color/data type relationships are listed in the Legend, found in the Views menu. Although D2K ensures only legal port connections are made, errors resulting from data incompatibilities may occur at runtime.

Legend window
Figure 14. Legend window

Readers familiar with the Java programming language should note that port data types are considered compatible when one of several conditions is met:

Before making a connection, first ensure the Connect Tool (Arrow) is selected in the Tool Bar. Then, drag from an output port of one module to the input port of another module to establish a simple module-to-module connection. Figure 15 demonstrates this action. To remove the connection, drag again from the output port to a vacant area of the Workspace. The connection will then be removed.

Connecting two modules
Figure 15. Connecting two modules

Your data mining application will sometimes require more complex connections than what can be created using the Connect Tool. D2K provides two other tools for creating special types of modules used to define connections between a single output and multiple inputs, or multiple inputs and a single output. To help you understand when this would be useful, imagine a simple application that needs to load all employee names from a database and print them to the console. In this scenario, we will be using two modules: one that can load the employee names from a single database, and one that will print the resulting strings to the console. A simple module-to-module connection, as described in the previous paragraph, can be used to accomplish this objective. Now suppose our company merges with another company and there are two databases containing employee names. Our application is still required to print all employee names to the console, but our data is no longer located in a single store. In a situation like this, we need a way to direct the output of two modules, each loading names from a different database, to the input of one module that will handle the printing. The Queue multiple inputs tool provides an elegant solution to this more complicated scenario.

Activate the Queue multiple inputs tool by clicking the icon to the right of the Connect tool. Then, click and drag from a module output to a vacant area of the Workspace. When the mouse button is released, a special module for queuing multiple outputs will appear.Control will then be returned to the Connect tool, allowing additional module outputs to be connected to this special module. At any point after the module for queuing multiple outputs has been placed, its output can be connected to a regular module input. Although regular modules having multiple inputs will only fire once all inputs have been satisfied, this is not the case with the Queue multiple inputs module. Instead, data is passed through the module as it arrives at one of the module inputs.

Queue multiple inputs module
Figure 16. Queue multiple inputs module

Activate the Generate multiple outputs tool by clicking the icon to the right of the Queue multiple inputs tool. Generating multiple outputs is achieved using a process very similar to queuing multiple inputs, however, the result is quite different. Instead of aggregating multiple inputs into a single data stream, generating multiple outputs takes a single input stream and pushes it to all of its outputs. Although the data stream is conceptually, and visually, multiplied by the number of outputs, it is very important to note that "deep copies" of object data structures are not made as they pass through the module. Each output receives the same object reference. This is especially noteworthy if the data contained in the objects are to be manipulated at a later stage of the itinerary. Using shallow copies is advantageous because fewer memory and compute resources are required. If a deep copy of an object is needed, the operation should be performed by another module immediately following the module for generating multiple outputs.

Generate multiple outputs module
Figure 17. Generate multiple outputs module

The modules for queuing multiple inputs and generating multiple outputs are designed to be very general purpose. As a result, far fewer restrictions are placed on the type of data that may flow in or out of these modules. In the case of the module for queuing multiple inputs, the first established connection designates the data type for all remaining inputs as well as the output. In the case of the module for generating multiple outputs, the input data type designates all output data types.

Setting Module Properties

Modules in the Workspace may have properties that can be set. This can be determined by looking for a "P" in the lower left hand corner of a module icon. With itinerary execution stopped, double click on a module icon to edit its properties. A property editor window will appear, as shown in figure 18.

To edit the properties of the modules contained in a nested itinerary, two methods are available. Either double click on the nested itinerary icon, which will display the nested itinerary, and edit the module properties normally, or hold down the control key (Shift key for Macintosh users) and double click on the module icon. The latter option will cause a property editor window to appear containing the properties for all modules in the itinerary. Additionally, a special property for enabling or disabling parallelization is displayed at the top. If enabled, the D2K Infrastructure will run copies of the itinerary in parallel during normal itinerary execution. When you are done, close the property editor window to ensure changed properties take effect.

Multiple instances of the same module with settable properties will each have properties to be set. Editing the properties of one will not change the properties of another. Module properties are saved when an itinerary is saved, and remembered when the itinerary is loaded again.

Module property editor
Figure 18. Module property editor

Executing Itineraries

Executing an itinerary is a very straightforward operation. With the itinerary loaded in the Workspace, click once on the Run button in the Tool Bar. Itinerary processing will begin immediately. When a run begins, the Process Busy Bar will animate above the itinerary execution buttons. Since the D2K Toolkit is unable to determine the itinerary's total execution time, the bar simply moves back and forth until execution is complete.

As the itinerary executes, module progress is represented with a color bar at the top of each module icon. The size of the color bar indicates the module's execution time as a percentage of the entire itinerary execution time. A green color bar signifies the module is currently executing. If the color bar is red, module execution has completed.

Clicking the Abort button will completely stop itinerary execution. Clicking the Pause button will temporarily halt execution. Both aborted and paused states may not be achieved immediately. The D2K Toolkit must first allow modules to finish current operations before processing will be truly stopped. An itinerary is completely paused when then Pause button is dimmed or disabled. Itinerary processing has been aborted when only the Run button is enabled. While an itinerary is paused, users may click the Checkpoint button to save itinerary progress. Support must be provided at the module level for a checkpoint to be successfully saved. A Save File dialog will appear requesting a file name and location for the saved checkpoint. The saved checkpoint file can then be loaded like any other itinerary, although it will not be available in the Itineraries Pane of the Resource Panel.

Machine usage
Figure 19. Machine usage

In order to optimize itinerary performance, it may be desirable to visualize thread and module activity on the machine, or machines, executing your itinerary. Doing so can help identify bottlenecks and/or poorly allocated system resources. Such problems are hardest to troubleshoot in distributed computing situations. Click Show Machine Usage in the Views menu to enable this feature in D2K. If enabled, a translucent display containing performance graphs for all machines executing the itinerary will overlay the Workspace, as shown in figure 19. Each graph is labeled with the machine's name. Graphs are divided top to bottom with a brown line. Color bars drawn above the brown line during itinerary processing represent the number of modules waiting for a thread. A maximum of 3 waiting modules can be shown. Color bars drawn below the brown line during itinerary processing represent running threads. The maximum number of threads available to the machine, a setting in the D2K Toolkit preferences, is displayed to the left of the graph. Lines below the brown divider represent each available thread. As the itinerary executes, performance samples are graphed from left to right. The best possible machine utilization has been achieved when all threads are busy and no modules are waiting. To minimize the overlay during itinerary processing, click the icon in the lower right hand corner. Dispose of the overlay by clicking the icon in the same location when processing has stopped.