ALG Logo
About ALG
Tools
Downloads
Projects
Case Studies
DocumentsLogin
Overview
Staff
Directions
Collaborators
Employment
Licensing
Feedback
What We Do

The primary goal of the Automated Learning Group is to extend the state of the art in the field of data mining. Toward that end, we collaborate with researchers to invent new approaches and tools that will become the basis for future commercial software. Development efforts are primarily fueled by data and problems brought to us by our industrial, government, and academic partners. The algorithms and solutions developed are then made available to partners and collaborators through web repositories, tutorials, and direct collaboration with ALG group members. By this process our partners have access to new methods long before they become commercially available.


Data Mining

Rapid advances in both data storage and data capture technologies have resulted in a marked increase in the amount of data being stored in both the business and scientific sectors. Because of these advances, millions of The next decade will produce a revolution in the use of archived, simulation,and near real-time data to guide future decisions and research directions.records are being generated, each of them containing tens or hundreds of fields. Many of these data sets are expanding on a daily basis. Traditional analyses of these types of data sets, involving human experts who manually analyze the data, are clearly no longer adequate.

The field of Data Mining has developed in response to the need for machine-oriented, automated methods for analyzing large data sets. Data mining combines work from areas such as statistics, machine learning, pattern recognition, databases, and, more recently, high performance computing. The goal of data mining is to discover interesting and previously unknown information in data sets. Tools for data mining have the ability to parse enormous amounts of data and discover significant patterns and relationships that might otherwise have taken a human being thousands of hours to find.

The methods and applications of data mining are still in their infancy. The next decade will produce a revolution in the use of archived, simulation, and near real time data to guide future decisions and research directions.


Research Interests

The field of data mining is relatively new and there exist many, many areas to explore in terms of methodology and applications. Some examples of basic research of interest to the Automated Learning Group include:

Mining Large-Scale Text Repositories

The need to mine large text corpora continues to grow in importance as the amount of information we produce continues to expand at exponential rates. It is becoming impossible even for experts to monitor the totality of pertinent information that is generated annually in their fields. We are drowning in data. Text mining is prepared to offer some relief.

We are interested in research leading to methods and tools that enable people to understand and organize large amounts of text and to extract the information they need to enhance understanding and foster more effective decision making. We are actively researching unsupervised and supervised learning approaches applied to very large-scale static and streaming data sets. We are focusing on methods that scale across distributed text and mixed mode data collections. We are also interested in approaches that extend beyond the standard bag-of-word approaches toward more human like understanding of natural language.

Development of Layered Learning Approaches

Given the explosion of available data mining algorithms, a method for finding the best single algorithm or combination of algorithms to solve a problem is becoming increasingly imperative. From an optimization point of view, each D2K itinerary that uses data mining algorithms to solve a problem represents a point in a decision space. Research is underway for methods to automatically select, within D2K, the best algorithmic solution for a given data mining problem.

Optimization algorithms are designed to find the best point(s) in decision space given one or more objectives. Most existing optimization algorithms operate in parameterized Euclidean decision spaces. For this reason a directed graph of D2K modules cannot be directly searched; however, the parameter space of a given D2K itinerary can be. For example, a simple D2K itinerary may consist of two basic modules, one that does a transformation on the data and one that applies a supervised learning algorithm on the transformed examples. Each module has a ParameterSpace that the optimizer can search. ParameterSpaces of each module in the itinerary are combined to create a global ParameterSpace which can be searched by a single optimization algorithm.

Using optimizers to search a large decision space for the best data mining approach produces more robust learning systems that on the average produce better results than using any single data mining approach. There are, however, two downsides. First, the cost of searching for the best solution compared to trying a single solution can be prohibitive. Second, if care is not taken, use of optimizers can lead to over fitting of the data and optimistic accuracy. To address the first problem, D2K allows for a network of machines to simultaneously work on the same problem, which reduces computation time for those with multiple CPUs. To address the second problem, ALG uses a two tier cross validation approach which involves yet another magnitude increased computational cost, but can reliably measure the accuracy of solutions involving optimization of itinerary parameters.

Parameter optimization and layered learning are applied in all of ALG's supervised learning applications. Most recently, we are applying it to mining gene activity from micro arrays, and analyzing sensor data.

Mining Image Data

ALG research scientists along with university scientists on multiple campuses work together to create applications that are not only highly useful in the given domain, but as easy and intuitive to use as possible. One domain area within our research focus is that of image analysis. Image data are complex and highly heterogeneous, and intelligent analysis of image data remains a problem area with few functional software tools. The ALG is seeking to address this gap with the development of the I2K - Image to Knowledge suite of tools.

The objective of the I2K suite of tools is to research and develop solutions to real life problems in the application areas of machine vision, synthetic aperture radar (SAR) target and scene modeling, precision farming, land use and land cover classification, map analysis, video surveillance, bio-informatics, microscopy and medical image processing, geo-spatial information systems (GIS) and advanced sensor environments. The main goal of the I2K research and development is to automate information processing of repetitive, laborious and tedious analysis tasks and build user-friendly decision-making systems that operate in automated or semi-automated mode in a variety of applications. The development is based on theoretical foundations of image and video processing, computer vision, statistical modeling, data mining and pattern recognition, software engineering and sensor design.

I2K tools have been successfully demonstrated in application areas, such as, bio-informatics, geographic information systems (GIS), hyperspectral image analysis in precision farming, map analysis and video surveillance.

Using Genetic Algorithms, Interactive Genetic Algorithms and Human-Based Genetic Algorithms to Achieve Policy-making Goals

Modern times challenge organizations and their leaders to adapt quickly to complex circumstances under trying conditions. Data sources are numerous, distributed, and contradictory. Problems are difficult to detect and diagnose, widely dispersed, and constantly changing. Sources of knowledge and expertise are distributed, of varying quality, and difficult to integrate. Moreover, the tools of the trade are increasing in technological sophistication, computational intensity, and require specialized hardware, software, and maintenance.

The Automated Learning Group is interested in exploring the potential for addressing these issues through the use of Genetic Algorithms (GAs). Applying GAs to a complicated problem domain, stakeholders can generate potential solutions that take into account multiple constraints and requirements.

Currently, the ALG is working with the Illinois Genetic Algorithms Lab (Michael Goldberg) to develop an environment (DISCUS) for interactive problem solving using Interactive GAs and Human-based GAs integrated into the D2K environment.

In addition, ALG has collaborated with Faculty Fellow, Barbara Minsker (Environmental Engineering - UIUC) to develop and environment called EMO. EMO, which stands for Evolutionary Multi-Objective Optimization, provides an application framework with a unique user interface that helps users move through the optimization process in a clear and simplified way. The framework was designed and built so it could be easily re-used for a variety of applications such as engineering design and financial planning. EMO is built on the Automated Learning Group's D2K and D2K-SL software, providing all the sophistication and power of the D2K infrastructure.

Information Design and Visualization

Data mining applications often involve large data sets and multi-dimensional results that can lead to information overload for the user. We believe that to maximize the effectiveness of applications developed within ALG, design and visualization must be included in the process. Incorporating perception techniques of visual hierarchy helps to clarify and group information and speed the interpretation process. Providing interaction methods that allow users to shift the hierarchy depending on their needs makes systems more powerful and customizable. Exploring drill-down methods for traversing levels of detail in information also helps enable user comprehension of information. Ultimately, successful visualization techniques act to simplify data mining results and help users gain knowledge from the data.

The ALG has explored and implemented information design and visualization techniques in many 2-D displays of data. These include Visualization Modules developed for inclusion in the D2K Toolkit such as visualizations for Naïve Bayesian analyses, Decision Trees, and Apriori, as well as several visualizations developed for Deviation Detection problems. We are currently working on a project with the Mid America Earthquake (MAE) Center to develop an interactive visualization tool for earthquake risk assessment across regions using D2K in combination with VTK or Java Visualization Tool Kit. In addition, we have several software development projects underway that have benefited from extensive concept and design phases before developer implementation. This, we believe, helps ensure the most intuitive, functional software environment possible.


D2K Data to Knowledge

In order to facilitate our research activities, ALG has spent the last few years developing the D2K application environment for data mining. D2K is a rapid, flexible data mining and machine learning system that integrates analytical data mining methods for prediction, discovery, and deviation detection, with data and information visualization tools. It offers a visual programming environment that allows users to connect programming modules together to build data mining applications and supplies a core set of modules, application templates, and a standard API for software component development. All D2K components are written in Java for maximum flexibility and portability. D2K is currently at a version 4.1 of development.








    Copyright © 2004 The Board of Trustees of the University of Illinois, All Rights ReservedNCSA Logo