This page is under refinement and incremental
Occasionally, students have a need to do projects. Sometimes the need is at the course level, sometimes at the masters level, sometimes at the doctoral level. When this happens, students sometimes want to know what projects might be of interest to a supervising professor. Generally, it is a good idea for a student to work on things of particular interest to her/his major professor.
The idea of this page is to provide a partial list of what might be of
interest to me.
My own current research interests are focused on aspects of unsupervised data mining, soft computing, natural language, and expert systems. My approach focus is that of learning.
I am also interested in a variety of areas that are not a major personal current research focus. If you have an interest in something not listed here, ask me. Perhaps, something can be worked out.
The following project ideas are derived from both my research focus and from tangential interests. The amount of work varies significantly from project to project. If something strikes your fancy, please get in contact with me and we can talk about it.
Search a classically organized (formatted) database for relationships between data items that may or may not be explicitly stated in a logical description (schema/view) of the data. The goal is to discover "surprizing" relationships. This is particularly interesting for large collections of data with many different attributes. There are several areas of interest.
Data mining can generate many rules. Various strategies can be applied. Among the most popular are setting different kinds of heuristic thresholds. These thresholds are generally set by a user. They are applied to all of the rules generated. User set thresholds are unsatisfactory because (a) The user may be naive and may not set effective thresholds and (b) The thresholds may not be optimally efficient. There is also a problem in using the same threshold setting for all rules; regardless of the frequency of the underlying data. Useful rules for data attributes that are not common may be ignored.
Most association rules are developed from categorical 0/1 data. What this means is that scalar values such as 1, 5, 23, etc. are all reduced to a single categorical value of either 0 or 1. Then, the association rules are formed from the categorical values. Obviously, potentially useful results may be lost. The most common way of handling scalar quantification is through binning. It would be interesting to compare the various binning strategies.
Most data in the world in qualitative. However, most data mining techniques are designed to handle non-qualitative data. An important question is how to mine qualitative data.
Data mining does not usually attempt to recognize causality in the data. Most of the time, the goal is to recognize when different data items are closely associated. For example, it might turn out that when someone goes to the grocery, they usually buy both bread and milk together (at the same time). We wouldn't normally say that someone bought the bread because they bought the milk or the other way around. For another example: it might turn out that when someone bought strawberries that they also bought whipped-cream. We might be willing to say that (a) they bought the whipped-cream because they bought the strawberries and (b) the strength of causality was greater for strawberries -> whipped-cream than for whipped-cream -> strawberries. Consequently, there are several research concerns
Data values in a stored in a database can become inconsistent. The initial difficulty is to recognize when a database inconsistency exists. Then, the task is to correct the inconsistency by use of artificial intelligence techniques. There are many different types of inconsistencies. Some of them should be caught by a good input editing when whole records are added; but, are not. Some are caused by changes to individual attributes. Some of the inconsistencies are subtle, some are straight forward. For example: (a) intentionally identical values/inferences can become different, (b) male employees listed as having born children, (c) employees of a certain job title having a base salary outside of the range allowed for the job title, (d) person#1 shown as parent of person#2 while person#2 shown as parent of person#1, (e) chains of subtotals wrong; i.e., salaries of x, y, z for a department and total department salary <> x+y+z, (f) exempt employee being paid overtime, (g) person shown as born in 1966 shown as hired in 1965, (h) person born in 1967 shown as retired in 1966, etc.
Partition a database using soft computing techniques to:
In the context of a knowledge-based system that is used to control clear text, information retrieval mechanism, determine if:
Do this by
Access different databases that use heterogeneous (i.e., different) DBMS, operating systems, data models, and platforms. Use a knowledge-based system to support cross translation and use.
Expert system rules are currently acquired either directly from human experts, or by examining machine stored cases. Rules can be considered a directed graph with a variety of arc weights and node types. This project would examine the possibility of learning the rules by considering inputs and results. The approach is similar to learning a Bayes net or a connectionist network. For example, a project might be to take football game statistics available from the NFL) and project winners and amount won by. The project would start with a complex graph of conjunctions and disjunctions and would learn the relative weights.
Consider a knowledge-based system's rules to be at least partially interdependent. The question is how to maintain consistency between the rules, their inferences, and their conclusions.
Consider the effectiveness of using window/non-window systems. Consider the effectiveness of developing an expert system using teams of technically knowledgeable/non-knowledgeable workers contrasted with (a) separate development by people not domain knowledgeable and (b) by people not technically knowledgeable. Accomplish this by adding a user interface to an existing knowledge-based system and empirically measuring the results.
A data-driven knowledge-based (or expert system) can be described as a directed graph from the knowledge sources (in the figure below: e,f,g, ... ,r) to the decision states (in figure below: b,c,d). The decision state with the greatest terminal value is considered the best decision. The arcs in the graph are weighted and the evidence is combined by various conjunctive and disjunctive functions. The graph may be of various depths with intermediate evidence combinations between the knowledge sources and the decision states. An opportunistic knowledge-based system would come to a decision with as little source resolution consistent with a satisfactory resolution. Possible resolution heuristics and algorithms have been developed and are available as a working paper.
return to index of projects
There are several substantially different software engineering techniques. A reasonable speculation is that some methods are better for certain kinds of problems. Focusing on requirements analysis, the question is: Are some methodologies better than others for particular problem domains?
The goal is to have the machine read natural language texts and develop the equivalent of human concepts using unsupervised learning.
The goal is to recognize when a joke occurs in written material. Various kinds of humor could be considered. One form of humor is humor by disappointed expectation. The idea is that a reader expects one thing, and instead is presented with another. The humor comes from certain types of surprise.
The issue is whether or not all possible aspects of multi-media learning are useful. If all aspects are useful, are some aspects more useful? Can student pre-testing indicate what learning aspects might be the most useful? Multi-media aspects include: motion, sound, text, and graphics.
Develop a language learning tool that would correct the language learner by matching desired sound patterns with patterns actually produced by the learner.
Given non-scalar data that can be ordered ordered, form clusters of data. The data and the results both may be linguistic variables. Some form of soft computing will be required.
Form clusters from data without knowing either the desired count of clusters or the seeds (i.e., approximate starting points). Some form of soft computing will be required. The goal is to accept a vector of scalar data and to (a) identify the count of centers and (b) identify the cluster centers.
A common problem in databases in duplicate data. One that we see often is when we get the same catalog twice from a mail order company - maybe the company bought the name from two different sources - maybe you bought something from them two different times and your name was recorded differently. Whatever, it would be useful for them to save money on mailing costs as well as to get a better picture of a customer's buying habits (by forming a profile based on all purchases). Similarly, transactions sometimes get double stored. A good first attempt at this project would most likely involve cleaning up an address database.
Sometimes, it is necessary to intelligently navigate a database. For example, when making airline reservations, there are a sequence of choices that cannot be completely anticipated. This project asks you to develop an intelligent database interface to optimize an airline reservation. The resulting program would access an airline reservation system through either Travelocity.Com, Expedia.Com, or Orbitz.Com. Then, it is to make reservations optimized on a weighted query that specifies: importance of price, importance of airline, importance of schedule, and other relevant factors.
Association rules are a product of data mining. Most association rules are drawn from data whose values have been reduced to Boolean data. The basic approach to developing this kind of rule is well known. In contrast, some methods have been attempted to develop data whose magnitudes can have a scalar range. These methodologies are still in flux. An empirical study is needed to compare quantitative association rules that produce similar results.