Some Random Musing On Causality

This page is under refinement and incremental development.



Recognizing when data is deterministic or random

An important part of data mining or data exploration is understanding if there is a relationship between data items. Sometimes, data items may occur in pairs but not have a deterministic relationship; for example, a grocery store shopper may buy both bread and milk at the same time. Most of the time, the purchase of the milk is not caused by the purchase of the bread; nor is the purchase of the bread caused by the purchase of the milk.

However, some data pairs do appear to have a causal relationship; for example, if someone hits a bottle with a hammer; we might expect the bottle to be broken as a causal effect.

Other events appear to have a partial causality; for example, when someone buys both strawberries and whipped cream, we might say that the purchase of the strawberries caused the purchase of the whipped cream.

Recognizing whether one thing causes another is difficult.

Prediction is not the same as causality. Recognizing whether a causal relationship existed in the past is not the same as predicting that in the future one thing will occur because of another thing. for example, knowing that a was a causal (or deterministic) factor for ß is different than saying when there is a, ß will deterministically occur (or even probabilistically occur to a degree l). There may me other necessary factors.

It may be possible to determine whether a collection of data is random or deterministic using attractor sets from Chaos theory [Packard, 1980]. A low dimensional attractor set would indicate regular, periodic behavior and would indicate determinate behavior. On the other hand, high dimensional results would indicate random behavior. {See Halpern [2000, pg 178]}

Causal necessity is not the same thing as causal sufficiency; for example, in order for event ß to occur, events a,b,c need to occur. We can say that a, by itself, is necessary, but not sufficient.

Part of the difficulty arises from identifying data that might be related. Data can have a high dimensionality with only a relatively few dimensions have some sort of relationship. Similarly, data may have a higher dimensionality than necessary to fully describe a situation. Dimensionality reduction is an important issue in learning from data . Some data might be redundant, some irrelevant, some are more important than others . In a large collection of data, complexity may be unknown

A causal discovery method cannot transcend the prejudices of analysts. Often, the choice of what data points to include and which to leave out, which type of curve to fit (linear, exponential, periodic, etc.), what time increments to use (years, decades, centuries) and other aspects of models depend on the instincts and preferences of researchers.

The term "causality" is used in the everyday, informal sense. There are several strict definitions that are not wholly compatible with each other. The sense here is that if one thing occurs because of another thing, we say that there is a dependent or causal relationship.

Beyond computational issues, there appear to be inherent limits on whether causality can be determined. Among them are:

G. Chatin [1987] Algorithmic Information Theory, Cambridge University Press

G. Chatin [1990] "A Random Walk In Arithmetic," New Scientist 125, n 1709 (March, 1990), 44-66

P. Halpern [2000] The Pursuit Of Destiny, Perseus, Cambridge, Massachusetts

N. Packard, J.,Chrutchfield, J. Farmer, R. Shaw [1980] "Geometry From A Time Series," Physical Review Letters, v 45, n 9, 712-716



last modified: 22 April 2003