(vers.
from 2-23-2005)
Discovery of causal knowledge is crucial for advancing research,
developing new technology, and making sound policy, financial, and marketing
decisions. Biologists need to know the factors that cause a disease to devise new
therapeutic procedures. Public health policy makers need to know the factors
that cause an increase in the number of medical errors in order to reduce them.
Epidemiologists seek the factors causing disease in order to prevent it.
Launching a new advertisement campaign requires knowing the factors that affect
consumer behavior regarding the product. Increasing the number of visitors to a
web site requires knowledge of what attracts them to the site.
Classically-trained statisticians often quote
the maxim ‘association is not causation’ to indicate that causal discovery is
impossible without experiments. For example, simply observing a high occurrence
of yellow stains on the fingers in patients with lung cancer relative to normal
subjects does not imply a causal relation between cancer and staining (in
reality heavy smoking is causing both to often co-occur). Similarly, observing
that two items tend to be purchased together in high frequency does not
necessarily imply that increasing the sales of the first item will be followed
by an increase of the sales of the second item.
Unfortunately, discovering causal relations
strictly by randomized experimentation is inefficient and often impractical,
unethical, or simply impossible. Recent advances in computational causal
discovery theory and algorithm research and development mathematically prove
and experimentally show respectively the feasibility of causal discovery from
observational data alone under broad conditions. In fact the Nobel Prize in
Economics in 2003 was awarded to C. W. J. Granger for his test for detecting
causality in observational econometric time series. The acceptance and
application of causal discovery methods are steadily gaining ground. The
following are just a few of important references in this emerging and exciting
branch of science and technology:
I.
Causation,
Prediction, and Search by Peter Spirtes, Clark Glymour, Richard Scheines
(Second Edition)
II. Causality : Models, Reasoning, and Inference by
Judea
III. Learning Bayesian Networks by Richard E.
Neapolitan
IV.
Computation,
Causation, and Discovery by Clark Glymour (Editor), Gregory F. Cooper (Editor)
Causal Explorer (CE) is a library of causal
discovery algorithms authored by the researchers at the Discovery Systems Laboratory
of the Department of Biomedical Informatics at
In addition to the causal discovery methods, CE
contains related variable (feature) selection. The variable selection
algorithms reduce the dimensionality of the data by selecting the smallest most
predictive subset of variables. Thus, they can be used to construct smaller and
some times more accurate predictive or classification models that are less
costly to operate and easier to interpret and understand. The variable
selection algorithms in CE are based on theories of causal discovery and the
selected variables have specific causal interpretation (e.g., they are the
direct causes or direct effects of the variable of interest, or alternatively
the Markov Blanket of the variable of interest).
The CE code emphasizes efficiency, scalability,
and quality of discovery. The implementations of previously published
algorithms included in CE are more efficient than their original
implementations. CE also includes algorithms never before translated to
computer programs.
A unique advantage of CE is the
inclusion of very large scale and high quality, proprietary algorithms,
developed by the Discovery Systems Laboratory researchers (patent pending).
Example papers describing DSL’s novel causal and variable selection algorithms
are:
1. Using local causal structure to select variables
for classification across several
biomedical domains with very high dimensionality:
Aliferis CF, Tsamardinos I, Statnikov A. HITON: a novel Markov blanket algorithm for optimal variable selection. AMIA 2003 Annual Symposium Proceedings 2003;21-5. [Article]
2. Recovering local causal structure even when the available sample is low and the number of variables large:
Tsamardinos I, Aliferis CF, Statnikov A. Time and sample efficient discovery of Markov blankets and direct causal relations. Proceedings of the Ninth International Conference on Knowledge Discovery and Data Mining (KDD) 2003;673-8. [Article]
3. Learning a complete network of causal
relations among all variables efficiently and with high accuracy:
The benefits of using the included algorithm have also been examined on a theoretical basis by the DSL researchers. They have shown that other state-of-the art predictive technologies like Support Vector Machines, while successful for classification and prediction, are not suitable for causal discovery (see the following [Article]). They have also shown and explored the formal link between causality and variable selection that led to the design of the novel, proprietary, and optimal variable selection algorithms with Markov Blanket induction, which also has local causal interpretability (see the following [Article]).
Causal Explorer is a library of local and global causal discovery algorithms. Several of those algorithms can also be used for variable selection for classification.
Local causal discovery algorithms determine from observational data which
predictors (variables/observed quantities) causally affect or are affected by a
target variable of interest (under certain conditions). The causal relations
inferred are direct, i.e., if variable A is
found to directly causally affect/be affected by variable B, no other predictor(s) measured in the dataset causally
interferes between A and B.
Global causal discovery algorithms determine from observational data all direct causal associations among
the variables/observed quantities and their orientation.
Causal Explorer can be
used to:
(a) Discover the direct causal or probabilistic
relations around a target variable of interest (e.g., disease is directly
caused by and directly causes a set of variables/observed quantities).
(b) Discover the set of all direct causal or
probabilistic relations among the variables (Bayesian Network Learning).
(c) Discover the Markov Blanket of a target
variable of interest, i.e., the smallest subset of variables that contains all
necessary information to predict the target variable; the Markov Blanket
variables is the smallest subset required to build optimal prediction models
(under certain broad conditions) and corresponds to the direct causes, direct
effects, and direct causes of direct effects of the target variable.
Such algorithms have been frequently employed in analysis of data in psychology,
medicine, biology, weather forecasting, animal breeding, agriculture, financial
modeling, information retrieval, natural language processing, and other fields.
They can be used to automatically construct Decision Support Systems from data
(e.g., for medical diagnosis), or to generate plausible causal hypotheses
(e.g., which gene regulates which).
The algorithms in Causal Explorer include the state-of-the-art in the
field and have been compared against each other in some of the largest
computational experimental studies in the literature (as an example see papers
1, 2, and 3 above; more publications are available at the web site of the
Discovery Systems Laboratory at http://www.dsl-lab.org/). The results of our studies confirm the applicability of the methods,
provide suggestions as to which algorithm to be used in different situations,
and illustrate the expectations regarding the performance of each algorithm.
All algorithms have well-characterized properties in terms of under what
conditions they are guaranteed to return correct results.
1. Causal Explorer contains our proprietary
algorithms HITON, MMMB, MMPC, and MMHC.
a. HITON has been shown to be a very effective
variable selection algorithm, tested in a variety of datasets in biomedicine
with superior results against several other state-of-the-art algorithms in the
field. HITON selects significantly smaller variable subsets than the other
algorithms we compared it with, without sacrificing prediction power; in
addition, the variables selected have a causal interpretation (they belong in
the Markov Blanket of the variable to be predicted).
b. MMMB and MMPC are local causal algorithms,
showed to outperform the previous state-of-the-art algorithms of similar type.
c. MMHC is a Bayesian Network learning algorithm
showed to outperform the previous state-of-the-art algorithms in a very
extensive empirical evaluation study.
2. Contrary to state-of-the-art methods used
extensively in large-scale data mining (e.g., association rules, decision
trees, regression, various feature selection
procedures) all algorithms provide theoretical guarantees for correctness while
scaling-up to tens of thousands, or hundreds of thousands of variables.
3. Causal Explorer provides the most extensive
palette of algorithms of this type, including the best algorithms in existence
in addition to our proprietary ones.
4. Causal Explorer provides some of the most
efficient implementations of state-of-the-art algorithms.
5. The algorithms in Causal Explorer have been
tested in extensive studies to provide suggestions guided by empirical results
as to the appropriateness of the algorithms in different situations.
References: