This Ph.D. thesis (367k compressed postscricpt) is concerned with pruning in relational learning, a subarea of the rapidly growing field of Inductive Logic Programming. In many real-world problems not only the values of the attributes, but also relations between them may be relevant for the definition of the target concept. Relational learning algorithms extend classical propositional learning systems like CN2 and ID3 with the ability to include this kind of knowledge. From a set of positive and negative examples and background knowledge in the form of PROLOG relations, relational learners will try to find a PROLOG program that can derive the positive, but not the negative examples. As real-world problems very often contain erroneous values and incomplete or misclassified examples, noise handling is an important problem that has to be faced. Pruning is the common framework for avoiding this problem of overfitting the noise in the data. The basic idea is to incorporate a bias towards more general and simpler theories in order to avoid fitting overly specific theories that try to find explanations for noisy examples.
Pre-pruning methods deal with noise during learning. Instead of trying to find a theory that is complete and consistent with the given training data, heuristics - so-called stopping criteria - are used to relax this constraint by stopping the learning process although some positive examples might not yet be explained and some of the negative examples may still be covered by the current theory. We review some of the basic algorithms that use this approach, most importantly FOIL which is the ancestor of most relational learning algorithms. Thereafter we introduce a new system, FOSSIL. While it uses the same basic separate-and-conquer learning strategy as FOIL, FOSSIL employs a new search heuristic based on statistical correlation which has several interesting properties that distinguish it from other approaches. Most importantly, we show how the correlation heuristic can effectively deal with noise when used with the simple cutoff stopping criterion. This new simple and efficient criterion does not depend on the number of training examples or on the amount of noise in the data and thus seems to be more robust than alternative approaches.
Another family of algorithms deals with noise after learning. These post-pruning algorithms typically first induce a theory that is complete and consistent with the training data. Then this theory is examined and clauses and literals that only explain characteristics of the particular training set used and thus do not reflect true regularities of the domain are discarded. The quality of the found clauses and literals is commonly evaluated on a separate set of training examples that have not been seen during learning. We review two common post-pruning methods, Reduced Error Pruning (REP) and GROW, a variant that uses a top-down search. While both methods prove to be very effective in noise-handling, they are also very inefficient.
One of the reasons for the inefficiency of post-pruning methods is that the intermediate theory resulting from the initial overfitting phase can be much more complex than the final theory. Thus a lot of effort has to be wasted in generating and subsequently simplifying this intermediate theory. One way for reducing this problem is to combine pre- and post-pruning. For this purpose pre-pruning heuristics are used to reduce (not entirely prevent) the amount of overfitting, so that learning and pruning will be more efficient. Our method, Top-Down Pruning (TDP), uses FOSSIL's simple cutoff stopping criterion to systematically vary the overfitting avoidance bias. Theories pruned to different degrees are generated in a top-down, general-to-specific order. The accuracies of the theories are evaluated on a separate set of data and the most specific theory with an accuracy comparable to the accuracy of the best theory so far will be submitted to a subsequent post-pruning phase. Experiments show that this initial top-down search for a better starting theory can be more efficient than the overfitting phase of classical post-pruning algorithms. As this search will typically return a theory that is closer to the final theory, we can also achieve a significant speed-up in the post-pruning phase along with a slight gain in accuracy.
Motivated by the success of this method, we have developed a more rigorous approach that tightly integrates pre- and post-pruning. Instead of learning an entire theory and pruning it thereafter, Incremental Reduced Error Pruning (I-REP) prunes single clauses right after they have been learned. This new algorithm entirely avoids the learning of an overfitting theory by using post-pruning methods as a pre-pruning stopping criterion and thus significant speedups can be achieved in noisy domains. As it avoids some problems with other approaches that incorporate post-pruning, I-REP also learns more accurate theories.
In a series of experiments in propositional and relational domains we have tried to confirm the quality of our algorithms, and found that in noisy domains, I-REP is superior in terms of both, accuracy and efficiency. However, this advantage was not as clear in propositional domains as it was in domains that require relations. Pre-pruning as used in FOSSIL appears to be the most stable method and should be used when nothing is known about the domain. A setting of 0.3 for its cutoff parameter has yielded good results in almost all tested domains, which illustrates the robustness of this algorithm against varying noise levels and varying training set sizes.
The research has been conducted at the Austrian Research Institute for Artificial Intelligence with support from the Austrian Fonds zur Förderung der wissenschaftlichen Forschung.