## Description of the ILLM system

DMS online rule induction is based on the algorithms of the ILLM (Inductive Learning by Logic Minimization) system. ILLM is a propositional learning system aimed at solving classification problems. The rules produced by ILLM are in CNF, DNF or mixed form, as it will be explained later.

Main differences to other propositional learning sytems (like the method explained earlier) are in the following:

- ILLM separates literal (attribute-value pair) generation and selection process and the rule construction process.
- It has a special literal selection algorithm based on the positive/negative example pairs (p/n pairs). This algorithm provides means for the selection of the reduced (relevant) subset of literals, prior to the rule construction process.
- Rule generation process minimizes the rule length by combining both CNF and DNF in a mixed form.
- Starting from the reduced set of literals, a minimal set of literals sufficient for the rule construction can be selected. This process is the base for noise detection in ILLM

### ILLM literal selection procedure

Most of the propositional inductive learning systems (rule induction systems) generate rules and select significant literals at the same time. Usually, some form of the information gain metric is used in the selection process. In the ILLM code, literal gneration and selection process precedes the rule construction process, and provides a reduced set of literals necessary to build a rule covering all of the examples of the target class.

ILLM literal selection process is not based on the information based metric, but on the positive/negative example pairs (p/n pairs in the rest of the text). Instead of having one covering table (as for the method described before), ILLM has two covering tables: P - table of positive examples and N - table of negative examples. The columns in these tables are all possible (or interesting) literals that can be used in the literal selection. The elements in the tables are true and false values, depending on the literal value for the example.

P and N tables can be reduced if some simple criteria are met. For example a column (literal) can be erased if there is another literal covering the same examples plus some more. Similarly, a row (an example) can be erased if it has true-values in at least all columns as another row.

This is the basic form of the covering tables for the literal selection process in ILLM. It has to be stressed here, that this form of the covering tables provides means for the effective noise handling and/or outlier elimination in the phase of the search for the minimal subset of literals. In the process of noise handling a relation between literal covering weight and significance of particular examples is used to decide which examples might be declared as outliers or noise and possibly eliminated from the following rule generation stage.

The process of literal generation and selection will be described in detail in following sections. The details of the rule construction procedure, which is employed on the online rule induction service of DMS as well, are given at the end. Search method for the minimal subset of literals, as employed on the online DMS service is given next. These concept is called confirmation rules and is explained elsewhere in greater detail (Gamberger, 2000).

ILLM representational formalism: literals

Consider a two-class learning problem where training set *E* consist of positive and negative examples of a concept (*E = P ∪ N*) and examples *e ∈ E* are tuples of truth-values of terms in a hypothesis language. The set of all terms, called literals, is denoted by *L*. Let us represent the training set *E* as a table where rows correspond to training examples and columns correspond to literals. An element in the table has the value *true* when the example satisfies the condition (literal) in the column of the table, otherwise its value is *false*.

A procedure for the generation of the set of all literals

If the training set does not have the form of tuples of truth-values, a transformation to this form is performed in preprocessing of the training set. For attribute-value learning, the transformation procedure is based on analysis of the values of examples in the training set. For each attribute *A _{i}*, let

*v*(

_{ix}*x =*1..

*kip*) be the

*k*different values of the attribute that appear in the positive examples and let

_{ip}*w*(

_{iy}*y =*1..

*k*) be the

_{in}*k*different values appearing

_{in}in the negative examples. The transformation results in a set of literals *L*:

- For discrete attributes
*Ai*, literals of the form*Ai=vix*and*Ai=wiy*are generated. - For continuous attributes
*Ai*, literals of the form*Ai*≤*(vix+wiy)/*2 are created for all neighboring value pairs (*vix, wiy*), and literals literals*Ai > (vix+wiy)/*2 for all neighboring pairs (*wiy, vix*). - For integer valued attributes
*Ai*, literals are generated as if*Ai*were both discrete and continuous, resulting in literals of four different forms:*Ai*≤*(vix+wiy)/*2,*Ai > (vix+wiy)/*2,*Ai=vix*, and*Ai=wiy*.

In the definition of literals for continuous and integer attributes a term neighbouring value pairs is introduced. This term defines such a pair, as two immediate neighbouring values for the paricular attribute, so that they belong to positive and negative example, or vice versa. There is no other example form the training set, whose attribute value comes in between these two values, and these two values must belong to examples of opposite target classes.

Covering tables of p/n pairs of examples

Let the set of training examples *E* be represented by a truth-value table where columns correspond to the set of (positive and negative) literals *L*, and rows are tuples of truth-values of literals, representing training examples *ei*. There are two tables, *P* and *N*, where *P* is the table of positive examples, and *N* the table of negative examples. To enable a formal discussion of the relevance of literals, and of the literal selection process, the following definitions are introduced:

**Definition 1.** A *p/n* pair is a pair of training examples where *p ∈ P* and *n ∈ N*.

**Definition 2.** Literal *l ∈ L* covers a *p/n* pair if in column *l* of the table of training examples *E* the positive example *p* has value *true* and the negative example *n* has value *false*. The set of all *p/n* pairs covered by literal *l* will be denoted by *E(l)*.

**Definition 3.** Literal *l* covers literal *l'* if *E(l') ⊆ E(l)*.

To illustrate the above definitions consider a simple learning problem with 5 training examples: three positive *p*1, *p*2 and *p*3, and two negative *n*1 and *n*2, described by truth-values of literals *li ∈ L*. The truth-value table *E*, showing just some of the truth-values, is given in Table 1.

Table 1: *P *and *N* tables, columns are literals, while rows are examples. Elements of the
table take two values *true *or *false*: *true* - if the literal is true for the example or
*false *- if the literal is not true for the example.

On the value (relevance) of literals

Literal *l*2 in Table 1, seems to be relevant for the formation of the inductive hypothesis since it is true for a positive example and false for both negative examples. This is due to the fact that *l*2 covers a positive example, and does not cover the negative examples, and is thus a reasonable ingredient of the hypothesis that should cover the positive examples and should not cover the negative examples. Literal *l*2 covers two *p/n* pairs: *E(l*2*) = {p*2*/n*1*, p*2*/n*2*}*. Literal *l*8 is inappropriate for constructing a hypothesis, since it does not cover any *p/n* pair: *E(l*8*) = 0*. Literal *l4* seems to be less relevant than *l2* and more relevant than *l8*; it covers only one *p/n* pair: *E(l4) = {p2/n1}*. Literal *l2* covers *l4* and *l8*, and literal *l4* covers *l8*, since *E(l8) ⊆ E(l4) ⊆ E(l2)*. Table 1 provides the following intuition: the more *p/n* pairs a literal covers the more relevant it is for hypothesis formation. This may be formalized by the next definition.

* Definition 4a.* Literal

*l'*is irrelevant if there exists a literal

*l ∈ L*such that

*l*covers

*l'*(

*E(l') ⊆ E(l)*). In other words, literal

*l'*is irrelevant if it covers a subset of

*p/n*pairs covered by some other literal

*l ∈ L*.

Assume that literals are assigned costs (for instance, cost can be a measure of complexity - the more complex the literal, the higher its cost). Let *c(l)* denote the cost of literal *l ∈ L*. The definition of irrelevance needs to be modified to take into account the cost of the literals.

* Definition 4b.* Literal

*l'*is irrelevant if there exists a literal

*l ∈ L*such that

*l*covers

*l'*(

*E(l') ⊆ E(l)*) and the cost of

*l*is lower than the cost of

*l'*(

*c(l)*≤

*c(l')*).

Relevance of literals as explained above is the basis for the literal selection procedure employed in the ILLM system, as well as for the noise handling procedure. Details of this procedures can be found in the literature. We will assume here that we have selected literals and that we can proceed to the rule construction process, which is introduced in the following section.

Rule construction procedure

Using the formalism described above, and the subset of relevant literals, we now proceed to the stage of rule set construction. The algorithm used for constructing, so called confimation rule subsets is given in Figure 1.

Figure 2: Algorithm for the construction of the confirmation rule subset

The procedure at step 3 of the Algorithm **CR**, is a search procedure which maximizes the weight of the new rule **R _{i }**to be added into

**CR**. This search can be exhaustive (combining all literals with high enough covering weight into a rule of predefined length

**L**), or heuristic, when the complexity of the problem is too high. One must note that the rule weight is inversely proportional to the sum of covered example weights.

_{R}Produced confirmation rule sets finaly contain rules which usually do not cover all positive class examples, but are 100% correct with respect to negative class examples. Due to the use of aadaptive example weights *c(e)* in Algorithm **CR**, rules in the **CR** subset are relatively independent, as each new rule added into **CR** is biased to cover those examples whose weight is lower, i.e. those not yet covered, or covered by a smaller number of rules.

This is, in general terms, the method behind the online rule induction system employed at DMS.

Application of ILLM and DMS rule induction

ILLM system is intended for solving classification problems (see the ILLM application page). It generates rules for the target class in the CNF or mixed CNF and DNF form. In conjunction with a special testing algorithm, generated rules can be used as fast labeling algorithms for new examples, hence as online detection or processing algorithms. On the other hand, propositional form of generated rules is very informative and in that respect useful for solving descriptive data mining tasks (see ILLM application in the COIL 2000 challenge). This descriptive value of generated rules is even more pronounced in the DMS online system, where CNF form of rules is translated into simple statements.

### Literature on ILLM system

- D.Gamberger. (1995) A minimization approach to inductive learning. In Proc. of the 8th European Conference on Machine Learning, Springer, pp.151-160. ( .ps file 266kB).
- D.Gamberger and N.Lavrac. (1997) Conditions for Occam's razor applicability and noise elimination. In Proc. of 9th European Conference on Machine Learning, Springer, pp. 108-123, ( .ps file 333kB).
- N.Lavrac, D.Gamberger, and P.Turney. (1998) A relevancy filter for constructive induction. IEEE Intelligent Systems , Vol.13, pp.50-56.
- D.Gamberger and N.Lavrac. (2000) Confirmation rule sets. In Proc. of 4th European Conference on Principles of Data Mining and Knowldge Discovery (PKDD 2000), Springer, pp. 34-43. ( .ps file 289kB).
- D.Gamberger and N.Lavrac. (2001) Filtering noisy instances and outliers. In H.Liu and H.Motoda, editors, Instance Selection and Construction for Data Mining, Kluwer Academic Publishers.
- D.Gamberger, N.Lavrac (2002) Expert-guided subgroup discovery: Methodology and Application. Journal of Artificial Intelligence Research, Vol. 17 pp.501-527. ( .ps file 289kB).

© 2001 LIS - Rudjer Boskovic Institute

Last modified: August 20 2018 19:41:09.