Association rules are mostly used in mining transaction data. Crucial terms in association rules terminology are:
- item (in DM terminology corresponds to attribute-value pair)
- transaction (a set of items; corresponds to example)
- a set (data set) of transactions containing more different items
Typical for transactions is that they differ in the number of items. Therefore, some transformations (see standard form) might be necessary to be able to data mine transaction data with most of the data mining. tools.
Each transaction in the set gives us information about which items co-occur in the transaction. Using this data one can create a co-occurence table that tells the number of times that any pair (or itemset) occurs together in the set of transactions. From the co-occurence table we can easily establish simple rules like:
R1="Item 1 comes together with Item 2 in 10% of all transactions"
10% is a measure of the number of co-occurences of these two items in the set of transactions, and is called a support of the rule. If the frequency of Item 1 occuring in the set of transactions is 15%, and that of Item 2, 20%, then the ratio of the number of transactions that support the rule (10%) to the number of transactions that support the conditional part of the rule (15%) gives the confidence of the rule. In this case the confidence is:
We can make the inverse of the R1 which is:
R2="Item 2 comes together with Item 1 in 10% of all transactions"
Confidence of this rule is:
What is confidence saying to us? Saying that confidence of the rule is 0.5 is equivalent to saying that when Item 2 occurs in the transaction, there is a 50% chance that also Item 1 will occur in the transaction. The most confident rules seem to be the best ones. But the problem is when for example Item 1 occurs more frequently in transactions (let's say in 60% of transactions). In that case the rule might have lower confidence than the random guess! This suggests using another measure called improvement. That measure tells how much better a rule is at predicting the outcome than just assuming the result. Improvement is given by formula:
In this case I(R2)=0.2/(0.2*0.1)=10, and for R1 I(R1)=0.1/(0.1*0.2)=5. When improvement is greater than 1 the rule is better than the random chance. When it is less than 1, it is worse. In our case R2 is 10 times better and R1 5 times better than the random guess.
Generating association rules is a multi-step process. The general algorithm is:
- Generate the co-occurence matrix for single items.
- Generate the co-occurence matrix for two items. Use this to find rules with two items.
- Generate the co-occurence matrix for three items. Use this to find rules with three items.
Applications of association rules
Association rules are typically used in market analysis (market basket analysis), primarily because of the utility and clarity of its results. They express how important products or services relate to each other, and immediately suggest particular actions. Association rules are used in mining categorical data - items. Besides the sole process of generating association rules, the process of application of association rules technique involves two important concerns:
1) Choice of the right set of items
The data used for association rule analysis is typically the detailed transaction data captured at the point of sale. Gathering and using this data is a critical part of applying association rule analysis, depending crucially on the items chosen for analysis. What constitutes a particular item depends on the business (problem) need. Items in stores usually have codes that form hierarchical categories (taxonomy). These categories help in generalization, and reduction of the volume of items used for a study. Dozens or hundreds of items may be reduced to a single generalized item, often corresponding to a single department or type of a product.
2) Practical limits imposed by a large number of items appearing in combinations large enough to be interesting
Number of combinations for larger itemsets rises exponentially with the number of items. Calculating the support, confidence, and improvement for a grocery store with thousands of different items, quickly rises to millions, as the number of items in the combinations grows. For example for 1000 products, total number of combinations of three products is:
Calculating the counts for five or more items can be completely out of hand. In that case the use of taxonomies reduces the number of items to a manageable size.
Generally, the strengths of association rule analysis are:
- It produces clear and understandable results.
- It supports undirected data mining (no target attribute).
- It works on data of variable length.
- The computation algorithm it uses is quite simple.
Links to Association rules tutorials:
Tutorial on High Performance Data Mining
by Vipin Kumar and Mahesh Joshi
ARMiner - a client-server data mining application specialized in finding association rules
maintained by L.Cristofor.
© 2001 LIS - Rudjer Boskovic Institute
Last modified: April 18 2014 22:55:05.