THE LUCS-KDD IMPLEMENTATIONS OF THE FP-GROWTH ALGORITHM



Liverpool University

Frans Coenen

Department of Computer Science

The University of Liverpool

Thursday 5th February 2003

DISCLAIMER! The following description of the FP-growth algorithm (Han et al. 2000) is that implemented by the author of this WWW page, i.e. it is not identical to that first produced by Jiawei Han, Jian Pet, Yiwen Yin and Runying Mao, but certainly displays all the "salient" features of the FP-growth algorithm.


CONTENTS

1. Introduction.
2. The FP-tree.
3. The FP-growth algorithm.
4. Note on the LUCS-KDD implementation.
5. Critique.
6. Downloading the software.
 
6.1. Compiling.
6.2. Documentation.
7. Running the software.
8. Conclusions.



1. INTRODUCTION


The popular FP-growth Association Rule Mining (ARM) algorirthm (Han et al. 2000) is applied to a particular kind of set enumerationj tree, the FP-tree, alsp developped by Han et al. Both the FP-tree and the FP-growth algorithm are described in the following two sections. A short critique is also provided in Section 3.

The essential difference between the original FP-growth algorithm and the LUCS-KDD (Java) implementation is that a second tree structure, the T-tree (developped by Frans Coenen, Paul Leng and Graham Goulbourne) is used to store the discovered frequent itemsets and subsequently generate the desired ARs.




2. THE FP-TREE

A popular "preprocessing" tree structure is the FP-tree proposed by Han et al. (2000). The FP-tree stores a single item (attribute) at each node, and includes additional links to facilitate processing. These links start from a header table and link together all nodes in the FP-tree which store the same "label", i.e. item.

To illustrate the method, considering the simple data set:

{1, 3, 4}
{2, 4, 5}
{2, 4, 6}

The construction process begins with an initial pass to count support for the single items. Those that fail to meet the support threshold are eliminated, and the others ordered by decreasing frequency. For our illustration we will assume that all 1-itemsets are adequately supported, so the ordering will be {4,2,1,3,5,6}. We then pass through the dataset a second time and produce an initial FP-tree. We commence by reading the first record in the dataset and place this in the FP-tree (Figure 1(a) --- note the links from the header table). We then add the second record; the first element of this is common with an existing node and so we add the new node to the FP-tree structure as shown in Figure 1(b). We then add the last record (Figure 1(c)) to complete the initial FP-tree.

FP TREE

Figure 1: Example FP tree




3. THE FP-GROWTH ALGORITHM

The algorithm, FP-growth, for mining the FP-tree structure is a recursive procedure during which many sub FP-trees and header tables are created. The process commences by examining each item in the header table, starting with the least frequent. For each entry the support value for the item is produced by following the links connecting all occurrences of the current item in the FP-tree. If the item is adequately supported, then for each leaf node a set of ancestor labels is produced (stored in a prefix tree), each of which has a support equivalent to the sum of the leaf node items from which it is generated. If the set of ancestor labels is not null, a new tree is generated with the set of ancestor labels as the dataset, and the process repeated. In our implementation all frequent itemsets thus discovered were placed in a T-tree, thus providing fast access during the final stage of the ARM process, while at the same time providing for the deletion of FP-subtrees and tables created "on route" as the FP-growth algorithm progressed.

For example in Figure 1(c) we would start with attribute 6. This has a support of 1, and assuming this is above the required suppoprt threshold, this would be identified as a frequent set. There are two ancester labels so a new FP-tree is created (Figure 2(a)) using the ancestor labels as the input set /bin/sh: -c: line 1: syntax error near unexpected token `(' /bin/sh: -c: line 1: `new FP-tree is created (Figure 2(a)) using the ancestor labels as the input set ' (note that the support values for the label are equivalent to the leaf node). FP growth is applied again and the frequent sets {2,6} discovered. There are still ancestor nodes so another tree is produced (Figure 2(b)) and the process is continued until there are no more ancestor nodes (Figure 2(c)).

FP GROWRG

Figure 2: FP growth algorithm




4. NOTE ON THE LUCS-KDD IMPLEMENTATION

The LUCS-KDD implementation of FP growth reorders and prunes the input data. Reorders so that the most common single attributes appear first, and pruned so that unsupported 1-itemsets are not included. The effect of this reordering and pruning is to make the running of the code much more computationally efficient (an advantage appropriate to all tree based ARM algorithms). Thus given an input set of the form:

{1 3 4}
{2 4 5}
{2 4 6}

This would be reordered to give:

{4 1 3}
{4 2 5}
{4 2 6}

However, for the code to operate correctly itemset attributes must be numbered sequentially. Thus some renumbering must take place:

4 becomes 1
2 becomes (stays at) 2
1 becomes 3
3 becomes 5
5 becomes (stays at) 5
6 becomes (stays at) 6

The input data now looks like this:

{1 3 4}
{1 2 5}
{1 2 6}

Running the code produces a tree of the form:

(1) 1:3 (ref to null)
(1.1) 2:2 (ref to null)
(1.1.1) 5:1 (ref to null)
(1.1.2) 6:1 (ref to null)
(1.2) 3:1 (ref to null)
(1.2.1) 4:1 (ref to null)

Where given X:Y, X is the attribute number and Y is the associated support. Reordering and pruning can be switched off by removing (commenting out) the lines:

newFPtree.idInputDataOrdering();
newFPtree.recastInputDataAndPruneUnsupportedAtts(); 
newFPtree.setNumOneItemSets();

From the FPgrowthApp.java class in which case this would give an FP tree of the form:

(1) 1:1 (ref to null)
(1.1) 3:1 (ref to null)
(1.1.1) 4:1 (ref to null)
(2) 2:2 (ref to null)
(2.1) 4:2 (ref to 4:1)
(2.1.1) 5:1 (ref to null)
(2.1.2) 6:1 (ref to null)

Note that this unordered tree contains more nodes than the ordered tree.




5. CRITIQUE

The advantages offered by algorithms such as FP-Growth (Han et al. 2000) are partly gained from the ordering process, which reduces the overall size of the input dataset (because unsupported single items are eliminated), and reduces processing time by allowing the most common items to be processed most efficiently. The first advantage can also be realised by other ARM algorithms; anything that reduces the overall size of the input dataset has a positive effect on mining efficiency. However, the advantage will only be significant given sparse datasets and/or high support thresholds, where many 1-itemsets are likely to be unsupported and therefore can be eliminated from the input dataset.

It should also be noteed here that reordering of attributes in datasets according to the support counts for single items has a general benefit with respect to the computational efficiency of set enumeration trees in general.




6. DOWNLOADING THE SOFTWARE


The LUCS KDD FP-growth software comprises four Java source files. These are provided from this WWW page together with an application classes. The source files are as follows:

AssocRuleMining.java FPtree.java TtreeNode.java FPgrowthApp.java TotalSupportTree.java
  1. AssocRuleMining.java: Set of general ARM utility methods to allow: (i) data input and input error checking, (ii) data preprocessing, (iii) manipulation of records (e.g. operations such as subset, member, union etc.) and (iv) data and parameter output.
  2. FPtree.java: The FP growth algorithm.
  3. TotalSupportTree.java: Methods to generate ans manipulate the "Total support" tree data structure (T-tree).
  4. TtreeNode.java: Descrition of a node in the T-tree.

The TtreeNode class is separate to the remaining classes which are arranged in a class hierarchy of the form presented in Figure 3.

 AssocRuleMining
	|
 TotalSupportTree 
	|
     FPtree

Figure 3: Class Hierarchy

The Apriori-TFPC FP-growth application class included here is as follows:

  1. FPgrowthApp.java: FP growth application demonstration class.

There is also a "tar ball" fpGrowth.tgz that can be downloaded that includes all of the above source and application class files. It can be unpacked using tar -zxf fpGrowth.tgz.


6.1. Compiling

The LUCS-KDD FP-growth software has been implemented in Java using the Java2 SDK(Software Development Kit) Version 1.4.0, which should therefore make it highly portable. The code does not require any special packages and thus can be compiled using a standard Java compiler:

javac *.java

6.2. Documentation

The code can be documented using Java Doc. First create a directory Documentation in which to place the resulting HTML pages and then type:

javadoc -d Documentation/ *.java

This will produce a hierarchy of WWW pages contained in the Document directory.




7. RUNNING THE SOFTWARE


When compiled the software can be invoked in the normal manner using the Java interpreter:

java APPLICATION_CLASS_FILE_NAME

If you are planning to process a very large data set it is a good idea to grab some extra memory. For example:

java -Xms600m -Xmx600m APPLICATION_CLASS_FILE_NAME

The input to the software, in all cases is a (space separated) binary valued data set R. The set R comprises a set of N records such that each record (r), in turn, comprises a set of attributes. Thus:

R  = {r | r = subset A}

Where A is the set of available attributes. The value D is then defined as:

D = |A|

We then say that a particular data set has D columns and N rows. A small example data sets might be as follows:

1 2 3 6
1 4 5 7
1 3 4 6
1 2 6
1 2 3 4 5 7

where, in this case, A = {1, 2, 3, 4, 5, 6, 7}. Note that attribute numbers are ordered sequentially commencing with the number 1 (the value 0 has a special meaning). The input file is included using a -F flag.

The program assumes support and confidence default threshold values of 20% and 80% respectively. However the user may enter their own thresholds using the -S and -C flags. Thresholds are expressed as percentages.

Some example invocations, using a discretized/ normalised Pima Indians data set (also available from this site) and each of the three application classes provided by this WWW site, are given below:

java FPgrowthApp -Fpima.D38.N768.C2.num
java FPgrowthApp -Fpima.D38.N768.C2.num -S2.5
java FPgrowthApp -Fpima.D38.N768.C2.num -S1 -C90

(note that the ordering of flags is not significant). The output from each application is a set of ARs (ordered according to confidence) plus some diagnostic information (run time, number of rules generated etc.). The example application also outputs the FP-trees generated. By editting the application source file (FPgrowthApp.java) these outputs can be removed or commented out.




8. CONCLUSIONS


The FP-growth ARM algorithm described here has been use successfully by the LUCS-KDD research group to contrast and compare a variety of ARM algorithms and techniques. The software is available for free, however the author would appreciate appropriate acknowledgement. The following reference format for referring to the LUCS-KDD FP-growth implementation available here is suggested:

  1. Coenen, F. (2003), The LUCS-KDD FP-growth Association Rule Mining Algorithm, http://www.cxc.liv.ac.uk/~frans/KDD/Software/FPgrowth/fpGrowth.html, Department of Computer Science, The University of Liverpool, UK.

Should you discover any "bugs" or other problems within the software (or this documentation), do not hesitate to contact the author.




REFERENCES

Coenen, F., Goulbourne, G. and Leng, P., (2003).
Tree Structures for Mining association Rules. Journal of Data Mining and Knowledge Discovery, Vol 8, No 1, pp25-51.
Goulbourne, G., Coenen, F. and Leng, P. (2000).
Algorithms for Computing Association Rules Using a Partial-Support Tree. Journal of Knowledge-Based Systems, Vol (13), pp141-149.
Han, J., Pei, J. and Yiwen, Y. (2000).
Mining Frequent Patterns Without Candidate Generation. Proceedings ACM-SIGMOD International Conference on Management of Data, ACM Press, pp1-12.



Return to KDD top level page.