|
CONTENTS | |||
|
1. INTRODUCTION |
The current version (3.2) of the LUCS-KDD data generator is available from this site. The data generator is intended to produce data sets for use in the testing of Association Rule Mining (ARM) algorithms, but may very well have other uses. The output from the generator is a file of N rows and M columns. For example the row: {1, 5, 7, 8} |
Indicates the presence of a binary '1' in columns 1, 5, 7 and 8 of the data table (all other columns have a '0' by default). The following sub-sections detail the development of the generator. |
2. DEVELOPMENT |
The principal input to the data generator are the values for M and N where M is the desired number of available attributes and N the number of desired rows. For each intersection (M,N) a random number R is generated such that 0 <= R < 100. In a naive system we might proceed as follows: if the value of R is less than 50 a '1' is considered to exist in that column and consequently the current value of N is output to file. We can express this process using pseudo code as follows: For all N loop { For all M loop { if R < 50 output M } output carriage return } Given an equal distribution for the value of R the resulting output is a data set where 50% of the intersections are (conceptually) represented by a binary `1' and the rest by a `0' (Figure 1). Therefore we could expect 50% of the elements in any column or row to be set to `1'. Thus the support for each one itemset can be estimated to be 50% (1/2), the support for each pair would be 25% (1/2*1/2), for each triple 12.5% (1/2*1/2*1/2) and so on. |
In other words we will more or less know the support values in advance --- not ideal for testing ARM algorithms. Data sets produced using this naive approach may thus be regarded as being in appropriate ![]() Figure 1 Distribution of test set data |
A better result can be produced if a random probability factor, P is assigned to each column prior to generation of the data set such that 0 <= P < 100. For each intersection (M,N) a `1' is associated with the intersection if R is less then P, and a `0' otherwise. Thus: for all M loop { generate random value for Pm } For all N loop { For all M loop { if R < Pm output M } output carriage return } Given a sufficiently large M (number of columns) the distribution of binary ones can be expressed as shown in Figure 2, i.e. some columns will be much more densely "populated" than others. Consequently the resulting data sets produced can be considered to bear a much closer resemblance to genuine data sets and thus more appropriate for testing the proposed ARM algorithms. |
However, the overall distribution (D) of `1's in the data set would always be %50. It is desirable to provide a facility to vary the distribution D such that 0 <= D < 100. ![]() Figure 2 Test set generation including variable column density |
Conceptually the process of increasing/decreasing the desired overall density D for the resulting output set is equivalent to the rotation of the diagonal in Figure 3 about either the "top-left" or "bottom-right" corner as appropriate (Figures 3a and 3b). The rotation is carried out as follows: if (D < 50) rotate about top left corner if (D == 50) do nothing if (D > 50) rotate about bottom right corner The current generation process (Version 3) incoporated into the test set generation algorithm can be expressed as follows: |
for all M loop { generate random value for Pm if (D < 50) Pm = Pm-(Pm*D/50) else { if (D > 50) Pm = Pm+((100-Pm)*(D-50)/50) } } --- As for Version 2 --- Thus we adjust the value of P prior to commencing the generation process. Table 1 below gives some sample adjusted values for a range of P given a range of values for D. The code (written in Java) for Version 3.2 of the Generation algorithm can be found here. |
![]() (b) |
![]() (a) |
Figure 3 Test set generation including input of desired data density
|
Table 1: Some example adjusted values using version three of the data generator
3. VERSION 3.2 JAVA CODE |
The java code for Version 3 of the "in house" test set generator is available:
|
The command line version of the generator (GeneratorApp) is called with up to 4 command line arguments as follows:
Example calls: $java GeneratorApp 10 10 $java GeneratorApp 20 100 25 $java GeneratorApp 30 1000 10 example1.dat The first produces a data set comprising 10 columns, 10 rows and a density of 50% stored in a file called testFile. The second produces a data set comprising 20 columns, 100 rows and a density of 25% stored in a file called testFile. The last produces a data set comprising 30 columns, 1000 rows and a density of 10% stored in a file called example1.dat. |