(IN HOUSE) TEST SET GENERATOR

Version 3.2

Frans Coenen

Department of Computer Science, The University of Liverpool

http://www2.csc.liv.ac.uk/~frans/cgi-html/DataMining/Generators/generator.html

Monday 13 November 2000

Revised Wednesday 7 February 2001, Friday 9 February 2007


CONTENTS

1. Introduction.
2. Development.
 
3. Version 3.2.



1. INTRODUCTION

The current version (3.2) of the LUCS-KDD data generator is available from this site. The data generator is intended to produce data sets for use in the testing of Association Rule Mining (ARM) algorithms, but may very well have other uses. The output from the generator is a file of N rows and M columns. For example the row:

{1, 5, 7, 8}
 

Indicates the presence of a binary '1' in columns 1, 5, 7 and 8 of the data table (all other columns have a '0' by default).

The following sub-sections detail the development of the generator.




2. DEVELOPMENT

The principal input to the data generator are the values for M and N where M is the desired number of available attributes and N the number of desired rows. For each intersection (M,N) a random number R is generated such that 0 <= R < 100. In a naive system we might proceed as follows: if the value of R is less than 50 a '1' is considered to exist in that column and consequently the current value of N is output to file. We can express this process using pseudo code as follows:

For all N loop {
	For all M loop {
		if R < 50 output M
    		}
	output carriage return
	}

Given an equal distribution for the value of R the resulting output is a data set where 50% of the intersections are (conceptually) represented by a binary `1' and the rest by a `0' (Figure 1). Therefore we could expect 50% of the elements in any column or row to be set to `1'. Thus the support for each one itemset can be estimated to be 50% (1/2), the support for each pair would be 25% (1/2*1/2), for each triple 12.5% (1/2*1/2*1/2) and so on.

 

In other words we will more or less know the support values in advance --- not ideal for testing ARM algorithms. Data sets produced using this naive approach may thus be regarded as being in appropriate

DISTRIBUTION OF TEST SET DATA

Figure 1 Distribution of test set data




A better result can be produced if a random probability factor, P is assigned to each column prior to generation of the data set such that 0 <= P < 100. For each intersection (M,N) a `1' is associated with the intersection if R is less then P, and a `0' otherwise. Thus:

for all M loop {
	generate random value for Pm
	}
For all N loop {
	For all M loop {
		if R < Pm output M
    		}
	output carriage return
	}

Given a sufficiently large M (number of columns) the distribution of binary ones can be expressed as shown in Figure 2, i.e. some columns will be much more densely "populated" than others. Consequently the resulting data sets produced can be considered to bear a much closer resemblance to genuine data sets and thus more appropriate for testing the proposed ARM algorithms.

 

However, the overall distribution (D) of `1's in the data set would always be %50. It is desirable to provide a facility to vary the distribution D such that 0 <= D < 100.

VARIABLE COLUMN DENSITY

Figure 2 Test set generation including variable column density




Conceptually the process of increasing/decreasing the desired overall density D for the resulting output set is equivalent to the rotation of the diagonal in Figure 3 about either the "top-left" or "bottom-right" corner as appropriate (Figures 3a and 3b). The rotation is carried out as follows:

if (D < 50) rotate about top left corner
if (D == 50) do nothing
if (D > 50) rotate about bottom right corner

The current generation process (Version 3) incoporated into the test set generation algorithm can be expressed as follows:

 
for all M loop {
	generate random value for Pm
	if (D < 50) Pm =  Pm-(Pm*D/50)
	else {
		if (D > 50) Pm =  Pm+((100-Pm)*(D-50)/50)
		}
	}
--- As for Version 2 ---

Thus we adjust the value of P prior to commencing the generation process. Table 1 below gives some sample adjusted values for a range of P given a range of values for D. The code (written in Java) for Version 3.2 of the Generation algorithm can be found here.

DENSITY LESS THAN 25%

(b)

 
DENSITY GREATER TAN 50%

(a)

Figure 3 Test set generation including input of desired data density

PD=25D=50D=75
25 P = 25 - (25*25/50) = 12.5 P = 25 P = 25 + ((100-25)*(75-50)/50) = 25 + ((75*25/50) = 62.5
50 P = 50 - (50*25/50) = 25 P = 50 P = 50 + ((100-50)*(75-50)/50) = 50 + ((50*25/50) = 75
75 P = 75 - (75*25/50) = 37.5 P = 75 P = 75 + ((100-75)*(75-50)/50) = 75 + ((25*25/50) = 87.5

Table 1: Some example adjusted values using version three of the data generator




3. VERSION 3.2 JAVA CODE

The java code for Version 3 of the "in house" test set generator is available:

  • GeneratorApp.java: Application code to allow the generator to be run from the command line (see notes below).
  • GeneratorGUI_App.java: GUI interface for simpler operation and "batch mode" operation. Batch mode allows a sequence of data sets to be generated with differing numbers of columns and/or numbers of rows and/or density. This is done by specifying a start, end and increment values for the number of columns/rows and/or density.
  • Generator.java: The version of the generator that operates with command line arguments.
  • GeneratorControl.java: The control (GUI) module for the GUI version of the generator.
  • GeneratorModel.java: The modal (GUI) module for the GUI version of the generator.
  • BatchModeParams.java: Additional GUI to allow input for batch mode operation. (This is what distinguishes Version 3.2 from the earlier 3.1 version).

The command line version of the generator (GeneratorApp) is called with up to 4 command line arguments as follows:

Argument No.Description Default Value
1 Number of Columns 1024
2 Number of Rows 1000000
3 Density 50 (%)
4 Output file name testFile

Example calls:

$java GeneratorApp 10 10

$java GeneratorApp 20 100 25

$java GeneratorApp 30 1000 10 example1.dat

The first produces a data set comprising 10 columns, 10 rows and a density of 50% stored in a file called testFile. The second produces a data set comprising 20 columns, 100 rows and a density of 25% stored in a file called testFile. The last produces a data set comprising 30 columns, 1000 rows and a density of 10% stored in a file called example1.dat.