CHI-SQUARED TESTING FOR ASSESSING CLASSIFICATION RULES


Frans Coenen

Department of Computer Science

The University of Liverpool

13 February 2004

CONTENTS

1. Overview.
 
2. Example.



1.OVERVIEW

Chi-squared tersting is used to determine whether two variables are independent of one another. In Chi-squared testing we compare a set of observed values (O) against a set of expected values (E) --- values that would be expected if there were no association between the variables. We calculate a value, chi2, using the identity:

chi2= sum of ((0-E)^2/E)

If the result is above a given critical threshold value then we can say that there is a relationship between the varaiables, otherwise there is no relation.

Given a classification rule we can determine whether the rule is surprising (i.e. unexpected) or not by determining whether there exists some special relationship between the attributes and the classifier, or that the rule is simply one that we might expect assuming a normal (chi-squared) distribution.




2. EXAMPLE

Given a rule A -> c such that:

support(A)         =  8  
support(c)         = 12 
support(A + c)    =   6 
N (num of records) = 32

We can produce a contingency table (referred to by some authors as a confusion matrix) of observed values as follows:

c !c
A sup(A+c) --- The number of true positives sup(A)-sup(A+c) --- The number of false negatives sup(A) --- The number of +ve records
!A sup(c)-sup(A+c) --- The number of false positives N-sup(A)-sup(c)-sup(Ac) --- The number of true negatives N-sup(A) --- The number of -ve records
sup(c) --- The number of records covered by the ruleN-sup(c) --- The number of records NOT covered by the ruleN

c!c
A 6 2 8
!A 61824
122032

The above contingency table iis a generalisation of a confusion matrix

The expected values are then calculated thus:

c!c
A (12*8)/32=3 (20*8)/32=5 8
!A(12*24)/32=9(20*24)/32=1524
122032

The Chi-squared value can then be calculated as follows:

 
O EO-E(O-E)^2((O-E)^2)/E
6 3 393.0
6 9 -391.0
2 5 -391.8
1815 390.6
6.4

To determine whether this Chi-squared value is significant or not we must know a critical value for Chi-squared. These are usually published in tables with signifcance level along the X-axis (expressed as a precentage) and the degrees of freedom (DF) along the Y-axis. DF is calculated as follows:

DF = (num rows in O or E table-1) *
	(num columns in O or E table-1)

In the case of classification rule mining this is always going to be:

DF (2-1)*(2-1)=1

Some critical values for DF=1 and a range of signifcance levels are given below. Usually a significance level of 5% is chosen, in which case the Chi-squared value for the above rule of 6.4 can be said to be significant (interesting or surprising).

10% 5% 2.5% 1% 0.5%
DF=12.70553.84155.02396.63497.8794



Created and maintained by Frans Coenen. Last updated 09 February 2001.