|
|
|
1.OVERVIEW |
Chi-squared tersting is used to determine whether two variables are independent of one another. In Chi-squared testing we compare a set of observed values (O) against a set of expected values (E) --- values that would be expected if there were no association between the variables. We calculate a value, chi2, using the identity:
chi2= sum of ((0-E)^2/E)
If the result is above a given critical threshold value then we can say that there is a relationship between the varaiables, otherwise there is no relation.
Given a classification rule we can determine whether the rule is surprising (i.e. unexpected) or not by determining whether there exists some special relationship between the attributes and the classifier, or that the rule is simply one that we might expect assuming a normal (chi-squared) distribution.
2. EXAMPLE |
Given a rule A -> c such that:
support(A) = 8 support(c) = 12 support(A + c) = 6 N (num of records) = 32 We can produce a contingency table (referred to by some authors as a confusion matrix) of observed values as follows:
The above contingency table iis a generalisation of a confusion matrix The expected values are then calculated thus:
The Chi-squared value can then be calculated as follows: |
To determine whether this Chi-squared value is significant or not we must know a critical value for Chi-squared. These are usually published in tables with signifcance level along the X-axis (expressed as a precentage) and the degrees of freedom (DF) along the Y-axis. DF is calculated as follows: DF = (num rows in O or E table-1) * (num columns in O or E table-1) In the case of classification rule mining this is always going to be: DF (2-1)*(2-1)=1 Some critical values for DF=1 and a range of signifcance levels are given below. Usually a significance level of 5% is chosen, in which case the Chi-squared value for the above rule of 6.4 can be said to be significant (interesting or surprising).
|
Created and maintained by Frans Coenen. Last updated 09 February 2001.