Work on data mining, and more generally in Knowledge Discovery in Data (KDD), has been "on-going" within the Department of Computer Science at The University of Liverpool since 1997. During this time the work has encompassed many aspects of data mining and has been conducted by groupings of many people. At times the people involved have identified themselves as the LUCS-KDD research group (or team) and some of the published work refers to this group. Officially the Department's work on data mining is encompassed by the Agents Research Group, although all those conducting the work have not necessarily also been members of the Agents group (The Department currently has four research groups).
KDD is concerned with the discovery of hitherto unrecognised and "interesting" information in (usually large) data repositories. Within this process, the term Data Mining is used to refer to the actual knowledge discovery aspects of this process (as opposed to (say) data preprocessing or the post processing of results). The work of the research team to date can best be described in terms of three phases.
In the rest of this WWW page the Department's work on data mining is described in terms of three overlapping chronological phases.
Phase 1 (1997-2005) --- Initial work
The original "team" (Frans Coenen, Paul Leng and Graham Goulbourne) was formed in September 1997 to work on the KDD for FM (Facilities Management) project funded by Royal-Sun Alliance Insurance. The objective of this work was to analyse FM databases which the project team achieved using Association Rule Mining (ARM) technology. An Association Rule (AR) is an pattern of the form A->B, where A, B are disjoint sets of binary valued attribute, interpret as "if the set of attribute set A is present in a database record, then it is likely that the attribute set B will also be present". In its fundamental form ARM involves the discovery, in a tabular database, of all ARs that satisfy certain predefined threshold requirements: essentially, that the relationships they express occur more often than would occur randomly.
ARM is a computationally demanding task because of the likely magnitude of the data involved and the size of the search space, which is exponential in the number of possible attribute combinations. For this reason, the search for effective algorithms and efficient implementation strategies is an active area of research. The team developed a number of algorithms (Apriori-T and TFP) that involved preprocessing of the data and set-enumeration tree structures (the P-tree and the T-tree) to order the search. Experimental results have shown that this method compared very favourably with the best published algorithms at the time. TFP still outperforms the majority of its competitor algorithms, Graham Goulbourne was awarded a PhD for his contribution to the KDD in FM project in June 2002.
The work on fast ARM resulting from the KDD in FM project was extended in a number of direction on completion of the project. This subsequent work investigated strategies for: (i) partitioning data for more effective very large scale ARM, (ii) methods for parallel and distributed ARM and (iii) Classification Association Rule Mining (CARM).
Work on very large scale ARM was undertaken by Shakil Ahmed who joined the team in September 2000. The work resulted in a number of data partitioning ideas and culminated in Shakil being awarded a PhD in September 2005. The work on distributed and parallel ARM was undertaken by Joy Alatta, Riming Zhang, Steve Phelps and Aris Pagourtzis in 2004/2005; with additional support from Dr Michele Zitto and Wojciech Rytter. Additional work on CARM (Classification ARM) was undertaken by Lu Zhang (then engaged upon the Stoves project). The CARM initiative resulted in a number of algorithms, of which the TFPC algorithm is the most significant. TFPC is of particular note as it is extremely efficient while maintaining the accuracy of the generated classifier.
Phase 2 (2003-2009) --- Extension of Initial Research Results
The second phase of data mining work undertaken within the Department (partly overlapping with Phase 1) was concerned with extending the application of ARM and CARM to wider application areas (especially the use of the TFP and TFPC algorithms). These "wider" areas included (i) text mining, (ii) Multi-Agent Data Mining (MADM), (iii) ARM supported argumentation and (iV) utility, weighted and Fuzzy ARM. Much of this work is still in progress. Phase 2 of the group's work also coincided with Rob Sanderson joining the "team".
Work on text mining was initially undertaken by Justin Wang (commencing in September 2003). The work was first directed at the pre-processing of document sets for text mining, i.e. key word and key phrase identification. The novel aspect of the work was that the keyword identification was done independently of the language of the input document sets. Indeed the statistical approaches that were proposed were tested on both English and Chinese document sets. Justin Wang was awarded a PhD for his work on text mining in June 2008.
Work on MADM commenced in September 2006 when Neena Madan and Kamal Ali Albashiri joined the group. Neena Madan (September 2006 to May 2007 investigated incremental ARM (I-ARM) techniques in the context of a data mining, an application that was well suited to MADM. Kamal Ali Albashiri worked on the communication protocols and architectures required to support MADM and devlopped a generic framework for MADM called EMADS (Extendable Mult-Agent Dataming System). Extendibility was seen as an important aspect of achieving generic MADM and was incorporated in EMADS as a system of wrappers. The operation of EMADS is currently (see Phase 3) being made more rigorous, by including a formal ontology etc., by Santhan Chaimontree.
ARM supported Argumentation is a novel application area of ARM technology investigated by Maya Wardeh (commencing in September 2006) with the support of Trevor Bench-Capon. Argumentation refers to a field of Multi-Agent Systems (MAS) research directed at the computer automation of negotiation (argumentation). Typically we have a number of players that each want to convince the other players of their view point. The concept that is advocated by the research is that the dynamic data mining of arguments (ARs) has advantages over the more standard Knowledge Based (KB) approach.
Work in utility, fuzzy and weighted ARM is being carried out in collaboration with Maybin Muyeba (Currently at Manchester Metropolitan University), David Reid (Liverpool Hope University) and Muhammad Sulaiman Khan (Liverpool Hope University). This work commenced in January 2007 and seeks to increase the current applicability of ARM.
Mention should also be made here (in the context of Phase 2) of Christian Setzkorn who worked on genetic algorithms to address classification problems and was awarded a well deserved PhD in June 2005.
Phase 3 (2006 onwards) --- New initiatives
Work in phase 3 seeks to widen the application of the Department's data mining expertise into various application areas. Of note is work on: (i) image mining, (ii) graph mining, (iii) trend mining and (iv) web mining.
Work on image mining was originally started by Shady Shidfer (in 2006) and has been extended in a number of directions, particularly with respect to medical image mining. Two medical image mining initiatives are currently in progress. The first is directed at the classification of MRI image data and is being undertaken by Ashraf El Sayed (commencing in November 2007) together with Martha van der Hoeke (Division of Health Statistics) and Vanessa Slumming. The MRI image mining work puts together a number of techniques: image segmentation (as promoted by the image analysis community), the representation of images using quad-trees, feature reduction and graph mining and classification. The work is on-going and currently good results have been produced. The second image mining initiative is looking at retina image analysis in the context of Age Related Macular Degeneration (AMD) and is being undertaken by Hanafi Hijazi together with Yalin Zhang (School of Clinical Sciences). The work commenced in November 2008 and is currently focussed on a histogram representation of retina images and Dynamic Time Warping techniques for image classification.
Graph mining is currently being pursued by Geof Jiang (who joined The Department in April 2007) and who is investigating weighted graph mining approaches and their application (including their application to graph represented images and document sets). The work on graph mining is significantly broadening the research team's expertise. Michele Zitto is also contributing to this work.
A recent innovation is various forms of trend mining. This is focussing on the use of the jumping/emerging pattern concept. There are a number of strands to this work. Vassiliki Somaraki (starting in October 2008) is investigating trend mining with respect to patient longitudinal data (with the support of Simon Harding. Puteri Nohuddin (Starting in December 2008), with the support of Rob Christley and Christian Setzkorn (Faculty of Veterinary Sciences), is conducting similar work in the context of the UK cattle movement DB. A third trend mining investigation is underway using classic customer data sets as part of the Transglobal KTP project which is being pursued by Reshma Patel together with Russel Martin (University of Liverpool) and Lawson Archer (Transglobal Express).
Web mining has been a data mining application area for some time and can be divided into two strands: (i) WWW usage mining and (ii) WWW content mining (the latter is very similar to text mining but with the inclusion of additional information such as URL data). Activity at Liverpool is looking at the automated identification of WWW site boundaries using clustering techniques. This is being undertaken by Ayesh Alshukri (supported by Rob Sanderson and Michele Zito). A second strand of this work is directed at WWW usage mining in the context of Web Based Application (deign) which is being undertaken by Yogesh Patel as part of the Deeside KTP project. Further support for this latter project is being provided by Katie Atkinson (University of Liverpool) and Shane Williams (Deeside Insurance).
In Phase 3 the work on text mining was progressed further by Stepahine Chua (starting in October 2008) who is looking at rule induction mechanisms for text mining but with the novel "twist" of dynamically refining the rules generated to date as the induction process progresses. The operations of EMADS (see Phase 2) is currently being made more rigorous by including a formal ontology etc. This work is being undertaken by Santhana Chaimontree commencing in September 2008 and supported by Katie Atkinson.
Future plans are directed at developing various ideas concerned with data stream mining (in collaboration with Paul Watry), questionnaire mining and Google map feature mining.
It has always been the policy of the LUCS-KDD research team to make a substantial amount of the software developed, as part of the on-going programme of KDD research work, publicly available free of charge (although we would appreciate appropriate acknowledgement) for non-commercial usage. A number of implementations of KDD tools and techniques created by the project team are publicly available here.
The membership of the group has always been very dynamic. PhD students and RAs have joined and left on completion of their programmes of work (many to take up academic posts at other institutions). Members of the Computer Science epartment's staff, members of staff within other departments and other universities, and industrial collaborators, have been involved in various initiatives. Current active members:
PhDs and RAs many of whome are now academics at other institutions:
A selection of technical reports, reviews and notes; prepared by the research team and relating to KDD in general and association rule mining in particular are available for local access only.
Created and maintained by Frans Coenen. Last updated 14 May 2009