Browsing by Author "Gao, Jie"
Now showing 1 - 5 of 5
Results Per Page
Sort Options
Item Open Access A cooperative distributed data mining model and its application to medical data on diabetes(2004) Gao, Jie; Denzinger, JörgWe present CoLe, a cooperative distributed system model for mining knowledge from heterogeneous data. CoLe allows for the cooperation of different learning algorithms and the combination of the mined knowledge into knowledge structures no individual learner can produce. CoLe organizes the work in rounds so that knowledge discovered by one learner can help others in the next round. We implemented a system based on CoLe for mining diabetes data, including a genetic algorithm for learning event sequences, improvements to the PART algorithm for our problem and combination methods to produce hybrid rules containing conjunctive and sequence conditions. In our experiments, the CoLe-based system outperformed the individual learners, with better rules and more rules of a certain quality. Our improvements to learners also showed they were useful. From the medical perspective, our system confirmed hypertension has a tight relation to diabetes, and it also suggested connections new to medical doctors.Item Open Access Agent-based cooperative heterogeneous data mining(2012) Gao, Jie; Denzinger, JörgThis thesis presents an agent-based cooperative data mining model named CoLe2. CoLe2 is targeted at performing data mining on large, heterogeneous data sets. It employs multiple different types of data mining algorithms, enables cooperations among these algorithms, and produces combined results in the form of rules. CoLe7- is a multi-agent system with three types of agents that have the different roles of running data mining algorithms, performing combination of mining results, and driving the entire CoLe2 system work flow with knowledge-based strategies, respectively. The system has a work flow with two levels of loops. The outer loop performs data selection, mining algorithm selection and expectation adjustment strategies. The inner loop performs data mining execution and result combination, with additional knowledge-based strategies implemented in the agents. The agents exchange useful information during the running of the work flow to help each other. A prototype system of the CoLe2 model is described. This prototype contains four different data mining algorithms (a classification algorithm, a sequence mining algorithm, an association rules mining algorithm and a descriptive mining algorithm), two combination strategies and instantiations of the knowledge-based strategies. The strategies instantiations include data selection based on a clustering algorithm, an asynchronous work flow for better turnaround time, relevance factor calculation, fuzzy condition matching, prediction histogram based rule similarity and rule grouping. Experiments have been performed with two data sets - a medium-sized data set of billing data from Calgary Health Region, and a large data set from the Alberta Kidney Disease Network. The experimental results show advantages of Cole? over individual data mining algorithms in terms of efficiency and result quality, as well as advantages over the CoLe model with only one level of work flow. Specialized experiments also prove the effectiveness of individual knowledge-based strategies.Item Open Access CoLe: A Cooperative Distributed Data Mining Model(2005-03-08) Gao, Jie; Denzinger, Jorg; James, Robert C.We present CoLe, a cooperative, distributed model for mining knowledge from heterogeneous data. CoLe allows for the cooperation of different learning algorithms and the combination of the mined knowledge into knowledge structures that no individual learner can produce. CoLe organizes the work in rounds so that knowledge discovered by one learner can help others in the next round. We implemented a CoLe-based system for mining diabetes data, including a genetic algorithm for learning event sequences, improvements to the PART algorithm for our problem and combination methods to produce hybrid rules containing conjunctive and sequence conditions. In our experiments, the CoLe-based system outperformed the individual learners, with better rules and more rules of a certain quality. Our improvements to learners also showed the ability to find useful rules. From the medical perspective, our system confirmed hypertension has a tight relation to diabetes, and it also suggested connections new to medical doctors.Item Open Access Using Learning of Behavior Rules to Mine Medical Data for Sequence Rules(2004-03-02) Denzinger, Jorg; Gao, JieIn fields like medical care the temporal relations in the records (transactions) are of great help for identifying a particular group of cases. Thus there is some need for sequence rule learning in the classification problems in these fields. In this paper, a genetic algorithm for sequence rule learning is presented based on concepts from learning behavior of agents. The algorithm employs a Michigan-like approach to evolve a group of sequence rules, and extracts good ones into the result sequence rule set from time to time. It contains a novel quality-based intelligent genetic operator, and many adaptive enhancements to make implicit use of data-set-specific knowledge. The algorithm is evaluated on a real-world medical data set from the PKDD 99 Challenge. The results indicate that the algorithm can get satisfactory sequence rule sets from the sparse and noisy data set.Item Open Access Utility of Knowledge Discovered from Sanitized Data(2008-09-30) Sramka, Michal; Safavi-Naini, Reihaneh; Denzinger,Jorg; Askari, Mina; Gao, JieWhile much attention has been paid to data sanitization methods with the aim of protecting users’ privacy, far less emphasis has been put to the usefulness of the sanitized data from the view point of knowledge discovery systems. We consider this question and ask whether sanitized data can be used to obtain knowledge that is not defined at the time of the sanitization. We propose a utility function for knowledge discovery algorithms, which quantifies the value of the knowledge from a perspective of users of the knowledge. We then use this utility function to evaluate the usefulness of the extracted knowledge when knowledge building is performed over the original data, and compare it to the case when knowledge building is performed over the sanitized data. Our experiments use an existing cooperative learning model of knowledge discovery and medical data, anonymized and perturbed using two widely known sanitization techniques, called E-differential privacy and k-anonymity. Our experimental results show that although the utility of sanitized data can be drastically reduced and in some cases completely lost, there are cases where the utility can be preserved. This confirms our strategy to look at triples consisting of a utility function, a sanitization mechanism, and a knowledge discovery algorithm that are useful in practice. We categorize a few instances of such triples based on usefulness obtained from experiments over a single database of medical records. We discuss our results and show directions for future work.