Browsing by Author "Alhajj, Reda"
Now showing 1 - 20 of 79
Results Per Page
Sort Options
Item Open Access A bounded and adaptive memory based approach to mine frequent patterns from huge databases(2011) Adnan, Muhaimenul; Alhajj, RedaItem Open Access A framework for effective web log mining and online prediction(2010) Guerbas, Abdelghani; Alhajj, RedaItem Open Access A Graph Based Approach for Making Recommendations Based on Multiple Data Sources(2015-05-27) Dhaliwal, Sukhpreet; Alhajj, Reda; Rokne, JonRecommendation system is an information filtering system that predicts customer preferences. Customer preferences are extracted through analyzing the behaviour patterns of customers from multiple data sources. Graph-based models play an important role in recommendation systems to extract the customer preferences from multiple data sources. However, graph-based models have been rarely used in traditional recommendation systems. The main objective of this thesis is to use a graph-based recommender system that uses multiple data sources. A graph-based hybrid recommender model is developed to integrate content-based, collaborative filtering and association rule mining techniques. Moreover, the PageRank algorithm is used to produce a ranked list of recommendation. Our analysis on a Retail store dataset shows the impact of using multiple data sources on the accuracy of a recommender system while handling the sparsity problem. Usage of demographic information of customers remedies the cold start problem. Grouping the products based on product type produced better results and it also showed the impact of using the different level of product taxonomy. Additionally, assembling content-based, collaborative filtering and association rule mining also showed many improvements in results. Moreover, indirect connections improve the coverage of our recommender system.Item Open Access A Heuristic Stock Portfolio Optimization Approach Based on Data Mining Techniques(2013-03-11) Koochakzadeh, Negar; Alhajj, RedaPortfolio optimization is the process of making investment decisions on holding a set of financial assets to meet various criteria. A variety of investment assets around the world make this multi-faceted decision problem very complicated. Econometric and statistical models as well as machine learning and data mining techniques have been used by many researchers and analysts to propose heuristic solutions for portfolio optimization. However, a literature review shows that the existing models are still not practical as they do not always perform better than even the naïve strategy of investing in all available assets in the market. The methodology proposed in this thesis is an alternative heuristic solution to help investors make stock investment decisions through a semi-automated process. The proposed solution is based on the fact that the investment decision cannot be fully automated because investors’ preferences that are the key factors in making investment decision, vary among different people. For this purpose, a semi-automated framework called SMPOpt (Stock Market Portfolio Optimizer) has been designed and implemented. In the proposed framework, the goal is to learn from the historical fundamental analysis of companies to discover the optimum portfolio by considering investors’ preferences. The Portfolio optimization problem is formulated and broken down into steps to be able to apply data mining techniques such as Clustering and Ranking, and Social Network Analysis. Some of these techniques are customized based on the temporal behaviour of financial datasets. For instance, the ranking algorithm based on Support Vector Machine (SVMRank) is modified and a new algorithm called Time- Series SVMRank is proposed. A comprehensive experimental study has been conducted using the real stock exchange market datasets from the past recent decades to evaluate the proposed portfolio optimization solution. The obtained results confirmed the strength of the proposed methodology.Item Open Access A self-organizing multi-agent system for adaptive continuous unsupervised learning in complex uncertain environments(2008) Kiselev, Igor; Alhajj, RedaItem Open Access Alternative approaches for producing and ranking alternative clustering(2006) Ozyer, Tansel; Alhajj, RedaItem Open Access Biological Network-based Approaches to The Functional Analysis of miRNAs in Prostate Cancer(2013-09-25) Alshalalfa, Mohammed; Alhajj, RedaThe cell is a highly organized system of interacting molecules that operate together in a complex and efficient manner to achieve biological functions and cellular phenotypes. Traditional biology studied these components one at a time, yielding limited insights about the way a cell functions. It is now apparent that the most effective way to understand how the cell works is to unravel how these different components work together. Recent advances in biological research have led to an explosive growth in scientific data to study and characterize the function of the different components of the cell. Thanks to high-throughput techniques, an explosive growth in the size and type of biological information are generated to reveal the internal complexity of cells. This has lead to a rapid increase in the number of computational techniques developed to mine the data and reveal functional understandings of the cell. Deciphering the molecular interactions among the cellular molecules embodies a more comprehensive view of the cellular function, and integrating heterogenous interaction networks and expression data reveals a system-level understanding of the cell behavior. This thesis focuses on integrating multiple heterogenous biological networks, in particular protein networks and miRNA-target interactions, to facilitate miRNA research. This integration layer between miRNAs and protein networks helps to study the propagation of miRNAs’ influence through the biological networks of the targets. This thesis provides a profound review of the cross-talk between miRNAs and biological networks, particularly protein networks. The role of miRNAs as part of the cellular system and their influence on functional protein modules is characterized in prostate cancer progression. In this thesis, different approaches are proposed to analyze the integration layer and provide potential applications to the genomic studies of miRNAs. The first approach predicts miRNAs with high influence on protein networks and assesses their prognostic significance. The second approach predicts protein complexes that are influences by miRNAs during prostate progression. The third approach characterizes the modulation effect of genes that encode protein partners of the protein encoded by miRNA targets. The fourth approach uses protein networks to identify miRNAs enriched in gene lists. The proposed methods reveal that integrating miRNA-target and protein networks provides a new layer of biological information that assists to reveal miRNA-target modules with potential function, and uncover principles governing miRNA-mediated regulation of targets in biological networks. The results suggest that the proposed methods are promising to reveal miRNA-mediated regulation, in the context of protein networks, involved in prostate cancer progression. This thesis shows that integrating protein networks and miRNA-target networks is a valuable source of knowledge that help researchers understand how miRNA exert their function on the cellular system. This facilitates miRNA genomic research to identify miRNAs with strong influence on the proteins regulating the cell function, and thus gain better characterization of their role in disease progression and possible utility for therapeutic purposes.Item Open Access BreCaHAD: a dataset for breast cancer histopathological annotation and diagnosis(2019-02-12) Aksac, Alper; Demetrick, Douglas J; Ozyer, Tansel; Alhajj, RedaAbstract Objectives Histopathological tissue analysis by a pathologist determines the diagnosis and prognosis of most tumors, such as breast cancer. To estimate the aggressiveness of cancer, a pathologist evaluates the microscopic appearance of a biopsied tissue sample based on morphological features which have been correlated with patient outcome. Data description This paper introduces a dataset of 162 breast cancer histopathology images, namely the breast cancer histopathological annotation and diagnosis dataset (BreCaHAD) which allows researchers to optimize and evaluate the usefulness of their proposed methods. The dataset includes various malignant cases. The task associated with this dataset is to automatically classify histological structures in these hematoxylin and eosin (H&E) stained images into six classes, namely mitosis, apoptosis, tumor nuclei, non-tumor nuclei, tubule, and non-tubule. By providing this dataset to the biomedical imaging community, we hope to encourage researchers in computer vision, machine learning and medical fields to contribute and develop methods/tools for automatic detection and diagnosis of cancerous regions in breast cancer histology images.Item Open Access CACTUS: cancer image annotating, calibrating, testing, understanding and sharing in breast cancer histopathology(2020-01-06) Aksac, Alper; Ozyer, Tansel; Demetrick, Douglas J; Alhajj, RedaAbstract Objective Develop CACTUS (cancer image annotating, calibrating, testing, understanding and sharing) as a novel web application for image archiving, annotation, grading, distribution, networking and evaluation. This helps pathologists to avoid unintended mistakes leading to quality assurance, teaching and evaluation in anatomical pathology. Effectiveness of the tool has been demonstrated by assessing pathologists performance in the grading of breast carcinoma and by comparing inter/intra-observer assessment of grading criteria amongst pathologists reviewing digital breast cancer images. Reproducibility has been assessed by inter-observer (kappa statistics) and intra-observer (intraclass correlation coefficient) concordance rates. Results CACTUS has been evaluated using a surgical pathology application—the assessment of breast cancer grade. We used CACTUS to present standardized images to four pathologists of differing experience. They were asked to evaluate all images to determine their assessment of Nottingham grade of a series of breast carcinoma cases. For each image, they were asked for their overall grade impression. CACTUS helps and guides pathologists to improve disease diagnosis with higher confidence and thereby reduces their workload and bias. CACTUS can be useful for both disseminating anatomical pathology images for teaching, as well as for evaluating agreement amongst pathologists or against a gold standard for evaluation or quality assurance.Item Open Access Cancer biomarker extraction from gene expression microarray data(2008) Alshalalfa, Mohammed; Alhajj, RedaBioinformatics is a new field of science mainly integrating computer science, mathematics, statistics and biology where the aim is to discover knowledge hidden within biological data. One of the widely investigated biological data is gene expression microarray data. Profiling the global gene expression patterns in different tissues/ sample can be investigated in few days due to microarray technology, which can accommodate the whole genome, unlike traditional methods which may take months. However, analyzing micro array data is challenging as the number of features (genes) is very large relative to the number of attributes (samples). Fortunately, microarray has been successfully used to study gene expression data; this allowed researchers to investigate different diseases, including cancer. In other words, using microarray in cancer diagnosis showed to be very efficient and reliable, but the large number of genes makes the data noisy and difficult to deal with. Consequently, identifying relevant genes has received considerable attention. In this thesis, we combine biological knowledge with machine learning techniques to propose three methods for extracting the most informative genes for cancer classification. The first method is based on double clustering; we filter the data initially with a statistical test and then cluster the data iteratively to get the best number of clusters. The genes closest to the centroids of the resulting clusters showed to have high potential to be significant features for sample classification. These genes (one per centroid) are used as input for building a classification model. The second method is based on iterative t-test in a way that eliminates noise from the data. The third method is a hybrid approach which combines statistical tests with entropy based tests. This method uses the t-test and Singular Value Decomposition (SVD) based entropy. It showed to be effective as it considers the feature itself and its effect on the data entropy. This approach is the first to combine entropy and statistical significance for gene ranking. We have also developed SVD based gene extraction method for multi-class data; only introduced at high level in this thesis, details are left are future work. The test results reported demonstrate the applicability and effectiveness of the three proposed approaches. _x000D_ Index Terms: Classification, clustering, t-test, singular value decomposition, support vector machine, microarray data, gene expression data, over-expression, underexpress10n._x000D_Item Open Access CARSVM: classification by integrating class association rules and support vector machine(2006) Keanmehr, Keivan; Alhajj, RedaItem Open Access ClusTex: using clustering techniques for information extraction from HTML pages containing semi-structured data(2005) Ashraf, Fatima; Alhajj, RedaItem Open Access Cocalerex: an engine for converting catalog-based and legacy relational databases into XML(2004) Wang, Chunyan; Alhajj, RedaItem Open Access Community aware personalized search for the web of data and services(2011) Shafiq, M. Omair; Alhajj, RedaItem Open Access Community Structure, Inference and Network-Based Markers(2014-07-10) Gao, Shang; Alhajj, RedaIn the core of system biology, it is believed that molecules within the cell act collaboratively in an organized behavior. Researchers are studying the interactions and mainly concentrate on identifying malfunctioning molecules as potential disease biomarkers. Thus, a network has become an important means to represent biological systems, and network approaches have shown substantial promise due to the simplicity in data representation and associated rich analytical apparatus. Generally speaking, the workflow of a computational system biology study means: 1.) Investigating certain elements of biological networks and their interactions, which depends on the purpose of the study. 2.) Collecting experimental high-throughput and genome-wide data and integrating computational methods to analyze the data and validate findings. In this thesis, we frame the investigations by first asking a system biology question, and then provide computational means to answer the question. My thesis consists of three major interrelated components, as the title suggests, we first study the network structure by a novel strategy of bridging together social and biological networks based on our argument that there exist a strong analogy between humans and molecules. As social network analysis is gaining popularity in modeling real world problems, the task of applying the social network model concepts and notions to biological data is still one of the most attractive research problems to be addressed. We design computational means to find community structures and design efficient algorithms to dynamically analyze gene boundaries using geometric convexity. Our approach contributes to the new branch of applying social network mechanisms in biological data analysis, leading to new data mining strategies implied by witnessing social behaviors in gene expression analysis. Further into the topology study of biological networks, we investigate the relationship between the multi-scalability of community structures of metabolic networks and the distributional effect of network motifs, i.e., the inference problem. We observe several patterns through studying three organisms, including the effect of directionality of networks, homogeneity of motif-enriched communities, and motif type-specific distributions across scales. We also provide methods to quantify motif influence under the community context. Overall, our work suggests that the theoretic evolvability of modularity tightly correlates with motif distributional effect and vice versa. In this regard, we design computational tools to analyze community structure of very large networks of arbitrary types. The Multi-scale Community Finder (MCF) is the first tool in this area. Finally we arrive at the question of how to design efficient bio-markers for complex diseases, e.g., cancer. First, it is important to understand the complexity of cancer. We believe that to understand individualized gene behavior across patients, relational status of genes needs to be considered because complex disease phenotype is often caused by cascaded failures of genetic interactions in cancer cells. We implement a framework to quantify the molecular heterogeneity of tumors from gene-gene relational perspective using co-expression networks and interactome data. Next, we present a method to reverse engineer integrative gene networks. The main advantage of our method is the integration of different quantitative and qualitative data sets in order to reconstruct a multiplex network, without necessarily imposing data constraints, such as each genomic datum needs to have the same number of entities. Another advantage of our method is that from the integrated networks, predictions can be made by propagating beliefs from seed nodes representing known knowledge. Thus, we combine data integration and network-based prediction into a single framework. We demonstrate our method through case studies using breast cancer data. Our approaches present promising results and new ways of thinking and mining complex genomic datasets. Overall, this thesis presents a comprehensive study of biological networks and the novel application of computational means to implement the biomarker detection problem in the era of big genomic data. Finally it is important to highlight the fact that our study considers the challenges due to data heterogeneity and the diversity in the sources producing the data.Item Open Access Coordinate MicroRNA-Mediated Regulation of Protein Complexes in Prostate Cancer(Public Library of Science, 2013-12-31) Alshalalfa, Mohammed; Bader, Gary D.; Bismar, Tarek A.; Alhajj, RedaItem Open Access Correction to: Realizing drug repositioning by adapting a recommendation system to handle the process(2018-07-02) Ozsoy, Makbule G; Özyer, Tansel; Polat, Faruk; Alhajj, RedaFollowing publication of the original article [1], the authors reported that there was an error in the spelling of the name of one of the authors.Item Open Access COVID-19 pandemic spread against countries’ non-pharmaceutical interventions responses: a data-mining driven comparative study(2021-09-01) Xylogiannopoulos, Konstantinos F.; Karampelas, Panagiotis; Alhajj, RedaAbstract Background The first half of 2020 has been marked as the era of COVID-19 pandemic which affected the world globally in almost every aspect of the daily life from societal to economical. To prevent the spread of COVID-19, countries have implemented diverse policies regarding Non-Pharmaceutical Intervention (NPI) measures. This is because in the first stage countries had limited knowledge about the virus and its contagiousness. Also, there was no effective medication or vaccines. This paper studies the effectiveness of the implemented policies and measures against the deaths attributed to the virus between January and May 2020. Methods Data from the European Centre for Disease Prevention and Control regarding the identified cases and deaths of COVID-19 from 48 countries have been used. Additionally, data concerning the NPI measures related policies implemented by the 48 countries and the capacity of their health care systems was collected manually from their national gazettes and official institutes. Data mining, time series analysis, pattern detection, machine learning, clustering methods and visual analytics techniques have been applied to analyze the collected data and discover possible relationships between the implemented NPIs and COVID-19 spread and mortality. Further, we recorded and analyzed the responses of the countries against COVID-19 pandemic, mainly in urban areas which are over-populated and accordingly COVID-19 has the potential to spread easier among humans. Results The data mining and clustering analysis of the collected data showed that the implementation of the NPI measures before the first death case seems to be very effective in controlling the spread of the disease. In other words, delaying the implementation of the NPI measures to after the first death case has practically little effect on limiting the spread of the disease. The success of implementing the NPI measures further depends on the way each government monitored their application. Countries with stricter policing of the measures seems to be more effective in controlling the transmission of the disease. Conclusions The conducted comparative data mining study provides insights regarding the correlation between the early implementation of the NPI measures and controlling COVID-19 contagiousness and mortality. We reported a number of useful observations that could be very helpful to the decision makers or epidemiologists regarding the rapid implementation and monitoring of the NPI measures in case of a future wave of COVID-19 or to deal with other unknown infectious pandemics. Regardless, after the first wave of COVID-19, most countries have decided to lift the restrictions and return to normal. This has resulted in a severe second wave in some countries, a situation which requires re-evaluating the whole process and inspiring lessons for the future.Item Open Access Data driven network construction and analysis extending the functionality of netdriller(2012) Sarraf Shirazi, Atieh; Alhajj, RedaSocial network analysis has emerged as a technique in sociology. However, it has become more and more interesting to researchers of other fields. The flexibility and scalability supported by the new technology encouraged the extension of the social network technology to handle new applications. A social network is defined as a set of nodes and a set of links connecting them. Social network analysis is the task of analyzing a social network with the purpose of gaining some information about the network such as patterns of connection or important nodes. However, there are a lot of applications where only raw data is available. Usually, the data sets contain data objects with their set of features. In this work, we propose an approach to construct a social network fom a raw data set. The approach is based on the assumption that if two objects are similar, there is a higher probability that they are placed in the same cluster in different clustering solutions. Based on this assumption, we use a multi-objective genetic algorithm approach to find different solutions for partitioning the data objects. The actors of the final social network are the data objects, and the link between them shows the ratio of partitioning solutions that the objects are placed in the same cluster. This work is implemented in NetDriller, a powerful social network analysis tool developed at Data Mining group at the University of Calgary. We show the validity of our approach by evaluating both the intermediate clustering results and the constructed social network in a case study on stock market.Item Open Access Data Structures, Algorithms and Applications for Big Data Analytics: Single, Multiple and All Repeated Patterns Detection in Discrete Sequences(2017) Xylogiannopoulos, Konstantinos; Alhajj, Reda; Rokne, Jon; Pardalos, Panayote; Kawash, Jalal; Helaoui, MohamedMy research work of the current thesis focuses on the detection of single, multiple and all repeated patterns in sequences. Many algorithms exist for single pattern detection that take an input argument (i.e., pattern to be detected) and produce as outcome the position(s) where the pattern exists. However, to the best of my knowledge, there is nothing in literature related to all repeated patterns detection, i.e., the detection of every pattern that occurs at least twice in one or more sequences. This is a very important problem in science because the outcome can be used for various practical applications, e.g., forecasting purposes in weather analysis or finance by detecting patterns having periodicity. The main problem of detecting all repeated patterns is that all data structures used in computer science are incapable of scaling well for such purposes due to their space and time complexity. In order to analyze sequences of Megabytes the space capacity required to construct the data structure and execute the algorithm can be of Terabyte magnitude. In order to overcome such problems, my research has focused on simultaneous optimization of space and time complexity by introducing a new data structure (LERP-RSA) while the mathematical foundation that guarantees its correctness and validity has also been built and proved. A unique, innovative algorithm (ARPaD), which takes advantage of the exceptional characteristics of the introduced data structure and allows big data mining with space and time optimization, has also been created. Additionally, algorithms for single (SPaD) and multiple (MPaD) pattern detection have been created, based on the LERP-RSA, which outperform any other known algorithm for pattern detection in terms of efficiency and usage of minimal resources. The combination of the innovative data structure and algorithm permits the analysis of any sequence of enormous size, greater than a trillion characters, in realistic time using conventional hardware. Moreover, several methodologies and applications have been developed to provide solutions for many important problems in diverse scientific and commercial fields such as Finance, Event and Time Series, Bioinformatics, Marketing, Business, Clickstream Analysis, Data stream Analysis, Image Analysis, Network Security and Mathematics.