MIH1005 | Data Mining |
Teaching Staff in Charge |
Lect. GOG Anca-Mirela, Ph.D., ancacs.ubbcluj.ro |
Aims |
The background that favored the data mining extensive development was the wide availability of huge amounts of data, coupled with the ever-increasing computational power that allowed these data sets to be analyzed. These data, adequately explored and analyzed, can be turned into useful information and knowledge, in different areas and for various applications: decision making, process control, production control, business management, market analysis, science exploration, information management, query processing etc.
This course presents recent developments in knowledge discovery in databases domain (KDD), with focus on an essential step in the KDD process, the data mining step. However, other related information to data mining, relevant for the KDD process, is also presented: data warehouses, OLAP, data preprocessing. The course introduces data mining concepts, methods and techniques, from a database perspective. The focus is on different data mining problems (tasks) and their corresponding solutions. The students will learn various data analysis techniques, and will apply these techniques for solving data mining problems using special software systems and tools. A perception of data mining as a strong application field, as well as a significant database research domain, will be formed. |
Content |
1. Introduction
Data mining - what is it, what are the factors that favoured this domain development, data mining and KDD (Knowledge Discovery in Databases) process Types of data explored in data mining Data mining functionalities Patterns and interesting patterns Data mining from a database perspective 2. Data warehouses and OLAP tehnology - overview What are data warehouses A multidimensional data model Data warehouse architecture Data warehouse implementation From data warehouses to data mining 3. Concept description - characterization and comparison Definitions Data generalization and summarization-based characterization Analytical characterization: attribute relevance analysis Class comparison: discriminating between classes Descriptive statistical measures in large databases 4. Data preprocessing Data cleaning Data transformation and integration Data reduction Discretization and concept hierarchy generation 5. Mining association rules (associations analysis) Problem definition Algorithms for mining single-dimensional boolean association rules from transaction databases - Apriori, FP-Growth Algorithms for mining multi-level association rules, multi-dimensional association rules, association rules with constraints Correlation analysis ODM and association analysis in ODM 6. Classification and prediction Problem definition Classification using decision tree induction Bayes classification Other classification methods Prediction - linear regression Classifier accuracy ODM and classification in ODM 7. Clustering (cluster analysis) Problem definition Types of data in cluster analysis Clustering methods classification Clustering methods classes: partitioning, hierarchical, density-based, grid- based, model-based clustering methods Outliers detection ODM and cluster analysis in ODM 8. Data mining standards and software - ODM, Microsoft OLE DB 9. Applications and trends in data mining Applications: telecommunications, financial data analysis, biological data analysis, etc. Data mining in statistical, audio, video databases Data mining, data security and privacy |
References |
1. Han, J., Kamber, M., Data Mining: Concepts and Techniques, 1st Edition, Morgan Kaufmann, 2000.
2. ODM (Oracle Data Mining) Documentation (electronic format). 3. P. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, Addison Wesley, 2006. 4. P. Adriaans, D. Zantinge, Data Mining, Addison-Wesley, 1996. 5. Conference and journal papers (provided by the instructor). 6. Weka system and documentation (http://www.cs.waikato.ac.nz/ml/weka/). Weka is a suite of machine learning / data mining software. It contains Java implementation for various mining algorithms, data preprocessing filters, and experimentation capabilities. Weka is free open-source software under the GNU General Public License (GPL). |
Assessment |
The activity ends with a written exam (grade E). During the semester, the students will prepare and present a theoretical report (grade R) and several practical (lab) projects, consisting in implementing data mining (association analysis, classification, cluster analysis) algorithms and performing data analysis using specialized software tools (grade P). The final grade is a weighted mean of the three grades mentioned above: Final Grade = 40%E + 25%R + 35%P. The students who will show considerable research abilities, involving into projects development and research results publication will be granted additional 10% score to the final grade. In order to successfully pass the exam, the final grade has to be at least 5. |
Links: | Syllabus for all subjects Romanian version for this subject Rtf format for this subject |