Topic outline
Module Overview
Module LecturerHello everyone, my name is Dr. Murugananthan (Dr. Muru for short!) and I am the lecturer responsible for Data Management. Please feel free to contact me via Ms Teams or email, I am also available for 1-2-1 consultation (Please refer to the iConsult system). Should you have any queries or questions please reach out, best of luck with the module!
- Module Synopsis
This module will provide the learner with an overview of the importance of data in the growing field of data science and analytics. The learner will study both the established methods and technologies used and also investigate new and emerging ones. Emphasis will be placed on the data mining models in context of organizational data, various data types & exploratory, data preprocessing measures & techniques, data warehousing & data governance. This module corresponds to CT051-3-M-DM, therefore please refer to the non-ODL MD if any changes to the module are needed.
Course Learning OutcomesCLO1: Evaluate the various data types, data storage systems and associated techniques for indexing and retrieving data. (C6, PLO7)
CLO2: Design feature engineering techniques to transform transactional data into meaningful inputs in order to create a predictive model. (A5, PLO6)
CLO3: Propose a suitable approach to designing a data warehouse to store and process large datasets. (A3, PLO5)
Module Introduction
Welcome to our first class. We will discuss the following matters.
•Module overview•Assessment requirements•Teaching strategiesPlease raise any queries about how the module will be covered as well as the nature of assessments
Organisation Data Preparation
I hope you are all excited to get started with our first topic The learning outcomes for this topic are as follows:
•List and define various sources of data•Explain the fundamental differences between databases, data warehouses, and datasets•Explain some of the ethical dilemmas associated with data mining and outline possible solutions.Explain the pros and cons of using regression in supervised data mining
Data Types
- In this topic of data types, the learning outcomes for this topic are as follows:• Define what a dataset is• Explain the different types of variables• Describe six basic ways to identify variables
Discuss the key differences between moderating variables and mediating variables with examples.
Data Preprocessing - Part 1 and 2
I hope you are enjoying the material. For data processing, there are two parts. The first part will
address the need for data preparation; discuss the multidimensional view of data quality as well as explain the major tasks in data preprocessing, especially data cleaning. In the second part, we will delve into data integration and data transformation.Explain with an example of the impact on mean normalization. You need to discuss the consequence to data analysis if we choose not to do mean normalization.
Exploratory Data Analysis
Now that we had discussed the processing of data, we are now ready for data analysis. There are two components to our approach: Descriptive statistics and graphical illustrations. For descriptive statistics, we will explore data analysis on categorical data and continuous data. For graphical illustrations, we will also discuss ways to represent categorical data and continuous data graphically.
- Provide an example of categorical data where using a histogram may not be the best approach to explore the data.
Data Warehouse
I hope you are enjoying the material as much as i do. FOr this topic on data warehousing, we will cover the following:
1. Nature of data warehouse and OLAP concepts2. Properties of a data warehouse architecture and schemes3. Concept of OLAPDescribe a practical scenario where an enterprise warehouse approach is more suitable than a data mart approach for a warehouse model.
Hadoop
With data warehousing under our belt, we explore HADOOP as a means to process big data with reasonable cost and time. In this topic, we will discuss the following sub-topics:
- Hadoop Framework
- Hadoop’s Architecture
- Hadoop in the Wild
- Data warehouse to Hadoop
- Discuss the pros and cons of using HADOOP in practical scenarios.
Hive
Hi, we continue to explore the Hadoop environment in this topic. The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig, and Hive that are used to help Hadoop modules. We will study in-depth the use of HIVE to develop SQL type scripts to do operations.
- Discuss how we may mitigate the cons of using HIV to analyse big data.
Data Security and Governance
Congratulations on coming through thus far. In our last topic, we are interested n the following questions:
- How do we define Data Governance and its relationship to IT Governance?
- What are some of the key pillars of a Data Governance Program?
- What challenges does a Data Governance Program face early on?
- How can Data Governance and Internal Audit collaborate or leverage each other?
Explain with some practical scenarios the consequences if we do not execute data governance with due diligence.
- This topic