CS447 Introduction to Data Science

Antalya Science University

Course Name: Introduction to Data Science

Course Code: CS 447

Language of Course: English

Credit: 3

Course Coordinator / Instructor: Şadi Evren ŞEKER

Contact: intrds@sadievrenseker.com

Schedule: Tuesday 15.00 – 18.00

Course Description: This course is an introduction level course to data science, specialized on machine learning, artificial intelligence and big data.

The course starts with a top down approach to data science projects. The first step is covering data science project management techniques and we follow CRISP-DM methodology with 6 steps below:

Business Understanding : We cover the types of problems and business processes in real life

Data Understanding: We cover the data types and data problems. We also try to visualize data to discover.

Data Preprocessing: We cover the classical problems on data and also handling the problems like noisy or dirty data and missing values. Row or column filtering, data integration with concatenation and joins. We cover the data transformation such as discretization, normalization, or pivoting.

Machine Learning: we cover the classification algorithms such as Naive Bayes, Decision Trees, Logistic Regression or K-NN. We also cover prediction / regression algorithms like linear regression, polynomial regression or decision tree regression. We also cover unsupervised learning problems like clustering and association rule learning with k-means or hierarchical clustering, and a priori algorithms. Finally we cover ensemble techniques in Knime and Python on Big Data Platforms.

Evaluation: In the final step of data science, we study the metrics of success via Confusion Matrix, Precision, Recall, Sensitivity, Specificity for classification; purity , randindex for Clustering and rmse, rmae, mse, mae for Regression / Prediction problems with Knime and Python on Big Data Platforms.

Course Objective and Learning Outcomes:

1. Understanding of real life cases about data

2. Understanding of real life data related problems

3. Understanding of data analysis methodologies

4. Understanding of some basic data operations like: preprocessing, transformation or manipulation

5. Understanding of new technologies like bigdata, nosql, cloud computing

6. Ability to use some trending software in the industry

7. Introduction to data related problems and their applications

Tools:

List of course software:

· Excel,

· KNIME,

· Python Programming with Numpy, Pandas, SKLearn, StatsModel or DASK

This course is following hands on experience in all the steps. So attendance with laptop computers is necessary. Also the software list above, will be provided during the course and the list is subject to updates.

Grading

Reading, Attendence and Discussions: 30%

Homeworks: 30%

Project: 40%

Course Content:

Week 1 (Feb 19): Introduction to Data, Problems and Real World Examples:Some useful information:DIKW Pyramid: DIKW pyramid – WikipediaCRISP-DM: Cross-industry standard process for data mining – WikipediaSlides from first week:week1

Week 2 (Feb 26): Introduction to Descriptive Analytics
Repeating the first week for majority of the class and starting the concept of end to end data science projects. Weight and Heigh Sample project and Data Set for Knime work flow. Brief introduction to algorithms: K-NN, Naive Bayes, Decision Trees, Linear Regression

Week 3 (Mar 5): Introduction to Data Manipulation
Concept of Data and types of data : Categorical (Nominal, Ordinal) and Numerical (Interval, Ratio).
Basic Data Manipulation techniques with Knime:
1.Row Filter and Concept of Missing Values
2.Column Filter
3.Advanced Filters
4.Concatenate
5.Join
6. Group by , Aggregation
7. Formulas, String Replace
8. String Manipulation
9. Discrete, Quantized Data, Binning
10. Normalization
11.Splitting and Merging
12.Type Conversion (Numeric , String)

Week 4 (Mar. 12): Introduction to Python Programming for Data Science and an end-to-end Python application for data science
Brief review of python programming
Introduction to data manipulation libraries: NumPY and Pandas
Introduction to the Sci-Kit Learn library and a sample classification

You can install anaconda and Spyder from the link below:

Also we have covered below topics during the class:

Data loading from external source using Pandas library (with read_excel or read_csv methods)
DataFrame slicing and dicing (using the iloc property and the lists provided to the iloc method)
Column Filtering (with copying into a new data frame)
Row Filtering (with copying into a new data frame)
Advanced row filtering (like filtering the people with even number of heights)
Column or row wise formula (we have calculated the BMI for everybody)
Quantization (discretization or binning): where we have applied the condition based binning
Min – Max Normalization (we have implemented MinMaxScaler from the SKLearn library)
Group By operation (we have implemented the groupby method from pandas library)

Click here to download the codes from the class

For further information I strongly suggest you to read the below documentations:

Pandas Library : https://pandas.pydata.org/pandas-docs/stable/
Numpy Library : http://www.numpy.org
SK Learn Library : https://scikit-learn.org/stable/
Pandas Data Frame (This is the main topic we have covered this week): https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

Week 5 (Mar 19): Classification Algorithms
concepts of classification algorithms, implementing the algorithms in Knime and coding in python. Algorithms covered are:
K-NN
Naive Bayes
Decision Tree
Logistic Regression
Support Vector Machines

2nd Python Code of the course for the classifications

Knime Workflow for the classification algorithms

Week 6 (Mar 26): Regression Algorithms
concepts of prediction algorithms, implementing the algorithms in Knime and coding in python. Algorithms covered are:
Linear Regression
Polynomial Regression
Support Vector Regressor
Regression Trees and Decision Tree Regressor

Python code for the Regression

Knime Workflow and the BIST 100 data set for the Regression Algorithms

The Data Set obtained from : finance.yahoo.com

Week 7 (Apr 2): Clustering Algorithms
concepts of clustering algorithms, implementing the algorithms in Knime and coding in python. Algorithms covered are:
K-Means
DBScan
Hierarchical Clustering

Knime Workflow

Python Code

Week 8 (Apr 9): Association Rule Mining
concepts of association rule mining (ARM) and association rule learning (ARL) algorithms, implementing the algorithms in Knime and coding in python. Algorithms covered are:
A-Priori Algorithm

Click Here To Download Apyroiri Library for the Python Codes

click for python code

click for knime workflow

Homework : Link for Kaggle, instacart

Week 9 (Apr 16): Concept of Error and Evaluation Techniques
n-Fold Cross Validation , LOO, Split Validation
RMSE, MAE, R2 values for regression
RandIndex, Silhouet, WCSS for clustering algorithms
Accuracy, Recall, Precision, F-Score, F1-Score etc. for classification algorithms

We also got an introduction to dimension reduction with PCA (principal component analysis) and Neural networks with MLP (multi layer perceptron)

Please don’t forget to install Keras for next week.

Week 10 (Apr 23): Collective Learning :

This content has moved to previous week because of the holiday

Week 11 (Apr 30): Collective Learning and Consensus Learning and Clustering Algorithms: Ensemble Learning, Bagging, Boosting Techniques, Random Forest, GBM, XGBoost, LightGBM

Some links useful for the class:

Understanding the Boosting with a simple Decision tree: https://towardsdatascience.com/boosting-algorithm-gbm-97737c63daa3
Simplified version of GBM coding and visualization: https://medium.com/mlreview/gradient-boosting-from-scratch-1e317ae4587d
Kaggle Entry for the same GBM story (Also holds the scratch codes of DecisionTree class): https://www.kaggle.com/grroverpr/gradient-boosting-simplified/
If you are curious about the splitting point and the std_agg or var_split functions : https://towardsdatascience.com/random-forests-and-decision-trees-from-scratch-in-python-3e4fa5ae4249

Readings and resources:

XGBoost Algorithm : https://xgboost.ai
The very early resource for the XGBoost: xgboost.readthedocs.io

Python Codes from the class :

Gradient Boosting:

XGBoost (for running the code install XGBoost by the command prompt:

conda install -c conda-forge xgboost

Install XGBoost extension for Knime

Week 12 (May 7): Project Presentations First Group.
Presentations will be picked randomly during the class and anybody absent will be considered as not presented.
Project Deliveries (until May 6): Project Presentation, Project Report (explaining your project, your approach and methodologies, difficulties you have faced, solutions you have found, results you have achieved in your projects, links to your data sources). Knime Workflows (in .knwf format) and python codes (in .py format). Please make all these files a single .zip or .rar archive and do not put more than 4 files in your archive.

Week 13 (May 14): Project Presentations Second Group

If you haven missed the project presentations in the first week, please contact me for further details.

Monthly Archives: February 2019

CS447 Introduction to Data Science

Recent Posts

Recent Comments

Archives

Categories

Meta