# SIT384 Cyber security analytics Pass Task 8.1P PCA dimensionality reduction Task description PCA (Principle Component Analysis) is a dimensionality reduction technique that projects the data into a lower dimensional space. It can be used to reduce high dimensional data into 2 or 3 dimensions so that we can visualize and hopefully understand the data better. In this task, you use PCA to reduce the dimensionality of a given dataset and visualize the data. You are given

SIT384 Cyber security analytics

Pass Task 8.1P: PCA dimensionality reduction

Task description:

PCA (Principle Component Analysis) is a dimensionality reduction technique that projects the data into a lower dimensional space. It can be used to reduce high dimensional data into 2 or 3 dimensions so that we can visualize and hopefully understand the data better.

In this task, you use PCA to reduce the dimensionality of a given dataset and visualize the data.

You are given:

• Breast cancer dataset which can be retrieved from:

from sklearn.datasets import load_breast_cancer cancer = load_breast_cancer() detailed info available at: https://scikitlearn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html

• PCA(n_components=2)

• 3D plot settings: (Please refer to prac7 for 3D plot examples) from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(10, 8)) cmap = plt.cm.get_cmap(-Spectral-) ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=10, azim=10) ax.scatter(x,y,z, c=cancer.target, cmap=cmap)

• Other settings of your choice

You are asked to:

• use StandardScaler() to first fit and transform the cancer.data,

• apply PCA (n_components=2) to fit and transform the scaled cancer.data set

• print the scaled dataset shape and PCA transformed dataset shape for comparison

• create 2D plot with the first principal component as x axis and the second principal component as y axis

• set proper xlabel, ylabel for the 2D plot

• print the PCA component shape and component values

• create a 3D plot with the first 3 features (as x,y and z) of the scaled cancer.data set

• create a 3D plot with the first principal component as x axis and the second principal component as y axis, no value for z axis

• set proper title for the two 3D plots

Sample output as shown in the following figures are for demonstration purposes only. Yours might be different from the provided.

Submission:

Submit the following files to OnTrack:

1. Your program source code (e.g. task8_1.py)

2. A screen shot of your program running

Check the following things before submitting:

1. Add proper comments to your code

SIT384 Cyber security analytics

Pass Task 7.1P: K-Means and Hierarchical Clustering

Task description:

In machine learning, clustering is used for analyzing and grouping data which does not include prelabelled class or even a class attribute at all. K-Means clustering and hierarchical clustering are all unsupervised learning algorithms.

K- means is a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters. It is a division of objects into clusters such that each object is in exactly one cluster, not several.

In Hierarchical clustering, clusters have a tree like structure or a parent child relationship. Here, the two most similar clusters are combined together and continue to combine until all objects are in the same cluster.

In this task, you use K-Means and Agglomerative Hierarchical algorithms to cluster a synthetic dataset and compare their difference.

You are given:

• np.random.seed(0)

• make_blobs class with input:

o n_samples: 200

o centers: [3,2], [6, 4], [10, 5] o cluster_std: 0.9

• KMeans() function with setting: init = -k-means++-, n_clusters = 3, n_init = 12

• AgglomerativeClustering() function with setting: n_clusters = 3, linkage = average

• Other settings of your choice

You are asked to:

• plot your created dataset

• plot the two clustering models for your created dataset

• set the K-Mean plot with title “KMeans”

• set the Agglomerative Hierarchical plot with title “Agglomerative Hierarchical”

• calculate distance matrix for Agglomerative Clustering using the input feature matrix (linkage = complete)

• display dendrogram

Sample output as shown in the following figure is for demonstration purposes only. Yours might be different from the provided.

Submission:

Submit the following files to OnTrack:

1. Your program source code (e.g. task7_1.py)

2. A screen shot of your program running

Check the following things before submitting: