Dimensionality Reduction – (PCA)
Variations in the dataset is actually the information from the dataset and this is what the PCA uses. In simple terms PCA or Principal component analysis is a process to emphasise variations in a data set and generate strong pattern out of it.
We can figure out the whole concepts in 3 points as follows—
- We reduce the dimensions of the data by finding new set of variables smaller then the existing set of variables
- We retain the maximum information in the process
- Data compression and classification are the main use cases
Reducing the dimension of feature space is termed as dimensionality reduction which can be carried out by one of the following processes-
A. Feature Elimination
B. Feature extraction
Feature elimination is a process in which we analyse and remove certain features from the existing data and reduce the feature space. Here our data turns out to be simple and interpretability of variables are maintained. But a major disadvantage is that we can’t have any information gain from the variables dropped i.e. the variables which are eliminated will not contribute or leave behind the benefits that it could provide to the predictive model created.
Feature Extraction is little different process as compared to previous one. Here we create a new dependent variables by using certain process over the existing feature variables. Suppose we are having 10 independent variables and we need to work upon 2 independent variables then through certain mathematical processes we generate those new variables. We follow a specific way to do this which results into new set of reduced variables with major information from the older set of variables. We rearrange them in an order they are capable of predicting dependent variables i.e. How well the can predict the dependent variables. Hereafter we just dropped the least important ones and preserve some most important variables carrying major informations from the existing features.
Principal component analysis is a process of feature extraction where we combine the input variables in a specific way and drop the least important ones and retain the most valuable parts with maximum information about the data.
Why (Principal Component Analysis) PCA ?
When she got back to the Cheshire Cat, she was surprised to find PCA makes a very important sense while doing data preprocessing. With high dimension data a lot of challenges you might need to face which are termed under ‘Curse of dimensionality’ and here analysing your data to extract meaningful and important features with reduced shape of dimension but retained required information can be done through PCA. A major problem like overfitting in a machine learning model can be treated well by reducing the independent variables. Apart from this if you need to visualise your data variables on x and y coordinates but you are having huge set of data variables then again PCA turns out to be very important technique to scale down your feature variables and let you visualise the spread over two or three dimensional co-ordinate system.
Steps to calculate Principal components.
- Collect the High dimension correlated data points.
- Centre the points i.e. transform the data as mean becomes 0 by standardising the data
- Compute Covariance matrix to figure out direction with maximum variance
- Compute Eigen values and Eigen vectors
- Select m < d Eigen vectors with highest Eigen values
- Project data points to those Eigen vectors
- Generate uncorrelated low dimension data
Pure Mathematics is in way, The poetry of logical ideas.Albert Einstein
Let’s begin the process by assuming a two dimension data or features distributed between X1 and X2 coordinates. In the process of reducing the features we follow following steps –
1. Generate a new coordinate e1 with the condition that it would give maximum variance if data points will be projected over it.
2. Generate e2 orthogonal to e1 and again project all the data points over it.
3. Compare the variance and select the axis with maximum variance and projected value will be new principal component.
In the following case e1 and e2 are two principal components with e1 is having maximum variance as all data points are projected over some vectors of e1 with well separated distances between but in case of e2 the distance between vectors are less after projection and two data points represented by red dots projected at almost same vector on e2.
High variance means maximum information secured after reducing the dimension from x1 and x2 to e1. So, it will be considered as first principal component.
Here we first start by define the set of Principal components using certain assumption like-
Suppose we have d-dimension data. Then-
a. Define direction of greatest variability in data and that will be first principal component
b. And the define perpendicular to the previous figured direction as second principal component and so on until d(actual dimension)
c. We consider m dimensions in a way m<d with maximum variability and informations.
PCA is a process to be applied over n*m matrix A which results into matrix B as a projection over maximum variance vector within data.
a11, a12 A = (a21, a22) a31, a32 B = PCA(A)
We start with calculating mean of each column.
M = mean(A) 'or' M(m11) = (a11 + a21 + a31) / 3 M(m12) = (a12 + a22 + a32) / 3
We centre the values in each columns by subtracting the mean column value.
Next we calculate the covariance matrix of centred matrix C. Covariance represents the amount of direction that two columns changes together. It is generalised and unnormalised version of correlation across multiple columns. It is the calculation of covariance of a given matrix with covariance scores for every column with every other columns, including itself.
At last we do Eigen decomposition of vector V which results into a list of Eigen Vector and Eigen Values.
C = A - M V = cov(c) values,vectors = eig(V)
Eigen vector is the direction in a coordinate space defined by a metrics which doesn’t change its direction with metrics transformation.
Eigen value is a scaler number which is multiplied with Eigen vector to give same result as Eigen vector multiplier with existing metrics.
Lets suppose A is metrics and v is a Eigen vector then
Av = λv , is the representation of Eigen vector ‘v’ with Eigen value ‘λ’.
The Eigenvectors are sorted by the Eigenvalues in descending order to generate the ranking of components. If the Eigenvalues are close to zero then they represents components that can be discarded. Ideally we select k eigenvectors or principal components that have k largest eigenvalues.
B = select(values, vectors) Once Choosen Data can be projected into the subspace via matrix multiplication P = B^T . A
Where A is the original Data to be projected, B^T is the chosen principal components and P is the projection of A.
It is the Covariance method of PCA.
Example given below is using a 3*2 matrix which is being centred first then Covariance is calculated of centred data. Then Eigenvalues and Eigenvectors are figured out as principal component and used to project the original data.
from numpy import array from numpy import mean from numpy import cov from numpy.linalg import eig # define a matrix A = array([[6, 5], [2, 3], [1, 4]]) print(A) # calculate the mean of each column M = mean(A.T, axis=1) print(M) # center columns by subtracting column means C = A - M print(C) # calculate covariance matrix of centered matrix V = cov(C.T) print(V) # eigendecomposition of covariance matrix values, vectors = eig(V) print(vectors) print(values) # project data P = vectors.T.dot(C.T) print(P.T)
#output [[6 5] [2 3] [1 4]] [3. 4.] [[ 3. 1.] [-1. -1.] [-2. 0.]] [[7. 2.] [2. 1.]] [[ 0.95709203 -0.28978415] [ 0.28978415 0.95709203]] [7.60555128 0.39444872] [[ 3.16106023 0.08773958] [-1.24687618 -0.66730788] [-1.91418405 0.5795683 ]]
We can see that only first eigenvectors are required for major data information. So, 3*2 matrix can be converted into 3*1 matrix.
P is the magnitude or scaler projection of normal distribution of data on Eigenvector, which are called as principal components. First column is PC1 with least amount of loss as eigenvalue associated is comparatively larger in number.
Using Scikit-Learn for PCA
We can calculate PCA using PCA() class of scikit-learn library. The benefit is like once you calculate projection, it can be applied to new data repeatedly. While creating class number of components need to be applied as parameter.
The class is fit to the dataset using .fit() function and the original or any other data set can be projected into the subspace with chosen number of dimensions by calling the transform() function. explained_variance_ and components_ attributes can used along with the PCA after the data is fit to figure out the Eigenvalues and principal components.
We will use same data matrix for PCA using the scikit-learn libraries in the following example.
from numpy import array from numpy import mean from numpy import cov from numpy.linalg import eig # define a matrix A = array([[6, 5], [2, 3], [1, 4]]) print(A) from sklearn.decomposition import PCA # create the PCA instance pca = PCA(2) # fit on data pca.fit(A) # access values and vectors print(pca.components_) print(pca.explained_variance_) # transform data B = pca.transform(A) print(B)
#Output [[6 5] [2 3] [1 4]] [[ 0.95709203 0.28978415] [ 0.28978415 -0.95709203]] [7.60555128 0.39444872] [[ 3.16106023 -0.08773958] [-1.24687618 0.66730788] [-1.91418405 -0.5795683 ]]
Now we can apply PCA some bigger data set. We will use Breast Cancer dataset with feature length of 30 and two categorical labels. i.e. two types of tumours as ‘malignant’ ‘benign’.
from sklearn.datasets import load_breast_cancer df = load_breast_cancer() print(df.keys())
#output dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
X = df['data'] Y = df['target'] print(df['target_names']) print(df['feature_names']) print(len(df['feature_names']))
#Output ['malignant' 'benign'] ['mean radius' 'mean texture' 'mean perimeter' 'mean area' 'mean smoothness' 'mean compactness' 'mean concavity' 'mean concave points' 'mean symmetry' 'mean fractal dimension' 'radius error' 'texture error' 'perimeter error' 'area error' 'smoothness error' 'compactness error' 'concavity error' 'concave points error' 'symmetry error' 'fractal dimension error' 'worst radius' 'worst texture' 'worst perimeter' 'worst area' 'worst smoothness' 'worst compactness' 'worst concavity' 'worst concave points' 'worst symmetry' 'worst fractal dimension'] 30
from sklearn.model_selection import train_test_split xtrain,xtest,ytrain,ytest = train_test_split(X,Y) from sklearn.decomposition import PCA # create the PCA instance pca = PCA(2) pca.fit(xtrain) print(pca.n_components_) xtr = pca.transform(xtrain) print(pca.explained_variance_) print(pca.explained_variance_ratio_)
#Output 2 #Eigenvalues array([412335.99291139, 7757.33074231]) array([0.97981073, 0.01843331])
It shows that first eigenvalue is bigger in number which results the first principal component will have major information from the data and variance ratio is representing that it is having 97% of information from the existing data.
from sklearn.neighbors import KNeighborsClassifier kmodel = KNeighborsClassifier() kmodel.fit(xtr,ytrain) print(kmodel.score(xtr,ytrain)) print(kmodel.score(xts,ytest))
#Training Accuracy 0.9413145539906104 #Testing Accuracy 0.9300699300699301
It shows the training and testing accuracy if the training is made over reduced dimensions using PCA.
import matplotlib.pyplot as plt plt.scatter(xtr[:,0:1] , xtr[:,1:2] , c = ytrain) plt.show()
xts = pca.transform(xtest) import matplotlib.pyplot as plt plt.scatter(xts[:,0:1] , xts[:,1:2] , c = ytest) plt.show()
ypred_test = kmodel.predict(xts) plt.scatter(xts[:,0:1] , xts[:,1:2] , c = ypred_test) plt.show()
With the Data plot and defining the colours using the labels given, we can decide that after reducing the dimensions from 30 to 2 using PCA major part of information can be secured for training our model.