Sunday, 23 September 2018

Machine Learning - Data Processing

Chapter 2  - Data Processing: It is crucial part of machine learning.

1.       Dataset : Get the dataset that we will use for pre-processing. Suppose you have the dataset like below.

It is in data.csv

In machine learning model we need to check that the Dependent Variable and Independent variable.

In Below example there first three colum are independent variable and Purchased colum is dependent variable

 Name Age Salary Purchased Bon 44 72000 No Ram 27 48000 Yes Sohan 30 54000 No eric 38 61000 No Mat 40 Yes Denie 35 58000 Yes Andie 52000 No Rus 48 79000 Yes Mak 50 83000 No Mark 37 67000 Yes

2.       Importing Library :

So from here let start with python. We are using the Spyder(Python) that you can get it from the Anaconda.

Now we are importing the below three libraries as below:

 #import libraries import numpy as np import matplotlib.pyplot as py import pandas as pd

Numpy where Np is shortcut it will generally use for mathematical operations

Matpotlib is the library where we are using the sublibrary as  pyplot and use shortcut as Py this will use to draw the charts.

Pandas will use to import the dataset, it is best library to import dataset.

Now select the code and  press ctrl + enter, you can see the output in console as like below.

3.       Importing Dataset :

Before importing dataset you need to set the working directory in spyder.  In right side window there is fileexplorer tab select the folders where you want to move and  than Just run your application by Run File (F5) command, by saving the file in that folder.

Now need to import the data.csv as given in 1st step.

 # import the dataset dataset = pd.read_csv('Data.csv')

Now run that line using ctrl + Enter. Now you can see in the variable Explorer tab and see the dataset exported like below.
In python the index will start from ), if you will see the output there in below.

Now we need to distinguish matrix of feature and dependent variable.
Now we need to create the matrix from the three variable given there which is independent variable.
Now creating the matrix of feature as X and Y. X is for independent variable and y for dependent variable.

 X = dataset.iloc[:,:-1].values Y = dataset.iloc[:,3].values

[:,:-1] – first is denoted by coloums. So by : we are taking all the column.
:-1, Taking row from 0 positions to before last one.
Run that line using the Ctrl + Enter. And see the output value of X you can see all the independent variable.

[:,3] : from first it taking all the colum value and last row as 3. [Colum,row]
Now see the output values there as like below  Ctrl + enter and check variable Y.
All dependent variable.

4.       Missing Data :
The first problem for data is missing of data, so how we will deal with it.  That is quite happen in real life.  If you seen the dataset given above there are two data are missing for Mat Salary is missing and Denie Age is missing.
There are option that you can remove the line and deal with missing problem but it is not a good option. There are another idea is to take the mean of the column and replace the missing data.

We will use the libray to get the missing data, we are not going to create any mean data method.
We will use Scikit Learn pre-processing sub library Imputer to make it done.

 # Fill the missing data and which will the mean of values from sklearn.preprocessing import Imputer imputer = Imputer(missing_values = 'NaN', strategy= 'mean',axis = 0 ) imputer = imputer.fit(X[:, 1:3]) X[:, 1:3] = imputer.transform(X[:, 1:3])

Run that code using the Ctrl + Enter, and see the value of X.

If you will see the code Mat and Andie value is filled by mean values.

5.       Test, Train and Split
In machine learning model machine is learning from data, what ever the data provided to it, it learn from that. So in machine learning model we split it into two part, training set of data and test set of data.
Here we import the library as   train_test_split of sklearn.model_selection .
 # split the dataset into train and test from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.30,random_state=42)

Here the Test_size is 0.30 means its 30% of the total data use for test data.

Full code Along with the Encoder and  and scalling.

 # -*- coding: utf-8 -*- """ Spyder Editor This is a temporary script file. """ #import libraries import numpy as np import matplotlib.pyplot as py import pandas as pd # import the dataset dataset = pd.read_csv('Data.csv') X = dataset.iloc[:,:-1].values Y = dataset.iloc[:,3].values # Fill the missing data and which will the mean of values from sklearn.preprocessing import Imputer imputer = Imputer(missing_values = 'NaN', strategy= 'mean',axis = 0 ) imputer = imputer.fit(X[:, 1:3]) X[:, 1:3] = imputer.transform(X[:, 1:3]) print(X) # Catgarical the data in python from sklearn.preprocessing import LabelEncoder, OneHotEncoder #label encoder labelencoder_X = LabelEncoder() X[:,0] = labelencoder_X.fit_transform(X[:,0]) labelencoder_y = LabelEncoder() Y = labelencoder_y.fit_transform(Y) # One hot encoder onehotencoder = OneHotEncoder(categorical_features= ) X = onehotencoder.fit_transform(X).toarray() #split the data into train and test data from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.30,random_state=42) #feature scalling from sklearn.preprocessing import StandardScaler sc_X = StandardScaler() X_train = sc_X.fit_transform(X_train) X_test = sc_X.fit_transform(X_test) ######################################## import quandl df = quandl.get("NSE/SBIN") print(df.tail())