data preprocessing

Data Preprocessing

This dataset contains 4 cols & 10 observations (rows), We have to distinguish btw dependent & independent variables. Here Country, Age, Salary are independent, Purchased dependent ( we try to predict the dependent values based on the independent variables ), (if the customer has purchased or not based on its age, country, salary!)

get datasets here: SuperDataScience

Unzip the files, move them INSIDE the "Part XX - Title" folder, depending on your needs. 
Amongst those files you have the dataset (csv files, excel or something similar) + the R or Python codes!

Data preprocessing

IMPORTING LIBRARIES

create a new file in spyder data_preprocessing_template.py, libraries are tools used to do something you need

IMPORT THE DATASET

inside the same file.. we specify first our working directory, on mac you click the button to set the folder as wd, in the explorer tab. 

OR just save your python file where your dataset is saved. press F5 to run the code even if is empty importing the dataset

Python
dataset = pandas.read_csv('Data.csv')

select and press ctrl+enter to see if it imports it correctly in the explorer section (upper-right) go to "variables explorer" and double click on your dataset, this will open the dataset

In python, observations (rows) start from 0!

you can format float numbers by clicking on format set it to %. 0f instead of scientific notation now that we have our dataset we need to distinguish between Matrix Feature (independent variable) and Dependent Variable Vector 

We create the matrix feature and we fill it with the first three cols

Python
X = dataset.iloc[:, :-1].values

# : all the lines
# :-1 all the columns except the last one (-1)
# .values means take all the values

select and press ctrl+enter, now if you type X in the console it will print your matrix feature, you'll understand why we make this in python

we create the dependent variable vector and we fill it with the fourth col only

Python
y = dataset.iloc[:,3].values

here we type our col index (3 for the purchased col) ctrl+enter, type y in the console to double check. 

In the Python part of the next tutorial, ‘NaN’ has to be replaced by ‘np.nan’

MISSING DATA

the case where you have some missing data.

if we open data.csv, there are two missing datas, to handle this problem, you could:

  • remove the missing data's lines; (this could be dangerous, if it contains crucial stuff)
  • take the mean of the columns, fill the missing data with the mean of its column (kind of better option)

if you can't see the full array, just run np.set_printoptions(threshold = np.nan)

we use this library that will make this job for us (taking care of the missing data)

Python
from sklearn.impute import SimpleImputer

sikitlearn library, contains libs to make ML modules, from it we take preprocessing lib, it contains classes/methods to preprocess datasets Imputer class, allows us to take care of missing data

create an object's instance from this class imputer = Imputer after you type this, press ctrl+I to see what kind of params it expects

Python
imputer_settings = SimpleImputer(missing_values = np.nan, strategy = 'mean')

we type np.nan because inside the dataset missing values are written this way, to feed this imputer with the dataset imputer.fit( X[:,1:3] ) X[rows,cols] (rows) ":" all (cols) "1:3" from 1 to 2 (3 excluded) we pass in only the cols where data is missing

Python
imputer_settings = imputer_settings.fit(X[:,1:3])

.fit() returns an array with the mean values for each column print(imputer_settings.statistics_) print .fit() array

replace the data by the means of their cols (transform method)

Python
X[:, 1:3] = imputer_settings.transform(X[:, 1:3])

.transform() applies .fit() values into the respective columns!

CATEGORICAL DATA

country and purchased are categorical (they contain categories: france, spain, .. & yes, no)

ML modules are based on math equations. Would be a problem to keep categorical variables, so we encode them into numbers

Python
from sklearn.impute import SimpleImputer
# Imputer class, allows us to take care of missing data
imputer_settings = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imputer_settings = imputer_settings.fit(X[:,1:3])
# .fit() returns an array with the mean values for each column
# print(imputer_settings.statistics_) # print .fit() array
# replace the data by the means of their cols (transform method)
X[:, 1:3] = imputer_settings.transform(X[:, 1:3])
# .transform() applies .fit() values into the respective columns! 

CATEGORICAL DATA

ML modules are based on math equations. Would be a problem to keep categorical variables, so we label-encode them with numbers:

Python
from sklearn.preprocessing import LabelEncoder # class
labelencoder_X = LabelEncoder()
# label-encode only the column country:
X[:, 0] = labelencoder_X.fit_transform(X[:,0])

doing this encodes Germany=1, Spain=2... this causes a problem: We must prevent that ML equations think that now Spain is greater than Germany when there is no > or 20% of the dataset for the test set and 0.8 -> 80% for the training set random_state is set so we have the same result after the split we have:

  • Testing Set: contains 8 observations for each: X_train(matrix feature) y_train(dependent variables)
  • Test Set: contains 2 observations for each: X_test(matrix feature) y_test(dependent variables to be predicted)

now our ML module should be able to predict the dependent variables in the test set The better it learns the correlation in the training set, the better will predict the module doesn't have to learn it by heart but to understand it! So it works when you change dataset that's called OVERFEEDING (regression section) regularization techniques to avoid this

FEATURE SCALING

open the datasheet. Age and Salary are not on the same scale: Age is going from 27 to 50

Salary is going from 40k to 90k because of it, this will cause issues in your ML module this is because some ML modules are based on the Euclidean distance between two points P1, P2, is: 

sqrt[ (x2-x1)^2 + (y2-y1)^2 ] 

since salary has a wider range, it would dominate the age values so we scale our features, by

Python
# STANDARDIZATION
#              ______x-mean(x)________
#    x_stand =  standard deviation(x)
#
# for each observation and each feature you withdraw(remove)
# the mean value of all the values of the feature
# and you divide it by the standard deviation
#
# NORMALIZATION
#              ___x-min(x)____
#     x_norm =  max(x)-min(x)
#
# you subtract your observation feature X by the minimum value
# of all the future values and you divide it by the max of your feature values
# and the minimum of your feature values
#
# with this we put our vars in the same range, so no var dominates the other!
#
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
# when you apply your standardscaler object (sc) to your training set,
# you have to fit the object of the training sets
# and then transform it (for the test set we will only transform it)
X_train = sc_X.fit_transform(X_train)
#
# we don't need to fit the sc_X object to the test set because it's already
# fitted to the training set
X_test = sc_X.transform(X_test)

it is important to fit sc on X_train so X_test will be scaled on the same basis! now there are two questions that we can ask ourselves

1. Do we need to fit&trasform the dummy variables??  it depends on the context. How much you want to keep interpretation in your models. Everything will be on the same scale but we might lose which observation belongs to which country!

it won't break your model, but you as user will lose the interpretation of those values.

We do it now at the end, because we don't need the interpretation anymore for now but keep in mind that usually you do!

NOTE: even models that aren't based on Euclidean distances use feature scaling  because the algorithm will converge much faster! CASE: decision trees (not based on Euclidean distances) we use feature scaling otherwise they will run for a long time!

2. Do we need to feature scale y_train & y_test? in our case the answer is no! because this is a classification problem with a category called dependent But we will see that for regression (when the dependent variable will take a huge range of values), we will need to apply feature scaling to the dep.var. as well Now we have a good accuracy and fast converging module!

DATA PRE-PROCESSING TEMPLATE

here we create a file but we just import the libs, the datasets but we don't handle missing data, categorical data. We add feature scaling as a comment just because some python/R libs require us to feature scale our data, usually most of them will do it for us

RESUME:

data science missing data categorical data feature scaling