ml regression: predicting fuel efficiency

Basic Regression: Predicting Fuel Efficiency in Tensorflow & Python

Requirements:

Make sure to first install this in your virtual environment:

# Use seaborn for pairplot
pip install -q seaborn

# Use some functions from tensorflow_docs
pip install -q git+https://github.com/tensorflow/docs

If you have problems [ssl,timeout] try:

pip install --trusted-host pypi.org --trusted-host files.pythonhosted.org pip setuptools

pip install <package_name>

Needed Libraries

Python
import pathlib

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers

print(tf.__version__)

import tensorflow_docs as tfdocs
import tensorflow_docs.plots
import tensorflow_docs.modeling

The Dataset: Auto MPG

"The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes."

(Quinlan, 1993) -uci ml repository

A list of datasets for machine learning: https://archive.ics.uci.edu/ml/

Discrete vs Continuous:

I will try to explain with an example:

Suppose your table in the database has a column which stores the temperature of the day or say a furnace. The values for that column come from a continuous domain of temperature values. If the table has a column named gender.

Then that is discrete in the sense that only two or maybe three values comprise its domain. I hope this helps.

-stackoverflow
Python
dataset_path = keras.utils.get_file("auto-mpg.data", "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data")
dataset_path # print path to file

keras.utils.get_file(fname, origin) Downloads a file from a URL if it not already in the cache

Import it using pandas

Python
column_names = ['MPG','Cylinders','Displacement','Horsepower','Weight',
                'Acceleration', 'Model Year', 'Origin']
raw_dataset = pd.read_csv(dataset_path, names=column_names,
                      na_values = "?", comment='\t',
                      sep=" ", skipinitialspace=True)

dataset = raw_dataset.copy()
dataset.tail()

From pandas.read_csv():
na_values
: Additional strings to recognize as NaN. When it encounters "?" it will replace it with NaN.
comment
: Takes a character as comment when parsing. "\t" will be considered as comment and won't be parsed.
sep
: Delimiter to use.
skipinitialspace
: Skips spaces after delimiter.

At this point the dataset will contain unknown values:

Python
dataset.isna().sum()

sum ( iterable , / , start=0 ) Sums start and the items of an iterable from left to right and returns the total. The iterable's items are normally numbers, and the start value is not allowed to be a string.

pandas.DataFrame.isna() Detect missing values. Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy. NaN , gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True ).

We clean them by:

Python
dataset = dataset.dropna() # drop NaN values

The "Origin" column is really categorical, not numeric. So convert that to a one-hot:

Python
dataset['Origin'] = dataset['Origin'].map({1: 'USA', 2: 'Europe', 3: 'Japan'})

dataset = pd.get_dummies(dataset, prefix='', prefix_sep='')
dataset.tail()

pandas.Series.map() Map values of Series according to input correspondence. Used for substituting each value in a Series with another value, that may be derived from a function, a dict or a Series.
arg : function, collections.abc.Mapping subclass or Series Mapping correspondence.
na_action : {None, ‘ignore’}, default None If ‘ignore’, propagate NaN values, without passing them to the mapping correspondence.

pandas.get_dummies() Convert categorical variable into dummy/indicator variables.
prefix : String to append DataFrame column names. Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternatively, prefix can be a dictionary mapping column names to prefixes.
prefix_sep : If appending prefix, separator/delimiter to use. Or pass a list or dictionary as with prefix.

pandas.DataFrame.tail() Return the last n rows. This function returns last n rows from the object based on position. It is useful for quickly verifying data, for example, after sorting or appending rows. For negative values of n , this function returns all rows except the first n rows, equivalent to df[n:].
n : (int, default = 5)

Split the dataset into Training & Testing sets

Python
train_dataset = dataset.sample(frac=0.8,random_state=0)
test_dataset = dataset.drop(train_dataset.index)

pandas.DataFrame.sample Return a random sample of items from an axis of object. You can use random_state for reproducibility.

pandas.DataFrame.drop Drop specified labels from rows or columns. Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level.

Inspect the data:

Python
sns.pairplot(train_dataset[["MPG", "Cylinders", "Displacement", "Weight"]], diag_kind="kde")
Data inspection

seaborn.pairplot(data, diag_kind) :

diag_kind {‘auto’, ‘hist’, ‘kde’, None}

optional Kind of plot for the diagonal subplots. The default depends on whether "hue" is used or not.

hue string (variable name)
optional Variable in data to map plot aspects to different colors.

data : DataFrame
Tidy (long-form) dataframe where each column is a variable and each row is an observation.

watch the statistics:

Python
train_stats = train_dataset.describe()
train_stats.pop("MPG")
train_stats = train_stats.transpose()
train_stats

pandas.DataFrame.describe Generate descriptive statistics.
pandas.DataFrame.transpose Transpose index and columns. Reflect the DataFrame over its main diagonal by writing rows as columns and vice-versa. The property T is an accessor to the method transpose().

Split features from labels

Python
train_labels = train_dataset.pop('MPG')
test_labels = test_dataset.pop('MPG')

pandas.DataFrame.pop(label) Return item and drop from frame. Raise KeyError if not found.  Label of column to be popped.

Normalize the data

We need to scale the data: notice the difference in the statistics between data.

Python
def norm(x):
  return (x - train_stats['mean']) / train_stats['std']
normed_train_data = norm(train_dataset)
normed_test_data = norm(test_dataset)

Build the Model

Python
def build_model():
  model = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=[len(train_dataset.keys())]),
    layers.Dense(64, activation='relu'),
    layers.Dense(1)
  ])

  optimizer = tf.keras.optimizers.RMSprop(0.001)

  model.compile(loss='mse',
                optimizer=optimizer,
                metrics=['mae', 'mse'])
  return model

model = build_model()
model.summary()

Try the model by making some predictions on a batch of 10 exmaples:

Python
example_batch = normed_train_data[:10]
example_result = model.predict(example_batch)
example_result

Train the model for 1000 epochs on the normed data:

Python
EPOCHS = 1000

history = model.fit(
  normed_train_data, train_labels,
  epochs=EPOCHS, validation_split = 0.2, verbose=0,
  callbacks=[tfdocs.modeling.EpochDots()])

Visualize the stats stored in the history object:

Python
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
hist.tail()

Plot the progress:

Python
plotter = tfdocs.plots.HistoryPlotter(smoothing_std=2)
plotter.plot({'Basic': history}, metric = "mae")
plt.ylim([0, 10])
plt.ylabel('MAE [MPG]')
Training and validation MAE during epochs for MPG

Python
plotter.plot({'Basic': history}, metric = "mse")
plt.ylim([0, 20])
plt.ylabel('MSE [MPG^2]')

This graph shows degradation in the validation error after about 100 epochs.
Let's automatically stop training when the validation score doesn't improve.

We'll use EarlyStopping that tests a training condition for every epoch: stop training when a monitored metric has stopped improving.

Training and validation MSE during epochs for MPG^2

Python
model = build_model()

# The patience parameter is the amount of epochs to check for improvement
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)

early_history = model.fit(normed_train_data, train_labels, 
                    epochs=EPOCHS, validation_split = 0.2, verbose=0, 
                    callbacks=[early_stop, tfdocs.modeling.EpochDots()])

Check validation error again (we're looking for improvements):

Python
plotter.plot({'Early Stopping': early_history}, metric = "mae")
plt.ylim([0, 10])
plt.ylabel('MAE [MPG]')
# You can see improvements
Using early stopping improves the model's quality

Let's print our model's "loss", "mean absolute" and "mean squared" errors:

Python
loss, mae, mse = model.evaluate(normed_test_data, test_labels, verbose=2)
print("Testing set Mean Abs Error: {:5.2f} MPG".format(mae))

Make predictions

Python
test_predictions = model.predict(normed_test_data).flatten()

a = plt.axes(aspect='equal')
plt.scatter(test_labels, test_predictions)
plt.xlabel('True Values [MPG]')
plt.ylabel('Predictions [MPG]')
lims = [0, 50]
plt.xlim(lims)
plt.ylim(lims)
_ = plt.plot(lims, lims)
Model's predictions

Take a look at the model's error distribution:

Python
error = test_predictions - test_labels
plt.hist(error, bins = 25)
plt.xlabel("Prediction Error [MPG]")
_ = plt.ylabel("Count")
Error distribution

Pro tips:

- If there is not much training data, one technique is to prefer a small network with few hidden layers to avoid overfitting.

- Early stopping is a useful technique to prevent overfitting.
Python

#
# Copyright (c) 2017 François Chollet
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.
python regression tensorflow machine learning ml