text classification

Text Classification: Classify positive/negative movie's reviews with TensorFlow Hub (Transfer Learning)


YouTube video here .

The tutorial demonstrates the basic application of transfer learning with TensorFlow Hub and Keras.

What is Transfer Learning?

Transfer Learning: apply a previous learned skill/information to a new/similar context/situation.

TL can:

  • Train a model with a smaller dataset & Speed up training
  • Improve generalization (ability of a model to be effective on new, unseen, data)

What is TensorFlow Hub?

Tensorflow Hub: is a library for the publication, discovery, and consumption of reusable parts of machine learning models. [A module is a self-contained piece of a TensorFlow graph (https://www.tensorflow.org/api_docs/python/tf/Graph), along with its weights and assets, that can be reused across different tasks in a process known as transfer learning.] Tensorflow.org

The problem

Classify 25'000 of 50'000 movie reviews from the Internet Movie Database (IMDB) as positive or negative.

Needed Libraries

In your virtual environment, make sure to install:

In Your Virtual Environment

Python
import numpy as np

import tensorflow as tf

!pip install -q tensorflow-hub
!pip install -q tfds-nightly
import tensorflow_hub as hub
import tensorflow_datasets as tfds

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly()) # default: enabled
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.experimental.list_physical_devices("GPU") else "NOT AVAILABLE")

Our Dataset

IMDB (Internet Movie DataBase) dataset. It contains the text of 50,000 movie reviews.

  • Training Set 25,000 reviews
  • Testing Set 25,000 reviews

In the training and testing sets there are an equal number of positive and negative reviews (they are balanced).

Python
# Split the training set into 60% and 40%, so we'll end up with 15,000 examples
# for training, 10,000 examples for validation and 25,000 examples for testing.
train_data, validation_data, test_data = tfds.load(
    name="imdb_reviews", 
    split=('train[:60%]', 'train[60%:]', 'test'),
    as_supervised=True)
  • split tfds.Split or str , which split of the data to load. If None, will return a dict with all splits (typically tfds.Split.TRAIN and tfds.Split.TEST ).
Datasets are typically split into different subsets to be used at various stages of training and evaluation.

TRAIN : the training data.
VALIDATION : the validation data. If present, this is typically used as evaluation data while iterating on a model (e.g. changing hyperparameters, model architecture, etc.).
TEST : the testing data. This is the data to report metrics on. Typically you do not want to use this during model iteration as you may overfit to it.
  • as_ supervised bool , if True , the returned tf.data.Dataset will have a 2-tuple structure (input, label) according to builder. info. supervised_ keys . If False , the default, the returned tf.data.Dataset will have a dictionary with all the features.

Print first 2 reviews:

Python
train_examples_batch, train_labels_batch = next(iter(train_data.batch(2)))
train_examples_batch
Output from previous code

Building The Model

In this example, the input data consists of sentences. The labels to predict are either 0 or 1.

One way to represent the text is to convert sentences into embeddings vectors.

Word Embedding: words or phrases from the vocabulary are mapped to vectors of real numbers. -wikipedia

For example:
"good","wow","brilliant" -> 0.212, 0.513, 0.9
"bad", "eew", "what's wrong with you people" -> -0.2453, -0.5121, -1

We can use a pre-trained text embedding (called google/tf2-preview/gnews-swivel-20dim/1) as the first layer, which will have three advantages:

  • We benefit Transfer Learning: a previous model trained on embedding vectors is used
  • We don't preprocess text (the imported model will take care of that)
  • The embedding has a fixed size, so it's simpler to process

Let's try our pre-trained text embedding model:
hub.KerasLayer() Wraps a SavedModel (or a legacy Hub.Module) as a Keras Layer.

Note that no matter the length of the input text, the output shape of the embeddings is: (num_examples, embedding_dimension).

Python
embedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(embedding, input_shape=[], 
                           dtype=tf.string, trainable=True)
# take first 2 examples, compute embedding vectors (see picture)
hub_layer(train_examples_batch[:2])
Output from previous code: the calculated embeddings on the first two reviews

We can say we want to use this in our model, let's build our full Neural Network!

Python
model = tf.keras.Sequential()
# we add our pre-trained model as first Layer!
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))

model.summary()
model.summary()

Only one neuron in the output layer because if the review it's "positive", it is not "negative".

Compile The Model

Since this is a binary classification problem and the model outputs logits (a single-unit layer with a linear activation), we'll use the binary_crossentropy loss function

Python
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

Train Your Model

20 epochs in mini-batches of 512 samples. While training, monitor the model's loss and accuracy on the 10,000 samples from the validation set:

Python
history = model.fit(train_data.shuffle(10000).batch(512),
                    epochs=20,
                    validation_data=validation_data.batch(512),
                    verbose=1)

In the console you might see: 

"tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence" issue#31509

Just move on after the fitting phase has completed!

The components of the resulting element will have an additional outer dimension, which will be batch_size (or N % batch_size for the last element if batch_size does not divide the number of input elements N evenly and drop_remainder is False ). If your program depends on the batches having the same outer dimension, you should set the drop_remainder argument to True to prevent the smaller batch from being produced.

This dataset fills a buffer with buffer_size elements (representing the number of elements from this dataset from which the new dataset will sample), then randomly samples elements from this buffer, replacing the selected elements with new elements. For perfect shuffling, a buffer size greater than or equal to the full size of the dataset is required.

Evaluate Your Model

Let's see how the model performs. Two values will be returned: Loss (a number which represents our error, lower values are better), and Accuracy.

Python
results = model.evaluate(test_data.batch(512), verbose=2)

for name, value in zip(model.metrics_names, results):
  print("%s: %.3f" % (name, value))
  • zip(*iterables) aggregates elements from each of the iterables and returns an iterator of tuples:
Python
model.metrics_names #['loss', 'accuracy']
results #[0.31, 0.85]
list(zip(model.metrics_names, results)) #[ ('loss', 0.31), ('accuracy', 0.85) ]
Python
#
# Copyright (c) 2017 François Chollet
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.
machine learning text classification