Sentiment analysis with sparse feature matrices

Predict sentiment on the IMDB database using :class:Calf. This example shows that Calf can efficiently fit and predict using sparse feature matrices produced by TfidfVectorizer. For this small example, we select 400 movie reviews out of 25000. Even with the limited number of samples, the bag of word model expands the number of feature columns to 9576. Calf learns the training sets. As expected, Calf needs more examples to show skill predicting unseen reviews.

Author: Rolf Carlson, Carlson Research LLC, hrolfrc@gmail.com

License: 3-clause BSD

Imports

[1]:
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import roc_auc_score
from calfcv import Calf

Import the sentiment

[2]:
im_train = load_files('../../data/imdb/train/', shuffle=False)
im_test = load_files('../../data/imdb/test/', shuffle=False)

Create the train and test datasets

[3]:
corpus = im_train.data + im_test.data
y = list(im_train.target) + list(im_test.target)
y_train, y_test = list(im_train.target), list(im_test.target)
X = TfidfVectorizer().fit_transform(corpus)

X_train = X[0:len(im_train.data), :]
X_test = X[len(im_train.data)::, :]

print(len(y_train), X_train.shape)
assert len(y_train) == X_train.shape[0], "y_train wrong shape"
print(len(y_test), X_test.shape)
assert len(y_test) == X_test.shape[0], "y_test wrong shape"
200 (200, 9576)
200 (200, 9576)

Make predictions

Predict the corpus

This prediction of the corpus should have an AUC near 1

[4]:
clf = Calf().fit(X, y)
y_pred = clf.predict(X)
roc_auc_score(y, y_pred)
[4]:
0.9425

With only 200 samples, Calf demonstrates poor skill at predicting sentiment in the training data after having learned the corpus. There needs to be more training data.

[5]:
y_pred = clf.predict(X_train)
roc_auc_score(y_train, y_pred)
[5]:
0.9349999999999999

Calf demonstrates skill at learning the testing data.

[6]:
y_pred = clf.predict(X_test)
roc_auc_score(y_test, y_pred)
[6]:
0.945

Training and predicting the training data results in a high AUC, even for a small number of samples.

[7]:
clf = Calf().fit(X_train, y_train)
y_pred = clf.predict(X_train)
roc_auc_score(y_train, y_pred)
[7]:
0.9099999999999999

Similarly, Calf learns the testing data

[8]:
clf = Calf().fit(X_test, y_test)
y_pred = clf.predict(X_test)
roc_auc_score(y_test, y_pred)
[8]:
0.93

Learn on training data and predict the testing data

Calf does not have enough training data to show skill predicting the testing set

[9]:
clf = Calf().fit(X_train, y_train)
y_pred = clf.predict(X_test)
roc_auc_score(y_test, y_pred)
[9]:
0.61

Predict the probability of the sentiment class requires additional training

[10]:
y_pred = clf.predict_proba(X_test)[:, 1]
roc_auc_score(y_test, y_pred)
[10]:
0.6234500000000001