Sentiment analysis with sparse feature matrices¶
Predict sentiment on the IMDB database using :class:Calf. This example shows that Calf can efficiently fit and predict using sparse feature matrices produced by TfidfVectorizer. For this small example, we select 400 movie reviews out of 25000. Even with the limited number of samples, the bag of word model expands the number of feature columns to 9576. Calf learns the training sets. As expected, Calf needs more examples to show skill predicting unseen reviews.
Author: Rolf Carlson, Carlson Research LLC, hrolfrc@gmail.com
License: 3-clause BSD
Imports¶
[1]:
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import roc_auc_score
from calfcv import Calf
Import the sentiment¶
[2]:
im_train = load_files('../../data/imdb/train/', shuffle=False)
im_test = load_files('../../data/imdb/test/', shuffle=False)
Create the train and test datasets¶
[3]:
corpus = im_train.data + im_test.data
y = list(im_train.target) + list(im_test.target)
y_train, y_test = list(im_train.target), list(im_test.target)
X = TfidfVectorizer().fit_transform(corpus)
X_train = X[0:len(im_train.data), :]
X_test = X[len(im_train.data)::, :]
print(len(y_train), X_train.shape)
assert len(y_train) == X_train.shape[0], "y_train wrong shape"
print(len(y_test), X_test.shape)
assert len(y_test) == X_test.shape[0], "y_test wrong shape"
200 (200, 9576)
200 (200, 9576)
Make predictions¶
Predict the corpus¶
This prediction of the corpus should have an AUC near 1
[4]:
clf = Calf().fit(X, y)
y_pred = clf.predict(X)
roc_auc_score(y, y_pred)
[4]:
0.9425
With only 200 samples, Calf demonstrates poor skill at predicting sentiment in the training data after having learned the corpus. There needs to be more training data.
[5]:
y_pred = clf.predict(X_train)
roc_auc_score(y_train, y_pred)
[5]:
0.9349999999999999
Calf demonstrates skill at learning the testing data.
[6]:
y_pred = clf.predict(X_test)
roc_auc_score(y_test, y_pred)
[6]:
0.945
Training and predicting the training data results in a high AUC, even for a small number of samples.
[7]:
clf = Calf().fit(X_train, y_train)
y_pred = clf.predict(X_train)
roc_auc_score(y_train, y_pred)
[7]:
0.9099999999999999
Similarly, Calf learns the testing data
[8]:
clf = Calf().fit(X_test, y_test)
y_pred = clf.predict(X_test)
roc_auc_score(y_test, y_pred)
[8]:
0.93
Learn on training data and predict the testing data¶
Calf does not have enough training data to show skill predicting the testing set
[9]:
clf = Calf().fit(X_train, y_train)
y_pred = clf.predict(X_test)
roc_auc_score(y_test, y_pred)
[9]:
0.61
Predict the probability of the sentiment class requires additional training
[10]:
y_pred = clf.predict_proba(X_test)[:, 1]
roc_auc_score(y_test, y_pred)
[10]:
0.6234500000000001