The N2 dataset

The N2 dataset provides features that are hypothesized to be informative for a progression to psychosis. This notebook provides an overview of N2.

[68]:
# Author: Rolf Carlson, Carlson Research LLC, <hrolfrc@gmail.com>
# License: 3-clause BSD
[69]:
import pandas
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import RocCurveDisplay
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
import math
from sklearn.model_selection import train_test_split

Get the input file path from the calf project

[70]:
input_file_path = "../../../data/n2.csv"

Read the input file into a DataFrame

[71]:
df = pandas.read_csv(input_file_path, header=0, sep=",")
df.head()
[71]:
ctrl/case ADIPOQ SERPINA3 AMBP A2M ACE AGT APOA1 APOA2 APOA4 ... CALCA IL6 LTA CSF3 PGF GCG_0001 IL1B TGFB3 FGF2 MDA-LDL
0 0 1.1538 -1.008 0.4650 -0.6181 -0.9350 1.7169 0.974 1.7821 -0.2580 ... -0.3688 0.8739 -0.2390 2.8335 0.4469 0.101 0.1688 -0.1861 1.9591 -0.0720
1 0 -0.7661 -1.039 1.2479 0.2220 -0.7140 2.6709 -0.275 0.1680 0.9759 ... -0.3688 1.5408 3.5482 -0.6669 -0.7770 1.015 0.1688 0.2689 -0.3498 -0.5491
2 0 -0.2721 -0.766 -0.7480 -1.0371 0.0459 -0.2940 0.046 0.9320 -0.5050 ... 0.2562 0.2060 -0.8099 -0.6669 -0.7770 0.372 -0.5152 -0.1861 -0.3498 -0.5491
3 0 -0.8201 -1.281 0.4650 -0.1980 0.8059 0.8980 -0.954 -0.4270 -0.5050 ... -0.3688 -0.5718 -0.8099 -0.6669 -0.5050 -0.543 0.1688 0.2689 -0.5978 -0.5491
4 0 0.0019 -1.188 -0.7090 0.6421 0.2039 -0.3500 -0.275 0.5070 0.2349 ... 0.0612 1.8737 1.0388 -0.6669 0.4469 0.575 -0.2332 -0.4131 0.4472 -0.5491

5 rows × 136 columns

Remove the outcome column to get the independent variables

[72]:
X = df.loc[:, df.columns != 'ctrl/case']
X.head()
[72]:
ADIPOQ SERPINA3 AMBP A2M ACE AGT APOA1 APOA2 APOA4 APOH ... CALCA IL6 LTA CSF3 PGF GCG_0001 IL1B TGFB3 FGF2 MDA-LDL
0 1.1538 -1.008 0.4650 -0.6181 -0.9350 1.7169 0.974 1.7821 -0.2580 2.6529 ... -0.3688 0.8739 -0.2390 2.8335 0.4469 0.101 0.1688 -0.1861 1.9591 -0.0720
1 -0.7661 -1.039 1.2479 0.2220 -0.7140 2.6709 -0.275 0.1680 0.9759 0.4610 ... -0.3688 1.5408 3.5482 -0.6669 -0.7770 1.015 0.1688 0.2689 -0.3498 -0.5491
2 -0.2721 -0.766 -0.7480 -1.0371 0.0459 -0.2940 0.046 0.9320 -0.5050 -0.0990 ... 0.2562 0.2060 -0.8099 -0.6669 -0.7770 0.372 -0.5152 -0.1861 -0.3498 -0.5491
3 -0.8201 -1.281 0.4650 -0.1980 0.8059 0.8980 -0.954 -0.4270 -0.5050 1.3320 ... -0.3688 -0.5718 -0.8099 -0.6669 -0.5050 -0.543 0.1688 0.2689 -0.5978 -0.5491
4 0.0019 -1.188 -0.7090 0.6421 0.2039 -0.3500 -0.275 0.5070 0.2349 -0.6120 ... 0.0612 1.8737 1.0388 -0.6669 0.4469 0.575 -0.2332 -0.4131 0.4472 -0.5491

5 rows × 135 columns

[73]:
# computing number of rows
rows = len(X.axes[0])

# computing number of columns
cols = len(X.axes[1])

print("Number of Rows (data points): ", rows)
print("Number of Columns (features or variables): ", cols)
Number of Rows (data points):  72
Number of Columns (features or variables):  135
[74]:
Y = df['ctrl/case']

Y represents whether the individuals became psychotic (1) or not (0). Y is a Pandas series.

[75]:
Y.head()
[75]:
0    0
1    0
2    0
3    0
4    0
Name: ctrl/case, dtype: int64
[76]:
Y.describe()
[76]:
count    72.000000
mean      0.444444
std       0.500391
min       0.000000
25%       0.000000
50%       0.000000
75%       1.000000
max       1.000000
Name: ctrl/case, dtype: float64

The individuals who did not progress to psychosis are labeled non_psychotic.

[77]:
non_psychotic = Y[Y == 0]
non_psychotic.head()
[77]:
0    0
1    0
2    0
3    0
4    0
Name: ctrl/case, dtype: int64

The individuals who progressed to psychosis are labeled pre_psychotic.

[78]:
pre_psychotic = Y[Y == 1]
[79]:
pre_psychotic.head()
[79]:
40    1
41    1
42    1
43    1
44    1
Name: ctrl/case, dtype: int64
[80]:
Y_names = Y.replace({0: 'non_psychotic', 1: 'pre_psychotic'})
Y_names
[80]:
0     non_psychotic
1     non_psychotic
2     non_psychotic
3     non_psychotic
4     non_psychotic
          ...
67    pre_psychotic
68    pre_psychotic
69    pre_psychotic
70    pre_psychotic
71    pre_psychotic
Name: ctrl/case, Length: 72, dtype: object

Fit N2

[81]:
lr = LogisticRegression().fit(X, Y)
[82]:
importance = lr.coef_.tolist()[0]
# Find the 5 most important coefficients and features
# Operate on the coefficients and features as tuples
# maintain the association as they are sorted and selected.
tup = list(zip(importance, X.columns))
tup = sorted(tup,  key=lambda c: abs(c[0]), reverse=True)[0:10]
tup = [c for c in tup if not math.isclose(c[0], 0)]

s, f = list(zip(*tup))

fig, ax = plt.subplots()
ax.bar(height=s, x=f)

plt.xticks(rotation=50)
plt.title("Feature importances via coefficients")
plt.show()
../../_images/notebooks_n2_n2_describe_22_0.png

Most important coefficients and features written as a sum

[83]:
i, j = tup[0]
s = str(round(i, 3)) + " " +str(j)
for i, j in tup[1:]:
    if i < 0:
        sgn = ' '
    else:
        sgn = ' + '
    s = s + sgn + str(round(i, 3)) + " " +str(j)
print(s)
0.805 MMP7 + 0.75 TSHB + 0.614 MDA-LDL -0.602 CXCL10 -0.558 ADIPOQ + 0.502 IL1B -0.469 MMP1 + 0.456 FTL + 0.434 EGFR + 0.431 CXCL5

The ROC AUC score

[84]:
y_score = lr.predict(X)
# Using all the data should result in an AUC near 1
roc_auc_score(Y, y_score)
[84]:
1.0

ROC curve with an AUC = 1

[85]:
RocCurveDisplay.from_predictions(
    Y,
    y_score,
    name="Conversion to Psychosis",
    color="darkorange",
)
plt.plot([0, 1], [0, 1], "k--", label="chance level (AUC = 0.5)")
plt.axis("square")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC curves")
plt.legend()
plt.show()
../../_images/notebooks_n2_n2_describe_28_0.png

The classifier does not learn from the training data.

[86]:
X_train, X_test, y_train, y_test = train_test_split(
    X.copy(), Y.copy(),
    random_state=42
)
[87]:
y_score =  LogisticRegression().fit(X_train, y_train).predict(X_test)
# Using all the data should result in an AUC near 1
roc_auc_score(y_test, y_score)
[87]:
0.48701298701298695