The N2 dataset¶

The N2 dataset provides features that are hypothesized to be informative for a progression to psychosis. This notebook provides an overview of N2.

[68]:

# Author: Rolf Carlson, Carlson Research LLC, <hrolfrc@gmail.com>
# License: 3-clause BSD

[69]:

import pandas
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import RocCurveDisplay
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
import math
from sklearn.model_selection import train_test_split

Get the input file path from the calf project

[70]:

input_file_path = "../../../data/n2.csv"

Read the input file into a DataFrame

[71]:

df = pandas.read_csv(input_file_path, header=0, sep=",")
df.head()

[71]:

	ADIPOQ	SERPINA3	AMBP	A2M	ACE	AGT	APOA1	APOA2	APOA4	...	CALCA	IL6	LTA	CSF3	PGF	GCG_0001	IL1B	TGFB3	FGF2	MDA-LDL
0	1.1538	-1.008	0.4650	-0.6181	-0.9350	1.7169	0.974	1.7821	-0.2580	...	-0.3688	0.8739	-0.2390	2.8335	0.4469	0.101	0.1688	-0.1861	1.9591	-0.0720
1	-0.7661	-1.039	1.2479	0.2220	-0.7140	2.6709	-0.275	0.1680	0.9759	...	-0.3688	1.5408	3.5482	-0.6669	-0.7770	1.015	0.1688	0.2689	-0.3498	-0.5491
2	-0.2721	-0.766	-0.7480	-1.0371	0.0459	-0.2940	0.046	0.9320	-0.5050	...	0.2562	0.2060	-0.8099	-0.6669	-0.7770	0.372	-0.5152	-0.1861	-0.3498	-0.5491
3	-0.8201	-1.281	0.4650	-0.1980	0.8059	0.8980	-0.954	-0.4270	-0.5050	...	-0.3688	-0.5718	-0.8099	-0.6669	-0.5050	-0.543	0.1688	0.2689	-0.5978	-0.5491
4	0.0019	-1.188	-0.7090	0.6421	0.2039	-0.3500	-0.275	0.5070	0.2349	...	0.0612	1.8737	1.0388	-0.6669	0.4469	0.575	-0.2332	-0.4131	0.4472	-0.5491

5 rows × 136 columns

Remove the outcome column to get the independent variables

[72]:

X = df.loc[:, df.columns != 'ctrl/case']
X.head()

[72]:

	ADIPOQ	SERPINA3	AMBP	A2M	ACE	AGT	APOA1	APOA2	APOA4	APOH	...	CALCA	IL6	LTA	CSF3	PGF	GCG_0001	IL1B	TGFB3	FGF2	MDA-LDL
0	1.1538	-1.008	0.4650	-0.6181	-0.9350	1.7169	0.974	1.7821	-0.2580	2.6529	...	-0.3688	0.8739	-0.2390	2.8335	0.4469	0.101	0.1688	-0.1861	1.9591	-0.0720
1	-0.7661	-1.039	1.2479	0.2220	-0.7140	2.6709	-0.275	0.1680	0.9759	0.4610	...	-0.3688	1.5408	3.5482	-0.6669	-0.7770	1.015	0.1688	0.2689	-0.3498	-0.5491
2	-0.2721	-0.766	-0.7480	-1.0371	0.0459	-0.2940	0.046	0.9320	-0.5050	-0.0990	...	0.2562	0.2060	-0.8099	-0.6669	-0.7770	0.372	-0.5152	-0.1861	-0.3498	-0.5491
3	-0.8201	-1.281	0.4650	-0.1980	0.8059	0.8980	-0.954	-0.4270	-0.5050	1.3320	...	-0.3688	-0.5718	-0.8099	-0.6669	-0.5050	-0.543	0.1688	0.2689	-0.5978	-0.5491
4	0.0019	-1.188	-0.7090	0.6421	0.2039	-0.3500	-0.275	0.5070	0.2349	-0.6120	...	0.0612	1.8737	1.0388	-0.6669	0.4469	0.575	-0.2332	-0.4131	0.4472	-0.5491

5 rows × 135 columns

[73]:

# computing number of rows
rows = len(X.axes[0])

# computing number of columns
cols = len(X.axes[1])

print("Number of Rows (data points): ", rows)
print("Number of Columns (features or variables): ", cols)

Number of Rows (data points):  72
Number of Columns (features or variables):  135

[74]:

Y = df['ctrl/case']

Y represents whether the individuals became psychotic (1) or not (0). Y is a Pandas series.

[75]:

Y.head()

[75]:

0    0
1    0
2    0
3    0
4    0
Name: ctrl/case, dtype: int64

[76]:

Y.describe()

[76]:

count    72.000000
mean      0.444444
std       0.500391
min       0.000000
25%       0.000000
50%       0.000000
75%       1.000000
max       1.000000
Name: ctrl/case, dtype: float64

The individuals who did not progress to psychosis are labeled non_psychotic.

[77]:

non_psychotic = Y[Y == 0]
non_psychotic.head()

[77]:

0    0
1    0
2    0
3    0
4    0
Name: ctrl/case, dtype: int64

The individuals who progressed to psychosis are labeled pre_psychotic.

[78]:

pre_psychotic = Y[Y == 1]

[79]:

pre_psychotic.head()

[79]:

40    1
41    1
42    1
43    1
44    1
Name: ctrl/case, dtype: int64

[80]:

Y_names = Y.replace({0: 'non_psychotic', 1: 'pre_psychotic'})
Y_names

[80]:

0     non_psychotic
1     non_psychotic
2     non_psychotic
3     non_psychotic
4     non_psychotic
          ...
67    pre_psychotic
68    pre_psychotic
69    pre_psychotic
70    pre_psychotic
71    pre_psychotic
Name: ctrl/case, Length: 72, dtype: object

Fit N2¶

[81]:

lr = LogisticRegression().fit(X, Y)

[82]:

importance = lr.coef_.tolist()[0]
# Find the 5 most important coefficients and features
# Operate on the coefficients and features as tuples
# maintain the association as they are sorted and selected.
tup = list(zip(importance, X.columns))
tup = sorted(tup,  key=lambda c: abs(c[0]), reverse=True)[0:10]
tup = [c for c in tup if not math.isclose(c[0], 0)]

s, f = list(zip(*tup))

fig, ax = plt.subplots()
ax.bar(height=s, x=f)

plt.xticks(rotation=50)
plt.title("Feature importances via coefficients")
plt.show()

../../_images/notebooks_n2_n2_describe_22_0.png

Most important coefficients and features written as a sum¶

[83]:

i, j = tup[0]
s = str(round(i, 3)) + " " +str(j)
for i, j in tup[1:]:
    if i < 0:
        sgn = ' '
    else:
        sgn = ' + '
    s = s + sgn + str(round(i, 3)) + " " +str(j)
print(s)

0.805 MMP7 + 0.75 TSHB + 0.614 MDA-LDL -0.602 CXCL10 -0.558 ADIPOQ + 0.502 IL1B -0.469 MMP1 + 0.456 FTL + 0.434 EGFR + 0.431 CXCL5

The ROC AUC score¶

[84]:

y_score = lr.predict(X)
# Using all the data should result in an AUC near 1
roc_auc_score(Y, y_score)

[84]:

1.0

ROC curve with an AUC = 1¶

[85]:

RocCurveDisplay.from_predictions(
    Y,
    y_score,
    name="Conversion to Psychosis",
    color="darkorange",
)
plt.plot([0, 1], [0, 1], "k--", label="chance level (AUC = 0.5)")
plt.axis("square")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC curves")
plt.legend()
plt.show()

../../_images/notebooks_n2_n2_describe_28_0.png

The classifier does not learn from the training data.¶

[86]:

X_train, X_test, y_train, y_test = train_test_split(
    X.copy(), Y.copy(),
    random_state=42
)

[87]:

y_score =  LogisticRegression().fit(X_train, y_train).predict(X_test)
# Using all the data should result in an AUC near 1
roc_auc_score(y_test, y_score)

[87]:

0.48701298701298695