The N2 dataset¶

The N2 dataset provides features that are hypothesized to be informative for a progression to psychosis. This notebook provides an overview of N2.

[ ]:

# Author: Rolf Carlson, Carlson Research LLC, <hrolfrc@gmail.com>
# License: 3-clause BSD

[13]:

import pandas

Get the input file path from the calf project

[14]:

input_file_path = "../../../data/n2.csv"

Read the input file into a DataFrame

[15]:

df = pandas.read_csv(input_file_path, header=0, sep=",")
df.head()

[15]:

	ADIPOQ	SERPINA3	AMBP	A2M	ACE	AGT	APOA1	APOA2	APOA4	...	CALCA	IL6	LTA	CSF3	PGF	GCG_0001	IL1B	TGFB3	FGF2	MDA-LDL
0	1.1538	-1.008	0.4650	-0.6181	-0.9350	1.7169	0.974	1.7821	-0.2580	...	-0.3688	0.8739	-0.2390	2.8335	0.4469	0.101	0.1688	-0.1861	1.9591	-0.0720
1	-0.7661	-1.039	1.2479	0.2220	-0.7140	2.6709	-0.275	0.1680	0.9759	...	-0.3688	1.5408	3.5482	-0.6669	-0.7770	1.015	0.1688	0.2689	-0.3498	-0.5491
2	-0.2721	-0.766	-0.7480	-1.0371	0.0459	-0.2940	0.046	0.9320	-0.5050	...	0.2562	0.2060	-0.8099	-0.6669	-0.7770	0.372	-0.5152	-0.1861	-0.3498	-0.5491
3	-0.8201	-1.281	0.4650	-0.1980	0.8059	0.8980	-0.954	-0.4270	-0.5050	...	-0.3688	-0.5718	-0.8099	-0.6669	-0.5050	-0.543	0.1688	0.2689	-0.5978	-0.5491
4	0.0019	-1.188	-0.7090	0.6421	0.2039	-0.3500	-0.275	0.5070	0.2349	...	0.0612	1.8737	1.0388	-0.6669	0.4469	0.575	-0.2332	-0.4131	0.4472	-0.5491

5 rows × 136 columns

Remove the outcome column to get the independent variables

[16]:

X = df.loc[:, df.columns != 'ctrl/case']
X.head()

[16]:

	ADIPOQ	SERPINA3	AMBP	A2M	ACE	AGT	APOA1	APOA2	APOA4	APOH	...	CALCA	IL6	LTA	CSF3	PGF	GCG_0001	IL1B	TGFB3	FGF2	MDA-LDL
0	1.1538	-1.008	0.4650	-0.6181	-0.9350	1.7169	0.974	1.7821	-0.2580	2.6529	...	-0.3688	0.8739	-0.2390	2.8335	0.4469	0.101	0.1688	-0.1861	1.9591	-0.0720
1	-0.7661	-1.039	1.2479	0.2220	-0.7140	2.6709	-0.275	0.1680	0.9759	0.4610	...	-0.3688	1.5408	3.5482	-0.6669	-0.7770	1.015	0.1688	0.2689	-0.3498	-0.5491
2	-0.2721	-0.766	-0.7480	-1.0371	0.0459	-0.2940	0.046	0.9320	-0.5050	-0.0990	...	0.2562	0.2060	-0.8099	-0.6669	-0.7770	0.372	-0.5152	-0.1861	-0.3498	-0.5491
3	-0.8201	-1.281	0.4650	-0.1980	0.8059	0.8980	-0.954	-0.4270	-0.5050	1.3320	...	-0.3688	-0.5718	-0.8099	-0.6669	-0.5050	-0.543	0.1688	0.2689	-0.5978	-0.5491
4	0.0019	-1.188	-0.7090	0.6421	0.2039	-0.3500	-0.275	0.5070	0.2349	-0.6120	...	0.0612	1.8737	1.0388	-0.6669	0.4469	0.575	-0.2332	-0.4131	0.4472	-0.5491

5 rows × 135 columns

[17]:

# computing number of rows
rows = len(X.axes[0])

# computing number of columns
cols = len(X.axes[1])

print("Number of Rows (data points): ", rows)
print("Number of Columns (features or variables): ", cols)

Number of Rows (data points):  72
Number of Columns (features or variables):  135

[18]:

Y = df['ctrl/case']

Y represents whether the individuals became psychotic (1) or not (0). Y is a Pandas series.

[19]:

Y.head()

[19]:

0    0
1    0
2    0
3    0
4    0
Name: ctrl/case, dtype: int64

[20]:

Y.describe()

[20]:

count    72.000000
mean      0.444444
std       0.500391
min       0.000000
25%       0.000000
50%       0.000000
75%       1.000000
max       1.000000
Name: ctrl/case, dtype: float64

The individuals who did not progress to psychosis are labeled non_psychotic.

[21]:

non_psychotic = Y[Y == 0]
non_psychotic.head()

[21]:

0    0
1    0
2    0
3    0
4    0
Name: ctrl/case, dtype: int64

The individuals who progressed to psychosis are labeled pre_psychotic.

[22]:

pre_psychotic = Y[Y == 1]

[23]:

pre_psychotic.head()

[23]:

40    1
41    1
42    1
43    1
44    1
Name: ctrl/case, dtype: int64

[24]:

Y_names = Y.replace({0: 'non_psychotic', 1: 'pre_psychotic'})
Y_names

[24]:

0     non_psychotic
1     non_psychotic
2     non_psychotic
3     non_psychotic
4     non_psychotic
          ...
67    pre_psychotic
68    pre_psychotic
69    pre_psychotic
70    pre_psychotic
71    pre_psychotic
Name: ctrl/case, Length: 72, dtype: object