The AUC from example 1 of the Calf paper¶
While calfpy yields an auc of 0.875 in example 1 from the Calf paper [1], calfcv produces an auc of 0.82.
[1]:
# Author: Rolf Carlson, Carlson Research LLC, <hrolfrc@gmail.com>
# License: 3-clause BSD
Get the data¶
[2]:
import pandas as pd
from sklearn.metrics import roc_auc_score
from calfcv import CalfCV, Calf
[3]:
input_file = "../../../data/n2.csv"
df = pd.read_csv(input_file, header=0, sep=",")
# The input data is everything except the first column
X = df.loc[:, df.columns != 'ctrl/case']
# The outcome or diagnoses are in the first ctrl/case column
Y = df['ctrl/case']
# The header row is the feature set
features = list(X.columns)
# label the outcomes
Y_names = Y.replace({0: 'non_psychotic', 1: 'pre_psychotic'})
# glmnet requires float64
x = X.to_numpy(dtype='float64')
y = Y.to_numpy(dtype='float64')
Data overview¶
Here we look at the feature names, number of features, shape, and category balance.
[4]:
features[0:5]
[4]:
['ADIPOQ', 'SERPINA3', 'AMBP', 'A2M', 'ACE']
[5]:
x.size
[5]:
9720
[6]:
x.shape
[6]:
(72, 135)
[7]:
print(list(Y).count(1), list(Y).count(0))
32 40
[8]:
len(y)
[8]:
72
Predict diagnoses¶
[9]:
y_pred = Calf().fit(x, y).predict_proba(x)
roc_auc_score(y, y_pred[:, 1])
[9]:
0.78359375
The class probabilities predicted by Calf
[10]:
y_pred
[10]:
array([[0.73105858, 0.26894142],
[0.65967676, 0.34032324],
[0.67352505, 0.32647495],
[0.47684355, 0.52315645],
[0.49403688, 0.50596312],
[0.63393725, 0.36606275],
[0.55006102, 0.44993898],
[0.66281609, 0.33718391],
[0.62506968, 0.37493032],
[0.59501702, 0.40498298],
[0.71191589, 0.28808411],
[0.55821739, 0.44178261],
[0.52240219, 0.47759781],
[0.71537548, 0.28462452],
[0.6175692 , 0.3824308 ],
[0.63651384, 0.36348616],
[0.62584695, 0.37415305],
[0.55156817, 0.44843183],
[0.60770431, 0.39229569],
[0.64037289, 0.35962711],
[0.44494103, 0.55505897],
[0.71615345, 0.28384655],
[0.39693984, 0.60306016],
[0.52301264, 0.47698736],
[0.45921654, 0.54078346],
[0.4121336 , 0.5878664 ],
[0.62750285, 0.37249715],
[0.33856961, 0.66143039],
[0.43123335, 0.56876665],
[0.59092639, 0.40907361],
[0.59729627, 0.40270373],
[0.44895471, 0.55104529],
[0.41576818, 0.58423182],
[0.49810242, 0.50189758],
[0.55482325, 0.44517675],
[0.44060927, 0.55939073],
[0.5029 , 0.4971 ],
[0.58781519, 0.41218481],
[0.40889401, 0.59110599],
[0.54580572, 0.45419428],
[0.26894142, 0.73105858],
[0.42625083, 0.57374917],
[0.36021738, 0.63978262],
[0.52808518, 0.47191482],
[0.35790598, 0.64209402],
[0.44528908, 0.55471092],
[0.49834102, 0.50165898],
[0.47800375, 0.52199625],
[0.71257885, 0.28742115],
[0.59224187, 0.40775813],
[0.3884741 , 0.6115259 ],
[0.50976535, 0.49023465],
[0.51407848, 0.48592152],
[0.46116416, 0.53883584],
[0.47545834, 0.52454166],
[0.56882478, 0.43117522],
[0.38211921, 0.61788079],
[0.43890178, 0.56109822],
[0.5723152 , 0.4276848 ],
[0.39136154, 0.60863846],
[0.41389145, 0.58610855],
[0.36282398, 0.63717602],
[0.40762356, 0.59237644],
[0.36091886, 0.63908114],
[0.36917046, 0.63082954],
[0.4016057 , 0.5983943 ],
[0.45714141, 0.54285859],
[0.38248224, 0.61751776],
[0.54324483, 0.45675517],
[0.43954232, 0.56045768],
[0.43985635, 0.56014365],
[0.5057918 , 0.4942082 ]])
[11]:
y_pred = CalfCV().fit(x, y).predict_proba(x)
roc_auc_score(y, y_pred[:, 1])
[11]:
0.8242187500000001
The classe probabilities predicted by CalfCV
[12]:
y_pred
[12]:
array([[0.57255396, 0.42744604],
[0.57190911, 0.42809089],
[0.49905191, 0.50094809],
[0.37396104, 0.62603896],
[0.43673843, 0.56326157],
[0.42049986, 0.57950014],
[0.58727104, 0.41272896],
[0.73105858, 0.26894142],
[0.55325599, 0.44674401],
[0.53275445, 0.46724555],
[0.60855296, 0.39144704],
[0.71632012, 0.28367988],
[0.47160497, 0.52839503],
[0.6000901 , 0.3999099 ],
[0.5182401 , 0.4817599 ],
[0.61569573, 0.38430427],
[0.44520824, 0.55479176],
[0.61067857, 0.38932143],
[0.43740406, 0.56259594],
[0.50806646, 0.49193354],
[0.45940988, 0.54059012],
[0.5844512 , 0.4155488 ],
[0.6167177 , 0.3832823 ],
[0.45002242, 0.54997758],
[0.48864786, 0.51135214],
[0.51369577, 0.48630423],
[0.37743322, 0.62256678],
[0.45003696, 0.54996304],
[0.61891498, 0.38108502],
[0.5935698 , 0.4064302 ],
[0.45187866, 0.54812134],
[0.57869656, 0.42130344],
[0.59731614, 0.40268386],
[0.54665274, 0.45334726],
[0.68688645, 0.31311355],
[0.44670906, 0.55329094],
[0.4693245 , 0.5306755 ],
[0.38162746, 0.61837254],
[0.37643876, 0.62356124],
[0.57096094, 0.42903906],
[0.51437333, 0.48562667],
[0.38786473, 0.61213527],
[0.46310646, 0.53689354],
[0.39080126, 0.60919874],
[0.43044281, 0.56955719],
[0.39303232, 0.60696768],
[0.42280034, 0.57719966],
[0.51340885, 0.48659115],
[0.26894142, 0.73105858],
[0.40361923, 0.59638077],
[0.36135889, 0.63864111],
[0.39733642, 0.60266358],
[0.33561026, 0.66438974],
[0.3574579 , 0.6425421 ],
[0.52855035, 0.47144965],
[0.44461878, 0.55538122],
[0.42918205, 0.57081795],
[0.36705095, 0.63294905],
[0.45142257, 0.54857743],
[0.34602883, 0.65397117],
[0.39609809, 0.60390191],
[0.45418881, 0.54581119],
[0.29541473, 0.70458527],
[0.53079061, 0.46920939],
[0.45560206, 0.54439794],
[0.36134196, 0.63865804],
[0.34390344, 0.65609656],
[0.59190982, 0.40809018],
[0.36853881, 0.63146119],
[0.41398225, 0.58601775],
[0.4007513 , 0.5992487 ],
[0.49950364, 0.50049636]])
[13]:
y_pred = Calf().fit(x, y).predict(x)
roc_auc_score(y, y_pred)
[13]:
0.696875
The classes predicted by Calf
[14]:
y_pred
[14]:
array([0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 1., 0., 1., 0., 1., 1., 0., 1., 1., 0., 0., 1., 1., 1.,
0., 1., 0., 0., 1., 0., 1., 1., 1., 0., 1., 1., 1., 1., 0., 0., 1.,
0., 0., 1., 1., 0., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
0., 1., 1., 0.])
[15]:
y_pred = CalfCV().fit(x, y).predict(x)
roc_auc_score(y, y_pred)
[15]:
0.709375
The classes predicted by CalfCV
[16]:
y_pred
[16]:
array([0., 0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1.,
0., 1., 0., 1., 0., 0., 1., 1., 0., 1., 1., 0., 0., 1., 0., 0., 0.,
0., 1., 1., 1., 1., 0., 0., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1.,
1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 0.,
1., 1., 1., 1.])
References:¶
[1] Jeffries, C.D., Ford, J.R., Tilson, J.L. et al. A greedy regression algorithm with coarse weights offers novel advantages. Sci Rep 12, 5440 (2022). https://doi.org/10.1038/s41598-022-09415-2