In a previous post, a setup using an F# script to perform machine learning with Microsoft.ML is described. A very import aspect for a successful model is picking the right features that will predict the label, i.e. the exposure that is associated with outcome. In this post a simple F# feature selection algorithm is described that automatically figures out which features result in the ‘best’ model.
As in the previous blog, a data set from a PICU is used containing PICU admissions and the outcome, survival. From literature and prognostic scoring models a list of features can be selected that will predict mortality.
let features =
[
"AdmissionYear"
"Age"
"Elective"
"Ventilated"
"Oxygen"
"SystolicBloodPressure"
"BaseExcess"
"NoRecovery"
"NonNeuroScore"
"NeuroScore"
"LowRisk"
"HighRisk"
"VeryHighRisk"
"Cancer"
"PIM3Score"
]
However which combination of the above features generates the ‘best model’, in terms of accuracy and area under the ROC curve is not clear. The number of combinations of 15 features can be calculated using this function:
module List =
// Create all combinations of xs
// And returning the list of combinations
// of xs.
// For example ["a"; "b"; "c"] will return:
// [["a"]; ["a"; "b"]; ["b"]; ["c"]; ["a"; "c"]]
let combinations xs =
[
for x in xs do
yield!
xs
|> List.fold (fun (i, acc) _ ->
i + 1,
xs
|> List.filter ((<>) x)
|> List.append [ x ]
|> List.take i
|> List.sort
|> List.singleton
|> List.append acc
) (0, [[]])
|> snd
|> List.filter (List.isEmpty >> not)
]
|> List.distinct
The above code can be used like:
features
|> List.combinations
// Results in:
//val it : string list list =
// [["AdmissionYear"]; ["AdmissionYear"; "Age"];
// ["AdmissionYear"; "Age"; "Elective"];
// ["AdmissionYear"; "Age"; "Elective"; "Ventilated"];
// ["AdmissionYear"; "Age"; "Elective"; "Oxygen"; "Ventilated"];
// ["AdmissionYear"; "Age"; "Elective"; "Oxygen"; "SystolicBloodPressure";
// "Ventilated"];val it : string list list =
// [["AdmissionYear"]; ["AdmissionYear"; "Age"];
// ["AdmissionYear"; "Age"; "Elective"];
// ["AdmissionYear"; "Age"; "Elective"; "Ventilated"];
// ["AdmissionYear"; "Age"; "Elective"; "Oxygen"; "Ventilated"];
// ["AdmissionYear"; "Age"; "Elective"; "Oxygen"; "SystolicBloodPressure";
// "Ventilated"];
// etc...
The total number of unique combinations of these 15 features is 119.
You can then iterate over this list and for each feature list calculate the metrics of the trained model. The below code does this, while keeping trach of the best performing set of features:
let analyze features =
features
// Calculate the model, metrics
// will be printed
|> calculate trainData testData
|> fun m ->
m |> printCalibratedMetrics
printfn ""
printfn ""
printfn "%s" (m.ConfusionMatrix.GetFormattedConfusionTable())
m
let pickFeatures features =
features
|> List.combinations
|> List.fold (fun (m' : {| acc : float; roc : float |}, fs') fs ->
if fs' |> List.isEmpty |> not then
printfn ""
printfn "*\t Best fit sofar with %s" (fs' |> String.concat ", ")
printfn "*\t Accuracy: %0.3f" m'.acc
printfn "*\t Area Under ROC Curve: %0.3f" m'.roc
printfn "------------------------------------"
printfn ""
fs
|> String.concat ", "
|> printfn "=== ANALYZING %s ==="
let m = fs |> analyze
// calc combined perf metrics
let a = m.Accuracy + m.AreaUnderRocCurve
let a' = m'.acc + m'.roc
// compare new perf with old perf
if a > a' then
// update the better perf metric using the new features set
({| acc = m.Accuracy; roc = m.AreaUnderRocCurve |}, fs)
else (m', fs')
) ({| acc = 0.; roc = 0.|}, [])
The end result, in this case, was the below set of features:
// Calculate the model, metrics
// will be printed
[
"AdmissionYear"
"Age"
"Elective"
"PIM3Score"
"Ventilated"
]
|> analyze
//* Metrics for train and test data
//*-----------------------------------------------------------
//* Model trained with 744 records
//* Containing 372 deaths
//* Model tested with 2851 records
//* Containing 93 deaths
//* Metrics for binary classification model
//*-----------------------------------------------------------
//* Accuracy: 0.803
//* Area Under Roc Curve: 0.889
//* Area Under PrecisionRecall Curve: 0.358
//* F1 Score: 0.206
//* LogLoss: 0.901
//* LogLoss Reduction: -3.347
//* Positive Precision: 0.119
//* Positive Recall: 0.785
//* Negative Precision: 0.991
//* Negative Recall: 0.803
Of course, there are tools to accomplish the same, like auto-ml. However, using a few simple functions you can do the same and have far more control and insight.