{"id":195,"date":"2020-11-16T09:07:52","date_gmt":"2020-11-16T08:07:52","guid":{"rendered":"https:\/\/informedica.nl\/?p=195"},"modified":"2020-11-16T09:10:41","modified_gmt":"2020-11-16T08:10:41","slug":"machine-learning-feature-selection","status":"publish","type":"post","link":"https:\/\/informedica.nl\/?p=195","title":{"rendered":"Machine Learning Feature Selection"},"content":{"rendered":"\n<p>In a previous post, a setup using an F# script to perform machine learning with Microsoft.ML is described. A very import aspect for a successful model is picking the right features that will predict the label, i.e. the exposure that is associated with outcome. In this post a simple F# feature selection algorithm is described that automatically figures out which features result in the &#8216;best&#8217; model.<\/p>\n\n\n<p><!--more--><\/p>\n<p>\u00a0<\/p>\n\n\n<p>As in the previous blog, a data set from a PICU is used containing PICU admissions and the outcome, survival. From literature and prognostic scoring models a list of features can be selected that will predict mortality.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"fsharp\" class=\"language-fsharp\">let features =\r\n    [ \r\n        \"AdmissionYear\"\r\n        \"Age\"\r\n        \"Elective\"\r\n        \"Ventilated\"\r\n        \"Oxygen\"\r\n        \"SystolicBloodPressure\"\r\n        \"BaseExcess\"\r\n        \"NoRecovery\"\r\n        \"NonNeuroScore\"\r\n        \"NeuroScore\"\r\n        \"LowRisk\"\r\n        \"HighRisk\"\r\n        \"VeryHighRisk\"\r\n        \"Cancer\"\r\n        \"PIM3Score\"\r\n    ]\r<\/code><\/pre>\n\n\n\n<p>However which combination of the above features generates the &#8216;best model&#8217;, in terms of accuracy and area under the ROC curve is not clear. The number of combinations of 15 features can be calculated using this function:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"fsharp\" class=\"language-fsharp\">module List =\r\n\r\n    \/\/ Create all combinations of xs\r\n    \/\/ And returning the list of combinations\r\n    \/\/ of xs.\r\n    \/\/ For example [\"a\"; \"b\"; \"c\"] will return:\r\n    \/\/ [[\"a\"]; [\"a\"; \"b\"]; [\"b\"]; [\"c\"]; [\"a\"; \"c\"]]\r\n    let combinations xs =\r\n        [\r\n            for x in xs do\r\n                yield!\r\n                    xs \r\n                    |> List.fold (fun (i, acc) _ ->\r\n                        i + 1,\r\n                        xs \r\n                        |> List.filter ((&lt;>) x)\r\n                        |> List.append [ x ]\r\n                        |> List.take i\r\n                        |> List.sort\r\n                        |> List.singleton \r\n                        |> List.append acc\r\n                    ) (0, [[]])\r\n                    |> snd\r\n                    |> List.filter (List.isEmpty >> not)\r\n        ]\r\n        |> List.distinct\r<\/code><\/pre>\n\n\n\n<p>The above code can be used like:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"fsharp\" class=\"language-fsharp\">features\r\n|> List.combinations\r\n\/\/ Results in:\r\n\/\/val it : string list list =\r\n\/\/  [[\"AdmissionYear\"]; [\"AdmissionYear\"; \"Age\"];\r\n\/\/   [\"AdmissionYear\"; \"Age\"; \"Elective\"];\r\n\/\/   [\"AdmissionYear\"; \"Age\"; \"Elective\"; \"Ventilated\"];\r\n\/\/   [\"AdmissionYear\"; \"Age\"; \"Elective\"; \"Oxygen\"; \"Ventilated\"];\r\n\/\/   [\"AdmissionYear\"; \"Age\"; \"Elective\"; \"Oxygen\"; \"SystolicBloodPressure\";\r\n\/\/    \"Ventilated\"];val it : string list list =\r\n\/\/  [[\"AdmissionYear\"]; [\"AdmissionYear\"; \"Age\"];\r\n\/\/   [\"AdmissionYear\"; \"Age\"; \"Elective\"];\r\n\/\/   [\"AdmissionYear\"; \"Age\"; \"Elective\"; \"Ventilated\"];\r\n\/\/   [\"AdmissionYear\"; \"Age\"; \"Elective\"; \"Oxygen\"; \"Ventilated\"];\r\n\/\/   [\"AdmissionYear\"; \"Age\"; \"Elective\"; \"Oxygen\"; \"SystolicBloodPressure\";\r\n\/\/    \"Ventilated\"];\r\n\/\/ etc...<\/code><\/pre>\n\n\n\n<p>The total number of unique combinations of these 15 features is 119. <\/p>\n\n\n\n<p>You can then iterate over this list and for each feature list calculate the metrics of the trained model. The below code does this, while keeping trach of the best performing set of features:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"fsharp\" class=\"language-fsharp\">\r\nlet analyze features =\r\n    features\r\n    \/\/ Calculate the model, metrics\r\n    \/\/ will be printed\r\n    |> calculate trainData testData\r\n    |> fun m -> \r\n        m |> printCalibratedMetrics\r\n        printfn \"\"\r\n        printfn \"\"\r\n        printfn \"%s\" (m.ConfusionMatrix.GetFormattedConfusionTable())\r\n        \r\n        m\r\n\r\n\r\nlet pickFeatures features =\r\n    features\r\n    |> List.combinations\r\n    |> List.fold (fun (m' : {| acc : float; roc : float |}, fs') fs ->\r\n        if fs' |> List.isEmpty |> not then\r\n            printfn \"\"\r\n            printfn \"*\\t Best fit sofar with %s\" (fs' |> String.concat \", \")\r\n            printfn \"*\\t Accuracy: %0.3f\" m'.acc \r\n            printfn \"*\\t Area Under ROC Curve: %0.3f\" m'.roc \r\n            printfn \"------------------------------------\"\r\n            printfn \"\"\r\n\r\n        fs\r\n        |> String.concat \", \" \r\n        |> printfn \"=== ANALYZING %s ===\"\r\n\r\n        let m = fs |> analyze\r\n        \/\/ calc combined perf metrics\r\n        let a = m.Accuracy + m.AreaUnderRocCurve\r\n        let a' = m'.acc + m'.roc \r\n        \/\/ compare new perf with old perf\r\n        if a > a' then \r\n            \/\/ update the better perf metric using the new features set\r\n            ({| acc = m.Accuracy; roc = m.AreaUnderRocCurve |}, fs)\r\n        else (m', fs')\r\n    ) ({| acc = 0.; roc = 0.|}, [])\r<\/code><\/pre>\n\n\n\n<p>The end result, in this case, was the below set of features:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"fsharp\" class=\"language-fsharp\">\/\/ Calculate the model, metrics\r\n\/\/ will be printed\r\n[   \r\n    \"AdmissionYear\"\r\n    \"Age\"\r\n    \"Elective\"\r\n    \"PIM3Score\"\r\n    \"Ventilated\"\r\n]\r\n|> analyze \r\n\/\/*\tMetrics for train and test data      \r\n\/\/*-----------------------------------------------------------\r\n\/\/*\tModel trained with 744 records\r\n\/\/*\tContaining 372 deaths\r\n\/\/*\tModel tested with 2851 records\r\n\/\/*\tContaining 93 deaths\r\n\r\n\/\/*\tMetrics for binary classification model      \r\n\/\/*-----------------------------------------------------------\r\n\/\/*\tAccuracy: 0.803\r\n\/\/*\tArea Under Roc Curve: 0.889\r\n\/\/*\tArea Under PrecisionRecall Curve: 0.358\r\n\/\/*\tF1 Score: 0.206\r\n\/\/*\tLogLoss: 0.901\r\n\/\/*\tLogLoss Reduction: -3.347\r\n\/\/*\tPositive Precision: 0.119\r\n\/\/*\tPositive Recall: 0.785\r\n\/\/*\tNegative Precision: 0.991\r\n\/\/*\tNegative Recall: 0.803\r<\/code><\/pre>\n\n\n\n<p>Of course, there are tools to accomplish the same, like auto-ml. However, using a few simple functions you can do the same and have far more control and insight. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>In a previous post, a setup using an F# script to perform machine learning with Microsoft.ML is described. A very import aspect for a successful model is picking the right features that will predict the label, i.e. the exposure that is associated with outcome. In this post a simple F# feature selection algorithm is described &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/informedica.nl\/?p=195\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Machine Learning Feature Selection&#8221;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-195","post","type-post","status-publish","format-standard","hentry","category-general"],"_links":{"self":[{"href":"https:\/\/informedica.nl\/index.php?rest_route=\/wp\/v2\/posts\/195","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/informedica.nl\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/informedica.nl\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/informedica.nl\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/informedica.nl\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=195"}],"version-history":[{"count":4,"href":"https:\/\/informedica.nl\/index.php?rest_route=\/wp\/v2\/posts\/195\/revisions"}],"predecessor-version":[{"id":199,"href":"https:\/\/informedica.nl\/index.php?rest_route=\/wp\/v2\/posts\/195\/revisions\/199"}],"wp:attachment":[{"href":"https:\/\/informedica.nl\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=195"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/informedica.nl\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=195"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/informedica.nl\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=195"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}