First Microsoft.ML steps!

Machine learning is the new kid on the block and has become accessible with the advent of the Microsoft.ML library. In medicine machine learning hasn’t been used that much. Most scientific epidemiological research still relies on established statistical analysis. The core principle, however, is the same. Known exposure is used to predict outcome. In Pediatric Intensive Care, prediction of mortality is used to benchmark and monitor performance. In the Netherlands, for this purpose, data is gathered at a national basis. This blog will discuss a machine learning setup to analyze data using a data set with 13.793 PICU admissions.

F# is used to extract data and feed this to the ML algorithms. Code is written in a regular F# script file, this allows for running and testing the code and parts of code in the F# interactive. This process is called a REPL, Read, Evaluate, Print and Loop. This is an extremely efficient way to write code.

The first step is to setup the infrastructure:

  • install a dotnet tool chain : dotnet new tool-manifest
  • install paket for package management : dotnet tool add paket
  • initiate paket: dotnet paket init
  • add Microsoft.ML : dotnet paket add Microsoft.ML
  • generate load scripts : dotnet paket generate-load-scripts

The generate-load-scripts command, no surprise, generates load scripts which can be used in a script file to get access to the needed libraries. There is a caveat however, the code is not compiled but interpreted. Therefore, a required native runtime library, CpuMathNative, is not copied over to the folder containing microsoft.ml.cpumath. This has to be done manually in order to get things up and running.

The seconds step is to create an F# script file and open this file using Visual Studio Code or Visual Studio. The script file starts with the load scripts, after which the Microsoft.ML library can be opened.


// Load all dependencies
#load "./.paket/load/netstandard2.0/main.group.fsx"


open System 
open System.IO

open Microsoft.ML
open Microsoft.ML.Data


// Make sure that code uses the current source directory
Environment.CurrentDirectory <- __SOURCE_DIRECTORY__

The third step is to load the data, in this case from a tab delimited text file.


// Random order an array using a 
// guid, then return the random
// ordered array
let randomOrder xs =
    xs
    |> Array.map (fun x -> Guid.NewGuid (), x)
    |> Array.sortBy fst
    |> Array.map snd


// Get the data from the file
// put this in an array of arrays
// i.e. a table structure. Also make
// sure that data is in random order
let source =
    File.ReadAllLines "Scores.txt"
    |> fun xs -> 
        xs 
        |> Array.skip 1    
        |> randomOrder
        |> Array.append (xs |> Array.take 1)
    |> Array.map (fun r -> r.Split('\t'))

// get table value from a row r with 
// column name c
let getRowColumn c (r : string[]) =
    let i =
        source
        |> Array.head
        |> Array.findIndex ((=) c)
    r.[i]

F# excels in these kind of programming problems. There is also a specific CSV provider, but in fact reading and getting the data out is trivial. At the same time you can also add features like the ability to randomize the data rows.

// The type to hold the data
// with the features and the
// label, -> Death
[<CLIMutable>]
type Data =
    {
        Age : Single
        Elective : Single
        SystolicBloodPressure : Single
        Ventilated : Single
        Oxygen : Single
        NoRecovery : Single
        NonNeuroScore : Single
        NeuroScore : Single
        LowRisk : Single
        HighRisk : Single
        VeryHighRisk : Single
        Cancer : Single
        PIM3Score : Single
        Death : bool
    }

From a medical epidemiological view we look at data in terms of exposure and outcome.

Epidemiology is The Study of Diseases in Populations

In ML land, exposure is called features and outcome is called label. So, the above data record holds the features in the top fields and the lower field, Death, is in fact the label, i.e. outcome.

Also, note that you can use a regular F# record to hold the data. You do need to add the CLIMutable attribute to make sure the record has an object initializer. You do not need to add the Column attributes you often see in ML code. Transforming the columns to the appropriate data type, however, is something that can be really easily achieved in F#.

To create the records the following utility functions are used:

// Low-risk diagnosis:
let pimLowRisk =
    [ "Asthma"
      "Bronchiolitis"
      "Croup"
      "ObstructiveSleepApnea"
      "DiabeticKetoacidosis"
      "SeizureDisorder" ]
// High-risk diagnosis:
let pimHighRisk =
    [ "CerebralHemorrhage"
      "CardiomyopathyOrMyocarditis"
      "HIVPositive  "
      "HypoplasticLeftHeartSyndrome"
      "NeurodegenerativeDisorder"
      "NecrotizingEnterocolitis" ]
// Very high-risk diagnosis:
let pimVeryHighRisk =
    [ "CardiacArrestInHospital"
      "CardiacArrestPreHospital"
      "SevereCombinedImmuneDeficiency"
      "LeukemiaorLymphoma"
      "BoneMarrowTransplant"
      "LiverFailure" ]

// Map a specific diagnosis to
// either a low, high or very high risk.
let mapRiskDiagnosis xs x = 
    if xs |> List.exists ((=) x) then 1. else 0.
    |> single
let mapLowRisk = mapRiskDiagnosis pimLowRisk
let mapHighRisk = mapRiskDiagnosis pimHighRisk
let mapVeryHighRisk = mapRiskDiagnosis pimVeryHighRisk

// Map a string to a value with type Single
let parseSingleWithDefault d (s : string) =
    s 
    |> Single.TryParse
    |> function
    | true, x -> x |> single
    | _ -> d

// Map a string to a boolean with type Single
let mapBoolean s2  s1 = 
    if s1 = s2 then 1 else 0 
    |> single

// Create an array of Data
let data = 
    source
    |> Array.filter ((getRowColumn "Age(days)") >> ((=) "") >> not)
    |> Array.skip 1
    |> Array.map (fun r ->

        {
            Age =
                r
                |> getRowColumn "Age(days)"
                |> fun x ->
                    try
                        x |> single
                    with _ -> sprintf "cannot parse %s" x |> failwith
            Elective =
                r
                |> getRowColumn "Urgency" 
                |> mapBoolean "Elective"
            SystolicBloodPressure = 
                r
                |> getRowColumn "SystolicBP"
                |> parseSingleWithDefault 120.f
            Ventilated =
                r
                |> getRowColumn "Ventilated"
                |> mapBoolean "True"
            Oxygen =
                let o =
                    r
                    |> getRowColumn "PaO2"
                    |> parseSingleWithDefault 0.f
                let f = 
                    r
                    |> getRowColumn "FiO2"
                    |> parseSingleWithDefault 1.f
                if o > 0.f then o / f else 0.23f
            NoRecovery =
                r
                |> getRowColumn "Recovery"
                |> mapBoolean "NoRecovery" 
            NonNeuroScore =
                r
                |> getRowColumn "PRISM3Score"
                |> parseSingleWithDefault 0.f
            NeuroScore =
                r
                |> getRowColumn "PRISM3Neuro"
                |> parseSingleWithDefault 0.f
            LowRisk =
                r
                |> getRowColumn "RiskDiagnoses"
                |> mapLowRisk
            HighRisk =
                r
                |> getRowColumn "RiskDiagnoses"
                |> mapHighRisk
            VeryHighRisk =
                r
                |> getRowColumn "RiskDiagnoses"
                |> mapVeryHighRisk
            Cancer =
                r
                |> getRowColumn "Cancer"
                |> mapBoolean "True"
            PIM3Score =
                r 
                |> getRowColumn "PIM3Score"
                |> parseSingleWithDefault 0.f
            Death =
                r
                |> getRowColumn "Status" 
                |> fun x -> x = "Death"
        }
    )

In this case we want a binary prediction, whether the patient will survive a PICU admission or not.

The next step is to divide the data set in a training set and a test set.

// Divide the data in a training set 
// and a test set. Making sure that 
// the training set is balanced, i.e.
// an equal amount of deaths as alive.
// also, the test set will not contain
// any records that were included in 
// the training data.
let trainData, testData =
	// get all the cases in the dataset
	let cases =
		data
		|> Array.filter (fun d -> d.Death)
	// calculate the case incidence
	let incidence =
		cases |> Array.length |> float
		|> fun x -> x / (data |> Array.length |> float)
	// create a training set with 80% of cases and
	// keep track of selected cases in selected
	let selected, trainData =
		let selected = 
			cases
			|> Array.take (0.8 * (cases |> Array.length |> float) |> int)
		selected,
		data 
		|> Array.filter (fun d -> d.Death |> not) 
		|> Array.take (selected |> Array.length)
		|> Array.append selected
	// pick the not-selected cases
	let notSelected =
		data
		|> Array.filter(fun x -> x.Death)
		|> Array.filter (fun x -> selected |> Array.exists ((=) x) |> not)

	trainData, 
	// take a random sample for the test data
	// making sure that it has the right incidence
	data
	|> randomOrder
	|> Array.filter (fun x -> 
		x.Death |> not &&
		trainData |> Array.exists((=) x) |> not
	)
	|> Array.take (1. / incidence * (notSelected |> Array.length |> float) |> int)
	|> Array.append notSelected

There is also a ML method to split the data, however, this will not create a balanced training set, i.e. a training set that contains an equal amount of cases compared to controls. The above code will do that. At the same time, the test data set will be completely independent of the training data set and have the same case incidence as in the whole data set.

The ML library can generate a metrics object which can be used to print out the model metrics.

let printDataMetrics (trainData : Data seq) (testData : Data seq) =
    printfn "*       Metrics for train and test data      " 
    printfn "*-----------------------------------------------------------"
    printfn "*       Model trained with %i records" (trainData |> Seq.length)
    printfn "*       Containing %i deaths" (trainData |> Seq.filter (fun d -> d.Death) |> Seq.length)
    printfn "*       Model tested with %i records" (testData |> Seq.length)
    printfn "*       Containing %i deaths" (testData |> Seq.filter (fun d -> d.Death) |> Seq.length)
    printfn ""
    

let printCalibratedMetrics (metrics : CalibratedBinaryClassificationMetrics) =
    printfn "*       Metrics for binary classification model      " 
    printfn "*-----------------------------------------------------------"
    printfn "*       Accuracy: %.3f" metrics.Accuracy
    printfn "*       Area Under Roc Curve: %.3f" metrics.AreaUnderRocCurve
    printfn "*       Area Under PrecisionRecall Curve: %.3f" metrics.AreaUnderPrecisionRecallCurve
    printfn "*       F1 Score: %.3f" metrics.F1Score
    printfn "*       LogLoss: %.3f" metrics.LogLoss
    printfn "*       LogLoss Reduction: %.3f" metrics.LogLossReduction
    printfn "*       Positive Precision: %.3f" metrics.PositivePrecision
    printfn "*       Positive Recall: %.3f" metrics.PositiveRecall
    printfn "*       Negative Precision: %.3f" metrics.NegativePrecision
    printfn "*       Negative Recall: %.3f" metrics.NegativeRecall

The actual calculation is relative simple:

// Calculate the model using the training data,
// and test data for the metrics. Include the features
// (Data column names) that has to be included in the model.
let calculate trainData testData features =
    let context = MLContext()

    let trainView = context.Data.LoadFromEnumerable trainData
    let testView = context.Data.LoadFromEnumerable testData
    
    let pipeline =
        let features = features |> Seq.toArray
        EstimatorChain()
            .Append(context.Transforms.Concatenate("Features", features))
            .Append(context.BinaryClassification.Trainers.SdcaLogisticRegression("Death", "Features"))

    let trained = pipeline.Fit(trainView)

    let predicted = trained.Transform(testView)

    let metrics = 
        //context.BinaryClassification.EvaluateNonCalibrated(data=predicted, labelColumnName="Death", scoreColumnName="Score")
        context.BinaryClassification.Evaluate(data=predicted, labelColumnName="Death", scoreColumnName="Score")


    printDataMetrics trainData testData
    metrics

The above function takes in a list of a training data set, a test data set and a list of features (a list of field names from the Data record). From this a metrics object is calculated to assess the performance of the generated model.

This code can be directly used from the script file like:


// analyze a features set
let analyze features =
    features
    // Calculate the model, metrics
    // will be printed
    |> calculate trainData testData
    |> fun m -> 
        m |> printCalibratedMetrics
        printfn ""
        printfn ""
        printfn "%s" (m.ConfusionMatrix.GetFormattedConfusionTable())
        m

// Calculate the model, metrics
// will be printed
[   
    "AdmissionYear"
    "Age"
    "Elective"
    "PIM3Score"
    "Ventilated"
]
|> analyze 
|> ignore

This will print out the following metrics (note that the training data set is balance, while the test data set represents the actual incidence):

*	Metrics for train and test data      
*-----------------------------------------------------------
*	Model trained with 744 records
*	Containing 372 deaths
*	Model tested with 2851 records
*	Containing 93 deaths

*	Metrics for binary classification model      
*-----------------------------------------------------------
*	Accuracy: 0.805
*	Area Under Roc Curve: 0.834
*	Area Under PrecisionRecall Curve: 0.252
*	F1 Score: 0.194
*	LogLoss: 0.906
*	LogLoss Reduction: -3.368
*	Positive Precision: 0.112
*	Positive Recall: 0.720
*	Negative Precision: 0.988
*	Negative Recall: 0.807


TEST POSITIVE RATIO:	0.0326 (93.0/(93.0+2758.0))
Confusion table
          ||======================
PREDICTED || positive | negative | Recall
TRUTH     ||======================
 positive ||       67 |       26 | 0.7204
 negative ||      531 |    2,227 | 0.8075
          ||======================
Precision ||   0.1120 |   0.9885 |

These metrics and the confusion table are really clearly described in this blog.

Leave a Reply

Your email address will not be published. Required fields are marked *