Using F# as an ETL tool

The ability to Extract, Transform and Load data to a format that enables data analysis and machine learning is essential to make use of the vast amount of observational data that is nowadays available. F# can be a very efficient tool to achieve these goals.

The general model to represent the extraction is as follows:

/// A function to summarize a list of values
/// as one value
type Collaps = Value seq -> Value

/// A conversion function for a value
type Convert = Value -> Value

/// An observation describes wich signals to
/// that describe that observation (Sources)
/// along with a collaps function that summarizes
/// the obtained values to one value.
/// Each source value also can be converted to
/// a different or adjusted value.
type Observation =
    { Name : Name // Art bloodgas 
      Sources : Source list
      Collaps : Collaps }

and Source =
    { Name : Name
      Id : ParameterId
      Convert : Convert }

/// A list of observations to retrieve
type Observations = Observation seq

Central is the Observation type. An observation is anything that can be observed, obviously, and is identified by a name, a list of sources and the ability to summarize the source list to a single value.

For example, when we want to observe the urinary output of a patient, this can be stored in a number of parameters like:

Spontaneous diuresis
Catheter measured diuresis
Diaper weight measured diuresis
etc..

So with these list of sources the observed diuresis can be summurazed by adding all the entries in the list of sources.

Another example is when temperature is measured at different locations and is stored using different parameters. A possible collapse function could use the list of sources as a hierarchical ordered list that returns the first entry that is not empty.

For each source a convert function can be used to either:

To convert the value to a different value, for example to match the other possible values in the list of sources or
To apply a filter to a value to filter out values that cannot be true measurements or observations

The end result is a list of observations that describe where to retrieve those observations, how to summarize the sources for the observations and convert and/or filter the observations values.

Next a dataset is created that contains the retrieved data.

/// The resulting dataset with colums
/// and rows of data. Each row has a
/// date time.
type DataSet =
    { Columns : Name seq
      Data : PatientId * DateTime * DataRow seq }

and DataRow = Value seq

The names contain the observation names, which are column names, and each row is uniquely identified by a patient a date time. The row itself is a list of values that map to the column names. Thus it represents a flat table containing all the patient data.

The final step is to anonymize the dataset by replacing patient id’s with random identifiers and replacing the actual date time with a time since the first observation for that patient. In the process the birthdate for a patient can be replaced by the age at the time of an observation. This leaves no traceable patient data.

A prototypical system has been created for the MetaVision PDMS from iMDSoft.

Leave a Reply Cancel reply