data-frame
(require data-frame) | package: data-frame |
A data frame is a data structure used to hold data in tables with rows and columns. It is meant for conveninent access and manipulation of relatively large data sets (however these data sets must fit into the process memory). The package also provides functions for loading and saving data from data frames as well as several utilities and helper for statistical calculations, plotting and curve fitting.
1 Rationale
Consider an example: during a sport activity, a sport watch will record data at periodic intervals, usually once every second. The data recorded might be time stamp, latitude, longitude, heart rate, distance, speed, cadence, power, etc. A high end sports watch will record up to 40 such measurements every second.
A simple approach for representing this data is to define a "DataPoint" structure, containing members for each possible values and represent the entire activity as a vector of data points. This approach has several problems:
If a structure is used, it will need to have up to 40 or so members, but most of the time they would be empty, wasting memory. Since we never know what data might be collected (this depends on the number and types of sensors that are active), we cannot save much by defining sub-types, like a RunDataPoint or a BikeDataPoint
Operations on the data is done for one or only a few measurements at the time. For example, to find the average heart rate, one needs to traverse all the data points and look a the "hr" member of such a structure, if the structure is big, every reference to the "hr" member will be a memory cache miss.
A data frame object addresses the problems above by storing measurements for the same parameter together. Essentially, all heart rate measurements are stored together in a vector, all cadence measurements are stored together in a different vectors. All these vectors have the same number of elements (the number of data points) and the same position in each such vector represents data at a certain point in time. This data organization has some advantages:
Memory is only used for data that actually exists. For example, if no power data is recorded, there will be no power data series in the data frame.
Operations on the data have efficient memory access. Calculating the average heart rate involves just referencing elements in a continuous vector.
2 Creating data frames
The functions below allow constructing new data frames. They are mainly intended for writing functions that load data into data frames from different sources.
procedure
(make-series name #:data data #:cmpfn cmpfn #:na na #:contract contractfn) → series? name : string? data : vector? cmpfn : (or/c #f (-> any/c any/c boolean?)) na : any/c contractfn : (-> any/c boolean?)
cmpfn specifies an ordering function to use. If present, values can be looked up in this series using df-index-of and df-lookup. The data must be ordered according to this function
na specifies the "not available" value for this series, by default it is #f
contractfn is a contract function. If present, all values in the data series, except NA values must satisfy this contract.
procedure
(df-add-series df series) → any/c
df : data-frame? series : series?
procedure
(df-add-derived df name base-series value-fn) → any/c
df : data-frame? name : string? base-series : (listof string?) value-fn : mapfn/c
If a series named name already exists in the data frame, it will be replaced.
procedure
(df-add-lazy df name base-series value-fn) → any/c
df : data-frame? name : string? base-series : (listof string?) value-fn : mapfn/c
procedure
(df-set-sorted df name cmpfn) → any/c
df : data-frame? name : string?
cmpfn :
(or/c #f (-> any/c any/c boolean?))
procedure
(df-set-contract df name contractfn) → any/c
df : data-frame? name : string? contractfn : (or/c #f (-> any/c boolean?))
3 Loading data into data-frames and saving it out again
The functions construct data frames by loading data from different sources.
procedure
db : connection? query : (or/c string? virtual-statement?) param : any/c
procedure
(df-read/csv input [ #:headers? headers?] #:na na [ #:quoted-numbers? quoted-numbers?]) → data-frame? input : (or/c path-string? input-port?) headers? : boolean? = #t na : (or/c string? (-> string? boolean?) "") quoted-numbers? : boolean? = #f
na represents the value in the CSV file that represents the "not available" value in the data frame. Strings equal? to this value will be replaced by #f. Alternatively, this can be a function which tests a string and returns #t if the string represents a NA value
When quoted-numbers? is #t, all quoted values in the CSV file will be converted to numbers, if possible. E.g. a value like "123" will be converted to the number 123 if quoted-numbers? is #t, but will remain the string "123" if the parameter is #f.
procedure
(df-write/csv df output #:start start #:stop stop series ...) → any/c df : data-frame? output : (or/c path-string? output-port?) start : exact-nonnegative-integer? stop : exact-nonnegative-integer? series : string?
start and stop denote the beginning and end rows to be written out, by default all rows are written out.
4 Inspecting and extracting data
procedure
(df-describe df) → any/c
df : data-frame?
procedure
(df-get-property df key [default]) → any/c
df : data-frame? key : symbol? default : any/c = (lambda () #f)
procedure
(df-row-count df) → exact-nonnegative-integer?
df : data-frame?
procedure
(df-select df series [ #:filter filter #:start start #:stop stop]) → vector? df : data-frame? series : string? filter : (or/c #f (-> any/c any/c)) = #f start : index/c = 0 stop : index/c = (df-row-count df)
start and stop indicate the first and one-before-last row to be selected. filter, when present, will filter values selected: only values for which the function returns #t will be added to the resulting vector.
If there is no filter specified, the resulting vector will have (- stop start) elements. If there is a filter, the number of elements depends on how many are filtered out by this function.
procedure
(df-select* df #:filter filter [ #:start start #:stop stop] series ...) → vector? df : data-frame? filter : (or/c #f (-> any/c any/c)) start : index/c = 0 stop : index/c = (df-row-count df) series : string?
start and stop indicate the first and one-before-last row to be selected. filter, when present, will filter values selected: only values for which the function returns #t will be added to the resulting vector.
If there is no filter specified, the resulting vector will have (- stop start) elements. If there is a filter, the number of elements depends on how many are filtered out by this function.
procedure
(in-data-frame df [ #:start start #:stop stop] series ...) → sequence? df : data-frame? start : index/c = 0 stop : index/c = (df-row-count df) series : string?
This is intended to be used in for and related constructs to iterate over elements in the data frame:
(for (([lat lon] (in-data-frame df "lat" "lon"))) (printf "lat = ~a, lon = ~a~%" lat lon))
procedure
(in-data-frame/list df [ #:start start #:stop stop] series ...) → sequence? df : data-frame? start : index/c = 0 stop : index/c = (df-row-count df) series : string?
(for ((coord (in-data-frame/list df "lat" "lon"))) (match-define (list lat lon) coord) (printf "lat = ~a, lon = ~a~%" lat lon))
procedure
df : data-frame? series : string? value : any/c
procedure
(df-index-of* df series value ...) → (listof index/c)
df : data-frame? series : string? value : any/c
The series must be sorted, see df-set-sorted, otherwise the calls will raise an error.
The value need not be present in the series, in that case, the returned index is the position of the first element which comes after the value, according to the sort function. This is the position where value could be inserted and still keep the series sorted. A value of 0 is returned if value is less or equal than the first value of the series and a value of (df-row-count df) is returned if the value is greater than all the values in series.
procedure
(df-ref df index series) → any/c
df : data-frame? index : index/c series : string?
procedure
(df-ref* df index series ...) → vector?
df : data-frame? index : index/c series : string?
procedure
(df-set! df index value series) → any/c
df : data-frame? index : index/c value : any/c series : string?
procedure
(df-lookup df base-series series value) → any/c
df : data-frame? base-series : string? series : (or/c string? (listof string?)) value : any/c
procedure
(df-lookup* df base-series series value ...) → list?
df : data-frame? base-series : string? series : (or/c string? (listof string?)) value : any/c
df-lookup* allows looking up multiple values and will return a list of the corresponding values.
These functions combine df-index-of and df-ref into a single function.
procedure
(df-lookup/interpolated df base-series series value #:interpolate interpolate [ lambda]) → any/c df : data-frame? base-series : string? series : (or/c string? (listof string?)) value : any/c interpolate : (-> real? any/c any/c any/c) lambda : (t v1 v2) = (+ (* t v1) (* (- 1 t) v2))
An interpolation function can be specified, if the default one is not sufficient. This function is called once for each value resulting series (i.e. it interpolates values one by one).
procedure
(df-map df series fn [ #:start start #:stop stop]) → vector? df : data-frame? series : (or/c string? (listof string?)) fn : mapfn/c start : index/c = 0 stop : index/c = (df-row-count df)
fn is a function of ether one or two arguments. If fn is a function with one argument, it is called with the values from all series as a single vector. If fn is a function of two arguments, it is called with the current and previous set of values, as vectors (this allows calculating "delta" values). I.e. fn is invoked as (fn prev current). If fn accepts two arguments, it will be invoked as (fn #f current) for the first element of the iteration.
procedure
(df-for-each df series fn [ #:start start #:stop stop]) → void df : data-frame? series : (or/c string? (listof string?)) fn : mapfn/c start : index/c = 0 stop : index/c = (df-row-count df)
procedure
(df-fold df series init-value fn [ #:start start #:stop stop]) → any/c df : data-frame? series : (or/c string? (listof string?)) init-value : any/c fn : foldfn/c start : index/c = 0 stop : index/c = (df-row-count df)
fn is a function of ether two or three arguments. If fn is a function with two arguments, it is called with the fold value plus the values from all series is passed in as a single vector. If fn is a function of three arguments, it is called with the fold value plus the current and previous set of values, as vectors (this allows calculating "delta" values). I.e. fn is invoked as (fn val prev current). If fn accepts two arguments, it will be invoked as (fn init-val #f current) for the first element of the iteration.
procedure
(df-count-na df series) → exact-nonnegative-integer?
df : data-frame? series : string?
(df-shallow-copy (-> data-frame? data-frame?)) (valid-only (-> any/c boolean?)))
5 Statistics
The following functions allow calculating statistics on data frame series. They build on top of the math/statistics module.
procedure
(df-set-default-weight-series df series) → any/c
df : data-frame? series : (or/c #f string?)
procedure
df : data-frame?
A weight series needs to be used when samples in the data frame don’t have equal weight. For example, if a parameter (e.g. heart rate) is recorded at variable intervals, simply averaging the values will not produce an accurate average, if a timer series is also present, it can be used as a weight series to produce a better average.
procedure
(df-statistics df series [ #:weight-series weight-series #:start start #:stop stop]) → (or/c #f statistics?) df : data-frame? series : string? weight-series : string? = (df-get-default-weight-series df) start : exact-nonnegative-integer? = 0 stop : exact-nonnegative-integer? = (df-row-count df)
procedure
(df-quantile df series #:weight-series string? [ #:less-than less-than] qvalue ...) → (or/c #f (listof real?)) df : data-frame? series : string? string? : (df-get-default-weight-series df) less-than : (-> any/c any/c boolean?) = < qvalue : (between/c 0 1)
6 Least Squares Fitting
struct
(struct least-squares-fit (type coefficients residual fn) #:extra-constructor-name make-least-squares-fit) type : (or/c 'linear 'polynomial 'power 'exponential 'logarithmic) coefficients : (listof real?) residual : (or/c #f real?) fn : (-> real? real?)
procedure
(df-least-squares-fit df xseries yseries [ #:start start #:stop stop #:mode mode #:polynomial-degree degree #:residual? residual? #:annealing? annealing? #:annealing-iterations iterations]) → least-squares-fit? df : data-frame? xseries : string? yseries : string? start : exact-nonnegative-integer = 0 stop : exact-nonnegative-integer = (df-row-count df)
mode : (or/c 'linear 'polynomial 'poly 'power 'exponential 'exp 'logarithmic 'log) = 'linear degree : exact-nonnegative-integer = 2 residual? : boolean? = #f annealing? : boolean = #f iterations : exact-nonnegative-integer? = 500
start and stop specify the start and end position in the series, by default all values are considered for the fit.
mode determines the type of the function being fitted and can have one of the following values:
'linear – a function Y = a * X + b is fitted where ’a’ and ’b’ are fitted; this is equivalent of fitting a ’polynomial of degree 1 (see below)
'polynomial or 'poly – a polynomial Y = a0 + a1 * X + a2 * X^2 + ... is fitted. The degree of the polynomial is specified by the degree parameter, by default this is 2.
'exponential or 'exp – a function of Y = a * e ^ (b * X) + c is fitted. Note that this fit is not very good, and annealing needs to be used to improve it (see below)
'logarithmic or 'log – a function of type Y = a + b * ln(X) is fitted. This will only return a "real" fit function (as opposed to an imaginary one) if all values in YSERIES are positive
'power – a function of type Y = a * X ^ b is fitted. This will only return a "real" fit function (as opposed to an imaginary one) if all values in YSERIES are positive. Note that this fit is not very good, and annealing needs to be used to improve it (see below)
residual? when #t indicates that the residual value is also returned in the ‘least-squares-fit‘ structure. Setting it to #f will avoid some unnecessary computations.
annealing? when #t indicates that the fit coefficients should be further refined using the annealing function. This is only used for 'exponential or
'power fit functions as these ones do not produce "best fit" coefficients – I don’t know why, I am not a mathematician, I only used the formulas. Using annealing will significantly improve the fit for these functions, but will still not determine the best one. Note that the annealing algorithm is probabilistic, so applying it a second time on the same arguments will produce a slightly different result.
iterations represents the number of annealing iterations, see the #:iterations parameter to the ‘annealing‘ function.
7 Histograms and histogram plots
procedure
(df-histogram df series [ #:weight-series weight-series #:bucket-width bucket-width #:trim-outliers trim-outliers #:include-zeroes? include-zeroes? #:as-percentage? as-percentage?]) → (or/c #f histogram/c) df : data-frame? series : string?
weight-series : (or/c #f string?) = (df-get-default-weight-series df) bucket-width : real? = 1 trim-outliers : (or/c #f (between/c 0 1)) = #f include-zeroes? : boolean? = #t as-percentage? : boolean? = #f
weight-series specifies the series to be used for weighting the samples. By default, it it uses the 'weight property stored in the data-frame, see df-set-default-weight-series. Use #f for no weighting, in this case, each sample will have a weight of 1.
bucket-width specifies the width of each histogram slot. Samples in the data series are grouped together into slots, which are from 0 to bucket-width, than from bucket-width to (* 2 bucket-width) and so on. The bucket-width value can be less than 1.0.
trim-outliers specifies to remove slots from both ends of the histogram that contain less than the specified percentage of values. When #f on slots are trimmed.
include-zeroes? specifies whether samples with a slot of 0 are included in the histogram or not. Note that slot 0 contains samples from 0 to bucket-width.
as-percentage? determines if the data in the histogram represents a percentage (all ranks add up to 100) or it is the rank of each slot.
In the resulting histogram, samples that are numbers or strings will be sorted. In addition, if the samples are numbers, empty slots will be created so that the buckets are also consecutive.
procedure
(histogram-renderer histogram [ #:color color #:skip skip #:x-min x-min #:label label #:blank-some-labels blank-some-labels? #:x-value-formatter formatter]) → (treeof renderer2d?) histogram : histogram/c color : any/c = #f skip : real? = (discrete-histogram-skip) x-min : real? = 0 label : string? = #f blank-some-labels? : boolean? = #t formatter : (or/c #f (-> number? string?)) = #f
color determines the color of the histogram bars.
label specifies the label to use for this plot renderer.
skip and x-min are used to plot dual histograms, see histogram-renderer/dual.
All the above arguments are sent directly to the discrete-histogram
blank-some-labels?, controls if some of the labels are blanked out if the plot contains too many values, this can produce a nicer looking plot.
formatter controls how the histogram values are displayed. By default, labels for the values are displayed with ~a, but this function can be used for custom formatter. For example, if the values in the histogram represent running pace, the formatter can transform a value of 300 into the label "5:00".
procedure
(histogram-renderer/dual combined-histogram label1 label2 [ #:color1 color1 #:color2 color2 #:x-value-formatter formatter]) → (treeof renderer2d?) combined-histogram : combined-histogram/c label1 : string? label2 : string? color1 : any/c = #f color2 : any/c = #f formatter : (or/c #f (-> number? string?)) = #f
label1 and color1 represent the label and colors for the first histogram, label2 and color2 represent the label and colors to use for the second histogram.
formatter controls how the histogram values are displayed. By default, labels for the values are displayed with ~a, but this function can be used for custom formatter. For example, if the values in the histogram represent running pace, the formatter can transform a value of 300 into the label "5:00".
procedure
(histogram-renderer/factors histogram factor-fn factor-colors [ #:x-value-formatter formatter]) → (treeof renderer2d?) histogram : histogram/c factor-fn : (-> real? symbol?) factor-colors : (listof (cons/c symbol? color/c)) formatter : (or/c #f (-> number? string?)) = #f
formatter controls how the histogram values are displayed. By default, labels for the values are displayed with ~a, but this function can be used for custom formatter. For example, if the values in the histogram represent running pace, the formatter can transform a value of 300 into the label "5:00".
8 GPX Files
(require data-frame/gpx) | package: data-frame |
This module provides functions for reading and writing data frames using the GPX Exchange Format (GPX).
procedure
input : (or/c path-string? input-port?)
"lat" and "lon" series representing the latitude and longitude of each point
"timestamp" series representing the UTC timestamp in seconds for each point. The series will also be marked as sorted, if it is actually sorted
"dst" representing a distance from the start. If distance data is not present in the GPX file, this series will be calculated from the GPX coordiantes. The series will be marked as sorted, if it is actually sorted
"hr" representing heart rate measurements
"cad" representing cadence measurements
"pwr" representing power measurements, in watts
"spd" representing the speed
The data frame will also have the following properties:
a 'name property containing the name of the track segment, if this is present in the GPX file.
a 'waypoints property containing a list of waypoints, if they GPX track has any. Each waypoint is represented as a list of TIMESTAMP, LAT, LON, ELEVATION and NAME
a 'laps property containing a list of timestamps corresponding to each way point in the waypoint list – the laps property cannot be constructed correctly if the waypoints are missing a timestamp property.
All the track segments in the GPX file will be concatenated.
procedure
(df-write/gpx df output #:name name) → any/c
df : data-frame? output : (or/c path-string? output-port?) name : (or/c #f string?)
The entire GPS track is exported as a single track segment.
The laps property, if present, is assumed to contain a list of timestamps and the positions corresponding to these timestamps are exported as way points.
The name of the segment can be specified as the name parameter. If this is #f, the 'name property in the data frame is consulted, if that one is missing a default track name is used.
9 TCX Files
(require data-frame/tcx) | package: data-frame |
This module provides functions for reading Training Center XML (TCX) files into data frames.
procedure
input : (or/c path-string? input-port?)
"lat" and "lon" series representing the latitude and longitude of each point
"timestamp" series representing the UTC timestamp in seconds for each point. The series will also be marked as sorted, if it is actually sorted
"dst" representing a distance from the start. If distance data is not present in the GPX file, this series will be calculated from the GPX coordiantes. The series will be marked as sorted, if it is actually sorted
"hr" representing heart rate measurements
"cad" representing cadence measurements
"pwr" representing power measurements, in watts
"spd" representing the speed
The data frame may also have the following properties (if they are present in the TCX document):
'unit-id the serial number of the device which recorded the activity.
'product-id the product id for the device that recorded the activity (indentifies the device type)
'sport the sport for the activity. This is a free form string, but TCX format usualy uses the strings "Running" for running activities and "Biking" for biking activities.
a 'laps property containing a list of timestamps corresponding to the start of each lap in the activity.
procedure
(df-read/tcx/multiple input) → (listof data-frame?)
input : (or/c path-string? input-port?)