1 Module rml/data
(require rml/data) | package: rml-core |
This module deals with two opaque structure types, data-set and data-set-field . These are not available to clients directly although certain accessors are exported by this module. Conceptually a data-set is a table of data, columns represent fields that are either features that represent properties of an instance, and classifiers or labels that are used to train and match instances.
> (require rml/data)
> (define dataset (load-data-set "test/iris_training_data.csv" 'csv (list (make-feature "sepal-length" #:index 0) (make-feature "sepal-width" #:index 1) (make-feature "petal-length" #:index 2) (make-feature "petal-width" #:index 3) (make-classifier "classification" #:index 4)))) > (displayln (data-set? dataset)) #t
> (displayln (features dataset)) (sepal-length sepal-width petal-length petal-width)
> (displayln (classifiers dataset)) (classification)
> (displayln (partition-count dataset)) 1
> (displayln (data-count dataset)) 135
> (displayln (classifier-product dataset)) (Iris-versicolor Iris-setosa Iris-virginica)
In this code block a training data set is loaded and the columns within the CSV data are described.
1.1 Types and Predicates
predicate
(data-set-field? a) → boolean?
a : any
predicate
(partition-id? a) → boolean?
a : any
1.2 Construction
procedure
(load-data-set file-name format fields) → data-set?
file-name : string? format : symbol? fields : (listof data-set-field?)
value
supported-formats : (listof symbol?)
constructor
(make-feature name #:index integer?) → (data-set-field?)
name : string? integer? : 0
constructor
(make-classifier name #:index integer?) → (data-set-field?)
name : string? integer? : 0
1.3 Accessors
accessor
(classifiers dataset) → (listof string?)
dataset : data-set?
accessor
(classifier-product dataset) → (listof string?)
dataset : data-set?
accessor
(data-count dataset) → exact-nonnegative-integer?
dataset : data-set?
accessor
(feature-vector dataset partition-id feature-name) → (vectorof number?) dataset : data-set? partition-id : exact-nonnegative-integer? feature-name : string?
accessor
(partition-count dataset) → exact-nonnegative-integer?
dataset : data-set?
accessor
dataset : data-set? partition-id : exact-nonnegative-integer?
value
default-partition : exact-nonnegative-integer?
value
test-partition : exact-nonnegative-integer?
value
training-partition : exact-nonnegative-integer?
1.4 Transformations
The following procedures perform transformations on one or more data-set structures and return a new data-set. These are typically concerned with partitioning a data set or optimizing the feature vectors.
procedure
(partition-equally partition-count [ entropy-features]) → data-set? partition-count : exact-positive-integer? entropy-features : (listof string?) = '()
procedure
(partition-for-test test-percentage [ entropy-features]) → data-set? test-percentage : (real-in 1.0 50.0) entropy-features : (listof string?) = '()
If specified, the entropy-features list denotes the names of features, or classifiers, that should be randomly spread across partitions.
parameter
(minimum-partition-data-total) → exact-positive-integer?
(minimum-partition-data-total partition-data-count) → void? partition-data-count : exact-positive-integer?
= 100
parameter
(minimum-partition-data) → exact-positive-integer?
(minimum-partition-data partition-data-count) → void? partition-data-count : exact-positive-integer?
= 100
1.5 Snapshots
Loading and manipulating data sets from source files may not always be efficient and so the parsed in-memory format can be saved and loaded externally. These saved forms are termed snapshots, they are serialized forms of the data-set structure.
io
(write-snapshot dataset out) → void?
dataset : data-set? out : output-port?
io
(read-snapshot dataset in) → data-set?
dataset : data-set? in : input-port?