4 Module rml/statistics
(require rml/statistics) | package: rml-core |
This module provides capabilities to compute statistical data over the underlying data for features in data sets. This assumes features are numeric and uses the math/statistics module for actual calculations.
> (require rml/data)
> (define dataset (load-data-set "test/iris_training_data.csv" 'csv (list (make-feature "sepal-length" #:index 0) (make-feature "sepal-width" #:index 1) (make-feature "petal-length" #:index 2) (make-feature "petal-width" #:index 3) (make-classifier "classification" #:index 4)))) > (define stats (compute-statistics iris-data-set)) > stats
'#hash(("petal-width" . #<future>)
("petal-length" . #<future>)
("sepal-length" . #<future>)
("sepal-width" . #<future>))
> (feature-statistics stats "sepal-length") (statistics 4.4 7.9 135.0 ...)
> (standardize-statistics iris-data-set stats) #<data-set>
predicate
(statistics-hash? a) → boolean?
a : any?
procedure
(compute-statistics dataset [feature-names]) → statistics-hash?
dataset : data-set? feature-names : (or/c #f (listof string?)) = #f
These are performed concurrently. The result is a hash of string names to statistics structures (or a future if the computation has not yet completed). Using the feature-statistics accessor will always return a statistics structure.
accessor
(feature-statistics stats-hash feature-name) → statistics-hash? stats-hash : statistics-hash? feature-name : string?
transform
(standardize-statistics dataset statistics-hash) → data-set? dataset : data-set? statistics-hash : statistics-hash?
From Scholarpedia:
… removes scale effects caused by use of features with different measurement scales. For example, if one feature is based on patient weight in units of kg and another feature is based on blood protein values in units of ng/dL in the range [-3,3], then patient weight will have a much greater influence on the distance between samples and may bias the performance of the classifier. Standardization transforms raw feature values into z-scores using the mean and standard deviation of a feature values over all input samples }