foldl-statistics-0.1.5.0: Statistical functions from the statistics package implemented as Folds.

Copyright(c) 2011 Bryan O'Sullivan 2016 National ICT Australia 2018 CSIRO
LicenseBSD3
MaintainerAlex.Mason@data61.csiro.au
Stabilityexperimental
Portabilityportable
Safe HaskellNone
LanguageHaskell2010

Control.Foldl.Statistics

Contents

Description

 

Synopsis

Introduction

Statistical functions from the Statistics.Sample module of the statistics package by Bryan O'Sullivan, implemented as Folds from the foldl package.

This allows many statistics to be computed concurrently with at most two passes over the data, usually by computing the mean first, and passing it to further Folds.

range :: Fold Double Double Source #

The difference between the largest and smallest elements of a sample.

sum' :: Fold Double Double Source #

A numerically stable sum using Kahan-Babuška-Neumaier summation from Numeric.Sum

histogram :: Ord a => Fold a (Map a Int) Source #

Create a histogram of each value of type a. Useful for folding over categorical values, for example, a CSV where you have a data type for a selection of categories.

It should not be used for continuous values which would lead to a high number of keys. One way to avoid this is to use the Profunctor instance for Fold to break your values into categories. For an example of doing this, see ordersOfMagnitude.

histogram' :: (Hashable a, Eq a) => Fold a (HashMap a Int) Source #

Like histogram, but for use when hashmaps would be more efficient for the particular type a.

ordersOfMagnitude :: Fold Double (Map Double Int) Source #

Provides a histogram of the orders of magnitude of the values in a series. Negative values are placed in the 0.0 category due to the behaviour of logBase. it may be useful to use lmap abs on this Fold to get a histogram of the absolute magnitudes.

Statistics of location

mean :: Fold Double Double Source #

Arithmetic mean. This uses Kahan-Babuška-Neumaier summation, so is more accurate than welfordMean unless the input values are very large.

Since foldl-1.2.2, Foldl exports a mean function, so you will have to hide one.

welfordMean :: Fold Double Double Source #

Arithmetic mean. This uses Welford's algorithm to provide numerical stability, using a single pass over the sample data.

Compared to mean, this loses a surprising amount of precision unless the inputs are very large.

meanWeighted :: Fold (Double, Double) Double Source #

Arithmetic mean for weighted sample. It uses a single-pass algorithm analogous to the one used by welfordMean.

geometricMean :: Fold Double Double Source #

Geometric mean of a sample containing no negative values.

Statistics of dispersion

The variance—and hence the standard deviation—of a sample of fewer than two elements are both defined to be zero.

Many of these Folds take the mean as an argument for constructing the variance, and as such require two passes over the data.

Functions over central moments

centralMoment :: Int -> Double -> Fold Double Double Source #

Compute the kth central moment of a sample. The central moment is also known as the moment about the mean.

This function requires the mean of the data to compute the central moment.

For samples containing many values very close to the mean, this function is subject to inaccuracy due to catastrophic cancellation.

centralMoments :: Int -> Int -> Double -> Fold Double (Double, Double) Source #

Compute the kth and jth central moments of a sample.

This fold requires the mean of the data to be known.

For samples containing many values very close to the mean, this function is subject to inaccuracy due to catastrophic cancellation.

centralMoments' :: Int -> Int -> Double -> Fold Double (Double, Double) Source #

Compute the kth and jth central moments of a sample.

This fold requires the mean of the data to be known.

This variation of centralMoments uses Kahan-Babuška-Neumaier summation to attempt to improve the accuracy of results, which may make computation slower.

skewness :: Double -> Fold Double Double Source #

Compute the skewness of a sample. This is a measure of the asymmetry of its distribution.

A sample with negative skew is said to be left-skewed. Most of its mass is on the right of the distribution, with the tail on the left.

skewness $ U.to [1,100,101,102,103]
==> -1.497681449918257

A sample with positive skew is said to be right-skewed.

skewness $ U.to [1,2,3,4,100]
==> 1.4975367033335198

A sample's skewness is not defined if its variance is zero.

This fold requires the mean of the data to be known.

For samples containing many values very close to the mean, this function is subject to inaccuracy due to catastrophic cancellation.

kurtosis :: Double -> Fold Double Double Source #

Compute the excess kurtosis of a sample. This is a measure of the "peakedness" of its distribution. A high kurtosis indicates that more of the sample's variance is due to infrequent severe deviations, rather than more frequent modest deviations.

A sample's excess kurtosis is not defined if its variance is zero.

This fold requires the mean of the data to be known.

For samples containing many values very close to the mean, this function is subject to inaccuracy due to catastrophic cancellation.

Functions requiring the mean to be known (numerically robust)

These functions use the compensated summation algorithm of Chan et al. for numerical robustness, but require two passes over the sample data as a result.

variance :: Double -> Fold Double Double Source #

Maximum likelihood estimate of a sample's variance. Also known as the population variance, where the denominator is n.

varianceUnbiased :: Double -> Fold Double Double Source #

Unbiased estimate of a sample's variance. Also known as the sample variance, where the denominator is n-1.

stdDev :: Double -> Fold Double Double Source #

Standard deviation. This is simply the square root of the unbiased estimate of the variance.

varianceWeighted :: Double -> Fold (Double, Double) Double Source #

Weighted variance. This is biased estimation. Requires the weighted mean of the input data.

Single-pass functions (faster, less safe)

The functions prefixed with the name fast below perform a single pass over the sample data using Knuth's algorithm. They usually work well, but see below for caveats. These functions are subject to fusion and do not require the mean to be passed.

Note: in cases where most sample data is close to the sample's mean, Knuth's algorithm gives inaccurate results due to catastrophic cancellation.

fastVariance :: Fold Double Double Source #

Maximum likelihood estimate of a sample's variance.

fastVarianceUnbiased :: Fold Double Double Source #

Maximum likelihood estimate of a sample's variance.

fastStdDev :: Fold Double Double Source #

Standard deviation. This is simply the square root of the maximum likelihood estimate of the variance.

fastLMVSK :: Fold Double LMVSK Source #

Efficiently compute the length, mean, variance, skewness and kurtosis with a single pass.

Since: 0.1.1.0

fastLMVSKu :: Fold Double LMVSK Source #

Efficiently compute the length, mean, unbiased variance, skewness and kurtosis with a single pass.

Since: 0.1.3.0

data LMVSK Source #

When returned by fastLMVSK, contains the count, mean, variance, skewness and kurtosis of a series of samples.

Since: 0.1.1.0

Instances

Eq LMVSK Source # 

Methods

(==) :: LMVSK -> LMVSK -> Bool #

(/=) :: LMVSK -> LMVSK -> Bool #

Show LMVSK Source # 

Methods

showsPrec :: Int -> LMVSK -> ShowS #

show :: LMVSK -> String #

showList :: [LMVSK] -> ShowS #

foldLMVSKState :: Fold Double LMVSKState Source #

Performs the heavy lifting of fastLMVSK. This is exposed because the internal LMVSKState is monoidal, allowing you to run these statistics in parallel over datasets which are split and then combine the results.

Since: 0.1.2.0

getLMVSK :: LMVSKState -> LMVSK Source #

Returns the stats which have been computed in a LMVSKState.

Since: 0.1.2.0

getLMVSKu :: LMVSKState -> LMVSK Source #

Returns the stats which have been computed in a LMVSKState, with the unbiased variance.

Since: 0.1.2.0

Linear Regression

fastLinearReg :: Fold (Double, Double) LinRegResult Source #

Computes the slope, (Y) intercept and correlation of (x,y) pairs, as well as the LMVSK stats for both the x and y series.

>>> F.fold fastLinearReg $ map (\x -> (x,3*x+7)) [1..100]
LinRegResult
  {lrrSlope = 3.0
  , lrrIntercept = 7.0
  , lrrCorrelation = 100.0
  , lrrXStats = LMVSK
      {lmvskCount = 100
      , lmvskMean = 50.5
      , lmvskVariance = 833.25
      , lmvskSkewness = 0.0
      , lmvskKurtosis = -1.2002400240024003}
  , lrrYStats = LMVSK
      {lmvskCount = 100
      , lmvskMean = 158.5
      , lmvskVariance = 7499.25
      , lmvskSkewness = 0.0
      , lmvskKurtosis = -1.2002400240024003}
  }
>>> F.fold fastLinearReg $ map (\x -> (x,0.005*x*x+3*x+7)) [1..100]
LinRegResult
  {lrrSlope = 3.5049999999999994
  , lrrIntercept = -1.5849999999999795
  , lrrCorrelation = 99.93226275740273
  , lrrXStats = LMVSK
      {lmvskCount = 100
      , lmvskMean = 50.5
      , lmvskVariance = 833.25
      , lmvskSkewness = 0.0
      , lmvskKurtosis = -1.2002400240024003}
  , lrrYStats = LMVSK
      {lmvskCount = 100
      , lmvskMean = 175.4175
      , lmvskVariance = 10250.37902625
      , lmvskSkewness = 9.862971188165422e-2
      , lmvskKurtosis = -1.1923628437011482}
  }

Since: 0.1.1.0

foldLinRegState :: Fold (Double, Double) LinRegState Source #

Performs the heavy lifting for fastLinReg. Exposed because LinRegState is a Monoid, allowing statistics to be computed on datasets in parallel and combined afterwards.

Since: 0.1.4.0

getLinRegResult :: LinRegState -> LinRegResult Source #

Produces the slope, Y intercept, correlation and LMVSK stats from a LinRegState.

Since: 0.1.4.0

data LinRegResult Source #

When returned by fastLinearReg, contains the count, slope, intercept and correlation of combining (x,y) pairs.

Since: 0.1.1.0

lrrCount :: LinRegResult -> Int Source #

The number of elements which make up this LinRegResult Since: 0.1.4.1

correlation :: (Double, Double) -> (Double, Double) -> Fold (Double, Double) Double Source #

Given the mean and standard deviation of two distributions, computes the correlation between them, given the means and standard deviation of the x and y series. The results may be more accurate than those returned by fastLinearReg

References