Copyright | (c) 2011 Bryan O'Sullivan 2016 National ICT Australia 2018 CSIRO |
---|---|
License | BSD3 |
Maintainer | Alex.Mason@data61.csiro.au |
Stability | experimental |
Portability | portable |
Safe Haskell | None |
Language | Haskell2010 |
- range :: Fold Double Double
- sum' :: Fold Double Double
- histogram :: Ord a => Fold a (Map a Int)
- histogram' :: (Hashable a, Eq a) => Fold a (HashMap a Int)
- ordersOfMagnitude :: Fold Double (Map Double Int)
- mean :: Fold Double Double
- welfordMean :: Fold Double Double
- meanWeighted :: Fold (Double, Double) Double
- harmonicMean :: Fold Double Double
- geometricMean :: Fold Double Double
- centralMoment :: Int -> Double -> Fold Double Double
- centralMoments :: Int -> Int -> Double -> Fold Double (Double, Double)
- centralMoments' :: Int -> Int -> Double -> Fold Double (Double, Double)
- skewness :: Double -> Fold Double Double
- kurtosis :: Double -> Fold Double Double
- variance :: Double -> Fold Double Double
- varianceUnbiased :: Double -> Fold Double Double
- stdDev :: Double -> Fold Double Double
- varianceWeighted :: Double -> Fold (Double, Double) Double
- fastVariance :: Fold Double Double
- fastVarianceUnbiased :: Fold Double Double
- fastStdDev :: Fold Double Double
- fastLMVSK :: Fold Double LMVSK
- fastLMVSKu :: Fold Double LMVSK
- data LMVSK = LMVSK {
- lmvskCount :: !Int
- lmvskMean :: !Double
- lmvskVariance :: !Double
- lmvskSkewness :: !Double
- lmvskKurtosis :: !Double
- data LMVSKState
- foldLMVSKState :: Fold Double LMVSKState
- getLMVSK :: LMVSKState -> LMVSK
- getLMVSKu :: LMVSKState -> LMVSK
- fastLinearReg :: Fold (Double, Double) LinRegResult
- foldLinRegState :: Fold (Double, Double) LinRegState
- getLinRegResult :: LinRegState -> LinRegResult
- data LinRegResult = LinRegResult {
- lrrSlope :: !Double
- lrrIntercept :: !Double
- lrrCorrelation :: !Double
- lrrXStats :: !LMVSK
- lrrYStats :: !LMVSK
- data LinRegState
- lrrCount :: LinRegResult -> Int
- correlation :: (Double, Double) -> (Double, Double) -> Fold (Double, Double) Double
- module Control.Foldl
Introduction
Statistical functions from the
Statistics.Sample
module of the
statistics package by
Bryan O'Sullivan, implemented as Fold
s from the
foldl package.
This allows many statistics to be computed concurrently with at most
two passes over the data, usually by computing the mean
first, and
passing it to further Fold
s.
range :: Fold Double Double Source #
The difference between the largest and smallest elements of a sample.
sum' :: Fold Double Double Source #
A numerically stable sum using Kahan-Babuška-Neumaier summation from Numeric.Sum
histogram :: Ord a => Fold a (Map a Int) Source #
Create a histogram of each value of type a. Useful for folding over categorical values, for example, a CSV where you have a data type for a selection of categories.
It should not be used for continuous values which would lead to a high number
of keys. One way to avoid this is to use the Profunctor
instance for Fold
to break your values into categories. For an example of doing this, see
ordersOfMagnitude
.
histogram' :: (Hashable a, Eq a) => Fold a (HashMap a Int) Source #
Like histogram
, but for use when hashmaps would be more efficient for the
particular type a
.
ordersOfMagnitude :: Fold Double (Map Double Int) Source #
Provides a histogram of the orders of magnitude of the values in a series.
Negative values are placed in the 0.0
category due to the behaviour of
logBase
. it may be useful to use lmap abs
on this Fold to get a histogram
of the absolute magnitudes.
Statistics of location
mean :: Fold Double Double Source #
Arithmetic mean. This uses Kahan-Babuška-Neumaier
summation, so is more accurate than welfordMean
unless the input
values are very large.
Since foldl-1.2.2, Foldl
exports a mean
function, so you will
have to hide one.
welfordMean :: Fold Double Double Source #
Arithmetic mean. This uses Welford's algorithm to provide numerical stability, using a single pass over the sample data.
Compared to mean
, this loses a surprising amount of precision
unless the inputs are very large.
meanWeighted :: Fold (Double, Double) Double Source #
Arithmetic mean for weighted sample. It uses a single-pass
algorithm analogous to the one used by welfordMean
.
geometricMean :: Fold Double Double Source #
Geometric mean of a sample containing no negative values.
Statistics of dispersion
The variance—and hence the standard deviation—of a sample of fewer than two elements are both defined to be zero.
Many of these Folds take the mean as an argument for constructing the variance, and as such require two passes over the data.
Functions over central moments
centralMoment :: Int -> Double -> Fold Double Double Source #
Compute the kth central moment of a sample. The central moment is also known as the moment about the mean.
This function requires the mean of the data to compute the central moment.
For samples containing many values very close to the mean, this function is subject to inaccuracy due to catastrophic cancellation.
centralMoments :: Int -> Int -> Double -> Fold Double (Double, Double) Source #
Compute the kth and jth central moments of a sample.
This fold requires the mean of the data to be known.
For samples containing many values very close to the mean, this function is subject to inaccuracy due to catastrophic cancellation.
centralMoments' :: Int -> Int -> Double -> Fold Double (Double, Double) Source #
Compute the kth and jth central moments of a sample.
This fold requires the mean of the data to be known.
This variation of centralMoments
uses Kahan-Babuška-Neumaier
summation to attempt to improve the accuracy of results, which may
make computation slower.
skewness :: Double -> Fold Double Double Source #
Compute the skewness of a sample. This is a measure of the asymmetry of its distribution.
A sample with negative skew is said to be left-skewed. Most of its mass is on the right of the distribution, with the tail on the left.
skewness $ U.to [1,100,101,102,103] ==> -1.497681449918257
A sample with positive skew is said to be right-skewed.
skewness $ U.to [1,2,3,4,100] ==> 1.4975367033335198
A sample's skewness is not defined if its variance
is zero.
This fold requires the mean of the data to be known.
For samples containing many values very close to the mean, this function is subject to inaccuracy due to catastrophic cancellation.
kurtosis :: Double -> Fold Double Double Source #
Compute the excess kurtosis of a sample. This is a measure of the "peakedness" of its distribution. A high kurtosis indicates that more of the sample's variance is due to infrequent severe deviations, rather than more frequent modest deviations.
A sample's excess kurtosis is not defined if its variance
is
zero.
This fold requires the mean of the data to be known.
For samples containing many values very close to the mean, this function is subject to inaccuracy due to catastrophic cancellation.
Functions requiring the mean to be known (numerically robust)
These functions use the compensated summation algorithm of Chan et al. for numerical robustness, but require two passes over the sample data as a result.
variance :: Double -> Fold Double Double Source #
Maximum likelihood estimate of a sample's variance. Also known as the population variance, where the denominator is n.
varianceUnbiased :: Double -> Fold Double Double Source #
Unbiased estimate of a sample's variance. Also known as the sample variance, where the denominator is n-1.
stdDev :: Double -> Fold Double Double Source #
Standard deviation. This is simply the square root of the unbiased estimate of the variance.
varianceWeighted :: Double -> Fold (Double, Double) Double Source #
Weighted variance. This is biased estimation. Requires the weighted mean of the input data.
Single-pass functions (faster, less safe)
The functions prefixed with the name fast
below perform a single
pass over the sample data using Knuth's algorithm. They usually
work well, but see below for caveats. These functions are subject
to fusion and do not require the mean to be passed.
Note: in cases where most sample data is close to the sample's mean, Knuth's algorithm gives inaccurate results due to catastrophic cancellation.
fastVarianceUnbiased :: Fold Double Double Source #
Maximum likelihood estimate of a sample's variance.
fastStdDev :: Fold Double Double Source #
Standard deviation. This is simply the square root of the maximum likelihood estimate of the variance.
fastLMVSK :: Fold Double LMVSK Source #
Efficiently compute the length, mean, variance, skewness and kurtosis with a single pass.
Since: 0.1.1.0
fastLMVSKu :: Fold Double LMVSK Source #
Efficiently compute the length, mean, unbiased variance, skewness and kurtosis with a single pass.
Since: 0.1.3.0
When returned by fastLMVSK
, contains the count, mean,
variance, skewness and kurtosis of a series of samples.
Since: 0.1.1.0
LMVSK | |
|
foldLMVSKState :: Fold Double LMVSKState Source #
Performs the heavy lifting of fastLMVSK. This is exposed
because the internal LMVSKState
is monoidal, allowing you
to run these statistics in parallel over datasets which are
split and then combine the results.
Since: 0.1.2.0
getLMVSK :: LMVSKState -> LMVSK Source #
Returns the stats which have been computed in a LMVSKState.
Since: 0.1.2.0
getLMVSKu :: LMVSKState -> LMVSK Source #
Returns the stats which have been computed in a LMVSKState, with the unbiased variance.
Since: 0.1.2.0
Linear Regression
fastLinearReg :: Fold (Double, Double) LinRegResult Source #
Computes the slope, (Y) intercept and correlation of (x,y)
pairs, as well as the LMVSK
stats for both the x and y series.
>>>
F.fold fastLinearReg $ map (\x -> (x,3*x+7)) [1..100]
LinRegResult {lrrSlope = 3.0 , lrrIntercept = 7.0 , lrrCorrelation = 100.0 , lrrXStats = LMVSK {lmvskCount = 100 , lmvskMean = 50.5 , lmvskVariance = 833.25 , lmvskSkewness = 0.0 , lmvskKurtosis = -1.2002400240024003} , lrrYStats = LMVSK {lmvskCount = 100 , lmvskMean = 158.5 , lmvskVariance = 7499.25 , lmvskSkewness = 0.0 , lmvskKurtosis = -1.2002400240024003} }
>>>
F.fold fastLinearReg $ map (\x -> (x,0.005*x*x+3*x+7)) [1..100]
LinRegResult {lrrSlope = 3.5049999999999994 , lrrIntercept = -1.5849999999999795 , lrrCorrelation = 99.93226275740273 , lrrXStats = LMVSK {lmvskCount = 100 , lmvskMean = 50.5 , lmvskVariance = 833.25 , lmvskSkewness = 0.0 , lmvskKurtosis = -1.2002400240024003} , lrrYStats = LMVSK {lmvskCount = 100 , lmvskMean = 175.4175 , lmvskVariance = 10250.37902625 , lmvskSkewness = 9.862971188165422e-2 , lmvskKurtosis = -1.1923628437011482} }
Since: 0.1.1.0
foldLinRegState :: Fold (Double, Double) LinRegState Source #
Performs the heavy lifting for fastLinReg
. Exposed because LinRegState
is a Monoid
, allowing statistics to be computed on datasets in parallel
and combined afterwards.
Since: 0.1.4.0
getLinRegResult :: LinRegState -> LinRegResult Source #
Produces the slope, Y intercept, correlation and LMVSK stats from a
LinRegState
.
Since: 0.1.4.0
data LinRegResult Source #
When returned by fastLinearReg
, contains the count,
slope, intercept and correlation of combining (x,y)
pairs.
Since: 0.1.1.0
LinRegResult | |
|
data LinRegState Source #
The Monoidal state used to compute linear regression, see fastLinearReg
.
Since: 0.1.4.0
lrrCount :: LinRegResult -> Int Source #
The number of elements which make up this LinRegResult
Since: 0.1.4.1
correlation :: (Double, Double) -> (Double, Double) -> Fold (Double, Double) Double Source #
Given the mean and standard deviation of two distributions, computes
the correlation between them, given the means and standard deviation
of the x
and y
series. The results may be more accurate than those
returned by fastLinearReg
References
- Chan, T. F.; Golub, G.H.; LeVeque, R.J. (1979) Updating formulae and a pairwise algorithm for computing sample variances. Technical Report STAN-CS-79-773, Department of Computer Science, Stanford University. ftp://reports.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf
- Knuth, D.E. (1998) The art of computer programming, volume 2: seminumerical algorithms, 3rd ed., p. 232.
- Welford, B.P. (1962) Note on a method for calculating corrected sums of squares and products. Technometrics 4(3):419–420. http://www.jstor.org/stable/1266577
- West, D.H.D. (1979) Updating mean and variance estimates: an improved method. Communications of the ACM 22(9):532–535. http://doi.acm.org/10.1145/359146.359153
- John D. Cook. Computing skewness and kurtosis in one pass http://www.johndcook.com/blog/skewness_kurtosis/
module Control.Foldl