dstats.regress
A module for performing linear regression. This module has an unusual
interface, as it is range-based instead of matrix based. Values for
independent variables are provided as either a tuple or a range of ranges.
This means that one can use, for example, map, to fit high order models and
lazily evaluate certain values. (For details, see examples below.)
Author:
David Simcha
- struct
PowMap
(ExpType,T) if (isForwardRange!(T));
- PowMap!(ExpType,T)
powMap
(ExpType, T)(T range, ExpType exponent);
- Maps a forward range to a power determined at runtime. ExpType is the type
of the exponent. Using an int is faster than using a double, but obviously
less flexible.
- struct
RegressRes
;
- Struct that holds the results of a linear regression. It's a plain old
data struct.
- double[]
betas
;
- The coefficients, one for each range in X. These will be in the order
that the X ranges were passed in.
- double[]
stdErr
;
- The standard error terms of the X ranges passed in.
- double[]
lowerBound
;
- The lower confidence bounds of the beta terms, at the confidence level
specificied. (Default 0.95).
- double[]
upperBound
;
- The upper confidence bounds of the beta terms, at the confidence level
specificied. (Default 0.95).
- double[]
p
;
- The P-value for the alternative that the corresponding beta value is
different from zero against the null that it is equal to zero.
- double
R2
;
- The coefficient of determination.
- double
adjustedR2
;
- The adjusted coefficient of determination.
- double
residualError
;
- The root mean square of the residuals.
- double
overallP
;
- The P-value for the model as a whole. Based on an F-statistic. The
null here is that the model has no predictive value, the alternative
is that it does.
- string
toString
();
- Print out the results in the default format.
- struct
PolyFitRes
(T);
- Struct returned by polyFit.
- T
X
;
- The array of PowMap ranges created by polyFit.
- RegressRes
regressRes
;
- The rest of the results. This is alias this'd.
- struct
Residuals
(F,U,T...);
- Forward Range for holding the residuals from a regression analysis.
- Residuals!(F,U,T)
residuals
(F, U, T...)(F[] betas, U Y, T X);
- Given the beta coefficients from a linear regression, and X and Y values,
returns a range that lazily computes the
residuals
.
- double[]
linearRegressBeta
(U, T...)(U Y, T XIn);
- Perform a linear regression and return just the beta values. The advantages
to just returning the beta values are that it's faster and that each range
needs to be iterated over only once, and thus can be just an input range.
The beta values are returned such that the smallest index corresponds to
the leftmost element of X. X can be either a tuple or a range of input
ranges. Y must be an input range.
Notes:
The X ranges are traversed in lockstep, but the traversal is stopped
at the end of the shortest one. Therefore, using infinite ranges is safe.
For example, using repeat(1) to get an intercept term works.
Examples:
int[] nBeers = [8,6,7,5,3,0,9];
int[] nCoffees = [3,6,2,4,3,6,8];
int[] musicVolume = [3,1,4,1,5,9,2];
int[] programmingSkill = [2,7,1,8,2,8,1];
double[] betas = linearRegressBeta(programmingSkill, repeat(1), nBeers, nCoffees,
musicVolume, map!"a * a"(musicVolume));
- double[]
linearRegressBetaBuf
(U, T...)(double[] buf, U Y, T XIn);
- Same as linearRegressBeta, but allows the user to specify a buffer for
the beta terms. If the buffer is too short, a new one is allocated.
Otherwise, the results are returned in the user-provided buffer.
- RegressRes
linearRegress
(U, TC...)(U Y, TC input);
- Perform a linear regression as in linearRegressBeta, but return a
RegressRes with useful stuff for statistical inference. If the last element
of input is a real, this is used to specify the confidence intervals to
be calculated. Otherwise, the default of 0.95 is used. The rest of input
should be the elements of X.
When using this function, which provides several useful statistics useful
for inference, each range must be traversed twice. This means:
1. They have to be forward ranges, not input ranges.
2. If you have a large amount of data and you're mapping it to some
expensive function, you may want to do this eagerly instead of lazily.
Notes:
The X ranges are traversed in lockstep, but the traversal is stopped
at the end of the shortest one. Therefore, using infinite ranges is safe.
For example, using repeat(1) to get an intercept term works.
BUGS:
The statistical tests performed in this function assume that an
intercept term is included in your regression model. If no intercept term
is included, the P-values, confidence intervals and adjusted R^2 values
calculated by this function will be wrong.
Examples:
int[] nBeers = [8,6,7,5,3,0,9];
int[] nCoffees = [3,6,2,4,3,6,8];
int[] musicVolume = [3,1,4,1,5,9,2];
int[] programmingSkill = [2,7,1,8,2,8,1];
// Using default confidence interval:
auto results = linearRegress(programmingSkill, repeat(1), nBeers, nCoffees,
musicVolume, map!"a * a"(musicVolume));
// Using user-specified confidence interval:
auto results = linearRegress(programmingSkill, repeat(1), nBeers, nCoffees,
musicVolume, map!"a * a"(musicVolume), 0.8675309);
- double[]
polyFitBeta
(T, U)(U Y, T X, uint N);
- Convenience function that takes a forward range X and a forward range Y,
creates an array of PowMap structs for integer powers from 0 through N,
and calls linearRegressBeta.
Returns:
An array of doubles. The index of each element corresponds to
the exponent. For example, the X2 term will have an index of
2.
- double[]
polyFitBetaBuf
(T, U)(double[] buf, U Y, T X, uint N);
- Same as polyFitBeta, but allows the caller to provide an explicit buffer
to return the coefficients in. If it's too short, a new one will be
allocated. Otherwise, results will be returned in the user-provided buffer.
- PolyFitRes!(PowMap!(uint,T)[])
polyFit
(T, U)(U Y, T X, uint N, double confInt = 0.95);
- Convenience function that takes a forward range X and a forward range Y,
creates an array of PowMap structs for integer powers 0 through N,
and calls linearRegress.
Returns:
A PolyFitRes containing the array of PowMap structs created and
a RegressRes. The PolyFitRes is alias this'd to the RegressRes.
- double[]
logisticRegressBeta
(T, U...)(T yIn, U xIn);
- Computes a logistic regression using a maximum likelihood estimator
and returns the beta coefficients. This is a generalized linear model with
the link function f(XB) = 1 / (1 + exp(XB)). This is generally used to model
the probability that a binary Y variable is 1 given a set of X variables.
For the purpose of this function, Y variables are interpreted as Booleans,
regardless of their type. X may be either a range of ranges or a tuple of
ranges. However, note that unlike in linearRegress, they are copied to an
array if they are not random access ranges. Note that each value is accessed
several times, so if your range is a map to something expensive, you may
want to evaluate it eagerly.
Also note that, as in linearRegress, repeat(1) can be used for the intercept
term.
Returns:
The beta coefficients for the regression model.
TODO:
Add hypothesis testing stuff and generalize to a parametrizable
generalized linear model function.
References:
http:
//en.wikipedia.org/wiki/Logistic_regression
http:
//socserv.mcmaster.ca/jfox/Courses/UCLA/logistic-regression-notes.pdf
- pure nothrow double
inverseLogit
(double xb);
- The inverse logit function used in logistic regression.
|