dstats.regress

A module for performing linear regression. This module has an unusual interface, as it is range-based instead of matrix based. Values for independent variables are provided as either a tuple or a range of ranges. This means that one can use, for example, map, to fit high order models and lazily evaluate certain values. (For details, see examples below.)

Author:
David Simcha

struct PowMap (ExpType,T) if (isForwardRange!(T));


PowMap!(ExpType,T) powMap (ExpType, T)(T range, ExpType exponent);
Maps a forward range to a power determined at runtime. ExpType is the type of the exponent. Using an int is faster than using a double, but obviously less flexible.

struct RegressRes ;
Struct that holds the results of a linear regression. It's a plain old data struct.

double[] betas ;
The coefficients, one for each range in X. These will be in the order that the X ranges were passed in.

double[] stdErr ;
The standard error terms of the X ranges passed in.

double[] lowerBound ;
The lower confidence bounds of the beta terms, at the confidence level specificied. (Default 0.95).

double[] upperBound ;
The upper confidence bounds of the beta terms, at the confidence level specificied. (Default 0.95).

double[] p ;
The P-value for the alternative that the corresponding beta value is different from zero against the null that it is equal to zero.

double R2 ;
The coefficient of determination.

double adjustedR2 ;
The adjusted coefficient of determination.

double residualError ;
The root mean square of the residuals.

double overallP ;
The P-value for the model as a whole. Based on an F-statistic. The null here is that the model has no predictive value, the alternative is that it does.

string toString ();
Print out the results in the default format.

struct PolyFitRes (T);
Struct returned by polyFit.

T X ;
The array of PowMap ranges created by polyFit.

RegressRes regressRes ;
The rest of the results. This is alias this'd.

struct Residuals (F,U,T...);
Forward Range for holding the residuals from a regression analysis.

Residuals!(F,U,T) residuals (F, U, T...)(F[] betas, U Y, T X);
Given the beta coefficients from a linear regression, and X and Y values, returns a range that lazily computes the residuals .

double[] linearRegressBeta (U, T...)(U Y, T XIn);
Perform a linear regression and return just the beta values. The advantages to just returning the beta values are that it's faster and that each range needs to be iterated over only once, and thus can be just an input range. The beta values are returned such that the smallest index corresponds to the leftmost element of X. X can be either a tuple or a range of input ranges. Y must be an input range.

Notes:
The X ranges are traversed in lockstep, but the traversal is stopped at the end of the shortest one. Therefore, using infinite ranges is safe. For example, using repeat(1) to get an intercept term works.

Examples:
 int[] nBeers = [8,6,7,5,3,0,9];
 int[] nCoffees = [3,6,2,4,3,6,8];
 int[] musicVolume = [3,1,4,1,5,9,2];
 int[] programmingSkill = [2,7,1,8,2,8,1];
 double[] betas = linearRegressBeta(programmingSkill, repeat(1), nBeers, nCoffees,
     musicVolume, map!"a * a"(musicVolume));


double[] linearRegressBetaBuf (U, T...)(double[] buf, U Y, T XIn);
Same as linearRegressBeta, but allows the user to specify a buffer for the beta terms. If the buffer is too short, a new one is allocated. Otherwise, the results are returned in the user-provided buffer.

RegressRes linearRegress (U, TC...)(U Y, TC input);
Perform a linear regression as in linearRegressBeta, but return a RegressRes with useful stuff for statistical inference. If the last element of input is a real, this is used to specify the confidence intervals to be calculated. Otherwise, the default of 0.95 is used. The rest of input should be the elements of X.

When using this function, which provides several useful statistics useful for inference, each range must be traversed twice. This means:

1. They have to be forward ranges, not input ranges.

2. If you have a large amount of data and you're mapping it to some expensive function, you may want to do this eagerly instead of lazily.

Notes:
The X ranges are traversed in lockstep, but the traversal is stopped at the end of the shortest one. Therefore, using infinite ranges is safe. For example, using repeat(1) to get an intercept term works.

BUGS:
The statistical tests performed in this function assume that an intercept term is included in your regression model. If no intercept term is included, the P-values, confidence intervals and adjusted R^2 values calculated by this function will be wrong.

Examples:
 int[] nBeers = [8,6,7,5,3,0,9];
 int[] nCoffees = [3,6,2,4,3,6,8];
 int[] musicVolume = [3,1,4,1,5,9,2];
 int[] programmingSkill = [2,7,1,8,2,8,1];

 // Using default confidence interval:
 auto results = linearRegress(programmingSkill, repeat(1), nBeers, nCoffees,
     musicVolume, map!"a * a"(musicVolume));

 // Using user-specified confidence interval:
 auto results = linearRegress(programmingSkill, repeat(1), nBeers, nCoffees,
     musicVolume, map!"a * a"(musicVolume), 0.8675309);


double[] polyFitBeta (T, U)(U Y, T X, uint N);
Convenience function that takes a forward range X and a forward range Y, creates an array of PowMap structs for integer powers from 0 through N, and calls linearRegressBeta.

Returns:
An array of doubles. The index of each element corresponds to the exponent. For example, the X2 term will have an index of 2.

double[] polyFitBetaBuf (T, U)(double[] buf, U Y, T X, uint N);
Same as polyFitBeta, but allows the caller to provide an explicit buffer to return the coefficients in. If it's too short, a new one will be allocated. Otherwise, results will be returned in the user-provided buffer.

PolyFitRes!(PowMap!(uint,T)[]) polyFit (T, U)(U Y, T X, uint N, double confInt = 0.95);
Convenience function that takes a forward range X and a forward range Y, creates an array of PowMap structs for integer powers 0 through N, and calls linearRegress.

Returns:
A PolyFitRes containing the array of PowMap structs created and a RegressRes. The PolyFitRes is alias this'd to the RegressRes.

double[] logisticRegressBeta (T, U...)(T yIn, U xIn);
Computes a logistic regression using a maximum likelihood estimator and returns the beta coefficients. This is a generalized linear model with the link function f(XB) = 1 / (1 + exp(XB)). This is generally used to model the probability that a binary Y variable is 1 given a set of X variables.

For the purpose of this function, Y variables are interpreted as Booleans, regardless of their type. X may be either a range of ranges or a tuple of ranges. However, note that unlike in linearRegress, they are copied to an array if they are not random access ranges. Note that each value is accessed several times, so if your range is a map to something expensive, you may want to evaluate it eagerly.

Also note that, as in linearRegress, repeat(1) can be used for the intercept term.

Returns:
The beta coefficients for the regression model.

TODO:
Add hypothesis testing stuff and generalize to a parametrizable generalized linear model function.

References:


http:
//en.wikipedia.org/wiki/Logistic_regression

http:
//socserv.mcmaster.ca/jfox/Courses/UCLA/logistic-regression-notes.pdf

pure nothrow double inverseLogit (double xb);
The inverse logit function used in logistic regression.

Page was generated with on Sun Aug 22 21:51:11 2010