dstats.regress

A module for performing linear regression. This module has an unusual interface, as it is range-based instead of matrix based. Values for independent variables are provided as either a tuple or a range of ranges. This means that one can use, for example, map, to fit high order models and lazily evaluate certain values. (For details, see examples below.)

Author:
David Simcha

struct PowMap (ExpType,T) if (isForwardRange!(T));

PowMap!(ExpType,T) powMap (ExpType, T)(T range, ExpType exponent);

Maps a forward range to a power determined at runtime. ExpType is the type of the exponent. Using an int is faster than using a double, but obviously less flexible.

struct RegressRes ;

Struct that holds the results of a linear regression. It's a plain old data struct.

double[] betas ;: The coefficients, one for each range in X. These will be in the order that the X ranges were passed in.
double[] stdErr ;: The standard error terms of the X ranges passed in.
double[] lowerBound ;: The lower confidence bounds of the beta terms, at the confidence level specificied. (Default 0.95).
double[] upperBound ;: The upper confidence bounds of the beta terms, at the confidence level specificied. (Default 0.95).
double[] p ;: The P-value for the alternative that the corresponding beta value is different from zero against the null that it is equal to zero.
double R2 ;: The coefficient of determination.
double adjustedR2 ;: The adjusted coefficient of determination.
double residualError ;: The root mean square of the residuals.
double overallP ;: The P-value for the model as a whole. Based on an F-statistic. The null here is that the model has no predictive value, the alternative is that it does.
string toString ();: Print out the results in the default format.

struct Residuals (F,U,T...);

Forward Range for holding the residuals from a regression analysis.

Residuals!(F,U,T) residuals (F, U, T...)(F[] betas, U Y, T X);

Given the beta coefficients from a linear regression, and X and Y values, returns a range that lazily computes the residuals .

double[] linearRegressBeta (U, T...)(U Y, T XIn);

Perform a linear regression and return just the beta values. The advantages to just returning the beta values are that it's faster and that each range needs to be iterated over only once, and thus can be just an input range. The beta values are returned such that the smallest index corresponds to the leftmost element of X. X can be either a tuple or a range of input ranges. Y must be an input range.

If, after all X variables are passed in, a numeric type is passed as the last parameter, this is treated as a ridge parameter and ridge regression is performed. Ridge regression is a form of regression that penalizes the L2 norm of the beta vector and therefore results in more parsimonious models. However, it makes statistical inference such as that supported by linearRegress() difficult to impossible. Therefore, linearRegress() doesn't support ridges.

If no ridge parameter is passed, or equivalently if the ridge parameter is zero, then ordinary least squares regression is performed.

Notes:
The X ranges are traversed in lockstep, but the traversal is stopped at the end of the shortest one. Therefore, using infinite ranges is safe. For example, using repeat(1) to get an intercept term works.

References:

http:
//www.mathworks.com/help/toolbox/stats/ridge.html

Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0 (This is the citation for the MASS R package.)

Examples:

int[] nBeers = [8,6,7,5,3,0,9];
int[] nCoffees = [3,6,2,4,3,6,8];
int[] musicVolume = [3,1,4,1,5,9,2];
int[] programmingSkill = [2,7,1,8,2,8,1];
double[] betas = linearRegressBeta(programmingSkill, repeat(1), nBeers, nCoffees,
    musicVolume, map!"a * a"(musicVolume));

// Now throw in a ridge parameter of 2.5.
double[] ridgeBetas = linearRegressBeta(programmingSkill, repeat(1), nBeers,
    nCoffees, musicVolume, map!"a * a"(musicVolume), 2.5);

double[] linearRegressBetaBuf (U, TRidge...)(double[] buf, U Y, TRidge XRidge);

Same as linearRegressBeta, but allows the user to specify a buffer for the beta terms. If the buffer is too short, a new one is allocated. Otherwise, the results are returned in the user-provided buffer.

RegressRes linearRegress (U, TC...)(U Y, TC input);

Perform a linear regression as in linearRegressBeta, but return a RegressRes with useful stuff for statistical inference. If the last element of input is a real, this is used to specify the confidence intervals to be calculated. Otherwise, the default of 0.95 is used. The rest of input should be the elements of X.

When using this function, which provides several useful statistics useful for inference, each range must be traversed twice. This means:

1. They have to be forward ranges, not input ranges.

2. If you have a large amount of data and you're mapping it to some expensive function, you may want to do this eagerly instead of lazily.

Notes:
The X ranges are traversed in lockstep, but the traversal is stopped at the end of the shortest one. Therefore, using infinite ranges is safe. For example, using repeat(1) to get an intercept term works.

If the confidence interval specified is exactly 0, this is treated as a special case and confidence interval calculation is skipped. This can speed things up significantly and therefore can be useful in monte carlo and possibly data mining contexts.

BUGS:
The statistical tests performed in this function assume that an intercept term is included in your regression model. If no intercept term is included, the P-values, confidence intervals and adjusted R^2 values calculated by this function will be wrong.

Examples:

int[] nBeers = [8,6,7,5,3,0,9];
int[] nCoffees = [3,6,2,4,3,6,8];
int[] musicVolume = [3,1,4,1,5,9,2];
int[] programmingSkill = [2,7,1,8,2,8,1];

// Using default confidence interval:
auto results = linearRegress(programmingSkill, repeat(1), nBeers, nCoffees,
    musicVolume, map!"a * a"(musicVolume));

// Using user-specified confidence interval:
auto results = linearRegress(programmingSkill, repeat(1), nBeers, nCoffees,
    musicVolume, map!"a * a"(musicVolume), 0.8675309);

struct PolyFitRes (T);

Struct returned by polyFit.

T X ;: The array of PowMap ranges created by polyFit.
RegressRes regressRes ;: The rest of the results. This is alias this'd.

double[] polyFitBeta (T, U)(U Y, T X, uint N, double ridge = 0);

Convenience function that takes a forward range X and a forward range Y, creates an array of PowMap structs for integer powers from 0 through N, and calls linearRegressBeta.

Returns:
An array of doubles. The index of each element corresponds to the exponent. For example, the X² term will have an index of 2.

double[] polyFitBetaBuf (T, U)(double[] buf, U Y, T X, uint N, double ridge = 0);

Same as polyFitBeta, but allows the caller to provide an explicit buffer to return the coefficients in. If it's too short, a new one will be allocated. Otherwise, results will be returned in the user-provided buffer.

PolyFitRes!(PowMap!(uint,T)[]) polyFit (T, U)(U Y, T X, uint N, double confInt = 0.95);

Convenience function that takes a forward range X and a forward range Y, creates an array of PowMap structs for integer powers 0 through N, and calls linearRegress.

Returns:
A PolyFitRes containing the array of PowMap structs created and a RegressRes. The PolyFitRes is alias this'd to the RegressRes.

double[] linearRegressPenalized (Y, X...)(Y yIn, X xIn, double lasso, double ridge);

Performs lasso (L1) and/or ridge (L2) penalized linear regression. Due to the way the data is standardized, no intercept term should be included in x (unlike linearRegress and linearRegressBeta). The intercept coefficient is implicitly included and returned in the first element of the returned array. Usage is otherwise identical.

Note:
Setting lasso equal to zero is equivalent to performing ridge regression. This can also be done with linearRegressBeta. However, the linearRegressBeta algorithm is optimized for memory efficiency and large samples. This algorithm is optimized for large feature sets.

Returns:
The beta coefficients for the regression model.

References:
Friedman J, et al Pathwise coordinate optimization. Ann. Appl. Stat. 2007;2:302-332.

Goeman, J. J., L1 penalized estimation in the Cox proportional hazards model. Biometrical Journal 52(1), 70{84.

Eilers, P., Boer, J., Van Ommen, G., Van Houwelingen, H. 2001 Classification of microarray data with penalized logistic regression. Proceedings of SPIE. Progress in Biomedical Optics and Images vol. 4266, pp. 187-198

double[] logisticRegressBeta (T, U...)(T yIn, U xRidge);

Computes a logistic regression using a maximum likelihood estimator and returns the beta coefficients. This is a generalized linear model with the link function f(XB) = 1 / (1 + exp(XB)). This is generally used to model the probability that a binary Y variable is 1 given a set of X variables.

For the purpose of this function, Y variables are interpreted as Booleans, regardless of their type. X may be either a range of ranges or a tuple of ranges. However, note that unlike in linearRegress, they are copied to an array if they are not random access ranges. Note that each value is accessed several times, so if your range is a map to something expensive, you may want to evaluate it eagerly.

If the last parameter passed in is a numeric value instead of a range, it is interpreted as a ridge parameter and ridge regression is performed. This penalizes the L2 norm of the beta vector (in a scaled space) and results in more parsimonious models. It limits the usefulness of inference techniques (p-values, confidence intervals), however, and is therefore not offered in logisticRegres().

If no ridge parameter is passed, or equivalenty if the ridge parameter is zero, then ordinary maximum likelihood regression is performed.

Note that, while this implementation of ridge regression was tested against the R Design Package implementation, it uses slightly different conventions that make the results not comparable without transformation. dstats uses a biased estimate of the variance to scale the beta vector penalties, while Design uses an unbiased estimate. Furthermore, Design penalizes by 1/2 of the L2 norm, whereas dstats penalizes by the L2 norm. Therefore, if n is the sample size, and lambda is the penalty used with dstats, the proper penalty to use in Design to get the same results is 2 * (n - 1) * lambda / n.

Also note that, as in linearRegress, repeat(1) can be used for the intercept term.

Returns:
The beta coefficients for the regression model.

References:

http:
//en.wikipedia.org/wiki/Logistic_regression

http:
//socserv.mcmaster.ca/jfox/Courses/UCLA/logistic-regression-notes.pdf

S. Le Cessie and J. C. Van Houwelingen. Ridge Estimators in Logistic Regression. Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 41, No. 1(1992), pp. 191-201

Frank E Harrell Jr (2009). Design: Design Package. R package version 2.3-0.

http:
//CRAN.R-project.org/package=Design

struct LogisticRes ;

Plain old data struct to hold the results of a logistic regression.

double[] betas ;: The coefficients, one for each range in X. These will be in the order that the X ranges were passed in.
double[] stdErr ;: The standard error terms of the X ranges passed in.
double[] lowerBound ;: The Wald lower confidence bounds of the beta terms, at the confidence level specificied. (Default 0.95).
double[] upperBound ;: The Wald upper confidence bounds of the beta terms, at the confidence level specificied. (Default 0.95).
double[] p ;: The P-value for the alternative that the corresponding beta value is different from zero against the null that it is equal to zero. These are calculated using the Wald Test.
double nullLogLikelihood ;: The log likelihood for the null model.
double logLikelihood ;: The log likelihood for the model fit.
const pure nothrow @property @safe double aic ();: Akaike Information Criterion, which is a complexity-penalized goodness- of-fit score, equal to 2 * k - 2 log(L) where L is the log likelihood and k is the number of parameters.
double overallP ;: The P-value for the model as a whole, based on the likelihood ratio test. The null here is that the model has no predictive value, the alternative is that it does have predictive value.
string toString ();: Print out the results in the default format.

LogisticRes logisticRegress (T, V...)(T yIn, V input);

Similar to logisticRegressBeta, but returns a LogisticRes with useful stuff for statistical inference. If the last element of input is a floating point number instead of a range, it is used to specify the confidence interval calculated. Otherwise, the default of 0.95 is used.

References:

http:
//en.wikipedia.org/wiki/Wald_test

http:
//en.wikipedia.org/wiki/Akaike_information_criterion

pure nothrow @safe double logistic (double xb);

The logistic function used in logistic regression.

double[] logisticRegressPenalized (Y, X...)(Y yIn, X xIn, double lasso, double ridge);

Performs lasso (L1) and/or ridge (L2) penalized logistic regression. Due to the way the data is standardized, no intercept term should be included in x (unlike logisticRegress and logisticRegressBeta). The intercept coefficient is implicitly included and returned in the first element of the returned array. Usage is otherwise identical.

Note:
Setting lasso equal to zero is equivalent to performing ridge regression. This can also be done with logisticRegressBeta. However, the logisticRegressBeta algorithm is optimized for memory efficiency and large samples. This algorithm is optimized for large feature sets.

Returns:
The beta coefficients for the regression model.

References:
Friedman J, et al Pathwise coordinate optimization. Ann. Appl. Stat. 2007;2:302-332.

Goeman, J. J., L1 penalized estimation in the Cox proportional hazards model. Biometrical Journal 52(1), 70{84.

Eilers, P., Boer, J., Van Ommen, G., Van Houwelingen, H. 2001 Classification of microarray data with penalized logistic regression. Proceedings of SPIE. Progress in Biomedical Optics and Images vol. 4266, pp. 187-198

Page was generated with

on Wed May 25 22:15:55 2011