lassoBasic

Syntax

lassoBasic(Y, X, [mode=0], [alpha=1.0], [intercept=true], [normalize=false], [maxIter=1000], [tolerance=0.0001], [positive=false], [swColName], [checkInput=true])

Details

Perform lasso regression.

Minimize the following objective function:

Arguments

Y is a numeric vector indicating the dependent variables.

X is a numeric vector/tuple/matrix/table, indicating the independent variables.
  • When X is a vector/tuple, its length must be equal to the length of Y.

  • When X is a matrix/table, its number of rows must be equal to the length of Y.

mode is an integer that can take the following three values:
  • 0 (default) : a vector of the coefficient estimates.

  • 1: a table with coefficient estimates, standard error, t-statistics, and p-values.

  • 2: a dictionary with the following keys: ANOVA, RegressionStat, Coefficient and Residual

    Table 1. ANOVA (one-way analysis of variance)
    Source of Variance DF (degree of freedom) SS (sum of square) MS (mean of square) F (F-score) Significance
    Regression p sum of squares regression, SSR regression mean square, MSR=SSR/R MSR/MSE p-value
    Residual n-p-1 sum of squares error, SSE mean square error, MSE=MSE/E
    Total n-1 sum of squares total, SST
    Table 2. RegressionStat (Regression statistics)
    Item Description
    R2 R-squared
    AdjustedR2 The adjusted R-squared corrected based on the degrees of freedom by comparing the sample size to the number of terms in the regression model.
    StdError The residual standard error/deviation corrected based on the degrees of freedom.
    Observations The sample size.
    Table 3. Coefficient
    Item Description
    factor Independent variables
    beta Estimated regression coefficients
    StdError Standard error of the regression coefficients
    tstat t statistic, indicating the significance of the regression coefficients

    Residual: the difference between each predicted value and the actual value.

alpha is a floating number representing the constant that multiplies the L1-norm. The default value is 1.0.

intercept is a Boolean variable indicating whether the regression includes the intercept. If it is true, the system automatically adds a column of "1"s to X to generate the intercept. The default value is true.

normalize is a Boolean value. If true, the regressors will be normalized before regression by subtracting the mean and dividing by the L2-norm. If intercept =false, this parameter will be ignored. The default value is false.

maxIter is a positive integer indicating the maximum number of iterations. The default value is 1000.

tolerance is a floating number. The iterations stop when the improvement in the objective function value is smaller than tolerance. The default value is 0.0001.

positive is a Boolean value indicating whether to force the coefficient estimates to be positive. The default value is false.

swColName is a STRING indicating a column name of ds. The specified column is used as the sample weight. If it is not specified, the sample weight is treated as 1.

checkInput is a BOOLEAN value. It determines whether to enable validation check for parameters yColName, xColNames, and swColName.
  • If checkInput = true (default), it will check the invalid value for parameters and throw an error if the NULL value exists.

  • If checkInput = false, the invalid value is not checked.

Note: It is recommended to specify checkInput = true. If it is false, it must be ensured that there are no invalid values in the input parameters and no invalid values are generated during intermediate calculations, otherwise the returned model may be inaccurate.

Examples

x1=1 3 5 7 11 16 23
x2=2 8 11 34 56 54 100
y=0.1 4.2 5.6 8.8 22.1 35.6 77.2;

print(lassoBasic(y, (x1,x2), mode = 0));
// output
[-9.133706333069543,2.535935196073186,0.189298948643987]


print(lassoBasic(y, (x1,x2), mode = 1));
// output
factor    beta               stdError          tstat              pvalue
--------- ------------------ ----------------- ------------------ -----------------
intercept -9.133706333069543 5.247492365971091 -1.740584968222107 0.156730846105191
x1        2.535935196073186  1.835793667840723 1.38138356205138   0.239309472176311
x2        0.189298948643987  0.410201227095842 0.461478260277749  0.66843504931137


print(lassoBasic(y, (x1,x2), mode = 2));
// output
Coefficient->
factor    beta               stdError          tstat              pvalue
--------- ------------------ ----------------- ------------------ -----------------
intercept -9.133706333069543 5.247492365971091 -1.740584968222107 0.156730846105191
x1        2.535935196073186  1.835793667840723 1.38138356205138   0.239309472176311
x2        0.189298948643987  0.410201227095842 0.461478260277749  0.66843504931137

RegressionStat->
item         statistics
------------ -----------------
R2           0.931480447323074
AdjustedR2   0.897220670984611
StdError     8.195817208870076
Observations 7

ANOVA->
Breakdown  DF SS                   MS                   F                  Significance
---------- -- -------------------- -------------------- ------------------ -----------------
Regression 2  4165.242566095043912 2082.621283047521956 31.004574440904473 0.003672076469395
Residual   4  268.685678884843582  67.171419721210895
Total      6  4471.637142857141952

Residual->
[6.319173239708383,4.21150915569809,-0.028258082380245,-6.254004293338318,-7.262321947798779,-6.063400030876729,9.077301958987561]