본문 바로가기

Data Science/R

(32)
[R] Simulation Study : Prediction Performance 1. How to do simulation about prediction performance? If \(X\) has \(n \times p (200 \times 2000)\) size, we need to find which variables are selected and which variable affects target most(coefficients). So, we need to consider variable models predicting well. - M1 : \(\hat{\beta}^{lasso} + \lambda_{min}\) - M2 : \(\hat{\beta}^{lasso} + \lambda_{1se}\) - M3 : \(\hat{\beta}^{lasso} + \lambda_{mi..
[R] Useful Functions for Regression Problems 1. model.matrix() Make dummy variable with category and intercept : model.matrix(~., x) Make dummy variable with category not intercept : mdoel.matirx(~., x)[, -1] Make prediciton from regsubsets, glm, or lm function : model.matrix(~., x) %*% coef(g, id=i) Make prediction from glmnet : model.matrix(~., x) %*% coef(g, s=g$lambda[0]) The model.matrix function convert original data into categorical..
[R] Regularization Methods : Binary 1. Regulaization Methods Regularization methods are based on a penalized likelihood : \(Q_{\lambda}(\beta_0, \beta) = -l(\beta_0, \beta) + p_{\lambda}(\beta)\) \((\hat{\beta_0}, \hat{\beta}) = arg min Q_{\lambda}(\beta_0, \beta)\) Penalized likelihood for quantitive Linear regression model : \(y_i = \beta_0 + x_i^T \beta + \epsilon_i\) l1-norm : \(\lambda \sum(\hat{\beta}^2)\) l2-norm : \(\lambd..
[R] Variable Selection Methods : Lasso 1. Lasso Regression Ridge have disadvantages of including all p predictors in the final model. What we want to do is variable selection. Lasso shrinks \(\hat{\beta}\) towards zero. \(RSS + \lambda\sum_{j=1}^{p}|\beta_j|\) The \(l_1\)-norm of \(\hat{\beta}\) : \(df(\hat{\beta}_{\lambda_1}) = 0
[R] Variable Selection Methods : Ridge 1. Variable Selection Methods We cannot use subset selection model in \(n > Var(\hat{\beta}^{sh})\) Examples Ridge Lasso Elastic Net : Ridge + Lasso 3. Ridge Regression \(RSS + \lambda\sum_{j=1}^{p}\beta_j^2\) where \(\lambda >= 0\) is a tuning parameter. For a grid of \(\lambda\) : \(\lambda_{max} = \lambda_1 > ... > \lambda_m = \lambda_{min}\). The \(l_2\)-norm of \(\hat{\beta}\) : \(||\hat{\b..
[R] Best Subset Selection 1. Three classes of solving problems To solve the problem (variance become higher when the number of features is bigger), we need to make p lower than n. Subset Selection : Identify a subset of the p predictors that we belive to be related to the response. Shrinkage : Fit a model involving all p predictors, but the estimated coefficient are shrunked towards zero relative to the OLS estimates. Di..
[R] Linear Model 1. OLS(Ordinary Least Square) model The linear regression model : \(Y = \beta_0 + \beta_1 X_1 + ... \beta_p X_p + \epsilon\) OLS Ordinary least squared (OLS) is a type of linear least squares method for estimating the unkown parameters in a linear regression. All parameters of OLS model are unbiased estimators. \(E(\hat{\beta}^{OLS}) = \beta\) \(Var(\hat{\beta}^{OLS}) ↓\) Problems in multiple li..
[R] Cross Validation 1. What is Cross Validation? In real world, we can't get test data for \(MSE_{test}\). So we should divide train data into train set and test set. Test-set error estimation Mathmatical Adjustment : \(C_p\), \(AIC\), \(BIC\), Adjusted \(R^2\) Hold out : holding out a subset of training set. Validation set approach K-fold Cross Validation LOOCV, LpOCV 2. Validation Set Approach Divide training set..
[R] Assessing Model Accuracy 1. How do we assess model accuracy? Quantitative : MSE(mean squared error) Qualitative : Classification error rate Type of dataset Training set : To fit statistical learning models Validation set : To select optimal tuning parameter Test set : To select the best model 2. MSE(Mean Squared Error) Suppose our fitted model \(\hat{f}(x)\) from training dataset, \((x_i, y_i)\). \(MSE_{train} = \frac{1..
[R] Flexibility and Interpretability 1. Parametric and Non-Parametric Methods Parametric methods : Make an assumption about the functional form or shape of \(f\). Non-Parametric methods : Do not make explicit assumptions about the functional form of \(f\). 2. Flexibility and Interpretability Flexibility : The flexibility of a model can be described as how much is model's behavior influenced by characteristics of the data. So, if fl..
[R] Supervised Learning 1. Model based on Supervised Learning Ideal model : \(Y = f(X) + \epsilon\) Good \(f(X)\) can make predictions of \(Y\) at new points \(X = x\). Statistical Learning refers to a set of approaches for estimating the function \(f(X)\). # Indexing without index AD
[R] Introduction to Statistical Learning 1. Definitions of Statistical Learning Statistical Learning is a set of tools for modeling and understanding complex datasets. Supervised Statistical Learning builds a statistical model for predicting or estimating for data with output based on one or more inputs. Unsupervised Statistical Learning learns relationships and structure from data that has inputs but no supervising output. 2. Supervis..