본문 바로가기

Data Science

(74)
[R] Simulation Study : Prediction Performance 1. How to do simulation about prediction performance? If \(X\) has \(n \times p (200 \times 2000)\) size, we need to find which variables are selected and which variable affects target most(coefficients). So, we need to consider variable models predicting well. - M1 : \(\hat{\beta}^{lasso} + \lambda_{min}\) - M2 : \(\hat{\beta}^{lasso} + \lambda_{1se}\) - M3 : \(\hat{\beta}^{lasso} + \lambda_{mi..
[pandas] Optimizing DataFrame's Memory 1. Estimating the amount of memory The Pandas DataFrame.info() method provides information on non-null counts, dtype, and memory usage of data frames. The memory_usage='deep' keyword can confirm more accurate memory usage. import pandas as pd df = pd.read_csv('file.csv') df.info(memory_usage='deep') 1.1 Pandas BlockManager The Pandas's BlockManager Class optimizes data by type and stores it sepa..
[R] Useful Functions for Regression Problems 1. model.matrix() Make dummy variable with category and intercept : model.matrix(~., x) Make dummy variable with category not intercept : mdoel.matirx(~., x)[, -1] Make prediciton from regsubsets, glm, or lm function : model.matrix(~., x) %*% coef(g, id=i) Make prediction from glmnet : model.matrix(~., x) %*% coef(g, s=g$lambda[0]) The model.matrix function convert original data into categorical..
[R] Regularization Methods : Binary 1. Regulaization Methods Regularization methods are based on a penalized likelihood : \(Q_{\lambda}(\beta_0, \beta) = -l(\beta_0, \beta) + p_{\lambda}(\beta)\) \((\hat{\beta_0}, \hat{\beta}) = arg min Q_{\lambda}(\beta_0, \beta)\) Penalized likelihood for quantitive Linear regression model : \(y_i = \beta_0 + x_i^T \beta + \epsilon_i\) l1-norm : \(\lambda \sum(\hat{\beta}^2)\) l2-norm : \(\lambd..
[R] Variable Selection Methods : Lasso 1. Lasso Regression Ridge have disadvantages of including all p predictors in the final model. What we want to do is variable selection. Lasso shrinks \(\hat{\beta}\) towards zero. \(RSS + \lambda\sum_{j=1}^{p}|\beta_j|\) The \(l_1\)-norm of \(\hat{\beta}\) : \(df(\hat{\beta}_{\lambda_1}) = 0
[R] Variable Selection Methods : Ridge 1. Variable Selection Methods We cannot use subset selection model in \(n > Var(\hat{\beta}^{sh})\) Examples Ridge Lasso Elastic Net : Ridge + Lasso 3. Ridge Regression \(RSS + \lambda\sum_{j=1}^{p}\beta_j^2\) where \(\lambda >= 0\) is a tuning parameter. For a grid of \(\lambda\) : \(\lambda_{max} = \lambda_1 > ... > \lambda_m = \lambda_{min}\). The \(l_2\)-norm of \(\hat{\beta}\) : \(||\hat{\b..
[R] Best Subset Selection 1. Three classes of solving problems To solve the problem (variance become higher when the number of features is bigger), we need to make p lower than n. Subset Selection : Identify a subset of the p predictors that we belive to be related to the response. Shrinkage : Fit a model involving all p predictors, but the estimated coefficient are shrunked towards zero relative to the OLS estimates. Di..
[R] Linear Model 1. OLS(Ordinary Least Square) model The linear regression model : \(Y = \beta_0 + \beta_1 X_1 + ... \beta_p X_p + \epsilon\) OLS Ordinary least squared (OLS) is a type of linear least squares method for estimating the unkown parameters in a linear regression. All parameters of OLS model are unbiased estimators. \(E(\hat{\beta}^{OLS}) = \beta\) \(Var(\hat{\beta}^{OLS}) ↓\) Problems in multiple li..
[R] Cross Validation 1. What is Cross Validation? In real world, we can't get test data for \(MSE_{test}\). So we should divide train data into train set and test set. Test-set error estimation Mathmatical Adjustment : \(C_p\), \(AIC\), \(BIC\), Adjusted \(R^2\) Hold out : holding out a subset of training set. Validation set approach K-fold Cross Validation LOOCV, LpOCV 2. Validation Set Approach Divide training set..
[R] Assessing Model Accuracy 1. How do we assess model accuracy? Quantitative : MSE(mean squared error) Qualitative : Classification error rate Type of dataset Training set : To fit statistical learning models Validation set : To select optimal tuning parameter Test set : To select the best model 2. MSE(Mean Squared Error) Suppose our fitted model \(\hat{f}(x)\) from training dataset, \((x_i, y_i)\). \(MSE_{train} = \frac{1..
[R] Flexibility and Interpretability 1. Parametric and Non-Parametric Methods Parametric methods : Make an assumption about the functional form or shape of \(f\). Non-Parametric methods : Do not make explicit assumptions about the functional form of \(f\). 2. Flexibility and Interpretability Flexibility : The flexibility of a model can be described as how much is model's behavior influenced by characteristics of the data. So, if fl..
[R] Supervised Learning 1. Model based on Supervised Learning Ideal model : \(Y = f(X) + \epsilon\) Good \(f(X)\) can make predictions of \(Y\) at new points \(X = x\). Statistical Learning refers to a set of approaches for estimating the function \(f(X)\). # Indexing without index AD
[R] Introduction to Statistical Learning 1. Definitions of Statistical Learning Statistical Learning is a set of tools for modeling and understanding complex datasets. Supervised Statistical Learning builds a statistical model for predicting or estimating for data with output based on one or more inputs. Unsupervised Statistical Learning learns relationships and structure from data that has inputs but no supervising output. 2. Supervis..
[pandas] Introduction to Pandas 1. Importing Pandas library The Numpy library provides useful operations for performing algebraic operations and has the advantage of fast execution time. However, Numpy's ndarray has the disadvantage of being able to recognize only numerical data and storing only the same data type. Pandas library allow us to work on different data types. import pandas as pd 2. Reading csv format file using Pan..
[numpy] Processing Datasets, Boolean, and Datatypes in Numpy 1. Datasets in Numpy 1.1 Load csv file into ndarray The numpy.genfromtxt() method stores numeric data inside a text file in ndarray. import numpy as np file = np.genfromtxt('file.csv', delimiter=',') fisrt_five = file[:5,:] The result stored in the ndarray is denoted by scientific notation and nan. np.nan stands for not a number and means character data inside a csv file. In addition, since each..
[numpy] Arithmetics with Numpy Arrays 1. Basic Arithmetics 1.1 Adding values The ndarray with the same length generate new ndarrays by adding elements of each index. In Python's case, the numpy packages is much more efficient in terms of algebric operations because the elements for each list must be extracted and added. The reason why the numpy object is much faster than the Python list that the numpy package is written in the C lan..
[numpy] Basic operations of Numpy array 1. What is Numpy? Numpy is a general-purpose array-processing package. It provides a high-performance multidimensional array object and tools for working with these arrays.The Numpy is an abbreviation for Numpy Python and is a Python package used in algebric calculations. import numpy as np 2. Array 2.1 Creating an Array object Numpy's core data structure is ndarray, which has a similar structur..
[Tensorflow] Binary Classification 1. Binary Classification Classification into one of two classes is a common machine learning problem. We might want to predict whether or not a customer is likely to make a purchase, whether or not a credit card transaction was fadulent, whether deep space signals show evidence of a new planet, or a medical test evidence of a disease. These are all binary classification problems. In our raw data..
[Tensorflow] Dropout and Batch Normalization 1. Dropout Dropout layer can help correcting overfitting. Overfitting is caused by the network spurious patterns in the training data. To recognize these spurious patterns a network will often rely on very specific combinations of weight, a kind of "conspiracy" of weights. Being so specific, they tend to be fragile : remove one and the conspiracy falls apart. This is the idea behind dropout. To ..
[Tensorflow] Overfitting and Underfitting 1. Interpreting the Learning Curves We might think about the information in the training data as being of two kinds : signal and noise. The signal is the part that generalizes, the part that can help our model make predictions from new data. The noise is that part is only true of the training data; the noise is all of the random fluctuation that comes from data in the real-world or all of the in..