본문 바로가기

분류 전체보기

(150)
[Models] Underfitting and Overfitting Note : This is just tiny subsets of full modeling workflow. We must understand domian knowledge of our training datasets and do statistical analysis firstly. 1. Underfitting and Overfitting Problems While we do modeling, we use reliable way to measure model accuracy. Using those metrics, we can experiment with alternative model and see which gives the best predictions. Overfitting is problem whe..
[Models] How to make model Note : This is just tiny subsets of full modeling workflow. We must understand domian knowledge of our training datasets and do statistical analysis firstly. Step 1 : Selecting data for modeling We need to start by picking a few variables using our intuition. To choose variables/columns, we need to see a list of all columns in the dataset. That is done with the column property of the DataFrame. ..
[Sklearn] Cross Validation 1. What is Cross Validation? We want to evaluate almost every model we every build. In most applications, the relevant measure of model quaity is predictive accuracy. In other words, will the model's predictions be close to what actually happens. Many people make a huge mistake when measuring predictive accuracy. They make predictions with their training data and compare those predictions to the..
[Sklearn] Modules 1. What is Scikit-Learn? Scikit-learn(Sklearn) is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python. This library, which is largely written in Python, is build upon Numpy, Scip..
[pandas] Basic Data Exploration 1. Basic Exploratory Data Analysis Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python probramming language. The most important thing of pandas library is DataFrame. DataFrame have table object consisted of rows and columns. import pandas as pd # Save filpath to variable for easier access melbourne_file_path = '../input..
[Theorem] Bias vs Variance 1. Intersection between Bias and Variance Let's review about overfitting problem and underfitting problem. Underfitting problem is the problem when we use too much low degree polynomial term. Overfitting problem is the problem when we use too much high degree polynomial term. So, when we plot training error of \(J(\Theta)\) by degree of polynomial, we can see that in lower degree, error is high,..
[Theorem] Validation Sets 1. Decide what to do? In machine learning, errors can be raised sometimes even we set correct aglrotihm terms. If so, what should we try next? We can make some solutions following : Get more training examples Try smaller sets of features Try getting additional features Try adding polynomial features Try decreasing lambda Try increasing lambda Then, how can we select in above solutions? So we nee..
[Theorem] Optimizing Neural Network 1. Unrolling Parameter In neural network, there are differneces at advanced optimziation. In logistic regression. our parameter \(\theta\) is a vector only has a one column. But in neural network, activation node isn't a vector but matrix. So if we want to do back propagation, we need to do unrolling parameters. 2. Gradient Checking One property of back propagation is that there are many ways to..
[Theorem] Neural Network 1. What is Neural Network Polynomial terms in linear regression and logistic regression, we have heavy features to set hypothesis. For example, if we have \(50 \times 50\) pixel images, then total pixels becomes 2500. So total features of logistic regression becomes \(n = 2500 + \alpha\) (very big, when applying polynomial term). If we have too many features, we can have overfitting problem and ..
[Theorem] Regularization 1. Regularization of Logistic Regression Because we don't know how many theta can affect overfitting, we make all theta become small. $$ \left(J(\theta )=\frac{1}{2m}\sum _{i=1}^m(h_{\theta }(x^{(i)})-y^{(i)})^2+\lambda \sum _{j=1}^m\theta _j^2)\right) $$ \(\lambda\) is called the regularization parameter which controls a trade off between two different goals. The first goal is that we would lik..
[Theorem] Overfitting 1. Overfitting in Linear Regression When degree of freedom is low, \(H(x)\) can only predict output in simple way and can't predict every case of x. This called 'underfitting' or 'high bias'. When degree of freedom is proper(not too low and not too high), predicting output is pretty well. When degree of freedom is high, model can predict output well, but can't generalize well to predict new data..
[Theorem] Logistic Regression 1. What is Classification Problem? Usually classification have two discrete output zero and one which first one is 'negative output', the other is a 'positive output'. For example, in classification for spam mail, zero means mail is not spam mail, one means mail is spam mail. $$ y \in 0, 1 $$ Multivariate classification have multiple discrete output. $$ y \in 0, 1, 2, ... $$ 2. Logistic Regressi..
[Theorem] Multivariate Linear Regression 1. Multivariate Hypothesis feet(x1) number of rooms(x2) Built Age(x3) Price of House 1412 5 30 3520 1530 3 45 2420 642 2 56 1238 \(x^{i}_{j}\) : value of feature j in ith training example \(x^i\) : the input features of the ith training example \(m\) : the number of training examples \(n\) : the number of features if \(x_{2}^{2}\), it means 45, if \(x_2\), it means [30, 45, 56] 3 dimensional vec..
[Theorem] Linear Regression 1. What is Hypothesis function? In Supervised Learning, we use 'Regression Algorithm' when we meet problem such as predicting continuous output. Using knowing data x, y in linear regression, we can predict \(y(n)\) when we have \(x(n)\) and function of \((x, y)\). Below is the function of \((x, y)\) when we have one variable. $$ H_{\theta}(x)=Y=\theta _0 + \theta _1 X $$ \(m\) : number of record..
[plotly] Layout components of plotly 1. Updating or Modfying Figures mad with Plotly Express If none of built-in plotly arguments allow us to customize the figure the way we need to, we can use the update_* and add_* methods on the plotly.graph objects.Figure object returned by the PX function to make any further modifications to the figure. 2. Usecase of those methods import plotly.express as px df = px.data.tips() fig = px.histog..
[plotly] Graph Objects in Python 1. What is Figure objects? The plotly Python package exists to create, manipulate and render graphical figures represented by data structures also referred to as figures. Figures can be represented in Python either as dicts or as instances of the plotly.graph_objects.Figure class, and are serialized as text in JSON before being passed to Plotly.js. Figure({ 'data': [{'hovertemplate': 'x=%{x} y=%..
[plotly] Plotly Express in Python 1. What is Plotly Express? The plotly.express module contains functions that can create entire figures at once, and is reffered to as Plotly Express or PX. Plotly Express is a built-in part of the plotly library, and is the recommended starting point for creating most common figures. Every Plotly Express function uses graph objects internally and returns a plotly.graph_objects.Figure instance. 2..
[plotly] iplot in Python 1. What is iplot? iplot() uses the Cufflinks wrapper over plotly that runs Matplotlib under the hood. It's seems to be the easiest way to get iteractive plots with simple one line code. 2. Differences between iplot and plot iplot is iteractive plot. Plotly takes Python code and makes beautiful looking JavaScript plots. plot coommand is Matplotlib which is more old-school. It creates static chart..
[plotly] Getting Started with Plotly in Python 1. What is Plotly? The plotly Python library is an interactive, open-source plotting library that supppors over 40 unique chart types covering a wide range of statistical, financial, geographic, scientific, and 3-dimensional use-cases. plotly also enables Python users to create beautiful interactive web-based visualization that can be displayed in Jupyter notebooks, saved to standalone HTML file..
[pandas] Useful personal function for EDA 1. Check missing records def missing(df) : missing_number = df.isnull().sum().sort_values(ascending = False) missing_percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending = False) missing_values = pd.concat([missing_number, missing_percent], axis = 1, keys = ['Missing_number', 'Missing_percent']) return missing_values 2. Grouping columns by its feature def categorize(df) : Quan..