본문 바로가기

Data Science

(74)
[Tensorflow] Stochastic Gradient Descent 1. The Loss Function The loss function measures the disparity between the target's true value and the value the model predicts. Different problems call for different loss functions. We've been looking at regression problems, where the task is predict some numerical value. A common loss function for regression problem is the mean absolute error or MAE. For each prediction y_pred, MAE measures the..
[Tensorflow] Deep Neural Networks 1. Layers Neural networks typically organize their neurons into layers. When we collect together linear units having a common set of inputs we get a dense layer. We could think of each layer in a neural network as performing some kind of relatively simple transform its input in more complex ways. In a well-trained neural netwrok, each layer is a transformation getting us a little bit closer to a..
[Tensorflow] A Single Neuron 1. What is Deep Learning? Some of the most impressive advances in artificial intelligence in recent years have been in the field of deep learning. Natural language translation, image recognition, and game playing are all tasks where deep learning models have neared or even exceeded humal-level performance. So what is deep learning? Deep learning is an approach to machine laerning characterized b..
[Sklearn] Hyperparameter Tuning using Grid Search 1. What is Hyperparameter Tuning? Hyperparameter is a characteristic of a model that is external to the model and whose value cannot be estimated from data. The value of the hyperparameter has to be set before the learning process begins. For example, c in Support Vector Machines, k in K-Nearest Neighbors, the number of hidden layers in Neural Networks. Grid Search is exploratory way to find hyp..
[Sklearn] Pipeline 1. What is Pipeline? Pipelines are a simple way to keep our data preprocessing and modeling code organized. Specifically, a pipeline bundles preprocessing and modeling steps so we can use the whole bundle as if it were a single step. Many data scientists hack together models without pipellines, but pipelines have some important benefits. Clearner code : Accounting for data at each step of prepro..
[Sklearn] Transforming Columns by its Type 1. What is ColumnTransformer? The scikit-learn library has special function called 'ColumnTransformer'. It applies transformers to columns of an array or pandas DataFrame. This estimators allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for hetero..
[Sklearn] Scalers 1. What is Scaling? Scaling is a method used to normalize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the preprocessing step. Since the range of values of raw data varies widely, in some machine learning algorihtms, objective functions will not work properly without normalization. Another rea..
[Sklearn] Feature Engineering Skills 1. What is Feature Engineering? Feature Engineering is the process of using domain knowledge to extract features from raw data. The motivation is to use these extra features to improve the quality of resuls from a machine learning process, compared with supplying only the raw data to the machine learning process. 2. Mutual Information Mutual Information is a lot like correlation in that it measu..
[Models] Classification Models Note : This is just tiny subsets of full modeling workflow. We must understand domian knowledge of our training datasets and do statistical analysis firstly. import pandas as pd x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/airline/x_train.csv") y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/airline/y_train.csv") x_test= pd.rea..
[Models] Regression Models Note : This is just tiny subsets of full modeling workflow. We must understand domian knowledge of our training datasets and do statistical analysis firstly. import pandas as pd x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/studentscore/X_train.csv") y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/studentscore/y_train.csv") x_te..
[Sklearn] Dealing Categorical Variables : Encoders 1. What is Categorical Variables? A Categorical Variables takes only a limited number of values. Consider a survey that asks how often you eat breakfast and provides four options : "Never", "Rarely", "Most days", or "Every day". In this case, the data is categorical, because response fall into a fixed set of categories. If people responded to a survey about which brand of car they owned, the res..
[Sklearn] Dealing Missing Values : Imputers 1. What is Missing Values? Missing data(or missing values) is defined as the data value that is not stored for a variable in the observation of interest. The problem of missing data is relatively common in almost all research and can have a siginificant effect on the conclusions that can be drawn for the data. There are three types of missing values according to the mechanisms of missingess. Mis..
[Models] Underfitting and Overfitting Note : This is just tiny subsets of full modeling workflow. We must understand domian knowledge of our training datasets and do statistical analysis firstly. 1. Underfitting and Overfitting Problems While we do modeling, we use reliable way to measure model accuracy. Using those metrics, we can experiment with alternative model and see which gives the best predictions. Overfitting is problem whe..
[Models] How to make model Note : This is just tiny subsets of full modeling workflow. We must understand domian knowledge of our training datasets and do statistical analysis firstly. Step 1 : Selecting data for modeling We need to start by picking a few variables using our intuition. To choose variables/columns, we need to see a list of all columns in the dataset. That is done with the column property of the DataFrame. ..
[Sklearn] Cross Validation 1. What is Cross Validation? We want to evaluate almost every model we every build. In most applications, the relevant measure of model quaity is predictive accuracy. In other words, will the model's predictions be close to what actually happens. Many people make a huge mistake when measuring predictive accuracy. They make predictions with their training data and compare those predictions to the..
[Sklearn] Modules 1. What is Scikit-Learn? Scikit-learn(Sklearn) is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python. This library, which is largely written in Python, is build upon Numpy, Scip..
[pandas] Basic Data Exploration 1. Basic Exploratory Data Analysis Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python probramming language. The most important thing of pandas library is DataFrame. DataFrame have table object consisted of rows and columns. import pandas as pd # Save filpath to variable for easier access melbourne_file_path = '../input..
[Theorem] Bias vs Variance 1. Intersection between Bias and Variance Let's review about overfitting problem and underfitting problem. Underfitting problem is the problem when we use too much low degree polynomial term. Overfitting problem is the problem when we use too much high degree polynomial term. So, when we plot training error of \(J(\Theta)\) by degree of polynomial, we can see that in lower degree, error is high,..
[Theorem] Validation Sets 1. Decide what to do? In machine learning, errors can be raised sometimes even we set correct aglrotihm terms. If so, what should we try next? We can make some solutions following : Get more training examples Try smaller sets of features Try getting additional features Try adding polynomial features Try decreasing lambda Try increasing lambda Then, how can we select in above solutions? So we nee..
[Theorem] Optimizing Neural Network 1. Unrolling Parameter In neural network, there are differneces at advanced optimziation. In logistic regression. our parameter \(\theta\) is a vector only has a one column. But in neural network, activation node isn't a vector but matrix. So if we want to do back propagation, we need to do unrolling parameters. 2. Gradient Checking One property of back propagation is that there are many ways to..