본문 바로가기

Machine Learning

(27)
[Tensorflow] Binary Classification 1. Binary Classification Classification into one of two classes is a common machine learning problem. We might want to predict whether or not a customer is likely to make a purchase, whether or not a credit card transaction was fadulent, whether deep space signals show evidence of a new planet, or a medical test evidence of a disease. These are all binary classification problems. In our raw data..
[Tensorflow] Dropout and Batch Normalization 1. Dropout Dropout layer can help correcting overfitting. Overfitting is caused by the network spurious patterns in the training data. To recognize these spurious patterns a network will often rely on very specific combinations of weight, a kind of "conspiracy" of weights. Being so specific, they tend to be fragile : remove one and the conspiracy falls apart. This is the idea behind dropout. To ..
[Tensorflow] Overfitting and Underfitting 1. Interpreting the Learning Curves We might think about the information in the training data as being of two kinds : signal and noise. The signal is the part that generalizes, the part that can help our model make predictions from new data. The noise is that part is only true of the training data; the noise is all of the random fluctuation that comes from data in the real-world or all of the in..
[Tensorflow] Stochastic Gradient Descent 1. The Loss Function The loss function measures the disparity between the target's true value and the value the model predicts. Different problems call for different loss functions. We've been looking at regression problems, where the task is predict some numerical value. A common loss function for regression problem is the mean absolute error or MAE. For each prediction y_pred, MAE measures the..
[Tensorflow] Deep Neural Networks 1. Layers Neural networks typically organize their neurons into layers. When we collect together linear units having a common set of inputs we get a dense layer. We could think of each layer in a neural network as performing some kind of relatively simple transform its input in more complex ways. In a well-trained neural netwrok, each layer is a transformation getting us a little bit closer to a..
[Tensorflow] A Single Neuron 1. What is Deep Learning? Some of the most impressive advances in artificial intelligence in recent years have been in the field of deep learning. Natural language translation, image recognition, and game playing are all tasks where deep learning models have neared or even exceeded humal-level performance. So what is deep learning? Deep learning is an approach to machine laerning characterized b..
[Sklearn] Hyperparameter Tuning using Grid Search 1. What is Hyperparameter Tuning? Hyperparameter is a characteristic of a model that is external to the model and whose value cannot be estimated from data. The value of the hyperparameter has to be set before the learning process begins. For example, c in Support Vector Machines, k in K-Nearest Neighbors, the number of hidden layers in Neural Networks. Grid Search is exploratory way to find hyp..
[Sklearn] Pipeline 1. What is Pipeline? Pipelines are a simple way to keep our data preprocessing and modeling code organized. Specifically, a pipeline bundles preprocessing and modeling steps so we can use the whole bundle as if it were a single step. Many data scientists hack together models without pipellines, but pipelines have some important benefits. Clearner code : Accounting for data at each step of prepro..
[Sklearn] Transforming Columns by its Type 1. What is ColumnTransformer? The scikit-learn library has special function called 'ColumnTransformer'. It applies transformers to columns of an array or pandas DataFrame. This estimators allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for hetero..
[Sklearn] Scalers 1. What is Scaling? Scaling is a method used to normalize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the preprocessing step. Since the range of values of raw data varies widely, in some machine learning algorihtms, objective functions will not work properly without normalization. Another rea..
[Sklearn] Feature Engineering Skills 1. What is Feature Engineering? Feature Engineering is the process of using domain knowledge to extract features from raw data. The motivation is to use these extra features to improve the quality of resuls from a machine learning process, compared with supplying only the raw data to the machine learning process. 2. Mutual Information Mutual Information is a lot like correlation in that it measu..
[Models] Classification Models Note : This is just tiny subsets of full modeling workflow. We must understand domian knowledge of our training datasets and do statistical analysis firstly. import pandas as pd x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/airline/x_train.csv") y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/airline/y_train.csv") x_test= pd.rea..
[Models] Regression Models Note : This is just tiny subsets of full modeling workflow. We must understand domian knowledge of our training datasets and do statistical analysis firstly. import pandas as pd x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/studentscore/X_train.csv") y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/studentscore/y_train.csv") x_te..
[Sklearn] Dealing Categorical Variables : Encoders 1. What is Categorical Variables? A Categorical Variables takes only a limited number of values. Consider a survey that asks how often you eat breakfast and provides four options : "Never", "Rarely", "Most days", or "Every day". In this case, the data is categorical, because response fall into a fixed set of categories. If people responded to a survey about which brand of car they owned, the res..
[Sklearn] Dealing Missing Values : Imputers 1. What is Missing Values? Missing data(or missing values) is defined as the data value that is not stored for a variable in the observation of interest. The problem of missing data is relatively common in almost all research and can have a siginificant effect on the conclusions that can be drawn for the data. There are three types of missing values according to the mechanisms of missingess. Mis..
[Models] Underfitting and Overfitting Note : This is just tiny subsets of full modeling workflow. We must understand domian knowledge of our training datasets and do statistical analysis firstly. 1. Underfitting and Overfitting Problems While we do modeling, we use reliable way to measure model accuracy. Using those metrics, we can experiment with alternative model and see which gives the best predictions. Overfitting is problem whe..
[Models] How to make model Note : This is just tiny subsets of full modeling workflow. We must understand domian knowledge of our training datasets and do statistical analysis firstly. Step 1 : Selecting data for modeling We need to start by picking a few variables using our intuition. To choose variables/columns, we need to see a list of all columns in the dataset. That is done with the column property of the DataFrame. ..
[Sklearn] Cross Validation 1. What is Cross Validation? We want to evaluate almost every model we every build. In most applications, the relevant measure of model quaity is predictive accuracy. In other words, will the model's predictions be close to what actually happens. Many people make a huge mistake when measuring predictive accuracy. They make predictions with their training data and compare those predictions to the..
[Sklearn] Modules 1. What is Scikit-Learn? Scikit-learn(Sklearn) is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python. This library, which is largely written in Python, is build upon Numpy, Scip..
[Theorem] Bias vs Variance 1. Intersection between Bias and Variance Let's review about overfitting problem and underfitting problem. Underfitting problem is the problem when we use too much low degree polynomial term. Overfitting problem is the problem when we use too much high degree polynomial term. So, when we plot training error of \(J(\Theta)\) by degree of polynomial, we can see that in lower degree, error is high,..