본문 바로가기

Data Science/Scikit-Learn

(9)
[Sklearn] Hyperparameter Tuning using Grid Search 1. What is Hyperparameter Tuning? Hyperparameter is a characteristic of a model that is external to the model and whose value cannot be estimated from data. The value of the hyperparameter has to be set before the learning process begins. For example, c in Support Vector Machines, k in K-Nearest Neighbors, the number of hidden layers in Neural Networks. Grid Search is exploratory way to find hyp..
[Sklearn] Pipeline 1. What is Pipeline? Pipelines are a simple way to keep our data preprocessing and modeling code organized. Specifically, a pipeline bundles preprocessing and modeling steps so we can use the whole bundle as if it were a single step. Many data scientists hack together models without pipellines, but pipelines have some important benefits. Clearner code : Accounting for data at each step of prepro..
[Sklearn] Transforming Columns by its Type 1. What is ColumnTransformer? The scikit-learn library has special function called 'ColumnTransformer'. It applies transformers to columns of an array or pandas DataFrame. This estimators allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for hetero..
[Sklearn] Scalers 1. What is Scaling? Scaling is a method used to normalize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the preprocessing step. Since the range of values of raw data varies widely, in some machine learning algorihtms, objective functions will not work properly without normalization. Another rea..
[Sklearn] Feature Engineering Skills 1. What is Feature Engineering? Feature Engineering is the process of using domain knowledge to extract features from raw data. The motivation is to use these extra features to improve the quality of resuls from a machine learning process, compared with supplying only the raw data to the machine learning process. 2. Mutual Information Mutual Information is a lot like correlation in that it measu..
[Sklearn] Dealing Categorical Variables : Encoders 1. What is Categorical Variables? A Categorical Variables takes only a limited number of values. Consider a survey that asks how often you eat breakfast and provides four options : "Never", "Rarely", "Most days", or "Every day". In this case, the data is categorical, because response fall into a fixed set of categories. If people responded to a survey about which brand of car they owned, the res..
[Sklearn] Dealing Missing Values : Imputers 1. What is Missing Values? Missing data(or missing values) is defined as the data value that is not stored for a variable in the observation of interest. The problem of missing data is relatively common in almost all research and can have a siginificant effect on the conclusions that can be drawn for the data. There are three types of missing values according to the mechanisms of missingess. Mis..
[Sklearn] Cross Validation 1. What is Cross Validation? We want to evaluate almost every model we every build. In most applications, the relevant measure of model quaity is predictive accuracy. In other words, will the model's predictions be close to what actually happens. Many people make a huge mistake when measuring predictive accuracy. They make predictions with their training data and compare those predictions to the..
[Sklearn] Modules 1. What is Scikit-Learn? Scikit-learn(Sklearn) is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python. This library, which is largely written in Python, is build upon Numpy, Scip..