본문 바로가기

분류 전체보기

(150)
[chardet] Encoding and Representing Text 1. What is Encoding? Encoding is a processing or processing method that converts the form or form of encoding information, and in the case of character encoding, it is a method of encoding a set of characters. Since the computer does not accept number other than 0 and 1, encoding is required to express chracters. ASCII encoding has 128 characters codes. In the case of ASCII encdoing, in addition..
[csv] Read files 1. What does csv library do? The Python's library called csv allows data from various format to be imported in our program. 2. How to use it? 2.1 Open files one by one First, open the file we want using open buillt-in function. And then, make reader object using reader method from csv library. We should close the file we open after we make reader obejct. The final data we will use will be stored..
[Roadmap] Data Engineering Roadmap Above image is an approximate Data Engineering Roadmap provided by Seattle Data Guy. We present a roadmap across Coding Basic, Data Warehouse, Workflow, NoSQL, Cloud, Streaming, Distrubuted System, and UI/UX. PDF file at the bottom of the video provides Youtube source and Lecture url, so let's check it out. Source from : https://www.youtube.com/watch?v=SpaFPPByOhM&t=957s
[Course] Datacamp vs Dataquest Datacamp covers many languages and tools such as Python and Scala. For example, lectures on all tools that Data Engineers can handle, such as Pyspark, Airflow, Postgres, Hadoop, Hive, and Presto. However, unlike Data Quest, lectures are provided on the premise that there is some knowledge of Python and SQL, so it is necessary to have basic knowledge and take lectures. As metioned earlier, Dataqu..
[Cheat Sheets] Data Science Cheat Sheets There is a very useful quick helper to make a data scientist happier. In Kaggle, I found data science cheat sheet repositories which is a collection of cheat sheets for various data-science related languages and topics. Enjoy it and have fun. :) Source from : Kaggle source : https://www.kaggle.com/datasets/timoboz/data-science-cheat-sheets Original source : https://github.com/abhat222/Data-Scien..
[Tensorflow] Binary Classification 1. Binary Classification Classification into one of two classes is a common machine learning problem. We might want to predict whether or not a customer is likely to make a purchase, whether or not a credit card transaction was fadulent, whether deep space signals show evidence of a new planet, or a medical test evidence of a disease. These are all binary classification problems. In our raw data..
[Tensorflow] Dropout and Batch Normalization 1. Dropout Dropout layer can help correcting overfitting. Overfitting is caused by the network spurious patterns in the training data. To recognize these spurious patterns a network will often rely on very specific combinations of weight, a kind of "conspiracy" of weights. Being so specific, they tend to be fragile : remove one and the conspiracy falls apart. This is the idea behind dropout. To ..
[Tensorflow] Overfitting and Underfitting 1. Interpreting the Learning Curves We might think about the information in the training data as being of two kinds : signal and noise. The signal is the part that generalizes, the part that can help our model make predictions from new data. The noise is that part is only true of the training data; the noise is all of the random fluctuation that comes from data in the real-world or all of the in..
[Tensorflow] Stochastic Gradient Descent 1. The Loss Function The loss function measures the disparity between the target's true value and the value the model predicts. Different problems call for different loss functions. We've been looking at regression problems, where the task is predict some numerical value. A common loss function for regression problem is the mean absolute error or MAE. For each prediction y_pred, MAE measures the..
[Tensorflow] Deep Neural Networks 1. Layers Neural networks typically organize their neurons into layers. When we collect together linear units having a common set of inputs we get a dense layer. We could think of each layer in a neural network as performing some kind of relatively simple transform its input in more complex ways. In a well-trained neural netwrok, each layer is a transformation getting us a little bit closer to a..
[Tensorflow] A Single Neuron 1. What is Deep Learning? Some of the most impressive advances in artificial intelligence in recent years have been in the field of deep learning. Natural language translation, image recognition, and game playing are all tasks where deep learning models have neared or even exceeded humal-level performance. So what is deep learning? Deep learning is an approach to machine laerning characterized b..
[Sklearn] Hyperparameter Tuning using Grid Search 1. What is Hyperparameter Tuning? Hyperparameter is a characteristic of a model that is external to the model and whose value cannot be estimated from data. The value of the hyperparameter has to be set before the learning process begins. For example, c in Support Vector Machines, k in K-Nearest Neighbors, the number of hidden layers in Neural Networks. Grid Search is exploratory way to find hyp..
[Sklearn] Pipeline 1. What is Pipeline? Pipelines are a simple way to keep our data preprocessing and modeling code organized. Specifically, a pipeline bundles preprocessing and modeling steps so we can use the whole bundle as if it were a single step. Many data scientists hack together models without pipellines, but pipelines have some important benefits. Clearner code : Accounting for data at each step of prepro..
[Sklearn] Transforming Columns by its Type 1. What is ColumnTransformer? The scikit-learn library has special function called 'ColumnTransformer'. It applies transformers to columns of an array or pandas DataFrame. This estimators allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for hetero..
[Sklearn] Scalers 1. What is Scaling? Scaling is a method used to normalize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the preprocessing step. Since the range of values of raw data varies widely, in some machine learning algorihtms, objective functions will not work properly without normalization. Another rea..
[Sklearn] Feature Engineering Skills 1. What is Feature Engineering? Feature Engineering is the process of using domain knowledge to extract features from raw data. The motivation is to use these extra features to improve the quality of resuls from a machine learning process, compared with supplying only the raw data to the machine learning process. 2. Mutual Information Mutual Information is a lot like correlation in that it measu..
[Models] Classification Models Note : This is just tiny subsets of full modeling workflow. We must understand domian knowledge of our training datasets and do statistical analysis firstly. import pandas as pd x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/airline/x_train.csv") y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/airline/y_train.csv") x_test= pd.rea..
[Models] Regression Models Note : This is just tiny subsets of full modeling workflow. We must understand domian knowledge of our training datasets and do statistical analysis firstly. import pandas as pd x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/studentscore/X_train.csv") y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/studentscore/y_train.csv") x_te..
[Sklearn] Dealing Categorical Variables : Encoders 1. What is Categorical Variables? A Categorical Variables takes only a limited number of values. Consider a survey that asks how often you eat breakfast and provides four options : "Never", "Rarely", "Most days", or "Every day". In this case, the data is categorical, because response fall into a fixed set of categories. If people responded to a survey about which brand of car they owned, the res..
[Sklearn] Dealing Missing Values : Imputers 1. What is Missing Values? Missing data(or missing values) is defined as the data value that is not stored for a variable in the observation of interest. The problem of missing data is relatively common in almost all research and can have a siginificant effect on the conclusions that can be drawn for the data. There are three types of missing values according to the mechanisms of missingess. Mis..