Machine Learning with Scikit-Learn

we will be using a machine learning library called Scikit-Learn to implement a machine learning algorithm

ML algorithms can be divided roughly into three categories

  • supervised learning

    1. classification

    2. categorical

    3. regression

    4. forecasting

  • unsupervised learning

    1. clustering

    2. dimensionality reduction

  • reinforcement learning

    1. Monte Carlo methods


    3.Policy Gradient methods

ML in Scikit-Learn

  1. Classification

    7개 SGD Classifier, KNeighborsClassifier, LinearSVC, NaiveBayes, SVC, Kernel approximation, EnsembleClassifiers

  2. Regression

    7개 SGD Regressor, Lasso, ElasticNet, RidgeRegression, SVR(kernel=’linear’), SVR(kernel=’rbf’), EnsembelRegressor

  3. Clustering

  4. Dimensionality Reduction

algorithms are chosen by

  • amount of data

  • presense of label (정답의 유무)

  • type of data (quantity or category)

About Scikit-Learn


  • Python (>= 3.6)

  • NumPy (>= 1.13.3)

  • SciPy (>= 0.19.1)

  • joblib (>= 0.11)


#!pip install scikit-learn
import sklearn


Commonly used API


Data Expression

Datasets can be expressed using

  • NumPy ndarray

  • Pandas DataFrame

  • SciPy Sparse Matrix

In Scikit-Learn, we express data by Feature Matrix and Target Vector

Feature Matrix is the given data and the Target Vector is the results or answers corresponding to the data

Regression Model

We will make a model that predicts outcome by using data and ML

import numpy as np
import matplotlib.pyplot as plt
r = np.random.RandomState(10)
x = 10 * r.rand(100)
y = 2 * x - 3 * r.rand(100)
<matplotlib.collections.PathCollection at 0x272f360aee0>



x and y shape is (100,) a 1 dimention vector

from sklearn.linear_model import LinearRegression
model = LinearRegression()
# ! 에러 발생, y)

If we use x as it is, w eget an error. We have to change x into a matrix. Since x is a Numpy ndarray type, we can use reshape().

#x는 numpy의 ndarray타입이니 reshape()를 사용하면 좋을 것 같네요. 
X = x.reshape(100,1),y)
x_new = np.linspace(-1, 11, 100)
X_new = x_new.reshape(100,1)
y_new = model.predict(X_new)

In reshape(), input -1 automatically calculates rest of the numbers.

X_ = x_new.reshape(-1,1)
(100, 1)

We can use sklearn.metrics to check if the regression model’s predictions are accurate

from sklearn.metrics import mean_squared_error

error = mean_squared_error(x_new, y_new)

plt.scatter(x, y, label='input data')
plt.plot(X_new, y_new, color='red', label='regression line')
[<matplotlib.lines.Line2D at 0x272f3d75880>]


The graph looks accurately placed

datasets modules

sklearn.datasets provide data for us to use.

  • datasets.load_boston(): 회귀 문제, 미국 보스턴 집값 예측

  • datasets.load_breast_cancer(): 분류 문제, 유방암 판별

  • datasets.load_digits(): 분류 문제, 0 ~ 9 숫자 분류

  • datasets.load_iris(): 분류 문제, iris 품종 분류

  • datasets.load_wine(): 분류 문제, 와인 분류

from sklearn.datasets import load_wine
data = load_wine()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names'])
array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,
       [1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,
       [1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,
       [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,
       [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,
       [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,
(178, 13)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2])
array(['class_0', 'class_1', 'class_2'], dtype='<U7')
.. _wine_dataset:

Wine recognition dataset

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
    :Summary Statistics:
    ============================= ==== ===== ======= =====
                                   Min   Max   Mean     SD
    ============================= ==== ===== ======= =====
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0.98  3.88    2.29  0.63
    Flavanoids:                   0.34  5.08    2.03  1.00
    Nonflavanoid Phenols:         0.13  0.66    0.36  0.12
    Proanthocyanins:              0.41  3.58    1.59  0.57
    Colour Intensity:              1.3  13.0     5.1   2.3
    Hue:                          0.48  1.71    0.96  0.23
    OD280/OD315 of diluted wines: 1.27  4.00    2.61  0.71
    Proline:                       278  1680     746   315
    ============================= ==== ===== ======= =====

    :Missing Attribute Values: None
    :Class Distribution: class_0 (59), class_1 (71), class_2 (48)
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (
    :Date: July, 1988

This is a copy of UCI ML Wine recognition datasets.

The data is the results of a chemical analysis of wines grown in the same
region in Italy by three different cultivators. There are thirteen different
measurements taken for different constituents found in the three types of

Original Owners: 

Forina, M. et al, PARVUS - 
An Extendible Package for Data Exploration, Classification and Correlation. 
Institute of Pharmaceutical and Food Analysis and Technologies,
Via Brigata Salerno, 16147 Genoa, Italy.


Lichman, M. (2013). UCI Machine Learning Repository
[]. Irvine, CA: University of California,
School of Information and Computer Science. 

.. topic:: References

  (1) S. Aeberhard, D. Coomans and O. de Vel, 
  Comparison of Classifiers in High Dimensional Settings, 
  Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of  
  Mathematics and Statistics, James Cook University of North Queensland. 
  (Also submitted to Technometrics). 

  The data was used with many others for comparing various 
  classifiers. The classes are separable, though only RDA 
  has achieved 100% correct classification. 
  (RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data)) 
  (All results using the leave-one-out technique) 

  (2) S. Aeberhard, D. Coomans and O. de Vel, 
  Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of 
  Mathematics and Statistics, James Cook University of North Queensland. 
  (Also submitted to Journal of Chemometrics).

dataset categorization

#!pip install pandas

import pandas as pd
pd.DataFrame(, columns=data.feature_names)
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue od280/od315_of_diluted_wines proline
0 14.23 1.71 2.43 15.6 127.0 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065.0
1 13.20 1.78 2.14 11.2 100.0 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050.0
2 13.16 2.36 2.67 18.6 101.0 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185.0
3 14.37 1.95 2.50 16.8 113.0 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480.0
4 13.24 2.59 2.87 21.0 118.0 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
173 13.71 5.65 2.45 20.5 95.0 1.68 0.61 0.52 1.06 7.70 0.64 1.74 740.0
174 13.40 3.91 2.48 23.0 102.0 1.80 0.75 0.43 1.41 7.30 0.70 1.56 750.0
175 13.27 4.28 2.26 20.0 120.0 1.59 0.69 0.43 1.35 10.20 0.59 1.56 835.0
176 13.17 2.59 2.37 20.0 120.0 1.65 0.68 0.53 1.46 9.30 0.60 1.62 840.0
177 14.13 4.10 2.74 24.5 96.0 2.05 0.76 0.56 1.35 9.20 0.61 1.60 560.0

178 rows × 13 columns

X =
y =
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(), y)
y_pred = model.predict(X)
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

#타겟 벡터 즉 라벨인 변수명 y와 예측값 y_pred을 각각 인자로 넣습니다. 
print(classification_report(y, y_pred))
#정확도를 출력합니다. 
print("accuracy = ", accuracy_score(y, y_pred))
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        59
           1       1.00      1.00      1.00        71
           2       1.00      1.00      1.00        48

    accuracy                           1.00       178
   macro avg       1.00      1.00      1.00       178
weighted avg       1.00      1.00      1.00       178

accuracy =  1.0

accuracy is 1. This cannot be correct. I will explain in below how to fix this problem.


We can use the estimator to predict outcome regardless of supervised or unsupervised learning

Split Train and Test Data

We used the same data set for training and testing. This resulted in accuracy of 100% in our testing

from sklearn.datasets import load_wine
data = load_wine()
(178, 13)
X_train =[:142]
X_test =[142:]
print(X_train.shape, X_test.shape)
(142, 13) (36, 13)
y_train =[:142]
y_test =[142:]
print(y_train.shape, y_test.shape)
(142,) (36,)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(), y_train)
y_pred = model.predict(X_test)
from sklearn.metrics import accuracy_score

print("정답률=", accuracy_score(y_test, y_pred))
정답률= 0.9444444444444444

We can also use the train_test_split function to split the data sets easily

from sklearn.model_selection import train_test_split

result = train_test_split(X, y, test_size=0.2, random_state=42)
<class 'list'>
(142, 13)
(36, 13)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# 데이터셋 로드하기
data = load_wine()
# 훈련용 데이터셋 나누기
X_train, X_test, y_train, y_test = train_test_split(,, test_size=0.2, random_state=11)
# 훈련하기
model = RandomForestClassifier(), y_train)
# 예측하기
y_pred = model.predict(X_test)
# 정답률 출력하기
print("정답률=", accuracy_score(y_test, y_pred))
정답률= 0.9722222222222222

