ML Intro | Iris Analysis - Import Data and Check

Machine Learning/ML with Python Library 2024. 1. 23. 19:02

Scikit Learn을 사용해서 첫번째 ML 어플리케이션을 만들어보자. 바로 Iris의 품종을 예측하는 모델인데, 꽃잎(petal)과 꽃받침(sepal)의 폭과 길이를 cm 단위로 측정한 값과 전문가들이 setosa, versicolor, virginica종으로 분류한 iris의 데이터도 가지고 이 값들을 이용해 어떤 품종인지 구분, 즉 Classification하는 어플을 만들것이다. 그래서 iris(붓꽃)의 종류는 클래스(class)라고 한다. 그리고 iris의 하나하나 특징은 레이블(label)이라고 한다.

먼저 코랩에서 sklearn에서 제공해주는 iris데이터를 import해서 살펴보았다.

from sklearn.datasets import load_iris
iris_dataset = load_iris()

print("ires_dataset key:", iris_dataset.keys())

데이터셋의 키를 확인했을 때 'DESCR'라는 키가 있었고,

ires_dataset key: dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

이는 Description, 설명을 의미했다.

print(iris_dataset['DESCR'][:1000] + '\n...')

Description을 살펴보자, 아래와 같은 설명들이 나왔다.

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
    ============== ==== ==== ======= ===== ===========
...

그리고 target_names를 살펴보았을 때, 아래와 같은 이름이 나왔다.

print("Target Name:", iris_dataset['target_names'])

Target Name: ['setosa' 'versicolor' 'virginica']

이 이외에도, feature names, data, 그리고 data size도 살펴보았다.

https://peaceadegbite1.medium.com/iris-flower-classification-60790e9718a1

print("Feature Names:", iris_dataset['feature_names'])
print("Data Type:", type(iris_dataset['data']))
print("Data Size:", iris_dataset['data'].shape)

Feature Names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Data Type: <class 'numpy.ndarray'>
Data Size: (150, 4)

Feature은 꽃받침 길이, 폭, 꽃잎의 길이 폭 순서로 4가지가 저장되어있고, data는 배열임을 알 수 있다. 데이터의 사이즈는 (150, 4), 즉 150개 꽃의 각 정보가 들어가 있다.

그렇다면, 데이터의 처음 다섯 row를 살펴보자.

print("Data's first 5 rows:\n", iris_dataset['data'][:5])

Data's first 5 rows: [[5.1 3.5 1.4 0.2] [4.9 3. 1.4 0.2] [4.7 3.2 1.3 0.2] [4.6 3.1 1.5 0.2] [5. 3.6 1.4 0.2]]

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] 순서이기 때문에, petal with, 즉 꽃잎의 폭이 모두 0.2cm이고, 첫번째 꽃이 다섯 꽃중에서 가장 꽃받침을 가졌다는걸 알 수 있다.

Feature들을 확인 했다면, target들을 확인해보자. 어떤 품종인지 나타내는 target이 어떤 형식으로 저장되어있는지 확인해보자.

print("target type:", type(iris_dataset['target']))

target type: <class 'numpy.ndarray'>

target은 numpy의 1차원 array이다. target의 크기는 얼마나 될까?

print("target size:", iris_dataset['target'].shape)

target size: (150,)

150개의 target이 저장되어있다. 일리있는점은, 이미 Feature데이터를 살펴봤을 때, 150가지의 꽃 정보들을 갖고 있었다. 각 꽃이 어떤 Target, 즉 어떤 품종인지를 보여주는것이다. Target 데이터를 보자.

print("Target:\n", iris_dataset['target'])

Target: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]

0 에서 2까지의 숫자가 150번 나열되어있는것을 확인할 수 있다. 이 숫자는 무엇을 의미할까?

print("Target Names:", iris_dataset['target_names'])

Target Names: ['setosa' 'versicolor' 'virginica']

target_names를 확인해보면, 0번은 setosa, 1번은 versicolor, 그리고 2번은 virginica임을 알 수 있다. 세개의 품종이 고르게 잘 적재되어있는것을 알 수 있다.

이제 데이터 적재는 완료했다. 그렇다면 실제 모델을 돌리기 위해 데이터를 준비해보자! (다음시간에...)

이번시간의 코랩은 아래에....

https://colab.research.google.com/drive/1ISPKQ-D2bGmJvDlqhqTieOUz-1ujrXbu?usp=sharing

_01_iris_ml_model.ipynb

Colaboratory notebook

colab.research.google.com

Reference

https://www.yes24.com/Product/Goods/42806875

파이썬 라이브러리를 활용한 머신러닝 - 예스24

사이킷런 핵심 개발자에게 배우는 머신러닝 이론과 구현 현업에서 머신러닝을 연구하고 인공지능 서비스를 개발하기 위해 꼭 학위를 받을 필요는 없다. 사이킷런(scikit-learn)과 같은 훌륭한 머신

www.yes24.com

'Machine Learning > ML with Python Library' 카테고리의 다른 글

ML Intro \| Iris Analysis - Prediction (0)	2024.01.27
ML Intro \| Iris Analysis - K-Nearest Neighbors(KNN) Model (1)	2024.01.27
ML Intro \| Iris Analysis - Look at your Data (2)	2024.01.27
ML Intro \| Iris Analysis - Training Data & Testing Data (1)	2024.01.24
Introduction to Machine Learning with Python (0)	2024.01.22

ABOUT ME

G471000 G471000

Reference

'Machine Learning > ML with Python Library' 카테고리의 다른 글

티스토리툴바

ABOUT ME

Reference

'Machine Learning > ML with Python Library' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바