Model Selection and Evaluation

Machine Learning/Stanford ML Specialization 2024. 4. 27. 11:55

머신러닝 모델을 만들었는데, 결과값이 심각하게 좋지 않을 때, 어떻게 해야할까? 아래를 포함한 여러가지 방법이 있을것이다.

트레이닝 데이터 더 수집하기
feature를 늘리거나 줄이기
ploynomical feature등을 사용하거나 다른 feature engineering 기법을 사용해보기
알파/감마값을 줄이거나 늘이기
다른 모델 선택하기

Evaluation

하지만, 이 모든것을 랜덤하게 해보는것은 수개월이 걸릴지도 모른다. 그렇기 때문에 어떤 문제가 발생하는지 모델을 Evaluation을 잘 하는것은 매우 중요하다.

Linear Regression에서, 첫번째로, Train/Test error를 계산하는 방법이 있다.

https://datascience.stackexchange.com/questions/86343/train-error-vs-test-error-in-linear-regression-by-samples-analysis

많이 사용하는 Squared Error Cost를 이용해볼 수 있는데, train값은 0에 가까울것이다. 하지만 test값은 train데이터가 그려낸 예측에서 차이가 있을것이다.

https://twitter.com/akshay_pachaar/status/1752310076566917561

그 오차를 Squared Error Cost를 이용해서 오차값이 많이 나는지 살펴볼 수 있다.

https://www.machinelearningworks.com/tutorials/mean-squared-error-cost-function

Classification 문제도 test 와 train 데이터를 갖고, 예측을 몇프로정도 잘 했는지 Jtest, Jtrain값을 연산해서 확인해볼 수 있다.

Model Selection

어떤 모델을 선택하는지에 대해 사용하기 위해서, 1차 다항식부터 10차 다항식까지 단순한 모델부터 고도화된 모델까지 수많은 선택지가 있다. 위에서 했던것처럼, 1차 다항식부터 2차, 3차 등 모든 모델에 대해서 Jtest 값을 계산해볼 수 있다. 그렇게 d=1 (f = wx + b), 즉 1차 다항식부터 예를 들어 10차 다항식의 모델까지 Jtest값을 계산하고, 어떤 것이 가장 낮은지 살펴보면 된다. 예를 들어 Jtest(w5, b5), 즉, 5차 다항식이 가장 낮은 Jtest값을 보여줬다면, 5차 다항식의 Linear Regression을 모델로 고려해볼 수 있다.

여기서, 너무나 일반화 되거나, 낙관적인 모델을 선택하는것을 피하고, 조금 더 나은 모델 선택을 하기 위해, 훈련 및 테스트 절차를 수정해볼 수 있다. 집값 예측을 예를 들어보자. 집의 size와 price가 있을 때, 10개의 예제가 있다면, 60%를 트레이닝세트에 넣어볼 수 있다. 그리고 20%세트를 cross valication세트에 넣고, 나머지 20%정도를 test세트로 사용해볼 수 있다. 이 세가지 결과를 바탕으로 모델을 선택할 수 있다. Linear, Classification, 그리고 Neural Network 모델 모두, 이런식으로 Jcv(Cost Valication)가 작은것을 선택해 Jtest를 계산해볼 수 있다.

Linear Regression Evaluation with Python

이제 실제로 코드를 통해 검증 과정을 테스트 해보자. 먼저, Regression 모델을 준비해봤다. 모델에 사용될 총 데이터는 50개이고, 1개의 feature, 그리고 target이 있다.

이제, 우리가 위에서 이야기 했던 대로, training 데이터, test 데이터를 나눠보자. 6:2:2로 나눌것이다.

# Get 60% of the dataset as the training set. Put the remaining 40% in temporary variables: x_ and y_.
x_train, x_, y_train, y_ = train_test_split(x, y, test_size=0.40, random_state=1)

# Split the 40% subset above into two: one half for cross validation and the other for the test set
x_cv, x_test, y_cv, y_test = train_test_split(x_, y_, test_size=0.50, random_state=1)

# Delete temporary variables
del x_, y_

print(f"the shape of the training set (input) is: {x_train.shape}")
print(f"the shape of the training set (target) is: {y_train.shape}\n")
print(f"the shape of the cross validation set (input) is: {x_cv.shape}")
print(f"the shape of the cross validation set (target) is: {y_cv.shape}\n")
print(f"the shape of the test set (input) is: {x_test.shape}")
print(f"the shape of the test set (target) is: {y_test.shape}")

the shape of the training set (input) is: (30, 1)
the shape of the training set (target) is: (30, 1)

the shape of the cross validation set (input) is: (10, 1)
the shape of the cross validation set (target) is: (10, 1)

the shape of the test set (input) is: (10, 1)
the shape of the test set (target) is: (10, 1)

시각화 하면 다음과 같다.

모델 트레이닝을 하기 전에, 퍼포먼스를 조금 더 좋게 하기 위해서 Feature scaling을 진행하도록 한다. x값이 너무 크기 때문에, StandardScaler를 사용했다.

# Initialize the class
scaler_linear = StandardScaler()

# Compute the mean and standard deviation of the training set then transform it
X_train_scaled = scaler_linear.fit_transform(x_train)

print(f"Computed mean of the training set: {scaler_linear.mean_.squeeze():.2f}")
print(f"Computed standard deviation of the training set: {scaler_linear.scale_.squeeze():.2f}")

# Plot the results
utils.plot_dataset(x=X_train_scaled, y=y_train, title="scaled input vs. target")

Computed mean of the training set: 2504.06
Computed standard deviation of the training set: 574.85

이제, 모델 성능을 높일 수 있도록 스케일링된 값을 시각적으로도 확인할 수 있다. scaled input값이 -1.5에서 2.0 사이값으로 변환되었다. 기존의 training set의 mean은 2504로 매우 컸다.

자, 이제 모델을 트레이닝 해보자.

# Initialize the class
linear_model = LinearRegression()

# Train the model
linear_model.fit(X_train_scaled, y_train )

그리고 Evaluation을 진행해보자. scikit-learn에서 지원하는 mean_squared_error()를 이용하면 손쉽게 계산할 수 있다. 만약 직접 계산하고 싶다면, 아래 코드에서 보여준대로, for loop을 이용할수도 있다. 둘다 모두 같은 값을 보였다. Jtrain 값은 406.19정도였다.

# Feed the scaled training set and get the predictions
yhat = linear_model.predict(X_train_scaled)

# Use scikit-learn's utility function and divide by 2
print(f"training MSE (using sklearn function): {mean_squared_error(y_train, yhat) / 2}")

# for-loop implementation
total_squared_error = 0

for i in range(len(yhat)):
    squared_error_i  = (yhat[i] - y_train[i])**2
    total_squared_error += squared_error_i                                              

mse = total_squared_error / (2*len(yhat))

print(f"training MSE (for-loop implementation): {mse.squeeze()}")

training MSE (using sklearn function): 406.19374192533155
training MSE (for-loop implementation): 406.19374192533155

자, 이제 Cross Valication셋트를 이용해서 계산해보도록 한다.

이 데이터도 역시, scale을 진행한 이후, mean_squared_error()를 이용했다.

# Scale the cross validation set using the mean and standard deviation of the training set
X_cv_scaled = scaler_linear.transform(x_cv)

print(f"Mean used to scale the CV set: {scaler_linear.mean_.squeeze():.2f}")
print(f"Standard deviation used to scale the CV set: {scaler_linear.scale_.squeeze():.2f}")

# Feed the scaled cross validation set
yhat = linear_model.predict(X_cv_scaled)

# Use scikit-learn's utility function and divide by 2
print(f"Cross validation MSE: {mean_squared_error(y_cv, yhat) / 2}")

Mean used to scale the CV set: 2504.06
Standard deviation used to scale the CV set: 574.85
Cross validation MSE: 551.7789026952216

자, 이제 Polynomial Feature를 더해보자. 1차항의 Linear Regression 모델의 CV를 알아냈으니, 고차항의 모델과 비교를 해보자.

# Instantiate the class to make polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)

# Compute the number of features and transform the training set
X_train_mapped = poly.fit_transform(x_train)

# Instantiate the class
scaler_poly = StandardScaler()

# Compute the mean and standard deviation of the training set then transform it
X_train_mapped_scaled = scaler_poly.fit_transform(X_train_mapped)

# Initialize the class
model = LinearRegression()

# Train the model
model.fit(X_train_mapped_scaled, y_train )

# Compute the training MSE
yhat = model.predict(X_train_mapped_scaled)
print(f"Training MSE: {mean_squared_error(y_train, yhat) / 2}")

# Add the polynomial features to the cross validation set
X_cv_mapped = poly.transform(x_cv)

# Scale the cross validation set using the mean and standard deviation of the training set
X_cv_mapped_scaled = scaler_poly.transform(X_cv_mapped)

# Compute the cross validation MSE
yhat = model.predict(X_cv_mapped_scaled)
print(f"Cross validation MSE: {mean_squared_error(y_cv, yhat) / 2}")

Training MSE: 49.11160933402521
Cross validation MSE: 87.69841211111924

다른것은 모두 비슷한데, PloynomialFeatures를 이용해서 degress를 2로 설정해주었다. 결과값은 위에서 나왔던 551의 Cross Valication보다 훨씬 적었다.

자, 이제, 1차부터 10차까지 모든 다항식을 이 방법으로 모델을 만들고, Cross Valication을 살펴봄으로써, 어떤 다항식이 가장 작은 오차를 보여주는지 한꺼번에 살펴보자.

# Initialize lists to save the errors, models, and feature transforms
train_mses = []
cv_mses = []
models = []
polys = []
scalers = []

# Loop over 10 times. Each adding one more degree of polynomial higher than the last.
for degree in range(1,11):
    
    # Add polynomial features to the training set
    poly = PolynomialFeatures(degree, include_bias=False)
    X_train_mapped = poly.fit_transform(x_train)
    polys.append(poly)
    
    # Scale the training set
    scaler_poly = StandardScaler()
    X_train_mapped_scaled = scaler_poly.fit_transform(X_train_mapped)
    scalers.append(scaler_poly)
    
    # Create and train the model
    model = LinearRegression()
    model.fit(X_train_mapped_scaled, y_train )
    models.append(model)
    
    # Compute the training MSE
    yhat = model.predict(X_train_mapped_scaled)
    train_mse = mean_squared_error(y_train, yhat) / 2
    train_mses.append(train_mse)
    
    # Add polynomial features and scale the cross validation set
    X_cv_mapped = poly.transform(x_cv)
    X_cv_mapped_scaled = scaler_poly.transform(X_cv_mapped)
    
    # Compute the cross validation MSE
    yhat = model.predict(X_cv_mapped_scaled)
    cv_mse = mean_squared_error(y_cv, yhat) / 2
    cv_mses.append(cv_mse)
    
# Plot the results
degrees=range(1,11)
utils.plot_train_cv_mses(degrees, train_mses, cv_mses, title="degree of polynomial vs. train and CV MSEs")

1차에 비해 2차항부터 현저하게 MSE값이 떨어지는것을 볼 수 있고, 오히려 6차부터는 training MSE는 조금 낮아지지만, CV MSE는 상승하는것을 볼 수 있다. 자, 그러면 가장 best model을 골라보자.

# Get the model with the lowest CV MSE (add 1 because list indices start at 0)
# This also corresponds to the degree of the polynomial added
degree = np.argmin(cv_mses) + 1
print(f"Lowest CV MSE is found in the model with degree={degree}")

Lowest CV MSE is found in the model with degree=4

4차항식의 모델이 가장 적은 Cross Valication MSE(Mean Squared Error)를 보여줬으므로, degree가 4인 모델이 가장 예측을 잘 할 확률이 크다.

# Add polynomial features to the test set
X_test_mapped = polys[degree-1].transform(x_test)

# Scale the test set
X_test_mapped_scaled = scalers[degree-1].transform(X_test_mapped)

# Compute the test MSE
yhat = models[degree-1].predict(X_test_mapped_scaled)
test_mse = mean_squared_error(y_test, yhat) / 2

print(f"Training MSE: {train_mses[degree-1]:.2f}")
print(f"Cross Validation MSE: {cv_mses[degree-1]:.2f}")
print(f"Test MSE: {test_mse:.2f}")

Training MSE: 47.15
Cross Validation MSE: 79.43
Test MSE: 104.63

자, 이제 test set에도 polynomical feature를 더하고, MSE를 구해보자. 가장 성능이 좋았던 4차항을 이용하게 된다.

Neural Networks Evaluation with Python

이번에는 신경망 모델을 선택하기 위한 코드를 작성해보자. 우선 1차항의 수식으로 Neural Network 모델을 트레이닝 해보자. 역시 좋은 결과를 같기위해 scaling도 진행한다.

# Add polynomial features
degree = 1
poly = PolynomialFeatures(degree, include_bias=False)
X_train_mapped = poly.fit_transform(x_train)
X_cv_mapped = poly.transform(x_cv)
X_test_mapped = poly.transform(x_test)

# Scale the features using the z-score
scaler = StandardScaler()
X_train_mapped_scaled = scaler.fit_transform(X_train_mapped)
X_cv_mapped_scaled = scaler.transform(X_cv_mapped)
X_test_mapped_scaled = scaler.transform(X_test_mapped)

자, 이제 임의로 주어진 3개의 모델을 이용해서 Cross Valication MSE를 계산해보자.

# Initialize lists that will contain the errors for each model
nn_train_mses = []
nn_cv_mses = []

# Build the models
nn_models = utils.build_models()

# Loop over the the models
for model in nn_models:
    
    # Setup the loss and optimizer
    model.compile(
    loss='mse',
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.1),
    )

    print(f"Training {model.name}...")
    
    # Train the model
    model.fit(
        X_train_mapped_scaled, y_train,
        epochs=300,
        verbose=0
    )
    
    print("Done!\n")

    
    # Record the training MSEs
    yhat = model.predict(X_train_mapped_scaled)
    train_mse = mean_squared_error(y_train, yhat) / 2
    nn_train_mses.append(train_mse)
    
    # Record the cross validation MSEs 
    yhat = model.predict(X_cv_mapped_scaled)
    cv_mse = mean_squared_error(y_cv, yhat) / 2
    nn_cv_mses.append(cv_mse)

    
# print results
print("RESULTS:")
for model_num in range(len(nn_train_mses)):
    print(
        f"Model {model_num+1}: Training MSE: {nn_train_mses[model_num]:.2f}, " +
        f"CV MSE: {nn_cv_mses[model_num]:.2f}"
        )

Training model_1...
Done!

Training model_2...
Done!

Training model_3...
Done!

RESULTS:
Model 1: Training MSE: 73.44, CV MSE: 113.87
Model 2: Training MSE: 73.40, CV MSE: 112.28
Model 3: Training MSE: 44.56, CV MSE: 88.51

모델3이 가장 작은 CV MSE를 보여줬으므로 모델 3을 선택했다.

# Select the model with the lowest CV MSE
model_num = 3

# Compute the test MSE
yhat = nn_models[model_num-1].predict(X_test_mapped_scaled)
test_mse = mean_squared_error(y_test, yhat) / 2

print(f"Selected Model: {model_num}")
print(f"Training MSE: {nn_train_mses[model_num-1]:.2f}")
print(f"Cross Validation MSE: {nn_cv_mses[model_num-1]:.2f}")
print(f"Test MSE: {test_mse:.2f}")

Selected Model: 3
Training MSE: 44.56
Cross Validation MSE: 88.51
Test MSE: 87.77

Classification Evaluation with Python

이번에는 분류 모델이다. 아래와 같은 특성을 가진 데이터를 y=1 또는 y=0으로 분류하는 과제이다.

먼저, 데이터를 6:2:2로 나누고, 역시 스케일링을 진행해주었다.

from sklearn.model_selection import train_test_split

# Get 60% of the dataset as the training set. Put the remaining 40% in temporary variables.
x_bc_train, x_, y_bc_train, y_ = train_test_split(x_bc, y_bc, test_size=0.40, random_state=1)

# Split the 40% subset above into two: one half for cross validation and the other for the test set
x_bc_cv, x_bc_test, y_bc_cv, y_bc_test = train_test_split(x_, y_, test_size=0.50, random_state=1)

# Delete temporary variables
del x_, y_

# Scale the features
# Initialize the class
scaler_linear = StandardScaler()

# Compute the mean and standard deviation of the training set then transform it
x_bc_train_scaled = scaler_linear.fit_transform(x_bc_train)
x_bc_cv_scaled = scaler_linear.transform(x_bc_cv)
x_bc_test_scaled = scaler_linear.transform(x_bc_test)

이후 Evaluation을 진행해준다. 이전에는 MSE를 이용해서 계산해줬다면, 이번에는 얼마나 많이 맞췄는지를 가지고 계산하게 된다. 아래는 예시이다. 모델이 1인지 0인지를 분류했고, 0.5이상일 경우 1, 미만이면 0으로 한다고 할 때, 모든 값이 실제 1이라면 misclassified는 어떻게 계산할까?

# Sample model output
probabilities = np.array([0.2, 0.6, 0.7, 0.3, 0.8])

# Apply a threshold to the model output. If greater than 0.5, set to 1. Else 0.
predictions = np.where(probabilities >= 0.5, 1, 0)

# Ground truth labels
ground_truth = np.array([1, 1, 1, 1, 1])

# Initialize counter for misclassified data
misclassified = 0

# Get number of predictions
num_predictions = len(predictions)

# Loop over each prediction
for i in range(num_predictions):
    
    # Check if it matches the ground truth
    if predictions[i] != ground_truth[i]:
        
        # Add one to the counter if the prediction is wrong
        misclassified += 1

# Compute the fraction of the data that the model misclassified
fraction_error = misclassified/num_predictions

print(f"probabilities: {probabilities}")
print(f"predictions with threshold=0.5: {predictions}")
print(f"targets: {ground_truth}")
print(f"fraction of misclassified data (for-loop): {fraction_error}")
print(f"fraction of misclassified data (with np.mean()): {np.mean(predictions != ground_truth)}")

예측값이 0.2, 0.6, 0.7, 0.3, 0.8이었으므로, predictions은 [0, 1, 1, 0, 1]으로 변환된다. 이후, ground_truth와 모든 값을 비교해서 misclassified에 잘못 분류된 것을 1로 카운트 해서 더해준다. 이 경우는 2개가 잘못 분류되었으므로 2가 된다. 이것을 전체 prediction으로 나누어주면 된다. 여기서는 40%가 잘못 되었으므로 0.4로 계산할 수 있다. numpy의 mean을 이용해서 한줄로 쉽게 계산할수도 있다.

자, 그러면 임의로 주어진 모델을 트레이닝 한 후, 이 방식을 이용해서 Clssification Error를 확인해보자.

# Initialize lists that will contain the errors for each model
nn_train_error = []
nn_cv_error = []

# Build the models
models_bc = utils.build_models()

# Loop over each model
for model in models_bc:
    
    # Setup the loss and optimizer
    model.compile(
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
    )

    print(f"Training {model.name}...")

    # Train the model
    model.fit(
        x_bc_train_scaled, y_bc_train,
        epochs=200,
        verbose=0
    )
    
    print("Done!\n")
    
    # Set the threshold for classification
    threshold = 0.5
    
    # Record the fraction of misclassified examples for the training set
    yhat = model.predict(x_bc_train_scaled)
    yhat = tf.math.sigmoid(yhat)
    yhat = np.where(yhat >= threshold, 1, 0)
    train_error = np.mean(yhat != y_bc_train)
    nn_train_error.append(train_error)

    # Record the fraction of misclassified examples for the cross validation set
    yhat = model.predict(x_bc_cv_scaled)
    yhat = tf.math.sigmoid(yhat)
    yhat = np.where(yhat >= threshold, 1, 0)
    cv_error = np.mean(yhat != y_bc_cv)
    nn_cv_error.append(cv_error)

# Print the result
for model_num in range(len(nn_train_error)):
    print(
        f"Model {model_num+1}: Training Set Classification Error: {nn_train_error[model_num]:.5f}, " +
        f"CV Set Classification Error: {nn_cv_error[model_num]:.5f}"
        )

Training model_1...
Done!

Training model_2...
Done!

Training model_3...
Done!

Model 1: Training Set Classification Error: 0.05833, CV Set Classification Error: 0.17500
Model 2: Training Set Classification Error: 0.06667, CV Set Classification Error: 0.15000
Model 3: Training Set Classification Error: 0.05000, CV Set Classification Error: 0.15000

CV set error값이 같다면, 어떻게 선택해야할까? 이럴때는 모델이 작은 경량모델을 선택하는게 리소스적으로도, 속도 면으로도 유리할 수 있다. 여기서는 Model1이 가장 작은 모델이고, Model3가 가장 큰 모델이다. 모델 2와 3이 같은 결과를 보여줬지만, Training Set Error에서 3이 조금더 우수했으므로 3을 선택해보자.

# Select the model with the lowest error
model_num = 3

# Compute the test error
yhat = models_bc[model_num-1].predict(x_bc_test_scaled)
yhat = tf.math.sigmoid(yhat)
yhat = np.where(yhat >= threshold, 1, 0)
nn_test_error = np.mean(yhat != y_bc_test)

print(f"Selected Model: {model_num}")
print(f"Training Set Classification Error: {nn_train_error[model_num-1]:.4f}")
print(f"CV Set Classification Error: {nn_cv_error[model_num-1]:.4f}")
print(f"Test Set Classification Error: {nn_test_error:.4f}")

Selected Model: 3
Training Set Classification Error: 0.0500
CV Set Classification Error: 0.1500
Test Set Classification Error: 0.1750

'Machine Learning > Stanford ML Specialization' 카테고리의 다른 글

ML Development Process (1)	2024.04.28
Bias and Variance (0)	2024.04.27
Multiclass Classification \| Neural Network Additional Layer Types (0)	2024.04.02
Multiclass Classification \| Neural Network Advanced Optimization (0)	2024.04.02
Multiclass Classification \| Advanced Learning Algorithm (0)	2024.03.24

ABOUT ME

G471000 G471000

Evaluation

Model Selection

Linear Regression Evaluation with Python

Neural Networks Evaluation with Python

Classification Evaluation with Python

'Machine Learning > Stanford ML Specialization' 카테고리의 다른 글

티스토리툴바

ABOUT ME

Evaluation

Model Selection

Linear Regression Evaluation with Python

Neural Networks Evaluation with Python

Classification Evaluation with Python

'Machine Learning > Stanford ML Specialization' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바

Linear Regression Evaluation with Python

Neural Networks Evaluation with Python

Classification Evaluation with Python