Machine Learning Workflow (Simplified)

Dataset Description

Note

This dataset originally comes from Uber Fares Dataset
We perform several edit for this mentoring purposes. So, please use the dataset from here.

Description - We’re looking to predict the fare of Uber’s transactions. - The dataset contains of the following fields

Feature	Type	Descriptions
`order_id`	`int`	a unique identifier for each trip
`pickup_time`	`str`	a class of pickup time. `04-10`, `10-16`, `16-22`, `22-04`. E.g. `04-10` means the pickup time is between 04.00 to 10.00
`pickup_longitude`	`float`	the longitude where the meter was engaged
`pickup_latitude`	`float`	the latitude where the meter was engaged
`dropoff_longitude`	`float`	the longitude where the meter was disengaged
`dropoff_latitude`	`float`	the latitude where the meter was disengaged
`passenger_count`	`float`	the number of passengers in the vehicle (driver entered value)
`fare_amount`	`int`	the cost of each trip in USD, (our target)

Modeling Workflow

1. Import data to Python
2. Data Preprocessing
3. Training a Machine Learning Models
4. Test Prediction

1. Import data to Python

# Import Numpy and Pandas library
import pandas as pd
import numpy as np

# Create a function to read the data
def read_data(fname):
    data = pd.read_csv(fname)
    print('Data shape raw               :', data.shape)
    print('Number of duplicate order id :', data.duplicated(subset='order_id').sum())
    data = data.drop_duplicates(subset='order_id', keep='last')
    data = data.set_index('order_id')
    print('Data shape after dropping    :', data.shape)
    print('Data shape final             :', data.shape)
    return data

# Read the Uber data
data = read_data(fname='uber_edit.csv')

Data shape raw               : (194814, 8)
Number of duplicate order id : 0
Data shape after dropping    : (194814, 7)
Data shape final             : (194814, 7)

data.head()

	fare_amount	pickup_time	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	passenger_count
order_id
24238194	7.5	16-22	-73.999817	40.738354	-73.999512	40.723217	1.0
27835199	7.7	16-22	-73.994355	40.728225	-73.994710	40.750325	1.0
44984355	12.9	16-22	-74.005043	40.740770	-73.962565	40.772647	1.0
25894730	5.3	04-10	-73.976124	40.790844	-73.965316	40.803349	3.0
17610152	16.0	16-22	-73.925023	40.744085	-73.973082	40.761247	5.0

2. Data Preprocessing

The processing pipeline

2.1 Input-Output Split
2.2 Train-Valid-Test Split
2.3 Separate Numerical and Categorical Features
2.4 Numerical Imputation
2.5 Categorical Imputation
2.6 Preprocess Categorical Features
2.7 Join the Data
2.8 Feature Engineering the Data
2.9 Create a Preprocessing Function

2.1. Input-Output Split

We’re going to split input & output according to the modeling objective.
Create a function to split the input & output

def split_input_output(data, target_col):
    X = data.drop(columns=target_col)
    y = data[target_col]
    print('X shape:', X.shape)
    print('y shape:', y.shape)
    return X, y

X, y = split_input_output(data=data,
                          target_col='fare_amount')

X shape: (194814, 6)
y shape: (194814,)

X.head()

	pickup_time	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	passenger_count
order_id
24238194	16-22	-73.999817	40.738354	-73.999512	40.723217	1.0
27835199	16-22	-73.994355	40.728225	-73.994710	40.750325	1.0
44984355	16-22	-74.005043	40.740770	-73.962565	40.772647	1.0
25894730	04-10	-73.976124	40.790844	-73.965316	40.803349	3.0
17610152	16-22	-73.925023	40.744085	-73.973082	40.761247	5.0

y.head()

order_id
24238194     7.5
27835199     7.7
44984355    12.9
25894730     5.3
17610152    16.0
Name: fare_amount, dtype: float64

2.2. Train-Valid-Test Split

Now, we want to split the data before modeling.
Split the data into three set:
- Train, for training the model
- Validation, for choosing the best model
- Test, for error generalization

from sklearn.model_selection import train_test_split
def split_train_test(X, y, test_size, seed):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)
    print('X train shape:', X_train.shape)
    print('y train shape:', y_train.shape)
    print('X test shape :', X_test.shape)
    print('y test shape :', y_test.shape)
    return X_train, X_test, y_train, y_test

# Split the data
# First, split the train & not train
X_train, X_not_train, y_train, y_not_train =  split_train_test(X, y, test_size=0.2, seed=123) 

# Then, split the valid & test
X_valid, X_test, y_valid, y_test = split_train_test(X_not_train, y_not_train, test_size=0.5, seed=123)

X train shape: (155851, 6)
y train shape: (155851,)
X test shape : (38963, 6)
y test shape : (38963,)
X train shape: (19481, 6)
y train shape: (19481,)
X test shape : (19482, 6)
y test shape : (19482,)

print(len(X_train)/len(X))  # should be 0.8
print(len(X_valid)/len(X))  # should be 0.1
print(len(X_test)/len(X))   # should be 0.1

0.7999989733797366
0.09999794675947314
0.1000030798607903

X_train.head()

	pickup_time	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	passenger_count
order_id
51655713	16-22	-73.979392	40.735734	-73.906281	40.745539	2.0
37525839	16-22	-73.986575	40.761473	-73.981880	40.768660	5.0
55058970	16-22	-73.972533	40.782260	-73.952761	40.708980	1.0
15663447	10-16	-73.979967	40.751612	-73.976313	40.758427	6.0
13325650	16-22	-73.976192	40.744026	-73.980935	40.733946	1.0

2.3. Separate Numerical and Categorical Features

We now prepare to perform data preprocessing
But, we first separate the data into numerical data & categorical data.

def split_num_cat(data, num_cols, cat_cols):
    data_num = data[num_cols]
    data_cat = data[cat_cols]
    print('Data num shape:', data_num.shape)
    print('Data cat shape:', data_cat.shape)
    return data_num, data_cat

num_cols = ['pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude' ,'passenger_count']
cat_cols = ['pickup_time']
X_train_num, X_train_cat = split_num_cat(X_train, num_cols, cat_cols) # WRITE YOUR CODE HERE

Data num shape: (155851, 5)
Data cat shape: (155851, 1)

X_train_num.head()

	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	passenger_count
order_id
51655713	-73.979392	40.735734	-73.906281	40.745539	2.0
37525839	-73.986575	40.761473	-73.981880	40.768660	5.0
55058970	-73.972533	40.782260	-73.952761	40.708980	1.0
15663447	-73.979967	40.751612	-73.976313	40.758427	6.0
13325650	-73.976192	40.744026	-73.980935	40.733946	1.0

X_train_cat.head()

	pickup_time
order_id
51655713	16-22
37525839	16-22
55058970	16-22
15663447	10-16
13325650	16-22

EDA before Preprocessing

Find the number of missing values

100 * (X_train.isna().sum(0) / len(X_train))

pickup_time          0.000000
pickup_longitude     0.000000
pickup_latitude      0.000000
dropoff_longitude    0.000000
dropoff_latitude     0.000000
passenger_count      0.606348
dtype: float64

We will impute all these variables if there is any missing value
First, check the numerical features distribution

import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(nrows=3, ncols=2, figsize=(12, 8))
axes = ax.flatten()

for i, col in enumerate(X_train_num.columns):
    sns.kdeplot(X_train_num[col], ax=axes[i])
    axes[i].set_title(f'Distribution of {col}')

plt.tight_layout()
plt.show()

All the distribution are skewed, we can impute a missing value by its features median.
Next, explore the pickup_time

X_train['pickup_time'].value_counts(normalize=True)

pickup_time
16-22    0.328160
10-16    0.286376
22-04    0.221648
04-10    0.157599
-        0.006217
Name: proportion, dtype: float64

There’s a missing value with symbol '-' in pickup_time,
We can impute the missing value with UNKNOWN
Explore the relation between pickup_time and fare

train_data = pd.concat((X_train, y_train), axis=1)
train_data.head()

	pickup_time	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	passenger_count	fare_amount
order_id
51655713	16-22	-73.979392	40.735734	-73.906281	40.745539	2.0	16.5
37525839	16-22	-73.986575	40.761473	-73.981880	40.768660	5.0	3.7
55058970	16-22	-73.972533	40.782260	-73.952761	40.708980	1.0	18.9
15663447	10-16	-73.979967	40.751612	-73.976313	40.758427	6.0	4.1
13325650	16-22	-73.976192	40.744026	-73.980935	40.733946	1.0	5.0

sns.boxplot(data=train_data[train_data['fare_amount'] < 50],
            x='pickup_time',
            y='fare_amount')
plt.show()

There is no significant fare different between pickup_time.
We can perform a one hot encoding for this data.

Conclusion for preprocessing - Impute the missing passenger_counts with its median - Impute the missing pickup_time with 'UNKNOWN' - Feature engineering the dropoff and pickup coordinate to be a distance between pickup and dropoff. We can use an Euclidean distance for simplicity.

2.4. Numerical Imputation (6 pts)

Now, let’s perform a numerical imputation
First check the missing value of the numerical data

X_train_num.isna().sum(0)

pickup_longitude       0
pickup_latitude        0
dropoff_longitude      0
dropoff_latitude       0
passenger_count      945
dtype: int64

Create a function to fit a numerical features imputer

from sklearn.impute import SimpleImputer
def num_imputer_fit(data):
    imputer = SimpleImputer(strategy='median')
    imputer.fit(data)
    return imputer

def num_imputer_transform(data, imputer):
    data_imputed = imputer.transform(data)
    data_imputed = pd.DataFrame(data_imputed, columns=data.columns, index=data.index)
    return data_imputed

Perform imputation

# Get the numerical imputer
num_imputer = num_imputer_fit(X_train_num) 
# Transform the data
X_train_num_imputed = num_imputer_transform(X_train_num, num_imputer)

X_train_num_imputed.isna().sum(0)

pickup_longitude     0
pickup_latitude      0
dropoff_longitude    0
dropoff_latitude     0
passenger_count      0
dtype: int64

2.5. Categorical Imputation

Next, let’s perform the categorical imputation

X_train_cat.value_counts(normalize=True)

pickup_time
16-22          0.328160
10-16          0.286376
22-04          0.221648
04-10          0.157599
-              0.006217
Name: proportion, dtype: float64

Create a function to fit a categorical features imputer

def cat_imputer_fit(data): 
    imputer = SimpleImputer(missing_values='-', strategy='constant', fill_value='UNKNOWN')
    imputer.fit(data)
    return imputer
def cat_imputer_transform(data, imputer):
    data_imputed = imputer.transform(data)
    data_imputed = pd.DataFrame(data_imputed, columns=data.columns, index=data.index)
    return data_imputed

Perform imputation

# Perform categorical imputation
cat_imputer = cat_imputer_fit(X_train_cat) 

# Transform
X_train_cat_imputed = cat_imputer_transform(X_train_cat, cat_imputer)

X_train_cat_imputed.value_counts(normalize=True)

pickup_time
16-22          0.328160
10-16          0.286376
22-04          0.221648
04-10          0.157599
UNKNOWN        0.006217
Name: proportion, dtype: float64

Great!

2.6. Preprocess Categorical Features

We will create a one-hot-encoder (read the EDA before processing) for the categorical features
Create a function to perform a one hot encoder

from sklearn.preprocessing import OneHotEncoder
def cat_encoder_fit(data):
    encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
    encoder.fit(data)
    return encoder

def cat_encoder_transform(data, encoder):
    data_encoded = encoder.transform(data)
    data_encoded = pd.DataFrame(data_encoded, columns=encoder.get_feature_names_out(data.columns), index=data.index)
    return data_encoded

Perform imputation

# Perform categorical imputation
cat_encoder = cat_encoder_fit(X_train_cat_imputed) 

# Transform
X_train_cat_encoded = cat_encoder_transform(X_train_cat_imputed, cat_encoder)

C:\Users\derik\anaconda3\Lib\site-packages\sklearn\preprocessing\_encoders.py:868: FutureWarning:

`sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.

print('Original shape:', X_train_cat_imputed.shape)
print('Encoded shape :', X_train_cat_encoded.shape)

Original shape: (155851, 1)
Encoded shape : (155851, 5)

X_train_cat_encoded.head()

	pickup_time_04-10	pickup_time_10-16	pickup_time_16-22	pickup_time_22-04	pickup_time_UNKNOWN
order_id
51655713	0.0	0.0	1.0	0.0	0.0
37525839	0.0	0.0	1.0	0.0	0.0
55058970	0.0	0.0	1.0	0.0	0.0
15663447	0.0	1.0	0.0	0.0	0.0
13325650	0.0	0.0	1.0	0.0	0.0

X_train_cat_imputed.head()

	pickup_time
order_id
51655713	16-22
37525839	16-22
55058970	16-22
15663447	10-16
13325650	16-22

Great!

2.7. Join the data

After all the data is filled (numerically), we can join the data
Create a function to join the data

def concat_data(num_data, cat_data):
    data = pd.concat((num_data, cat_data), axis=1)
    return data

Perform concatenated

X_train_concat = concat_data(X_train_num_imputed, X_train_cat_encoded)
print('Numerical data shape  :', X_train_num_imputed.shape)
print('Categorical data shape:', X_train_cat_encoded.shape)
print('Concat data shape     :', X_train_concat.shape)

Numerical data shape  : (155851, 5)
Categorical data shape: (155851, 5)
Concat data shape     : (155851, 10)

# Validate
X_train_concat.head()

	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	passenger_count	pickup_time_04-10	pickup_time_10-16	pickup_time_16-22	pickup_time_22-04	pickup_time_UNKNOWN
order_id
51655713	-73.979392	40.735734	-73.906281	40.745539	2.0	0.0	0.0	1.0	0.0	0.0
37525839	-73.986575	40.761473	-73.981880	40.768660	5.0	0.0	0.0	1.0	0.0	0.0
55058970	-73.972533	40.782260	-73.952761	40.708980	1.0	0.0	0.0	1.0	0.0	0.0
15663447	-73.979967	40.751612	-73.976313	40.758427	6.0	0.0	1.0	0.0	0.0	0.0
13325650	-73.976192	40.744026	-73.980935	40.733946	1.0	0.0	0.0	1.0	0.0	0.0

2.8. Feature engineering the data

Now, pickup and dropoff coordinate is not an explicit features.
We can create a better feature called by distance to summarize the pickup and dropoff coordinate.

def map_distance(data):
    data['distance'] = np.sqrt((data['pickup_longitude'] - data['dropoff_longitude'])**2 + (data['pickup_latitude'] - data['dropoff_latitude'])**2)
    data = data.drop(columns=['pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude'])
    return data

Perform distance calculation

X_train_concat_fe = map_distance(X_train_concat)
print('Original data shape:', X_train_concat.shape)
print('Mapped data shape  :', X_train_concat_fe.shape)

Original data shape: (155851, 11)
Mapped data shape  : (155851, 7)

X_train_concat_fe.head()

	passenger_count	pickup_time_04-10	pickup_time_10-16	pickup_time_16-22	pickup_time_22-04	pickup_time_UNKNOWN	distance
order_id
51655713	2.0	0.0	0.0	1.0	0.0	0.0	0.073766
37525839	5.0	0.0	0.0	1.0	0.0	0.0	0.008585
55058970	1.0	0.0	0.0	1.0	0.0	0.0	0.075901
15663447	6.0	0.0	1.0	0.0	0.0	0.0	0.007733
13325650	1.0	0.0	0.0	1.0	0.0	0.0	0.011140

And finally, we standardize the data so that it can perform well during model optimization

from sklearn.preprocessing import StandardScaler
def fit_scaler(data):
    scaler = StandardScaler()
    scaler.fit(data)
    return scaler
def transform_scaler(data, scaler):
    data_scaled = scaler.transform(data)
    data_scaled = pd.DataFrame(data_scaled, columns=data.columns, index=data.index)
    return data_scaled

# Fit the scaler
scaler = fit_scaler(X_train_concat_fe) 
# Transform the scaler
X_train_clean = transform_scaler(X_train_concat_fe, scaler)

X_train_clean.describe().round(4)

	passenger_count	pickup_time_04-10	pickup_time_10-16	pickup_time_16-22	pickup_time_22-04	pickup_time_UNKNOWN	distance
count	155851.0000	155851.0000	155851.0000	155851.0000	155851.0000	155851.0000	155851.0000
mean	-0.0000	0.0000	-0.0000	-0.0000	-0.0000	0.0000	-0.0000
std	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000
min	-0.5267	-0.4325	-0.6335	-0.6989	-0.5336	-0.0791	-0.0383
25%	-0.5267	-0.4325	-0.6335	-0.6989	-0.5336	-0.0791	-0.0340
50%	-0.5267	-0.4325	-0.6335	-0.6989	-0.5336	-0.0791	-0.0311
75%	0.2412	-0.4325	1.5786	1.4308	-0.5336	-0.0791	-0.0256
max	3.3130	2.3120	1.5786	1.4308	1.8739	12.6427	218.5245

2.9. Create the preprocess function

Now, let’s create a function to preprocess other set of data (valid & test) so that we can predict that

def preprocess_data(data, num_cols, cat_cols, num_imputer, cat_imputer, cat_encoder, scaler):
    data_num, data_cat = split_num_cat(data, num_cols, cat_cols)
    data_num_imputed = num_imputer_transform(data_num, num_imputer)
    data_cat_imputed = cat_imputer_transform(data_cat, cat_imputer)
    data_cat_encoded = cat_encoder_transform(data_cat_imputed, cat_encoder)
    data_concat = concat_data(data_num_imputed, data_cat_encoded)
    data_mapped = map_distance(data_concat)
    data_scaled = transform_scaler(data_mapped, scaler)
    return data_scaled

X_train_clean = preprocess_data(data=X_train, 
                                num_cols=num_cols, 
                                cat_cols=cat_cols, 
                                num_imputer=num_imputer, 
                                cat_imputer=cat_imputer, 
                                cat_encoder=cat_encoder, 
                                scaler=scaler)

print('Numerical data shape  :', X_train_num_imputed.shape)
print('Categorical data shape:', X_train_cat_encoded.shape)
print('Concat data shape     :', X_train_concat.shape)
print('Original data shape:', X_train_concat.shape)
print('Mapped data shape  :', X_train_concat_fe.shape)

Data num shape: (155851, 5)
Data cat shape: (155851, 1)
Numerical data shape  : (155851, 5)
Categorical data shape: (155851, 5)
Concat data shape     : (155851, 11)
Original data shape: (155851, 11)
Mapped data shape  : (155851, 7)

print('Original data shape:', X_train.shape)
print('Cleaned data shape :', X_train_clean.shape)
X_train_clean.head()

Original data shape: (155851, 6)
Cleaned data shape : (155851, 7)

	passenger_count	pickup_time_04-10	pickup_time_10-16	pickup_time_16-22	pickup_time_22-04	pickup_time_UNKNOWN	distance
order_id
51655713	0.241233	-0.432531	-0.633481	1.430838	-0.533634	-0.079097	-0.013978
37525839	2.545080	-0.432531	-0.633481	1.430838	-0.533634	-0.079097	-0.035436
55058970	-0.526715	-0.432531	-0.633481	1.430838	-0.533634	-0.079097	-0.013275
15663447	3.313029	-0.432531	1.578579	-0.698891	-0.533634	-0.079097	-0.035716
13325650	-0.526715	-0.432531	-0.633481	1.430838	-0.533634	-0.079097	-0.034594

X_valid_clean = preprocess_data(data=X_valid, 
                                num_cols=num_cols, 
                                cat_cols=cat_cols, 
                                num_imputer=num_imputer, 
                                cat_imputer=cat_imputer, 
                                cat_encoder=cat_encoder, 
                                scaler=scaler)
X_test_clean = preprocess_data(data=X_test, 
                               num_cols=num_cols, 
                               cat_cols=cat_cols, 
                               num_imputer=num_imputer, 
                               cat_imputer=cat_imputer, 
                               cat_encoder=cat_encoder, 
                               scaler=scaler)
print('Cleaned X_valid data shape :', X_valid_clean.shape)
print('Cleaned X_test data shape :', X_test_clean.shape)

Data num shape: (19481, 5)
Data cat shape: (19481, 1)
Data num shape: (19482, 5)
Data cat shape: (19482, 1)
Cleaned X_valid data shape : (19481, 7)
Cleaned X_test data shape : (19482, 7)

3. Training Machine Learning Models

3.1 Prepare train & evaluate model function
3.2 Train & evaluate several models
3.3 Choose the best model

3.1. Preprare train & evaluate model function

Before modeling, let’s prepare function to train & evaluate model

from sklearn.metrics import mean_squared_error
def train_model(estimator, X_train, y_train):
    estimator.fit(X_train, y_train)
    
def evaluate_model(estimator, X_train, y_train, X_valid, y_valid):
    y_train_pred = estimator.predict(X_train)
    y_valid_pred = estimator.predict(X_valid)
    rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
    rmse_valid = np.sqrt(mean_squared_error(y_valid, y_valid_pred))
    return rmse_train, rmse_valid

3.2. Train and Evaluate Several Models

Now, let’s train & evaluate several models
You should check, which one of the following model is the best model
1. Baseline model
2. k-NN with k=1
3. k-NN with k=100
4. k-NN with k=200

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.dummy import DummyRegressor
reg_1 = DummyRegressor()
reg_2 = KNeighborsRegressor(n_neighbors=1)
reg_3 = KNeighborsRegressor(n_neighbors=100) 
reg_4 = KNeighborsRegressor(n_neighbors=200)

# Train the model
train_model(reg_1, X_train_clean, y_train)
train_model(reg_2, X_train_clean, y_train)
train_model(reg_3, X_train_clean, y_train)
train_model(reg_4, X_train_clean, y_train)

import time

for reg in [reg_1, reg_2, reg_3, reg_4]:
    t0 = time.time()

    # Generate the rmse
    rmse_train, rmse_valid = evaluate_model(estimator=reg,
                                            X_train=X_train_clean,
                                            y_train=y_train,
                                            X_valid=X_valid_clean,
                                            y_valid=y_valid)

    # Logging
    elapsed = time.time() - t0
    print(f'model : {str(reg):40s} '
          f'| RMSE train: {rmse_train:.4f} '
          f'| RMSE valid: {rmse_valid:.4f} '
          f'| Time elapsed: {elapsed*1000:.2f} ms')

model : DummyRegressor()                         | RMSE train: 8.9221 | RMSE valid: 8.8614 | Time elapsed: 2.00 ms
model : KNeighborsRegressor(n_neighbors=1)       | RMSE train: 1.4510 | RMSE valid: 5.4185 | Time elapsed: 11933.78 ms
model : KNeighborsRegressor(n_neighbors=100)     | RMSE train: 3.9589 | RMSE valid: 3.9791 | Time elapsed: 23782.22 ms
model : KNeighborsRegressor(n_neighbors=200)     | RMSE train: 4.1513 | RMSE valid: 4.1420 | Time elapsed: 22960.39 ms

3.3. Choose the best model

From the previous results, which one is the best model?

Why do you choose that model?

And, create a reg_best to store the best model

reg_best = reg_3

4. Predictions & Evaluations

4.1 Predict & Evaluate on the Train Data
4.2 Predict & Evaluate on the Test Data

4.1. Predict & evaluate on train data

# Predict
y_train_pred = reg_best.predict(X_train_clean)

plt.scatter(y_train, y_train_pred)

plt.plot([0, 200], [0, 200], c='red')
plt.xlim(0, 200); plt.ylim(0, 200)
plt.xlabel('y actual'); plt.ylabel('y predicted')
plt.title('Comparison of y actual vs y predicted on Train Data')
plt.show()

4.2. Predict & evaluate on test data

# Predict
y_test_pred = reg_best.predict(X_test_clean)

# Visualize & compare the prediction
plt.scatter(y_test, y_test_pred)

plt.plot([0, 200], [0, 200], c='red')
plt.xlim(0, 200); plt.ylim(0, 200)
plt.xlabel('y actual'); plt.ylabel('y predicted')
plt.title('Comparison of y actual vs y predicted on Test Data')
plt.show()

# RMSE 
rmse_train, rmse_test = evaluate_model(estimator=reg_best,
                                            X_train=X_train_clean,
                                            y_train=y_train,
                                            X_valid=X_test_clean,
                                            y_valid=y_test)

print(f'| RMSE train: {rmse_train:.4f} '
      f'| RMSE test: {rmse_test:.4f} ')

| RMSE train: 3.9589 | RMSE test: 4.0970