Regresi Tak Linier

Wage Data

Dataset ini berisi informasi terkait individu, termasuk gaji, pendidikan, pengalaman kerja, dan karakteristik demografis lainnya. Berikut adalah definisi dari setiap variabel:

Kolom Tipe Data Deskripsi
wage Numerik (float) Gaji individu dalam satuan dolar per minggu
education Numerik (int) Jumlah tahun pendidikan formal yang telah ditempuh individu
experience Numerik (int) Jumlah tahun pengalaman kerja individu
ethnicity Kategorikal (string) Kelompok etnis individu
area_type Kategorikal (string) Jenis area tempat tinggal individu
region Kategorikal (string) Wilayah geografis tempat individu bekerja
parttime Kategorikal (string) Status pekerjaan individu, apakah part-time atau full-time

Pemodelan

Tujuan: Memprediksi gaji individu berdasarkan variabel-variabel yang ada. Variabel wage akan dijadikan variabel target (Y). Sisanya akan dijadikan variabel prediktor (X).

import pandas as pd
import numpy as np

df1 = pd.read_csv('wagedata.csv')
df1.head()
wage education experience ethnicity area_type region parttime
0 498.58 14 15 cauc urban south no
1 205.76 9 47 cauc urban south yes
2 490.39 13 14 cauc urban west no
3 237.42 13 4 cauc urban west yes
4 759.73 12 44 cauc urban northeast no

Desktiptif

df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   wage        500 non-null    float64
 1   education   500 non-null    int64  
 2   experience  500 non-null    int64  
 3   ethnicity   500 non-null    object 
 4   area_type   500 non-null    object 
 5   region      500 non-null    object 
 6   parttime    500 non-null    object 
dtypes: float64(1), int64(2), object(4)
memory usage: 27.5+ KB

Data Preprocessing

Exploratory Data Analysis (EDA)

!! Latihan !!

Lakukan eksplorasi data terhadap seluruhan variabel yang ada. Silahkan lakukan: - Statistik deskriptif untuk variabel numerik - Frekuensi untuk variabel kategorikal - Visualisasi distribusi data untuk variabel numerik - Visualisasi frekuensi untuk variabel kategorikal - Visualisasi hubungan antara variabel wage dan variabel prediktor lainnya. Misal nya, hubungan antara wage dan experience dengan scatter plot. Hubungan antara wage dan education dengan boxplot. - Buat scatter plot antara wage dan experience di mana wage diubah menjadi skala logaritmik.

Split Input Output

def split_input_output(data, target_col):
    X = data.drop(columns=target_col)
    y = data[target_col]
    print('X shape:', X.shape)
    print('y shape:', y.shape)
    return X, y
X, y = split_input_output(data=df1,
                          target_col='wage')
X shape: (500, 6)
y shape: (500,)
X.head()  
education experience ethnicity area_type region parttime
0 14 15 cauc urban south no
1 9 47 cauc urban south yes
2 13 14 cauc urban west no
3 13 4 cauc urban west yes
4 12 44 cauc urban northeast no
y.head()
0    498.58
1    205.76
2    490.39
3    237.42
4    759.73
Name: wage, dtype: float64

Train Test Split

from sklearn.model_selection import train_test_split
def split_train_test(X, y, test_size, seed):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)
    print('X train shape:', X_train.shape)
    print('y train shape:', y_train.shape)
    print('X test shape :', X_test.shape)
    print('y test shape :', y_test.shape)
    return X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test =  split_train_test(X, y, test_size=0.1, seed=123) 
X train shape: (450, 6)
y train shape: (450,)
X test shape : (50, 6)
y test shape : (50,)
X_train.head()
education experience ethnicity area_type region parttime
289 16 6 cauc urban midwest no
182 16 1 cauc urban midwest yes
13 6 36 cauc urban northeast no
114 12 0 cauc rural south no
475 18 24 cauc urban northeast no
y_train.head()
289    1234.57
182     237.04
13      344.25
114     123.46
475    1414.61
Name: wage, dtype: float64

Split Numerik Kategorikal

def split_num_cat(data, num_cols, cat_cols):
    data_num = data[num_cols]
    data_cat = data[cat_cols]
    print('Data num shape:', data_num.shape)
    print('Data cat shape:', data_cat.shape)
    return data_num, data_cat
num_cols = ['education', 'experience']
cat_cols = ['ethnicity', 'area_type', 'region', 'parttime']

X_train_num, X_train_cat = split_num_cat(X_train, num_cols, cat_cols) 
Data num shape: (450, 2)
Data cat shape: (450, 4)

One-Hot Encoding

from sklearn.preprocessing import OneHotEncoder
def cat_encoder_fit(data):
    encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
    encoder.fit(data)
    return encoder

def cat_encoder_transform(data, encoder):
    data_encoded = encoder.transform(data)
    data_encoded = pd.DataFrame(data_encoded, columns=encoder.get_feature_names_out(data.columns), index=data.index)
    return data_encoded
cat_encoder = cat_encoder_fit(X_train_cat) 
X_train_cat_encoded = cat_encoder_transform(X_train_cat, cat_encoder) 
C:\Users\derik\anaconda3\Lib\site-packages\sklearn\preprocessing\_encoders.py:868: FutureWarning:

`sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.
print('Original shape:', X_train_cat.shape)
print('Encoded shape :', X_train_cat_encoded.shape)
Original shape: (450, 4)
Encoded shape : (450, 10)
X_train_cat.head()
ethnicity area_type region parttime
289 cauc urban midwest no
182 cauc urban midwest yes
13 cauc urban northeast no
114 cauc rural south no
475 cauc urban northeast no
X_train_cat_encoded.head()
ethnicity_afam ethnicity_cauc area_type_rural area_type_urban region_midwest region_northeast region_south region_west parttime_no parttime_yes
289 0.0 1.0 0.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0
182 0.0 1.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0
13 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0
114 0.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0
475 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0

Gabungkan Data

def concat_data(num_data, cat_data):
    data = pd.concat((num_data, cat_data), axis=1)
    return data
X_train_concat = concat_data(X_train_num, X_train_cat_encoded)
print('Numerical data shape  :', X_train_num.shape)
print('Categorical data shape:', X_train_cat_encoded.shape)
print('Concat data shape     :', X_train_concat.shape)
Numerical data shape  : (450, 2)
Categorical data shape: (450, 10)
Concat data shape     : (450, 12)

Fungsi Preprocessing

def preprocess_data(data, num_cols, cat_cols, cat_encoder):
    X_num, X_cat = split_num_cat(data, num_cols, cat_cols)
    X_cat_encoded = cat_encoder_transform(X_cat, cat_encoder)
    X_concat = concat_data(X_num, X_cat_encoded)
    return X_concat
X_train_clean = preprocess_data(X_train, num_cols, cat_cols, cat_encoder)
X_train_clean.head()
Data num shape: (450, 2)
Data cat shape: (450, 4)
education experience ethnicity_afam ethnicity_cauc area_type_rural area_type_urban region_midwest region_northeast region_south region_west parttime_no parttime_yes
289 16 6 0.0 1.0 0.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0
182 16 1 0.0 1.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0
13 6 36 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0
114 12 0 0.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0
475 18 24 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0
X_train_clean.shape
(450, 12)

Train Model

Dummy Regression

from sklearn.dummy import DummyRegressor
reg_1 = DummyRegressor()
reg_1.fit(X_train_clean, y_train)
DummyRegressor()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Linear Regression

from sklearn.linear_model import LinearRegression
reg_2 = LinearRegression()
reg_2.fit(X_train_clean, y_train)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Polynomial Regression

from sklearn.preprocessing import PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

# Tentukan kolom numerik dan kategorikal
numeric_features = ["experience"]  # Hanya `experience` yang akan dipolynomialkan
other_features = ["education"]  # Fitur numerik lainnya tetap sama

# Buat transformer untuk polynomial hanya pada `experience`
poly_transformer = ColumnTransformer([
    ("poly_exp", PolynomialFeatures(degree=2, include_bias=False), ["experience"]),
    ("passthrough", "passthrough", other_features)  # Fitur lain tidak diubah
])

# Buat pipeline regresi dengan polynomial transformation hanya untuk `experience`
reg_3 = Pipeline([
    ("poly_features", poly_transformer),
    ("regressor", LinearRegression())
])

# Latih model dengan data
reg_3.fit(X_train_clean, y_train)
Pipeline(steps=[('poly_features',
                 ColumnTransformer(transformers=[('poly_exp',
                                                  PolynomialFeatures(include_bias=False),
                                                  ['experience']),
                                                 ('passthrough', 'passthrough',
                                                  ['education'])])),
                ('regressor', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# `experience` dan `education` diubah menjadi polynomial
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
poly_features = PolynomialFeatures(degree=2, include_bias=False)
reg_4 = make_pipeline(poly_features, LinearRegression())
reg_4.fit(X_train_clean, y_train)
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(include_bias=False)),
                ('linearregression', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Regresi Piecewise

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Definisi fungsi transformasi piecewise untuk `experience`
def piecewise_transform(X):
    return np.where(X < 10, X, 10 + (X - 10) * 0.5)

# FunctionTransformer untuk `experience`
piecewise_transformer = FunctionTransformer(piecewise_transform, validate=True)

# Tentukan fitur
numeric_features = ["experience"]  # Hanya `experience` yang mendapat transformasi piecewise
other_numeric_features = ["education"]  # Tetap digunakan tanpa perubahan
categorical_features = ["region", "parttime"]  # Kategorikal yang perlu encoding

# Gunakan ColumnTransformer untuk menerapkan transformasi
preprocessor = ColumnTransformer([
    ("piecewise_exp", piecewise_transformer, ["experience"]),  # Piecewise untuk `experience`
    ("passthrough", "passthrough", other_numeric_features)
])

# Buat pipeline regresi
reg_5 = Pipeline([
    ("preprocessor", preprocessor),
    ("regressor", LinearRegression())
])

# Latih model dengan data
reg_5.fit(X_train_clean, y_train)
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('piecewise_exp',
                                                  FunctionTransformer(func=<function piecewise_transform at 0x000001B9B0743F60>,
                                                                      validate=True),
                                                  ['experience']),
                                                 ('passthrough', 'passthrough',
                                                  ['education'])])),
                ('regressor', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Regresi Spline

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import SplineTransformer

# Tentukan fitur numerik dan kategorikal
numeric_features = ["experience"]  # Hanya `experience` yang akan menggunakan Spline
other_numeric_features = ["education"]  # Fitur numerik lain tetap tanpa transformasi
categorical_features = ["region", "parttime"]  # Fitur kategorikal

# Buat ColumnTransformer untuk menerapkan Spline hanya ke `experience`
preprocessor = ColumnTransformer([
    ("spline_exp", SplineTransformer(degree=2, n_knots=4, include_bias=False), ["experience"]),
    ("pass_numeric", "passthrough", other_numeric_features)
])

# Buat pipeline regresi dengan spline transformation + prediktor lainnya
reg_6 = Pipeline([
    ("preprocessor", preprocessor),
    ("regressor", LinearRegression())
])

# Latih model dengan data
reg_6.fit(X_train_clean, y_train)
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('spline_exp',
                                                  SplineTransformer(degree=2,
                                                                    include_bias=False,
                                                                    n_knots=4),
                                                  ['experience']),
                                                 ('pass_numeric', 'passthrough',
                                                  ['education'])])),
                ('regressor', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Evaluasi Model

from sklearn.metrics import mean_squared_error
def evaluate_model(model, X, y):
    y_pred = model.predict(X)
    rmse = mean_squared_error(y, y_pred, squared=False)
    return rmse
rmse_1 = evaluate_model(reg_1, X_train_clean, y_train)
rmse_2 = evaluate_model(reg_2, X_train_clean, y_train)
rmse_3 = evaluate_model(reg_3, X_train_clean, y_train)
rmse_4 = evaluate_model(reg_4, X_train_clean, y_train)
rmse_5 = evaluate_model(reg_5, X_train_clean, y_train)
rmse_6 = evaluate_model(reg_6, X_train_clean, y_train)

print('RMSE Dummy Regressor:', rmse_1)
print('RMSE Linear Regressor:', rmse_2)
print('RRMSE Polynomial Regressor:', rmse_3)
print('RMSE Polynomial Regressor 2:', rmse_4)
print('RMSE Piecewise Regressor:', rmse_5)
print('RMSE Spline Regressor:', rmse_6)
RMSE Dummy Regressor: 397.1282408894203
RMSE Linear Regressor: 326.46844565698024
RRMSE Polynomial Regressor: 337.1922218217974
RMSE Polynomial Regressor 2: 302.8829003193245
RMSE Piecewise Regressor: 352.2093852655176
RMSE Spline Regressor: 334.9618629115691

Evaluasi Model Data Test

X_test_clean = preprocess_data(X_test, num_cols, cat_cols, cat_encoder)
X_test_clean.head()
Data num shape: (50, 2)
Data cat shape: (50, 4)
education experience ethnicity_afam ethnicity_cauc area_type_rural area_type_urban region_midwest region_northeast region_south region_west parttime_no parttime_yes
229 15 20 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0
337 14 23 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0
327 18 18 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0
416 12 12 0.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
306 11 24 0.0 1.0 0.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0
rmse_1_test = evaluate_model(reg_1, X_test_clean, y_test)
rmse_2_test = evaluate_model(reg_2, X_test_clean, y_test)
rmse_3_test = evaluate_model(reg_3, X_test_clean, y_test)
rmse_4_test = evaluate_model(reg_4, X_test_clean, y_test)
rmse_5_test = evaluate_model(reg_5, X_test_clean, y_test)
rmse_6_test = evaluate_model(reg_6, X_test_clean, y_test)

print('RMSE Dummy Regressor:', rmse_1_test)
print('RMSE Linear Regressor:', rmse_2_test)
print('RRMSE Polynomial Regressor:', rmse_3_test)
print('MSE Polynomial Regressor 2:', rmse_4_test)
print('RMSE Piecewise Regressor:', rmse_5_test)
print('RMSE Spline Regressor:', rmse_6_test)
RMSE Dummy Regressor: 497.65831671172754
RMSE Linear Regressor: 437.80311521309835
RRMSE Polynomial Regressor: 441.7499051748951
MSE Polynomial Regressor 2: 410.76789730687176
RMSE Piecewise Regressor: 462.3779894756133
RMSE Spline Regressor: 442.6886935062947

MAnakah model yang paling baik?