Regresi Tak Linier

Wage Data

Dataset ini berisi informasi terkait individu, termasuk gaji, pendidikan, pengalaman kerja, dan karakteristik demografis lainnya. Berikut adalah definisi dari setiap variabel:

Kolom	Tipe Data	Deskripsi
`wage`	Numerik (float)	Gaji individu dalam satuan dolar per minggu
`education`	Numerik (int)	Jumlah tahun pendidikan formal yang telah ditempuh individu
`experience`	Numerik (int)	Jumlah tahun pengalaman kerja individu
`ethnicity`	Kategorikal (string)	Kelompok etnis individu
`area_type`	Kategorikal (string)	Jenis area tempat tinggal individu
`region`	Kategorikal (string)	Wilayah geografis tempat individu bekerja
`parttime`	Kategorikal (string)	Status pekerjaan individu, apakah part-time atau full-time

Pemodelan

Tujuan: Memprediksi gaji individu berdasarkan variabel-variabel yang ada. Variabel wage akan dijadikan variabel target (Y). Sisanya akan dijadikan variabel prediktor (X).

import pandas as pd
import numpy as np

df1 = pd.read_csv('wagedata.csv')
df1.head()

	wage	education	experience	ethnicity	area_type	region	parttime
0	498.58	14	15	cauc	urban	south	no
1	205.76	9	47	cauc	urban	south	yes
2	490.39	13	14	cauc	urban	west	no
3	237.42	13	4	cauc	urban	west	yes
4	759.73	12	44	cauc	urban	northeast	no

Desktiptif

df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   wage        500 non-null    float64
 1   education   500 non-null    int64  
 2   experience  500 non-null    int64  
 3   ethnicity   500 non-null    object 
 4   area_type   500 non-null    object 
 5   region      500 non-null    object 
 6   parttime    500 non-null    object 
dtypes: float64(1), int64(2), object(4)
memory usage: 27.5+ KB

Data Preprocessing

Exploratory Data Analysis (EDA)

!! Latihan !!

Lakukan eksplorasi data terhadap seluruhan variabel yang ada. Silahkan lakukan: - Statistik deskriptif untuk variabel numerik - Frekuensi untuk variabel kategorikal - Visualisasi distribusi data untuk variabel numerik - Visualisasi frekuensi untuk variabel kategorikal - Visualisasi hubungan antara variabel wage dan variabel prediktor lainnya. Misal nya, hubungan antara wage dan experience dengan scatter plot. Hubungan antara wage dan education dengan boxplot. - Buat scatter plot antara wage dan experience di mana wage diubah menjadi skala logaritmik.

Split Input Output

def split_input_output(data, target_col):
    X = data.drop(columns=target_col)
    y = data[target_col]
    print('X shape:', X.shape)
    print('y shape:', y.shape)
    return X, y

X, y = split_input_output(data=df1,
                          target_col='wage')

X shape: (500, 6)
y shape: (500,)

X.head()

	education	experience	ethnicity	area_type	region	parttime
0	14	15	cauc	urban	south	no
1	9	47	cauc	urban	south	yes
2	13	14	cauc	urban	west	no
3	13	4	cauc	urban	west	yes
4	12	44	cauc	urban	northeast	no

y.head()

0    498.58
1    205.76
2    490.39
3    237.42
4    759.73
Name: wage, dtype: float64

Train Test Split

from sklearn.model_selection import train_test_split
def split_train_test(X, y, test_size, seed):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)
    print('X train shape:', X_train.shape)
    print('y train shape:', y_train.shape)
    print('X test shape :', X_test.shape)
    print('y test shape :', y_test.shape)
    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test =  split_train_test(X, y, test_size=0.1, seed=123)

X train shape: (450, 6)
y train shape: (450,)
X test shape : (50, 6)
y test shape : (50,)

X_train.head()

	education	experience	ethnicity	area_type	region	parttime
289	16	6	cauc	urban	midwest	no
182	16	1	cauc	urban	midwest	yes
13	6	36	cauc	urban	northeast	no
114	12	0	cauc	rural	south	no
475	18	24	cauc	urban	northeast	no

y_train.head()

289    1234.57
182     237.04
13      344.25
114     123.46
475    1414.61
Name: wage, dtype: float64

Split Numerik Kategorikal

def split_num_cat(data, num_cols, cat_cols):
    data_num = data[num_cols]
    data_cat = data[cat_cols]
    print('Data num shape:', data_num.shape)
    print('Data cat shape:', data_cat.shape)
    return data_num, data_cat

num_cols = ['education', 'experience']
cat_cols = ['ethnicity', 'area_type', 'region', 'parttime']

X_train_num, X_train_cat = split_num_cat(X_train, num_cols, cat_cols)

Data num shape: (450, 2)
Data cat shape: (450, 4)

One-Hot Encoding

from sklearn.preprocessing import OneHotEncoder
def cat_encoder_fit(data):
    encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
    encoder.fit(data)
    return encoder

def cat_encoder_transform(data, encoder):
    data_encoded = encoder.transform(data)
    data_encoded = pd.DataFrame(data_encoded, columns=encoder.get_feature_names_out(data.columns), index=data.index)
    return data_encoded

cat_encoder = cat_encoder_fit(X_train_cat) 
X_train_cat_encoded = cat_encoder_transform(X_train_cat, cat_encoder)

C:\Users\derik\anaconda3\Lib\site-packages\sklearn\preprocessing\_encoders.py:868: FutureWarning:

`sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.

print('Original shape:', X_train_cat.shape)
print('Encoded shape :', X_train_cat_encoded.shape)

Original shape: (450, 4)
Encoded shape : (450, 10)

X_train_cat.head()

	ethnicity	area_type	region	parttime
289	cauc	urban	midwest	no
182	cauc	urban	midwest	yes
13	cauc	urban	northeast	no
114	cauc	rural	south	no
475	cauc	urban	northeast	no

X_train_cat_encoded.head()

	ethnicity_cauc	area_type_rural	area_type_urban	region_midwest	region_northeast	region_south	parttime_no	parttime_yes
289	1.0	0.0	1.0	1.0	0.0	0.0	1.0	0.0
182	1.0	0.0	1.0	1.0	0.0	0.0	0.0	1.0
13	1.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0
114	1.0	1.0	0.0	0.0	0.0	1.0	1.0	0.0
475	1.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0

Gabungkan Data

def concat_data(num_data, cat_data):
    data = pd.concat((num_data, cat_data), axis=1)
    return data

X_train_concat = concat_data(X_train_num, X_train_cat_encoded)
print('Numerical data shape  :', X_train_num.shape)
print('Categorical data shape:', X_train_cat_encoded.shape)
print('Concat data shape     :', X_train_concat.shape)

Numerical data shape  : (450, 2)
Categorical data shape: (450, 10)
Concat data shape     : (450, 12)

Fungsi Preprocessing

def preprocess_data(data, num_cols, cat_cols, cat_encoder):
    X_num, X_cat = split_num_cat(data, num_cols, cat_cols)
    X_cat_encoded = cat_encoder_transform(X_cat, cat_encoder)
    X_concat = concat_data(X_num, X_cat_encoded)
    return X_concat

X_train_clean = preprocess_data(X_train, num_cols, cat_cols, cat_encoder)
X_train_clean.head()

Data num shape: (450, 2)
Data cat shape: (450, 4)

	education	experience	ethnicity_cauc	area_type_rural	area_type_urban	region_midwest	region_northeast	region_south	parttime_no	parttime_yes
289	16	6	1.0	0.0	1.0	1.0	0.0	0.0	1.0	0.0
182	16	1	1.0	0.0	1.0	1.0	0.0	0.0	0.0	1.0
13	6	36	1.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0
114	12	0	1.0	1.0	0.0	0.0	0.0	1.0	1.0	0.0
475	18	24	1.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0

X_train_clean.shape

(450, 12)

Train Model

Dummy Regression

from sklearn.dummy import DummyRegressor
reg_1 = DummyRegressor()
reg_1.fit(X_train_clean, y_train)

DummyRegressor()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Linear Regression

from sklearn.linear_model import LinearRegression
reg_2 = LinearRegression()
reg_2.fit(X_train_clean, y_train)

LinearRegression()

Polynomial Regression

from sklearn.preprocessing import PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

# Tentukan kolom numerik dan kategorikal
numeric_features = ["experience"]  # Hanya `experience` yang akan dipolynomialkan
other_features = ["education"]  # Fitur numerik lainnya tetap sama

# Buat transformer untuk polynomial hanya pada `experience`
poly_transformer = ColumnTransformer([
    ("poly_exp", PolynomialFeatures(degree=2, include_bias=False), ["experience"]),
    ("passthrough", "passthrough", other_features)  # Fitur lain tidak diubah
])

# Buat pipeline regresi dengan polynomial transformation hanya untuk `experience`
reg_3 = Pipeline([
    ("poly_features", poly_transformer),
    ("regressor", LinearRegression())
])

# Latih model dengan data
reg_3.fit(X_train_clean, y_train)

Pipeline(steps=[('poly_features',
                 ColumnTransformer(transformers=[('poly_exp',
                                                  PolynomialFeatures(include_bias=False),
                                                  ['experience']),
                                                 ('passthrough', 'passthrough',
                                                  ['education'])])),
                ('regressor', LinearRegression())])

# `experience` dan `education` diubah menjadi polynomial
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
poly_features = PolynomialFeatures(degree=2, include_bias=False)
reg_4 = make_pipeline(poly_features, LinearRegression())
reg_4.fit(X_train_clean, y_train)

Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(include_bias=False)),
                ('linearregression', LinearRegression())])

Regresi Piecewise

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Definisi fungsi transformasi piecewise untuk `experience`
def piecewise_transform(X):
    return np.where(X < 10, X, 10 + (X - 10) * 0.5)

# FunctionTransformer untuk `experience`
piecewise_transformer = FunctionTransformer(piecewise_transform, validate=True)

# Tentukan fitur
numeric_features = ["experience"]  # Hanya `experience` yang mendapat transformasi piecewise
other_numeric_features = ["education"]  # Tetap digunakan tanpa perubahan
categorical_features = ["region", "parttime"]  # Kategorikal yang perlu encoding

# Gunakan ColumnTransformer untuk menerapkan transformasi
preprocessor = ColumnTransformer([
    ("piecewise_exp", piecewise_transformer, ["experience"]),  # Piecewise untuk `experience`
    ("passthrough", "passthrough", other_numeric_features)
])

# Buat pipeline regresi
reg_5 = Pipeline([
    ("preprocessor", preprocessor),
    ("regressor", LinearRegression())
])

# Latih model dengan data
reg_5.fit(X_train_clean, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('piecewise_exp',
                                                  FunctionTransformer(func=<function piecewise_transform at 0x000001B9B0743F60>,
                                                                      validate=True),
                                                  ['experience']),
                                                 ('passthrough', 'passthrough',
                                                  ['education'])])),
                ('regressor', LinearRegression())])

Regresi Spline

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import SplineTransformer

# Tentukan fitur numerik dan kategorikal
numeric_features = ["experience"]  # Hanya `experience` yang akan menggunakan Spline
other_numeric_features = ["education"]  # Fitur numerik lain tetap tanpa transformasi
categorical_features = ["region", "parttime"]  # Fitur kategorikal

# Buat ColumnTransformer untuk menerapkan Spline hanya ke `experience`
preprocessor = ColumnTransformer([
    ("spline_exp", SplineTransformer(degree=2, n_knots=4, include_bias=False), ["experience"]),
    ("pass_numeric", "passthrough", other_numeric_features)
])

# Buat pipeline regresi dengan spline transformation + prediktor lainnya
reg_6 = Pipeline([
    ("preprocessor", preprocessor),
    ("regressor", LinearRegression())
])

# Latih model dengan data
reg_6.fit(X_train_clean, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('spline_exp',
                                                  SplineTransformer(degree=2,
                                                                    include_bias=False,
                                                                    n_knots=4),
                                                  ['experience']),
                                                 ('pass_numeric', 'passthrough',
                                                  ['education'])])),
                ('regressor', LinearRegression())])

Evaluasi Model

from sklearn.metrics import mean_squared_error
def evaluate_model(model, X, y):
    y_pred = model.predict(X)
    rmse = mean_squared_error(y, y_pred, squared=False)
    return rmse

rmse_1 = evaluate_model(reg_1, X_train_clean, y_train)
rmse_2 = evaluate_model(reg_2, X_train_clean, y_train)
rmse_3 = evaluate_model(reg_3, X_train_clean, y_train)
rmse_4 = evaluate_model(reg_4, X_train_clean, y_train)
rmse_5 = evaluate_model(reg_5, X_train_clean, y_train)
rmse_6 = evaluate_model(reg_6, X_train_clean, y_train)

print('RMSE Dummy Regressor:', rmse_1)
print('RMSE Linear Regressor:', rmse_2)
print('RRMSE Polynomial Regressor:', rmse_3)
print('RMSE Polynomial Regressor 2:', rmse_4)
print('RMSE Piecewise Regressor:', rmse_5)
print('RMSE Spline Regressor:', rmse_6)

RMSE Dummy Regressor: 397.1282408894203
RMSE Linear Regressor: 326.46844565698024
RRMSE Polynomial Regressor: 337.1922218217974
RMSE Polynomial Regressor 2: 302.8829003193245
RMSE Piecewise Regressor: 352.2093852655176
RMSE Spline Regressor: 334.9618629115691

Evaluasi Model Data Test

X_test_clean = preprocess_data(X_test, num_cols, cat_cols, cat_encoder)
X_test_clean.head()

Data num shape: (50, 2)
Data cat shape: (50, 4)

	education	experience	ethnicity_cauc	area_type_rural	area_type_urban	region_midwest	region_northeast	region_south	region_west	parttime_no
229	15	20	1.0	1.0	0.0	0.0	1.0	0.0	0.0	1.0
337	14	23	1.0	0.0	1.0	0.0	0.0	1.0	0.0	1.0
327	18	18	1.0	0.0	1.0	0.0	0.0	0.0	1.0	1.0
416	12	12	1.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0
306	11	24	1.0	0.0	1.0	1.0	0.0	0.0	0.0	1.0

rmse_1_test = evaluate_model(reg_1, X_test_clean, y_test)
rmse_2_test = evaluate_model(reg_2, X_test_clean, y_test)
rmse_3_test = evaluate_model(reg_3, X_test_clean, y_test)
rmse_4_test = evaluate_model(reg_4, X_test_clean, y_test)
rmse_5_test = evaluate_model(reg_5, X_test_clean, y_test)
rmse_6_test = evaluate_model(reg_6, X_test_clean, y_test)

print('RMSE Dummy Regressor:', rmse_1_test)
print('RMSE Linear Regressor:', rmse_2_test)
print('RRMSE Polynomial Regressor:', rmse_3_test)
print('MSE Polynomial Regressor 2:', rmse_4_test)
print('RMSE Piecewise Regressor:', rmse_5_test)
print('RMSE Spline Regressor:', rmse_6_test)

RMSE Dummy Regressor: 497.65831671172754
RMSE Linear Regressor: 437.80311521309835
RRMSE Polynomial Regressor: 441.7499051748951
MSE Polynomial Regressor 2: 410.76789730687176
RMSE Piecewise Regressor: 462.3779894756133
RMSE Spline Regressor: 442.6886935062947

MAnakah model yang paling baik?

	ethnicity_cauc	area_type_rural	area_type_urban	region_midwest	region_northeast	region_south	parttime_no	parttime_yes
289	1.0	0.0	1.0	1.0	0.0	0.0	1.0	0.0
182	1.0	0.0	1.0	1.0	0.0	0.0	0.0	1.0
13	1.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0
114	1.0	1.0	0.0	0.0	0.0	1.0	1.0	0.0
475	1.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0

	education	experience	ethnicity_cauc	area_type_rural	area_type_urban	region_midwest	region_northeast	region_south	parttime_no	parttime_yes
289	16	6	1.0	0.0	1.0	1.0	0.0	0.0	1.0	0.0
182	16	1	1.0	0.0	1.0	1.0	0.0	0.0	0.0	1.0
13	6	36	1.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0
114	12	0	1.0	1.0	0.0	0.0	0.0	1.0	1.0	0.0
475	18	24	1.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0

	education	experience	ethnicity_cauc	area_type_rural	area_type_urban	region_midwest	region_northeast	region_south	region_west	parttime_no
229	15	20	1.0	1.0	0.0	0.0	1.0	0.0	0.0	1.0
337	14	23	1.0	0.0	1.0	0.0	0.0	1.0	0.0	1.0
327	18	18	1.0	0.0	1.0	0.0	0.0	0.0	1.0	1.0
416	12	12	1.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0
306	11	24	1.0	0.0	1.0	1.0	0.0	0.0	0.0	1.0

	ethnicity_cauc	area_type_rural	area_type_urban	region_midwest	region_northeast	region_south	parttime_no	parttime_yes
289	1.0	0.0	1.0	1.0	0.0	0.0	1.0	0.0
182	1.0	0.0	1.0	1.0	0.0	0.0	0.0	1.0
13	1.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0
114	1.0	1.0	0.0	0.0	0.0	1.0	1.0	0.0
475	1.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0

	education	experience	ethnicity_cauc	area_type_rural	area_type_urban	region_midwest	region_northeast	region_south	parttime_no	parttime_yes
289	16	6	1.0	0.0	1.0	1.0	0.0	0.0	1.0	0.0
182	16	1	1.0	0.0	1.0	1.0	0.0	0.0	0.0	1.0
13	6	36	1.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0
114	12	0	1.0	1.0	0.0	0.0	0.0	1.0	1.0	0.0
475	18	24	1.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0

	education	experience	ethnicity_cauc	area_type_rural	area_type_urban	region_midwest	region_northeast	region_south	region_west	parttime_no
229	15	20	1.0	1.0	0.0	0.0	1.0	0.0	0.0	1.0
337	14	23	1.0	0.0	1.0	0.0	0.0	1.0	0.0	1.0
327	18	18	1.0	0.0	1.0	0.0	0.0	0.0	1.0	1.0
416	12	12	1.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0
306	11	24	1.0	0.0	1.0	1.0	0.0	0.0	0.0	1.0

	ethnicity_cauc	area_type_rural	area_type_urban	region_midwest	region_northeast	region_south	parttime_no	parttime_yes
289	1.0	0.0	1.0	1.0	0.0	0.0	1.0	0.0
182	1.0	0.0	1.0	1.0	0.0	0.0	0.0	1.0
13	1.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0
114	1.0	1.0	0.0	0.0	0.0	1.0	1.0	0.0
475	1.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0

	education	experience	ethnicity_cauc	area_type_rural	area_type_urban	region_midwest	region_northeast	region_south	parttime_no	parttime_yes
289	16	6	1.0	0.0	1.0	1.0	0.0	0.0	1.0	0.0
182	16	1	1.0	0.0	1.0	1.0	0.0	0.0	0.0	1.0
13	6	36	1.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0
114	12	0	1.0	1.0	0.0	0.0	0.0	1.0	1.0	0.0
475	18	24	1.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0

	education	experience	ethnicity_cauc	area_type_rural	area_type_urban	region_midwest	region_northeast	region_south	region_west	parttime_no
229	15	20	1.0	1.0	0.0	0.0	1.0	0.0	0.0	1.0
337	14	23	1.0	0.0	1.0	0.0	0.0	1.0	0.0	1.0
327	18	18	1.0	0.0	1.0	0.0	0.0	0.0	1.0	1.0
416	12	12	1.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0
306	11	24	1.0	0.0	1.0	1.0	0.0	0.0	0.0	1.0