Dataset ini berisi informasi terkait individu, termasuk gaji, pendidikan, pengalaman kerja, dan karakteristik demografis lainnya. Berikut adalah definisi dari setiap variabel:
Kolom
Tipe Data
Deskripsi
wage
Numerik (float)
Gaji individu dalam satuan dolar per minggu
education
Numerik (int)
Jumlah tahun pendidikan formal yang telah ditempuh individu
experience
Numerik (int)
Jumlah tahun pengalaman kerja individu
ethnicity
Kategorikal (string)
Kelompok etnis individu
area_type
Kategorikal (string)
Jenis area tempat tinggal individu
region
Kategorikal (string)
Wilayah geografis tempat individu bekerja
parttime
Kategorikal (string)
Status pekerjaan individu, apakah part-time atau full-time
Pemodelan
Tujuan: Memprediksi gaji individu berdasarkan variabel-variabel yang ada. Variabel wage akan dijadikan variabel target (Y). Sisanya akan dijadikan variabel prediktor (X).
import pandas as pdimport numpy as npdf1 = pd.read_csv('wagedata.csv')df1.head()
Lakukan eksplorasi data terhadap seluruhan variabel yang ada. Silahkan lakukan: - Statistik deskriptif untuk variabel numerik - Frekuensi untuk variabel kategorikal - Visualisasi distribusi data untuk variabel numerik - Visualisasi frekuensi untuk variabel kategorikal - Visualisasi hubungan antara variabel wage dan variabel prediktor lainnya. Misal nya, hubungan antara wage dan experience dengan scatter plot. Hubungan antara wage dan education dengan boxplot. - Buat scatter plot antara wage dan experience di mana wage diubah menjadi skala logaritmik.
Split Input Output
def split_input_output(data, target_col): X = data.drop(columns=target_col) y = data[target_col]print('X shape:', X.shape)print('y shape:', y.shape)return X, y
X, y = split_input_output(data=df1, target_col='wage')
C:\Users\derik\anaconda3\Lib\site-packages\sklearn\preprocessing\_encoders.py:868: FutureWarning:
`sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.
Original shape: (450, 4)
Encoded shape : (450, 10)
X_train_cat.head()
ethnicity
area_type
region
parttime
289
cauc
urban
midwest
no
182
cauc
urban
midwest
yes
13
cauc
urban
northeast
no
114
cauc
rural
south
no
475
cauc
urban
northeast
no
X_train_cat_encoded.head()
ethnicity_afam
ethnicity_cauc
area_type_rural
area_type_urban
region_midwest
region_northeast
region_south
region_west
parttime_no
parttime_yes
289
0.0
1.0
0.0
1.0
1.0
0.0
0.0
0.0
1.0
0.0
182
0.0
1.0
0.0
1.0
1.0
0.0
0.0
0.0
0.0
1.0
13
0.0
1.0
0.0
1.0
0.0
1.0
0.0
0.0
1.0
0.0
114
0.0
1.0
1.0
0.0
0.0
0.0
1.0
0.0
1.0
0.0
475
0.0
1.0
0.0
1.0
0.0
1.0
0.0
0.0
1.0
0.0
Gabungkan Data
def concat_data(num_data, cat_data): data = pd.concat((num_data, cat_data), axis=1)return data
X_train_concat = concat_data(X_train_num, X_train_cat_encoded)print('Numerical data shape :', X_train_num.shape)print('Categorical data shape:', X_train_cat_encoded.shape)print('Concat data shape :', X_train_concat.shape)
Numerical data shape : (450, 2)
Categorical data shape: (450, 10)
Concat data shape : (450, 12)
from sklearn.dummy import DummyRegressorreg_1 = DummyRegressor()reg_1.fit(X_train_clean, y_train)
DummyRegressor()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DummyRegressor()
Linear Regression
from sklearn.linear_model import LinearRegressionreg_2 = LinearRegression()reg_2.fit(X_train_clean, y_train)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Polynomial Regression
from sklearn.preprocessing import PolynomialFeaturesfrom sklearn.compose import ColumnTransformerfrom sklearn.pipeline import Pipelinefrom sklearn.linear_model import LinearRegression# Tentukan kolom numerik dan kategorikalnumeric_features = ["experience"] # Hanya `experience` yang akan dipolynomialkanother_features = ["education"] # Fitur numerik lainnya tetap sama# Buat transformer untuk polynomial hanya pada `experience`poly_transformer = ColumnTransformer([ ("poly_exp", PolynomialFeatures(degree=2, include_bias=False), ["experience"]), ("passthrough", "passthrough", other_features) # Fitur lain tidak diubah])# Buat pipeline regresi dengan polynomial transformation hanya untuk `experience`reg_3 = Pipeline([ ("poly_features", poly_transformer), ("regressor", LinearRegression())])# Latih model dengan datareg_3.fit(X_train_clean, y_train)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ColumnTransformer(transformers=[('piecewise_exp',
FunctionTransformer(func=<function piecewise_transform at 0x000001B9B0743F60>,
validate=True),
['experience']),
('passthrough', 'passthrough', ['education'])])
['experience']
FunctionTransformer(func=<function piecewise_transform at 0x000001B9B0743F60>,
validate=True)
['education']
passthrough
LinearRegression()
Regresi Spline
from sklearn.preprocessing import OneHotEncoderfrom sklearn.compose import ColumnTransformerfrom sklearn.pipeline import Pipelinefrom sklearn.linear_model import LinearRegressionfrom sklearn.preprocessing import SplineTransformer# Tentukan fitur numerik dan kategorikalnumeric_features = ["experience"] # Hanya `experience` yang akan menggunakan Splineother_numeric_features = ["education"] # Fitur numerik lain tetap tanpa transformasicategorical_features = ["region", "parttime"] # Fitur kategorikal# Buat ColumnTransformer untuk menerapkan Spline hanya ke `experience`preprocessor = ColumnTransformer([ ("spline_exp", SplineTransformer(degree=2, n_knots=4, include_bias=False), ["experience"]), ("pass_numeric", "passthrough", other_numeric_features)])# Buat pipeline regresi dengan spline transformation + prediktor lainnyareg_6 = Pipeline([ ("preprocessor", preprocessor), ("regressor", LinearRegression())])# Latih model dengan datareg_6.fit(X_train_clean, y_train)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.