Panduan Project-Based Learning (Tugas Terstruktur)

Author

Deri Siswara

Pendahuluan

Dataset yang akan digunakan dalam proyek ini adalah Statlog German Credit Data dari [UCI Machine Learning Repository] (https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data). Dataset ini berisi informasi tentang pinjaman dan risiko kredit di Jerman, dengan tujuan untuk memprediksi apakah seorang peminjam memiliki risiko kredit yang baik atau buruk. Dataset ini telah menjadi salah satu referensi standar untuk tugas klasifikasi di bidang keuangan dan manajemen risiko kredit.

Cara Import Data

Instalasi library ucimlrepo

# pip install ucimlrepo

Collecting ucimlrepo
  Obtaining dependency information for ucimlrepo from https://files.pythonhosted.org/packages/3b/07/1252560194df2b4fad1cb3c46081b948331c63eb1bb0b97620d508d12a53/ucimlrepo-0.0.7-py3-none-any.whl.metadata
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Requirement already satisfied: pandas>=1.0.0 in c:\users\derik\anaconda3\lib\site-packages (from ucimlrepo) (2.0.3)
Requirement already satisfied: certifi>=2020.12.5 in c:\users\derik\anaconda3\lib\site-packages (from ucimlrepo) (2023.11.17)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\derik\appdata\roaming\python\python311\site-packages (from pandas>=1.0.0->ucimlrepo) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in c:\users\derik\anaconda3\lib\site-packages (from pandas>=1.0.0->ucimlrepo) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in c:\users\derik\anaconda3\lib\site-packages (from pandas>=1.0.0->ucimlrepo) (2023.3)
Requirement already satisfied: numpy>=1.21.0 in c:\users\derik\anaconda3\lib\site-packages (from pandas>=1.0.0->ucimlrepo) (1.24.3)
Requirement already satisfied: six>=1.5 in c:\users\derik\appdata\roaming\python\python311\site-packages (from python-dateutil>=2.8.2->pandas>=1.0.0->ucimlrepo) (1.17.0)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7
Note: you may need to restart the kernel to use updated packages.

Import data

from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
statlog_german_credit_data = fetch_ucirepo(id=144) 
  
# data (as pandas dataframes) 
X = statlog_german_credit_data.data.features 
y = statlog_german_credit_data.data.targets

{'uci_id': 144, 'name': 'Statlog (German Credit Data)', 'repository_url': 'https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data', 'data_url': 'https://archive.ics.uci.edu/static/public/144/data.csv', 'abstract': 'This dataset classifies people described by a set of attributes as good or bad credit risks. Comes in two formats (one all numeric). Also comes with a cost matrix', 'area': 'Social Science', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 1000, 'num_features': 20, 'feature_types': ['Categorical', 'Integer'], 'demographics': ['Other', 'Marital Status', 'Age', 'Occupation'], 'target_col': ['class'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1994, 'last_updated': 'Thu Aug 10 2023', 'dataset_doi': '10.24432/C5NC77', 'creators': ['Hans Hofmann'], 'intro_paper': None, 'additional_info': {'summary': 'Two datasets are provided.  the original dataset, in the form provided by Prof. Hofmann, contains categorical/symbolic attributes and is in the file "german.data".   \r\n \r\nFor algorithms that need numerical attributes, Strathclyde University produced the file "german.data-numeric".  This file has been edited and several indicator variables added to make it suitable for algorithms which cannot cope with categorical variables.   Several attributes that are ordered categorical (such as attribute 17) have been coded as integer.    This was the form used by StatLog.\r\n\r\nThis dataset requires use of a cost matrix (see below)\r\n\r\n ..... 1        2\r\n----------------------------\r\n  1   0        1\r\n-----------------------\r\n  2   5        0\r\n\r\n(1 = Good,  2 = Bad)\r\n\r\nThe rows represent the actual classification and the columns the predicted classification.\r\n\r\nIt is worse to class a customer as good when they are bad (5), than it is to class a customer as bad when they are good (1).\r\n', 'purpose': None, 'funded_by': None, 'instances_represent': None, 'recommended_data_splits': None, 'sensitive_data': None, 'preprocessing_description': None, 'variable_info': 'Attribute 1:  (qualitative)      \r\n Status of existing checking account\r\n             A11 :      ... <    0 DM\r\n\t       A12 : 0 <= ... <  200 DM\r\n\t       A13 :      ... >= 200 DM / salary assignments for at least 1 year\r\n               A14 : no checking account\r\n\r\nAttribute 2:  (numerical)\r\n\t      Duration in month\r\n\r\nAttribute 3:  (qualitative)\r\n\t      Credit history\r\n\t      A30 : no credits taken/ all credits paid back duly\r\n              A31 : all credits at this bank paid back duly\r\n\t      A32 : existing credits paid back duly till now\r\n              A33 : delay in paying off in the past\r\n\t      A34 : critical account/  other credits existing (not at this bank)\r\n\r\nAttribute 4:  (qualitative)\r\n\t      Purpose\r\n\t      A40 : car (new)\r\n\t      A41 : car (used)\r\n\t      A42 : furniture/equipment\r\n\t      A43 : radio/television\r\n\t      A44 : domestic appliances\r\n\t      A45 : repairs\r\n\t      A46 : education\r\n\t      A47 : (vacation - does not exist?)\r\n\t      A48 : retraining\r\n\t      A49 : business\r\n\t      A410 : others\r\n\r\nAttribute 5:  (numerical)\r\n\t      Credit amount\r\n\r\nAttibute 6:  (qualitative)\r\n\t      Savings account/bonds\r\n\t      A61 :          ... <  100 DM\r\n\t      A62 :   100 <= ... <  500 DM\r\n\t      A63 :   500 <= ... < 1000 DM\r\n\t      A64 :          .. >= 1000 DM\r\n              A65 :   unknown/ no savings account\r\n\r\nAttribute 7:  (qualitative)\r\n\t      Present employment since\r\n\t      A71 : unemployed\r\n\t      A72 :       ... < 1 year\r\n\t      A73 : 1  <= ... < 4 years  \r\n\t      A74 : 4  <= ... < 7 years\r\n\t      A75 :       .. >= 7 years\r\n\r\nAttribute 8:  (numerical)\r\n\t      Installment rate in percentage of disposable income\r\n\r\nAttribute 9:  (qualitative)\r\n\t      Personal status and sex\r\n\t      A91 : male   : divorced/separated\r\n\t      A92 : female : divorced/separated/married\r\n              A93 : male   : single\r\n\t      A94 : male   : married/widowed\r\n\t      A95 : female : single\r\n\r\nAttribute 10: (qualitative)\r\n\t      Other debtors / guarantors\r\n\t      A101 : none\r\n\t      A102 : co-applicant\r\n\t      A103 : guarantor\r\n\r\nAttribute 11: (numerical)\r\n\t      Present residence since\r\n\r\nAttribute 12: (qualitative)\r\n\t      Property\r\n\t      A121 : real estate\r\n\t      A122 : if not A121 : building society savings agreement/ life insurance\r\n              A123 : if not A121/A122 : car or other, not in attribute 6\r\n\t      A124 : unknown / no property\r\n\r\nAttribute 13: (numerical)\r\n\t      Age in years\r\n\r\nAttribute 14: (qualitative)\r\n\t      Other installment plans \r\n\t      A141 : bank\r\n\t      A142 : stores\r\n\t      A143 : none\r\n\r\nAttribute 15: (qualitative)\r\n\t      Housing\r\n\t      A151 : rent\r\n\t      A152 : own\r\n\t      A153 : for free\r\n\r\nAttribute 16: (numerical)\r\n              Number of existing credits at this bank\r\n\r\nAttribute 17: (qualitative)\r\n\t      Job\r\n\t      A171 : unemployed/ unskilled  - non-resident\r\n\t      A172 : unskilled - resident\r\n\t      A173 : skilled employee / official\r\n\t      A174 : management/ self-employed/\r\n\t\t     highly qualified employee/ officer\r\n\r\nAttribute 18: (numerical)\r\n\t      Number of people being liable to provide maintenance for\r\n\r\nAttribute 19: (qualitative)\r\n\t      Telephone\r\n\t      A191 : none\r\n\t      A192 : yes, registered under the customers name\r\n\r\nAttribute 20: (qualitative)\r\n\t      foreign worker\r\n\t      A201 : yes\r\n\t      A202 : no\r\n', 'citation': None}}

# export to pandas dataframe
import pandas as pd
df = pd.DataFrame(X)
df['target'] = y
df.to_csv('statlog_german_credit_data.csv', index=False)
df

	Attribute1	Attribute2	Attribute3	Attribute4	Attribute5	Attribute6	Attribute7	Attribute8	Attribute9	Attribute10	...	Attribute12	Attribute13	Attribute14	Attribute15	Attribute16	Attribute17	Attribute18	Attribute19	Attribute20	target
0	A11	6	A34	A43	1169	A65	A75	4	A93	A101	...	A121	67	A143	A152	2	A173	1	A192	A201	1
1	A12	48	A32	A43	5951	A61	A73	2	A92	A101	...	A121	22	A143	A152	1	A173	1	A191	A201	2
2	A14	12	A34	A46	2096	A61	A74	2	A93	A101	...	A121	49	A143	A152	1	A172	2	A191	A201	1
3	A11	42	A32	A42	7882	A61	A74	2	A93	A103	...	A122	45	A143	A153	1	A173	2	A191	A201	1
4	A11	24	A33	A40	4870	A61	A73	3	A93	A101	...	A124	53	A143	A153	2	A173	2	A191	A201	2
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
995	A14	12	A32	A42	1736	A61	A74	3	A92	A101	...	A121	31	A143	A152	1	A172	1	A191	A201	1
996	A11	30	A32	A41	3857	A61	A73	4	A91	A101	...	A122	40	A143	A152	1	A174	1	A192	A201	1
997	A14	12	A32	A43	804	A61	A75	4	A93	A101	...	A123	38	A143	A152	1	A173	1	A191	A201	1
998	A11	45	A32	A43	1845	A61	A73	4	A93	A101	...	A124	23	A143	A153	1	A173	1	A192	A201	2
999	A12	45	A34	A41	4576	A62	A71	3	A93	A101	...	A123	27	A143	A152	1	A173	1	A191	A201	1

1000 rows × 21 columns

Deskripsi Dataset

Dataset Input:

Dataset Statlog German Credit berisi 20 atribut yang menggambarkan karakteristik peminjam dan riwayat kredit mereka.

Atribut pada Dataset:

Status rekening yang ada (kategorikal: A11, A12, A13, A14)
- A11: < 0 DM
- A12: 0 - 200 DM
- A13: > 200 DM
- A14: tidak ada rekening giro
Durasi kredit dalam bulan (numerik)
Riwayat kredit (kategorikal: A30, A31, A32, A33, A34)
- A30: tidak ada kredit/semua kredit dibayar
- A31: semua kredit di bank ini dibayar
- A32: kredit yang ada dibayar hingga saat ini
- A33: keterlambatan pembayaran di masa lalu
- A34: rekening bermasalah
Tujuan (kategorikal: A40, A41, A42, A43, A44, A45, A46, A47, A48, A49, A410)
- A40: mobil (baru)
- A41: mobil (bekas)
- A42: perabotan/peralatan
- A43: radio/televisi
- A44: peralatan rumah tangga
- A45: perbaikan
- A46: pendidikan
- A47: liburan
- A48: pelatihan
- A49: bisnis
- A410: lainnya
Jumlah kredit (numerik)
Rekening tabungan (kategorikal: A61, A62, A63, A64, A65)
- A61: < 100 DM
- A62: 100 - 500 DM
- A63: 500 - 1000 DM
- A64: > 1000 DM
- A65: tidak diketahui/tidak ada rekening tabungan
Status pekerjaan saat ini (kategorikal: A71, A72, A73, A74, A75)
- A71: pengangguran/tidak terampil - bukan penduduk
- A72: tidak terampil - penduduk
- A73: terampil/pegawai
- A74: manajemen/pekerja mandiri/karyawan/pegawai tingkat tinggi
- A75: …
Tingkat cicilan dalam persentase dari pendapatan yang dapat dibelanjakan (numerik)
Status pribadi dan jenis kelamin (kategorikal: A91, A92, A93, A94, A95)
- A91: laki-laki : bercerai/berpisah
- A92: perempuan : bercerai/berpisah/menikah
- A93: laki-laki : lajang
- A94: laki-laki : menikah/janda
- A95: perempuan : lajang
Pihak lain/Penjamin (kategorikal: A101, A102, A103)
- A101: tidak ada
- A102: penjamin
- A103: co-pemohon
Lama tinggal di alamat saat ini (numerik: tahun)
Kepemilikan properti (kategorikal: A121, A122, A123, A124)
- A121: real estate
- A122: perjanjian tabungan bangunan/asuransi jiwa
- A123: mobil atau properti lain
- A124: tidak diketahui/tidak ada properti
Usia (numerik: tahun)
Kredit lain yang ada (kategorikal: A141, A142, A143)
- A141: bank
- A142: toko
- A143: tidak ada
Perumahan (kategorikal: A151, A152, A153)
- A151: sewa
- A152: milik sendiri
- A153: gratis
Jumlah kredit yang ada di bank ini (numerik)
Pekerjaan (kategorikal: A171, A172, A173, A174)
- A171: pengangguran/tidak terampil - bukan penduduk
- A172: tidak terampil - penduduk
- A173: terampil
- A174: sangat terampil
Jumlah tanggungan (numerik)
Telepon (kategorikal: A191, A192)
- A191: tidak ada
- A192: ya, terdaftar atas nama pemohon
Pekerja asing (kategorikal: A201, A202)
- A201: ya
- A202: tidak

Target Variable:

Klasifikasi Risiko Kredit: 1 - Risiko Baik, 2 - Risiko Buruk

Informasi Tambahan:

Dataset memiliki 1000 instance (baris data)
Terdapat 20 atribut (fitur) dan 1 variabel target
Data disajikan dalam format yang telah dikategorikan dan diberi kode
Kelas yang tidak seimbang: 700 instance kelas “good” dan 300 instance kelas “bad”

Tugas

Buatlah jupyter notebook untuk proyek machine learning secara end-to-end.
Buatlah presentation slide untuk mempresentasikan proyek Anda. Contoh Slide Presentasi Project