Panduan Project-Based Learning (Tugas Terstruktur)

Author

Deri Siswara

Pendahuluan

Dataset yang akan digunakan dalam proyek ini adalah Statlog German Credit Data dari [UCI Machine Learning Repository] (https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data). Dataset ini berisi informasi tentang pinjaman dan risiko kredit di Jerman, dengan tujuan untuk memprediksi apakah seorang peminjam memiliki risiko kredit yang baik atau buruk. Dataset ini telah menjadi salah satu referensi standar untuk tugas klasifikasi di bidang keuangan dan manajemen risiko kredit.

Cara Import Data

  1. Instalasi library ucimlrepo
# pip install ucimlrepo
Collecting ucimlrepo
  Obtaining dependency information for ucimlrepo from https://files.pythonhosted.org/packages/3b/07/1252560194df2b4fad1cb3c46081b948331c63eb1bb0b97620d508d12a53/ucimlrepo-0.0.7-py3-none-any.whl.metadata
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Requirement already satisfied: pandas>=1.0.0 in c:\users\derik\anaconda3\lib\site-packages (from ucimlrepo) (2.0.3)
Requirement already satisfied: certifi>=2020.12.5 in c:\users\derik\anaconda3\lib\site-packages (from ucimlrepo) (2023.11.17)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\derik\appdata\roaming\python\python311\site-packages (from pandas>=1.0.0->ucimlrepo) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in c:\users\derik\anaconda3\lib\site-packages (from pandas>=1.0.0->ucimlrepo) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in c:\users\derik\anaconda3\lib\site-packages (from pandas>=1.0.0->ucimlrepo) (2023.3)
Requirement already satisfied: numpy>=1.21.0 in c:\users\derik\anaconda3\lib\site-packages (from pandas>=1.0.0->ucimlrepo) (1.24.3)
Requirement already satisfied: six>=1.5 in c:\users\derik\appdata\roaming\python\python311\site-packages (from python-dateutil>=2.8.2->pandas>=1.0.0->ucimlrepo) (1.17.0)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7
Note: you may need to restart the kernel to use updated packages.
  1. Import data
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
statlog_german_credit_data = fetch_ucirepo(id=144) 
  
# data (as pandas dataframes) 
X = statlog_german_credit_data.data.features 
y = statlog_german_credit_data.data.targets 
{'uci_id': 144, 'name': 'Statlog (German Credit Data)', 'repository_url': 'https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data', 'data_url': 'https://archive.ics.uci.edu/static/public/144/data.csv', 'abstract': 'This dataset classifies people described by a set of attributes as good or bad credit risks. Comes in two formats (one all numeric). Also comes with a cost matrix', 'area': 'Social Science', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 1000, 'num_features': 20, 'feature_types': ['Categorical', 'Integer'], 'demographics': ['Other', 'Marital Status', 'Age', 'Occupation'], 'target_col': ['class'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1994, 'last_updated': 'Thu Aug 10 2023', 'dataset_doi': '10.24432/C5NC77', 'creators': ['Hans Hofmann'], 'intro_paper': None, 'additional_info': {'summary': 'Two datasets are provided.  the original dataset, in the form provided by Prof. Hofmann, contains categorical/symbolic attributes and is in the file "german.data".   \r\n \r\nFor algorithms that need numerical attributes, Strathclyde University produced the file "german.data-numeric".  This file has been edited and several indicator variables added to make it suitable for algorithms which cannot cope with categorical variables.   Several attributes that are ordered categorical (such as attribute 17) have been coded as integer.    This was the form used by StatLog.\r\n\r\nThis dataset requires use of a cost matrix (see below)\r\n\r\n ..... 1        2\r\n----------------------------\r\n  1   0        1\r\n-----------------------\r\n  2   5        0\r\n\r\n(1 = Good,  2 = Bad)\r\n\r\nThe rows represent the actual classification and the columns the predicted classification.\r\n\r\nIt is worse to class a customer as good when they are bad (5), than it is to class a customer as bad when they are good (1).\r\n', 'purpose': None, 'funded_by': None, 'instances_represent': None, 'recommended_data_splits': None, 'sensitive_data': None, 'preprocessing_description': None, 'variable_info': 'Attribute 1:  (qualitative)      \r\n Status of existing checking account\r\n             A11 :      ... <    0 DM\r\n\t       A12 : 0 <= ... <  200 DM\r\n\t       A13 :      ... >= 200 DM / salary assignments for at least 1 year\r\n               A14 : no checking account\r\n\r\nAttribute 2:  (numerical)\r\n\t      Duration in month\r\n\r\nAttribute 3:  (qualitative)\r\n\t      Credit history\r\n\t      A30 : no credits taken/ all credits paid back duly\r\n              A31 : all credits at this bank paid back duly\r\n\t      A32 : existing credits paid back duly till now\r\n              A33 : delay in paying off in the past\r\n\t      A34 : critical account/  other credits existing (not at this bank)\r\n\r\nAttribute 4:  (qualitative)\r\n\t      Purpose\r\n\t      A40 : car (new)\r\n\t      A41 : car (used)\r\n\t      A42 : furniture/equipment\r\n\t      A43 : radio/television\r\n\t      A44 : domestic appliances\r\n\t      A45 : repairs\r\n\t      A46 : education\r\n\t      A47 : (vacation - does not exist?)\r\n\t      A48 : retraining\r\n\t      A49 : business\r\n\t      A410 : others\r\n\r\nAttribute 5:  (numerical)\r\n\t      Credit amount\r\n\r\nAttibute 6:  (qualitative)\r\n\t      Savings account/bonds\r\n\t      A61 :          ... <  100 DM\r\n\t      A62 :   100 <= ... <  500 DM\r\n\t      A63 :   500 <= ... < 1000 DM\r\n\t      A64 :          .. >= 1000 DM\r\n              A65 :   unknown/ no savings account\r\n\r\nAttribute 7:  (qualitative)\r\n\t      Present employment since\r\n\t      A71 : unemployed\r\n\t      A72 :       ... < 1 year\r\n\t      A73 : 1  <= ... < 4 years  \r\n\t      A74 : 4  <= ... < 7 years\r\n\t      A75 :       .. >= 7 years\r\n\r\nAttribute 8:  (numerical)\r\n\t      Installment rate in percentage of disposable income\r\n\r\nAttribute 9:  (qualitative)\r\n\t      Personal status and sex\r\n\t      A91 : male   : divorced/separated\r\n\t      A92 : female : divorced/separated/married\r\n              A93 : male   : single\r\n\t      A94 : male   : married/widowed\r\n\t      A95 : female : single\r\n\r\nAttribute 10: (qualitative)\r\n\t      Other debtors / guarantors\r\n\t      A101 : none\r\n\t      A102 : co-applicant\r\n\t      A103 : guarantor\r\n\r\nAttribute 11: (numerical)\r\n\t      Present residence since\r\n\r\nAttribute 12: (qualitative)\r\n\t      Property\r\n\t      A121 : real estate\r\n\t      A122 : if not A121 : building society savings agreement/ life insurance\r\n              A123 : if not A121/A122 : car or other, not in attribute 6\r\n\t      A124 : unknown / no property\r\n\r\nAttribute 13: (numerical)\r\n\t      Age in years\r\n\r\nAttribute 14: (qualitative)\r\n\t      Other installment plans \r\n\t      A141 : bank\r\n\t      A142 : stores\r\n\t      A143 : none\r\n\r\nAttribute 15: (qualitative)\r\n\t      Housing\r\n\t      A151 : rent\r\n\t      A152 : own\r\n\t      A153 : for free\r\n\r\nAttribute 16: (numerical)\r\n              Number of existing credits at this bank\r\n\r\nAttribute 17: (qualitative)\r\n\t      Job\r\n\t      A171 : unemployed/ unskilled  - non-resident\r\n\t      A172 : unskilled - resident\r\n\t      A173 : skilled employee / official\r\n\t      A174 : management/ self-employed/\r\n\t\t     highly qualified employee/ officer\r\n\r\nAttribute 18: (numerical)\r\n\t      Number of people being liable to provide maintenance for\r\n\r\nAttribute 19: (qualitative)\r\n\t      Telephone\r\n\t      A191 : none\r\n\t      A192 : yes, registered under the customers name\r\n\r\nAttribute 20: (qualitative)\r\n\t      foreign worker\r\n\t      A201 : yes\r\n\t      A202 : no\r\n', 'citation': None}}
# export to pandas dataframe
import pandas as pd
df = pd.DataFrame(X)
df['target'] = y
df.to_csv('statlog_german_credit_data.csv', index=False)
df
Attribute1 Attribute2 Attribute3 Attribute4 Attribute5 Attribute6 Attribute7 Attribute8 Attribute9 Attribute10 ... Attribute12 Attribute13 Attribute14 Attribute15 Attribute16 Attribute17 Attribute18 Attribute19 Attribute20 target
0 A11 6 A34 A43 1169 A65 A75 4 A93 A101 ... A121 67 A143 A152 2 A173 1 A192 A201 1
1 A12 48 A32 A43 5951 A61 A73 2 A92 A101 ... A121 22 A143 A152 1 A173 1 A191 A201 2
2 A14 12 A34 A46 2096 A61 A74 2 A93 A101 ... A121 49 A143 A152 1 A172 2 A191 A201 1
3 A11 42 A32 A42 7882 A61 A74 2 A93 A103 ... A122 45 A143 A153 1 A173 2 A191 A201 1
4 A11 24 A33 A40 4870 A61 A73 3 A93 A101 ... A124 53 A143 A153 2 A173 2 A191 A201 2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 A14 12 A32 A42 1736 A61 A74 3 A92 A101 ... A121 31 A143 A152 1 A172 1 A191 A201 1
996 A11 30 A32 A41 3857 A61 A73 4 A91 A101 ... A122 40 A143 A152 1 A174 1 A192 A201 1
997 A14 12 A32 A43 804 A61 A75 4 A93 A101 ... A123 38 A143 A152 1 A173 1 A191 A201 1
998 A11 45 A32 A43 1845 A61 A73 4 A93 A101 ... A124 23 A143 A153 1 A173 1 A192 A201 2
999 A12 45 A34 A41 4576 A62 A71 3 A93 A101 ... A123 27 A143 A152 1 A173 1 A191 A201 1

1000 rows × 21 columns

Deskripsi Dataset

Dataset Input:

Dataset Statlog German Credit berisi 20 atribut yang menggambarkan karakteristik peminjam dan riwayat kredit mereka.

Atribut pada Dataset:

  1. Status rekening yang ada (kategorikal: A11, A12, A13, A14)

    • A11: < 0 DM
    • A12: 0 - 200 DM
    • A13: > 200 DM
    • A14: tidak ada rekening giro
  2. Durasi kredit dalam bulan (numerik)

  3. Riwayat kredit (kategorikal: A30, A31, A32, A33, A34)

    • A30: tidak ada kredit/semua kredit dibayar
    • A31: semua kredit di bank ini dibayar
    • A32: kredit yang ada dibayar hingga saat ini
    • A33: keterlambatan pembayaran di masa lalu
    • A34: rekening bermasalah
  4. Tujuan (kategorikal: A40, A41, A42, A43, A44, A45, A46, A47, A48, A49, A410)

    • A40: mobil (baru)
    • A41: mobil (bekas)
    • A42: perabotan/peralatan
    • A43: radio/televisi
    • A44: peralatan rumah tangga
    • A45: perbaikan
    • A46: pendidikan
    • A47: liburan
    • A48: pelatihan
    • A49: bisnis
    • A410: lainnya
  5. Jumlah kredit (numerik)

  6. Rekening tabungan (kategorikal: A61, A62, A63, A64, A65)

    • A61: < 100 DM
    • A62: 100 - 500 DM
    • A63: 500 - 1000 DM
    • A64: > 1000 DM
    • A65: tidak diketahui/tidak ada rekening tabungan
  7. Status pekerjaan saat ini (kategorikal: A71, A72, A73, A74, A75)

    • A71: pengangguran/tidak terampil - bukan penduduk
    • A72: tidak terampil - penduduk
    • A73: terampil/pegawai
    • A74: manajemen/pekerja mandiri/karyawan/pegawai tingkat tinggi
    • A75: …
  8. Tingkat cicilan dalam persentase dari pendapatan yang dapat dibelanjakan (numerik)

  9. Status pribadi dan jenis kelamin (kategorikal: A91, A92, A93, A94, A95)

    • A91: laki-laki : bercerai/berpisah
    • A92: perempuan : bercerai/berpisah/menikah
    • A93: laki-laki : lajang
    • A94: laki-laki : menikah/janda
    • A95: perempuan : lajang
  10. Pihak lain/Penjamin (kategorikal: A101, A102, A103)

    • A101: tidak ada
    • A102: penjamin
    • A103: co-pemohon
  11. Lama tinggal di alamat saat ini (numerik: tahun)

  12. Kepemilikan properti (kategorikal: A121, A122, A123, A124)

    • A121: real estate
    • A122: perjanjian tabungan bangunan/asuransi jiwa
    • A123: mobil atau properti lain
    • A124: tidak diketahui/tidak ada properti
  13. Usia (numerik: tahun)

  14. Kredit lain yang ada (kategorikal: A141, A142, A143)

    • A141: bank
    • A142: toko
    • A143: tidak ada
  15. Perumahan (kategorikal: A151, A152, A153)

    • A151: sewa
    • A152: milik sendiri
    • A153: gratis
  16. Jumlah kredit yang ada di bank ini (numerik)

  17. Pekerjaan (kategorikal: A171, A172, A173, A174)

    • A171: pengangguran/tidak terampil - bukan penduduk
    • A172: tidak terampil - penduduk
    • A173: terampil
    • A174: sangat terampil
  18. Jumlah tanggungan (numerik)

  19. Telepon (kategorikal: A191, A192)

    • A191: tidak ada
    • A192: ya, terdaftar atas nama pemohon
  20. Pekerja asing (kategorikal: A201, A202)

    • A201: ya
    • A202: tidak

Target Variable:

  1. Klasifikasi Risiko Kredit: 1 - Risiko Baik, 2 - Risiko Buruk

Informasi Tambahan:

  • Dataset memiliki 1000 instance (baris data)
  • Terdapat 20 atribut (fitur) dan 1 variabel target
  • Data disajikan dalam format yang telah dikategorikan dan diberi kode
  • Kelas yang tidak seimbang: 700 instance kelas “good” dan 300 instance kelas “bad”

Tugas

  1. Buatlah jupyter notebook untuk proyek machine learning secara end-to-end.

  2. Buatlah presentation slide untuk mempresentasikan proyek Anda. Contoh Slide Presentasi Project