ETL for Credit Card Default dataset

This ETL tool is build for:

  • Extract CSV format file (UCI_Credit_Card.csv)
  • Transform some column name and value
  • Load into MySQL

Dataset from Kaggle: https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset

Import libraries

  • Pandas is use for reading CSV, turn into dataframe
  • MySQLdb is library that enable Python connect with MySQL server
In [1]:
import pandas as pd
import MySQLdb

SQL Insert command

Writing the SQL code for Insert Value and assign it to INSERT_DB:

INSERT INTO table_name (column1, column2, column3, ...)``
VALUES (value1, value2, value3, ...);
In [2]:
INSERT_DB = "INSERT INTO card_default(id, limit_bal, sex, education, marriage, age,\
pay_status1, pay_status2, pay_status3, pay_status4, pay_status5,  pay_status6,\
bill_amt1, bill_amt2, bill_amt3, bill_amt4, bill_amt5,  bill_amt6,\
pay_amt1, pay_amt2, pay_amt3, pay_amt4, pay_amt5,  pay_amt6,\
pay_default) \
VALUES ({0}, {1}, '{2}', '{3}', '{4}', {5}\
, '{6}', '{7}', '{8}', '{9}', '{10}', '{11}'\
, {12}, {13}, {14}, {15}, {16}, {17} \
, {18}, {19}, {20}, {21}, {22}, {23}\
, {24})" 

Connect with MySQL

Fill in hostname, username, password, database.

In [3]:
db = MySQLdb.connect(host="database",
                     user="name",
                     passwd="password",
                     db="scyap")
cur = db.cursor()

Extract CSV to dataframe using Pandas

In [4]:
card_default = pd.read_csv("UCI_Credit_Card.csv") 

Transform Data

Change column name

  • default.payment.next.month : change to DEFAULT (shorter and witou dot that could cause error code)
  • PAY_0 : change to PAY_1 (to consistent with other columns eg. BILL_AMT1 and PAY_AMT1)
In [5]:
card_default = card_default.rename(columns={'default.payment.next.month': 'DEFAULT', 'PAY_0': 'PAY_1'})

Change value in EDUCATION

  • Change from numerical 1, 2, 3 to Graduate School, University, High School
  • Change from numerical 0, 4, 5 to Other (no clear description from dataset, all assume to be other)
  • Print out unique value to see changes

Changing numerical to string, because it will be more convenient when analysing in SQL. When doing a GROUP BY, it make more sense to show name of each category, instead of number representation.

In [6]:
card_default.loc[(card_default.EDUCATION == 1), 'EDUCATION'] = 'Graduate School'
card_default.loc[(card_default.EDUCATION == 2), 'EDUCATION'] = 'University'
card_default.loc[(card_default.EDUCATION == 3), 'EDUCATION'] = 'High School'

fill = (card_default.EDUCATION == 0) | (card_default.EDUCATION == 4) | (card_default.EDUCATION == 5) | (card_default.EDUCATION == 6)
card_default.loc[fill, 'EDUCATION'] = 'Other'

print('EDUCATION ' + str(card_default['EDUCATION'].unique()))
EDUCATION ['University' 'Graduate School' 'High School' 'Other']

Change value in SEX

  • Change from numerical 1 & 2 to M & F
  • Print out unique value to see changes
In [7]:
card_default.loc[(card_default.SEX == 1), 'SEX'] = 'M'
card_default.loc[(card_default.SEX == 2), 'SEX'] = 'F'

print('SEX ' + str(card_default['SEX'].unique()))
SEX ['F' 'M']

Change value in MARRIAGE

  • Change from numerical 0, 1, 2, 3 to Other, Married, Single Divorce
  • Print out unique value to see changes
In [8]:
card_default.loc[(card_default.MARRIAGE == 0), 'MARRIAGE'] = 'Other'
card_default.loc[(card_default.MARRIAGE == 1), 'MARRIAGE'] = 'Married'
card_default.loc[(card_default.MARRIAGE == 2), 'MARRIAGE'] = 'Single'
card_default.loc[(card_default.MARRIAGE == 3), 'MARRIAGE'] = 'Divorce'

print('MARRIAGE ' + str(card_default['MARRIAGE'].unique()))
MARRIAGE ['Married' 'Single' 'Divorce' 'Other']

Change value in MARRIAGE

  • Change from numerical -2, -1, 0, 1, 2, ... to no consumption, paid full, paid min, delay 1 mth, delay 2 mths ...
  • Print out unique value to see changes
In [9]:
# cleaning data
for i in range(1,7):
    pay = 'PAY_' + str(i)
    card_default.loc[(card_default[pay] == -2), pay] = 'no consumption'
    card_default.loc[(card_default[pay] == -1), pay] = 'paid full'
    card_default.loc[(card_default[pay] == 0), pay] = 'paid min'
    card_default.loc[(card_default[pay] == 1), pay] = 'delay 1 mth'
    card_default.loc[(card_default[pay] == 2), pay] = 'delay 2 mths'
    card_default.loc[(card_default[pay] == 3), pay] = 'delay 3 mths'
    card_default.loc[(card_default[pay] == 4), pay] = 'delay 4 mths'
    card_default.loc[(card_default[pay] == 5), pay] = 'delay 5 mths'
    card_default.loc[(card_default[pay] == 6), pay] = 'delay 6 mths'
    card_default.loc[(card_default[pay] == 7), pay] = 'delay 7 mths'
    card_default.loc[(card_default[pay] == 8), pay] = 'delay 8 mths'
    
print('PAY_1 ' + str(sorted(card_default['PAY_1'].unique())))
PAY_1 ['delay 1 mth', 'delay 2 mths', 'delay 3 mths', 'delay 4 mths', 'delay 5 mths', 'delay 6 mths', 'delay 7 mths', 'delay 8 mths', 'no consumption', 'paid full', 'paid min']

Pre-loading

After transform columns name and values, we show top 5 row as sample to check how it looks like.

In [10]:
card_default.head()
Out[10]:
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_1 PAY_2 PAY_3 PAY_4 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 DEFAULT
0 1 20000.0 F University Married 24 delay 2 mths delay 2 mths paid full paid full ... 0.0 0.0 0.0 0.0 689.0 0.0 0.0 0.0 0.0 1
1 2 120000.0 F University Single 26 paid full delay 2 mths paid min paid min ... 3272.0 3455.0 3261.0 0.0 1000.0 1000.0 1000.0 0.0 2000.0 1
2 3 90000.0 F University Single 34 paid min paid min paid min paid min ... 14331.0 14948.0 15549.0 1518.0 1500.0 1000.0 1000.0 1000.0 5000.0 0
3 4 50000.0 F University Married 37 paid min paid min paid min paid min ... 28314.0 28959.0 29547.0 2000.0 2019.0 1200.0 1100.0 1069.0 1000.0 0
4 5 50000.0 M University Married 57 paid full paid min paid full paid min ... 20940.0 19146.0 19131.0 2000.0 36681.0 10000.0 9000.0 689.0 679.0 0

5 rows × 25 columns

Loading Data

Use a for loop to read row by row and load it into MySQL

In [13]:
for index, row in card_default.iterrows():
    sql = INSERT_DB.format(
        row["ID"],
        float(row["LIMIT_BAL"]),
        row["SEX"],
        row["EDUCATION"],
        row["MARRIAGE"],
        int(row["AGE"]),
        row["PAY_1"],
        row["PAY_2"],
        row["PAY_3"],
        row["PAY_4"],
        row["PAY_5"],
        row["PAY_6"],
        float(row["BILL_AMT1"]),
        float(row["BILL_AMT2"]),
        float(row["BILL_AMT3"]),
        float(row["BILL_AMT4"]),
        float(row["BILL_AMT5"]),
        float(row["BILL_AMT6"]),
        float(row["PAY_AMT1"]),
        float(row["PAY_AMT2"]),
        float(row["PAY_AMT3"]),
        float(row["PAY_AMT4"]),
        float(row["PAY_AMT5"]),
        float(row["PAY_AMT6"]),
        int(row["DEFAULT"])
    )
    # Optional: to print out loading status when running
    #   if index % 1000 == 0:
    #       print(str(index) + ' row done') 
    cur.execute(sql)
    db.commit()
In [12]:
cur.close()
db.close()

ETL completed

Dataset have been Extract, Transform, Load into MySQL. We may proceed to MySQL to check our data.

In [ ]: