ETL for Credit Card Default dataset¶

This ETL tool is build for:

Extract CSV format file (UCI_Credit_Card.csv)
Transform some column name and value
Load into MySQL

Dataset from Kaggle: https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset

Import libraries¶

Pandas is use for reading CSV, turn into dataframe
MySQLdb is library that enable Python connect with MySQL server

import pandas as pd
import MySQLdb

SQL Insert command¶

Writing the SQL code for Insert Value and assign it to INSERT_DB:

INSERT INTO table_name (column1, column2, column3, ...)``
VALUES (value1, value2, value3, ...);

INSERT_DB = "INSERT INTO card_default(id, limit_bal, sex, education, marriage, age,\
pay_status1, pay_status2, pay_status3, pay_status4, pay_status5,  pay_status6,\
bill_amt1, bill_amt2, bill_amt3, bill_amt4, bill_amt5,  bill_amt6,\
pay_amt1, pay_amt2, pay_amt3, pay_amt4, pay_amt5,  pay_amt6,\
pay_default) \
VALUES ({0}, {1}, '{2}', '{3}', '{4}', {5}\
, '{6}', '{7}', '{8}', '{9}', '{10}', '{11}'\
, {12}, {13}, {14}, {15}, {16}, {17} \
, {18}, {19}, {20}, {21}, {22}, {23}\
, {24})"

Connect with MySQL¶

Fill in hostname, username, password, database.

db = MySQLdb.connect(host="database",
                     user="name",
                     passwd="password",
                     db="scyap")
cur = db.cursor()

Extract CSV to dataframe using Pandas¶

card_default = pd.read_csv("UCI_Credit_Card.csv")

Transform Data¶

Change column name¶

default.payment.next.month : change to DEFAULT (shorter and witou dot that could cause error code)
PAY_0 : change to PAY_1 (to consistent with other columns eg. BILL_AMT1 and PAY_AMT1)

card_default = card_default.rename(columns={'default.payment.next.month': 'DEFAULT', 'PAY_0': 'PAY_1'})

Change value in EDUCATION¶

Change from numerical 1, 2, 3 to Graduate School, University, High School
Change from numerical 0, 4, 5 to Other (no clear description from dataset, all assume to be other)
Print out unique value to see changes

Changing numerical to string, because it will be more convenient when analysing in SQL. When doing a GROUP BY, it make more sense to show name of each category, instead of number representation.

card_default.loc[(card_default.EDUCATION == 1), 'EDUCATION'] = 'Graduate School'
card_default.loc[(card_default.EDUCATION == 2), 'EDUCATION'] = 'University'
card_default.loc[(card_default.EDUCATION == 3), 'EDUCATION'] = 'High School'

fill = (card_default.EDUCATION == 0) | (card_default.EDUCATION == 4) | (card_default.EDUCATION == 5) | (card_default.EDUCATION == 6)
card_default.loc[fill, 'EDUCATION'] = 'Other'

print('EDUCATION ' + str(card_default['EDUCATION'].unique()))

EDUCATION ['University' 'Graduate School' 'High School' 'Other']

Change value in SEX¶

Change from numerical 1 & 2 to M & F
Print out unique value to see changes

card_default.loc[(card_default.SEX == 1), 'SEX'] = 'M'
card_default.loc[(card_default.SEX == 2), 'SEX'] = 'F'

print('SEX ' + str(card_default['SEX'].unique()))

SEX ['F' 'M']

Change value in MARRIAGE¶

Change from numerical 0, 1, 2, 3 to Other, Married, Single Divorce
Print out unique value to see changes

card_default.loc[(card_default.MARRIAGE == 0), 'MARRIAGE'] = 'Other'
card_default.loc[(card_default.MARRIAGE == 1), 'MARRIAGE'] = 'Married'
card_default.loc[(card_default.MARRIAGE == 2), 'MARRIAGE'] = 'Single'
card_default.loc[(card_default.MARRIAGE == 3), 'MARRIAGE'] = 'Divorce'

print('MARRIAGE ' + str(card_default['MARRIAGE'].unique()))

MARRIAGE ['Married' 'Single' 'Divorce' 'Other']

Change value in MARRIAGE¶

Change from numerical -2, -1, 0, 1, 2, ... to no consumption, paid full, paid min, delay 1 mth, delay 2 mths ...
Print out unique value to see changes

# cleaning data
for i in range(1,7):
    pay = 'PAY_' + str(i)
    card_default.loc[(card_default[pay] == -2), pay] = 'no consumption'
    card_default.loc[(card_default[pay] == -1), pay] = 'paid full'
    card_default.loc[(card_default[pay] == 0), pay] = 'paid min'
    card_default.loc[(card_default[pay] == 1), pay] = 'delay 1 mth'
    card_default.loc[(card_default[pay] == 2), pay] = 'delay 2 mths'
    card_default.loc[(card_default[pay] == 3), pay] = 'delay 3 mths'
    card_default.loc[(card_default[pay] == 4), pay] = 'delay 4 mths'
    card_default.loc[(card_default[pay] == 5), pay] = 'delay 5 mths'
    card_default.loc[(card_default[pay] == 6), pay] = 'delay 6 mths'
    card_default.loc[(card_default[pay] == 7), pay] = 'delay 7 mths'
    card_default.loc[(card_default[pay] == 8), pay] = 'delay 8 mths'
    
print('PAY_1 ' + str(sorted(card_default['PAY_1'].unique())))

PAY_1 ['delay 1 mth', 'delay 2 mths', 'delay 3 mths', 'delay 4 mths', 'delay 5 mths', 'delay 6 mths', 'delay 7 mths', 'delay 8 mths', 'no consumption', 'paid full', 'paid min']

Pre-loading¶

After transform columns name and values, we show top 5 row as sample to check how it looks like.

card_default.head()

Loading Data¶

Use a for loop to read row by row and load it into MySQL

for index, row in card_default.iterrows():
    sql = INSERT_DB.format(
        row["ID"],
        float(row["LIMIT_BAL"]),
        row["SEX"],
        row["EDUCATION"],
        row["MARRIAGE"],
        int(row["AGE"]),
        row["PAY_1"],
        row["PAY_2"],
        row["PAY_3"],
        row["PAY_4"],
        row["PAY_5"],
        row["PAY_6"],
        float(row["BILL_AMT1"]),
        float(row["BILL_AMT2"]),
        float(row["BILL_AMT3"]),
        float(row["BILL_AMT4"]),
        float(row["BILL_AMT5"]),
        float(row["BILL_AMT6"]),
        float(row["PAY_AMT1"]),
        float(row["PAY_AMT2"]),
        float(row["PAY_AMT3"]),
        float(row["PAY_AMT4"]),
        float(row["PAY_AMT5"]),
        float(row["PAY_AMT6"]),
        int(row["DEFAULT"])
    )
    # Optional: to print out loading status when running
    #   if index % 1000 == 0:
    #       print(str(index) + ' row done') 
    cur.execute(sql)
    db.commit()

cur.close()
db.close()

ETL completed¶

Dataset have been Extract, Transform, Load into MySQL. We may proceed to MySQL to check our data.

	ID	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_1	PAY_2	PAY_3	PAY_4	...	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	DEFAULT
0	1	20000.0	F	University	Married	24	delay 2 mths	delay 2 mths	paid full	paid full	...	0.0	0.0	0.0	0.0	689.0	0.0	0.0	0.0	0.0	1
1	2	120000.0	F	University	Single	26	paid full	delay 2 mths	paid min	paid min	...	3272.0	3455.0	3261.0	0.0	1000.0	1000.0	1000.0	0.0	2000.0	1
2	3	90000.0	F	University	Single	34	paid min	paid min	paid min	paid min	...	14331.0	14948.0	15549.0	1518.0	1500.0	1000.0	1000.0	1000.0	5000.0	0
3	4	50000.0	F	University	Married	37	paid min	paid min	paid min	paid min	...	28314.0	28959.0	29547.0	2000.0	2019.0	1200.0	1100.0	1069.0	1000.0	0
4	5	50000.0	M	University	Married	57	paid full	paid min	paid full	paid min	...	20940.0	19146.0	19131.0	2000.0	36681.0	10000.0	9000.0	689.0	679.0	0