BA 648 Project This project is an individual exercise intended to practice the art of ‘data wrangling’ with the application of SQL. At each step, you will need to figure out what needs to be done and in what order, like what you will need to do in any other business analytics project.

BA 648 Project
This project is an individual exercise intended to practice the art of ‘data wrangling’ with the application of
SQL. At each step, you will need to figure out what needs to be done and in what order, like what you will
need to do in any other business analytics project.
Problem Introduction
You will be working as a business with Hardy Business Loans (HBL), a hypothetical company that lends
money to small businesses for small to medium sized equipment acquisitions. The borrower can be either the
business itself or one or more of the business’s owners or partners. For example, Hardy might lend money to
a restaurant for a pizza oven; in the case of a transportation company, it could be a tractor (truck).
HBL has recently acquired portfolios from “Beachside Lenders” and “Wilson & Sons” and their CEO would
like for you to analyze whether or not their lending practices fit within HBL’s current models. Of course, the
data files from each of these companies has different layouts, formats, and fields. Each portfolio consists of
applications and loans. Each loan should have a corresponding loan application but not all loan applications
lead to a loan. Applications that do not lead to loans are termed “unbooked” and applications that lead to
loans are termed “booked”. Reasons that an application may be “unbooked” may be that the lender declined
the application, or the borrower acquired a loan elsewhere, to name two reasons for an application failing to
The lender, as required, consults several other credit rating agencies to collect information to help HBL
4/30/2021 64754 – BA 648 ProjectThis project is an individual exercise 3/8
determine the creditworthiness of its borrower. Ordinarily, credit reporting agencies supply information
regarding the credit habits and experience of individuals as well as businesses. The HBL “agency” uses data
collected from these credit reporting agencies.
When requested, data reporting agencies supply a variety of “low-level” data. This “low-level” data is
specific to a single “customer”. It also provides a single creditworthiness score, or the “agency score”. Lowlevel data might include information regarding the number of times a customer was 30 days late paying a
credit card bill, the number of credit cards the customer has, whether the customer has declared bankruptcy,
and other items relating to the customer’s creditworthiness. The agency score is a single number intended to
represent the customer’s overall creditworthiness by combining low-level customer data using its own
proprietary algorithms. It is often the case that the lender, such as HBL, will make the same request for
customer information to multiple credit reporting agencies, requesting the agency score as well as low-level
data for its customer. The result of these multiple requests is that multiple agency scores will likely be
collected for each customer. For example, HBL would use agency scores from credit rating companies such
as Experian, TransUnion, and Equifax. Additionally, lenders such as HBL, often create their own assessment
from low-level data provided in the same request.
The Objective
The overall objective of this project is to demonstrate your ability to clean data and develop SQL
applications. As your analysis progresses, you will begin to create reports and collect detail important for
communicating your work. You will combine these reports as a collection of deliverables at the time of the
project completion. The objective of this project can be broken down into several smaller, general objectives;
data cleaning, model building, and reporting. For the purpose of this class project, we will be focusing on the
data cleaning and SQL coding objectives. A short description for each of the deliverables to support this
effort follows.
1. Executive Summary
2. Summary Statistics
3. Graphs and Charts
4. Test and Training Data Files
5. Data Cleaning Journal
6. SQL Code
1. Executive Summary
The executive summary is a summary of your work. It should be approximately 1 page in length, between
350 – 500 words. It should describe the project as well as summarize your results, and conclusions. A clear,
well-articulated statement regarding your recommendation is also required.
2. Summary Statistics
Summary statistics should include statistical descriptions of the data used to asses each loan application used
in rating the borrower’s credit. For instance, it could be useful to know the average level of income or the debt
to income ratio median for a region.
3. Graphs and Charts
Graphs and charts are extremely important for visualizing data and trends. Graphs should mirror the
information that you are attempting to convey in the analysis. Use descriptive labels for all axes, titles, and
data. Scale your charts and graphs appropriately for the question you are attempting to answer for the data
provided. Graphs and charts should require little explanation and be readable. Use fonts, font sizes, and
colors that highlight important details and enhance readability.
4. Test and Training Data Files
You will create two datasets, a test set and a training set. These two datasets will be used to develop a model
for evaluating future loan application and to determine the fit of your model. The first data set, Booked Data,
comes from the booked loans. You will need to construct a response variable for each record in the loan data
set. This response variable will be used for future modeling efforts to contain the measurement that describes
the outcome of the loan decision. The response variable represents the valuation of the loan and can be
assigned one of several levels. (We will define these levels more completely later in this document.) An
example of levels is “Good”, “Bad”, or -Inteterminate-. In other words, each record in the first data set will
contain 1) a constructed variable, referred to as the response variable, that represents a
measurement of the loan, 2) a number of measurements on different variables, referred to as “predictors”, and
3) several keys that will link the final data to the components used to build it.
4/30/2021 64754 – BA 648 ProjectThis project is an individual exercise 4/8
5. Data Cleaning Journal
The data cleaning journal is used to document the actions you used to transform your data in preparation for
analysis. This would include any mathematical operations necessary to combine variables. For instance, it
could be that you needed to apply a logarithmic transformation to correct issues with your data causing it to
violate model assumptions. Sources of your data should be recorded and any pertinent data relating to
accessing these sources should be preserved.
Ultimately, the objective of the data cleaning exercise is to create tabular data with one row per loan
including all the necessary data (columns) of descriptive data. In addition to the loan data listed in each row,
any observations that were omitted and the reason for the omission must be reported. For example, was there
too much missing data, no matching records in the agency data, or some other reason for the omission.
6. SQL Code
SQL is often used for cleaning, manipulating, and reporting data and is required for this project. Your SQL
code should be readable and adhere to best practices where possible. It should also follow ANSI standards
defined by the PostgreSQL implementation to allow for maximum portability.
Modeling Considerations
Modeling considerations include dividing the data sets into multiple data sets, such as creating a training data
set for building models and a test data set for testing. This step will most likely be determined by randomly
selecting the data for each of the data sets being created. There are other methods of combining and
recombining data as well. However, while the expectation is that modeling is not part of this exercise, you
will need to prepare the data as if you were proceeding with the next step. Analytical models are based on
meeting certain assumptions for satisfying specific criteria. Assumptions of normality or missing data are two
such examples.
Examples of Things to Check
Where possible you should check every column (variable) in the data against the data dictionary. Be sure to
look for duplicates and classes of data that fall in the range of expected values. Check the minimum and
maximum values and determine appropriate ranges, tabulate categorical variables.
Begin by examining each column independently, then consider pairs of columns and so forth. It is often
worthwhile to create cross tabula or multi-way tables of all your categorical variables as part of your
preliminary data cleaning efforts. Get to know your data. Be diligent regarding unexpected values or missing
values. Missing dates, missing categories are two examples.
In practice, predictor variables may be available from multiple sources. In such cases, you may want to
consult the source to determine if one is more reliable than another. Likely, neither source will be perfect.
You will need to determine which is best and then make improvements by cleaning, etc.
The Data
You will need to construct the data for your project by combining inputs from several sources. Each of the
sources is described in more detail provided after the following summary:
• The two loan application portfolios, Beachside and Wilson & Sons, provide information about the
borrower’s loan application.
We will use rules to determine whether the performance of a loan is -Good-, -Bad-, or -Indeterminate-. For
the purposes of this project exclude the unbooked records or records that contain an indeterminant response
used in the final set of data for the initial modeling effort.
Steps for creating the Predictor Data for Modeling Step 1. Combine the two portfolios create loan application.
Identify the corresponding variables in each of the 2 portfolios and combine into a single data file or table.
You may need to combine variables or otherwise manipulate the data in each portfolio to complete the
action(s). For each loan application, data may not be yet available on the initial step.
2. Assemble the Predictor table
Leveraging the data in the loan application, create a new predictor table based on the specification provided
by Table 1.
3. Rate loan application and assign value to response variable.
Create a response variable in your table that indicates whether the loan was “Good”, “Bad”, or –
Indeterminate-. Use the following business logic to determine the value for the response variable.
• Loans granted in the last 6 months are “Indeterminant”
• Loans graded below “C” are “Bad”
• Months since last delinquent is less than 3 are “Bad”
4/30/2021 64754 – BA 648 ProjectThis project is an individual exercise 5/8
• Debt to income ratio greater than 30% is “Bad”
• Loans with a status of “Charged Off” or “Default are “Bad”
• All other loans are “Good”
4. Create your test and training data sets
After completing steps 1-3, randomly split the combined data so that you have a training set that contains
80% of the records and a test set that contains 20% of the records.
Five Important Fields (Predictor Data):
Application Number: A 10-digit number that identifies the application. Every application has a unique
application number, and application numbers should be distinct between the two portfolios.
Loan Number: A field that identifies loans for booked applications. The six-digit loan numbers may not be
unique, meaning that they can appear in both data portfolios by chance.
Application Date: The date that the application was completed. In the loan application process, we will
constrain ourselves to looking only at the information available as of this date for the purposes of evaluating
unbooked loans in the modeling process. For example, if a loan application is dated December 31, 2016, we
will ignore scores.
Loan Date: This is the date that the loan is “booked” (approved). After this date, scores and other data will be
Active Date: The term “active date” is used to identify the application date for an unbooked loan and the loan
date for booked loans. This date marks the date that the approval decision is made. It is important because it
determines the instance of time at which relating information is ignored. Unlike other important date, Active
Date is inferred.
Loan and Application Portfolios
Information for the HBL loan portfolio come from two sources: Beachside Lenders and Wilson & Sons. The
sources provide low-level reporting data that has been recorded at or before the active date for both booked
and unbooked loans.
The Beachside Lenders data file is a CSV named Beach Loans.csv. Data for the Wilson portfolio can be
found in WilsonandSons.csv. The Wilson data may contain multiple loans for a single borrower; each record
may have a different application number.
Most of the variables that we need to extract may, in some cases, be constructs of the variables listed in Table
1. Predictive Model Columns.
Table 1. Predictive Model Variables
Descriptor (Name) Beachside Wilson Notes
Member Number MBR_NO Member Number Member Number
Member Date MBR_DT Member Date Member Date
Loan Number LOAN_NO Loan Number Loan Number (Booked)
Loan Date LOAN_DT Loan Date Loan Date (booked)
Loan Stat STAT Loan Status Booked, Declined
Title CUST_TITLE Customer Job Employment Title
Time Employed CUST_EMP Length of
Employment Time in job
State CUST_ST Customer State Customer State/Province
Zip CUST_ZIP Customer Zip Customer Zip {Region)
Loan Amount LOAN_AMT Net Loan Amount Loan Amount
Down Derived Derived Down Payment
CredRisk Derived Derived Credit Risk (‘Good’, ‘Bad’, ‘Indeterminate’)
Source Derived Derived Beachside or Wilson
Credit Score FICO High FICO FICO Credit Score
Grade GRADE Grade Loan Grade
Response Derived Derived Loan Indicator
Table 2. Loan Application
No Loan Application Variable Description
1 id A unique LC assigned ID for the loan listing.
2 addr_state The state provided by the borrower in the loan application
3 application_type Indicates whether the loan is an individual application or a joint application with two co-
4/30/2021 64754 – BA 648 ProjectThis project is an individual exercise 6/8
4 emp_length Employment length in years. Possible values are between 0 and 10 where 0 means less than
one year and 10 means ten or more years.
5 annual_inc The self-reported annual income provided by the borrower during registration.
6 annual_inc_joint The combined self-reported annual income provided by the coborrowers during
7 verified_status_joint Indicates if the co-borrowers’ joint income was verified by LC, not verified, or if the
income source was verified
8 home_ownership The home ownership status provided by the borrower during registration. Our values are:
9 title The loan title provided by the borrower
10 purpose A category provided by the borrower for the loan request.
11 loan_amnt The listed amount of the loan applied for by the borrower. If at some point in time, the credit
department reduces the loan amount, then it will be reflected in this value.
12 pymnt_plan Indicates if a payment plan has been put in place for the loan
13 int_rate Interest Rate on the loan
14 installment The monthly payment owed by the borrower if the loan originates.
15 loan_status Current status of the loan
16 issue_d The month which the loan was funded
17 last_credit_pull_d The most recent month LC pulled credit for this loan
18 earliest_cr_line The month the borrower’s earliest reported credit line was opened
19 mths_since_last_delinq The number of months since the borrower’s last delinquency.
20 mths_since_last_record The number of months since the last public record.
21 open_acc The number of open credit lines in the borrower’s credit file.
22 pub_rec Number of derogatory public records
23 inq_last_6mths The number of inquiries in past 6 months (excluding auto and mortgage inquiries)
24 fico_range_high The upper boundary range the borrower’s FICO at loan origination belongs to.
25 fico_range_low The lower boundary range the borrower’s FICO at loan origination belongs to.
26 revol_bal Total credit revolving balance
27 revol_util Revolving line utilization rate, or the amount of credit the borrower is using relative to all
available revolving credit.
28 tot_coll_amt Total collection amounts ever owed
29 tot_cur_bal Total current balance of all accounts
30 total_acc The total number of credit lines currently in the borrower’s credit file
31 funded_amnt The total amount committed to that loan at that point in time.
32 term The number of payments on the loan. Values are in months and can be either 36 or 60.