Looking at the first round of 2020 PPP loans
At the time, last year’s Payroll Protection Program legislation was the largest federal expenditure ever (although it has now been surpassed!).
Kaggle features this data set including PPP loan data to small businesses through August 8, 2020.
“Paycheck Protection Program loan data by state for loans of $150,000 and above.”
Objective
The program was designed to save local jobs, so let’s take a closer look at the data to see what we can learn about the effects of the program.
Getting Started
Load the file into PyCharm with the standard libraries for analysis and then look at descriptors.
The shape of our data set is (661,218 rows, 16 columns). That’s a lot! We can cross-check the number of loans issued with information from the US Department of Treasury website and see there is a difference of 1,267 rows, or about 0.19%.
If we look closer at the schema of our data frame:
RangeIndex: 661218 entries, 0 to 661217
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 LoanRange 661218 non-null object
1 BusinessName 661210 non-null object
2 Address 661201 non-null object
3 City 661203 non-null object
4 State 661218 non-null object
5 Zip 661202 non-null float64
6 NAICSCode 654435 non-null float64
7 BusinessType 659789 non-null object
8 RaceEthnicity 661218 non-null object
9 Gender 661218 non-null object
10 Veteran 661218 non-null object
11 NonProfit 42462 non-null object
12 JobsRetained 620712 non-null float64
13 DateApproved 661218 non-null object
14 Lender 661218 non-null object
15 CD 661218 non-null object
dtypes: float64(3), object(13)
memory usage: 80.7+ MB
Data Pre-processing
Our objective is to learn about JobsRetained
, so let’s process the data set and make it more manageable to manipulate.
In this analysis, we are going to look for insights relating to geographic distribution, classification of companies receiving stimulus money, and the number of jobs retained.
NAICS Code refers to codes from the North American Industry Classification System used for categorization and statistical analysis of the US economy. It’s important to note this is a categorical variable and not continuous.
Otherwise, the data looks straightforward except that LoanRange
feature is not in the ideal format for computations. We’ll get to that later. For now, let’s look directly at the JobsRetained
column.
Summary statistics about JobsRetained
reveal that it is estimated that the total number of jobs saved stood at 31,456,513.
We can contextualize the statistical summary data by graphing the distribution.
Notice how the graph appears to be normally distributed with skew to the right hand size.
The statistics indicate that 75% of loans were loaned organizations reporting 54 or less JobsRetained
. We could see this as positive outcome of the program, because many smaller businesses have less access to alternative financing routes that are needed to raise money to weather the pandemic.
But, we can’t look at the number of jobs saved alone. The absolute number should be contextualized with a look at what the cost of such a program is.
In order to access loan value as an integer, we need to manipulate the current string
in LoanRange
and engineer new data frame features.
Our JobsRetained
, LoanMin
, and LoanMax
data is prepared for analysis.
Our range is value, at best, but we can estimate that the cost of saving a single job is priced somewhere between ~$8k and ~$19.4k. These numbers don’t mean much alone, and I want to cross-check these with available information about 2019 incomes provided by the US Census Bureau.
In 2019 the median annual income in the US was $68,703. PPP loans were designed to directly replace the wages of workers and subsidize payrolls. With the numbers on hand, one could suggest we could expect a PPP loan to allow the median worker to stay on payroll for a period of time in the range of 42 days to 103 calendar days.
There is a huge room for variance, as the exact size of loans distributed and number of actual jobs retained is imprecise.
It is also realistic to imagine that when the salary of a job retained is lower than the medianIncome
we would see the lower and upper bounds of our range increase. So, while a medianIncome
job might be saved for 1–3 months we would expect that an individual with an income of significantly less might be able to stay on payroll for an even longer period of time.
We can take this idea further by exploring data according to the unique values of theLoanRange
feature.
Our final data frame shows us the upper and lower bounds of the cost range for each single JobsRetained
in a given LoanRange
. Here, a fascinating insight is revealed to us when we visualize our result:
We see that the upper bound of potential cost for saving a job through a loan in the range of $350k-$1mm is high compared other ranges.
Final Thoughts (for now)
The efficiency of loans issued in the $350k-$1mm range have an unclear economic impact on jobs savings which requires further explanation.
Loans made in the $150k-$350k range appear to be the most efficient at saving jobs given their relative affordability. What is the optimal distribution of loans to this segment to maximize the number of jobs saved?
The cost range of saving a job with a loan size between $5mm-$10mm is about 2x that of a job saved using a loan size between $150k-$350k. How do we assess and compare the benefits of saving jobs that are in different LoanRange
?
End
Thanks for following me this far! If you have any recommendations on how to improve my approach, please share in the comments below. Stay tuned for future analyses!