Looking at the first round of 2020 PPP loans

5 min readMar 21, 2021

At the time, last year’s Payroll Protection Program legislation was the largest federal expenditure ever (although it has now been surpassed!).

Kaggle features this data set including PPP loan data to small businesses through August 8, 2020.

“Paycheck Protection Program loan data by state for loans of $150,000 and above.”

Objective

The program was designed to save local jobs, so let’s take a closer look at the data to see what we can learn about the effects of the program.

Getting Started

Load the file into PyCharm with the standard libraries for analysis and then look at descriptors.

The shape of our data set is (661,218 rows, 16 columns). That’s a lot! We can cross-check the number of loans issued with information from the US Department of Treasury website and see there is a difference of 1,267 rows, or about 0.19%.

If we look closer at the schema of our data frame:

RangeIndex: 661218 entries, 0 to 661217
Data columns (total 16 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   LoanRange      661218 non-null  object 
 1   BusinessName   661210 non-null  object 
 2   Address        661201 non-null  object 
 3   City           661203 non-null  object 
 4   State          661218 non-null  object 
 5   Zip            661202 non-null  float64
 6   NAICSCode      654435 non-null  float64
 7   BusinessType   659789 non-null  object 
 8   RaceEthnicity  661218 non-null  object 
 9   Gender         661218 non-null  object 
 10  Veteran        661218 non-null  object 
 11  NonProfit      42462 non-null   object 
 12  JobsRetained   620712 non-null  float64
 13  DateApproved   661218 non-null  object 
 14  Lender         661218 non-null  object 
 15  CD             661218 non-null  object 
dtypes: float64(3), object(13)
memory usage: 80.7+ MB

Data Pre-processing

Our objective is to learn about JobsRetained, so let’s process the data set and make it more manageable to manipulate.

In this analysis, we are going to look for insights relating to geographic distribution, classification of companies receiving stimulus money, and the number of jobs retained.

NAICS Code refers to codes from the North American Industry Classification System used for categorization and statistical analysis of the US economy. It’s important to note this is a categorical variable and not continuous.

Otherwise, the data looks straightforward except that LoanRange feature is not in the ideal format for computations. We’ll get to that later. For now, let’s look directly at the JobsRetained column.

Summary statistics about JobsRetained reveal that it is estimated that the total number of jobs saved stood at 31,456,513.

We can contextualize the statistical summary data by graphing the distribution.

Notice how the graph appears to be normally distributed with skew to the right hand size.

The statistics indicate that 75% of loans were loaned organizations reporting 54 or less JobsRetained. We could see this as positive outcome of the program, because many smaller businesses have less access to alternative financing routes that are needed to raise money to weather the pandemic.

But, we can’t look at the number of jobs saved alone. The absolute number should be contextualized with a look at what the cost of such a program is.

In order to access loan value as an integer, we need to manipulate the current string in LoanRange and engineer new data frame features.

Our JobsRetained, LoanMin, and LoanMaxdata is prepared for analysis.

Our range is value, at best, but we can estimate that the cost of saving a single job is priced somewhere between ~$8k and ~$19.4k. These numbers don’t mean much alone, and I want to cross-check these with available information about 2019 incomes provided by the US Census Bureau.

In 2019 the median annual income in the US was $68,703. PPP loans were designed to directly replace the wages of workers and subsidize payrolls. With the numbers on hand, one could suggest we could expect a PPP loan to allow the median worker to stay on payroll for a period of time in the range of 42 days to 103 calendar days.

There is a huge room for variance, as the exact size of loans distributed and number of actual jobs retained is imprecise.

It is also realistic to imagine that when the salary of a job retained is lower than the medianIncome we would see the lower and upper bounds of our range increase. So, while a medianIncome job might be saved for 1–3 months we would expect that an individual with an income of significantly less might be able to stay on payroll for an even longer period of time.

We can take this idea further by exploring data according to the unique values of theLoanRange feature.

Our final data frame shows us the upper and lower bounds of the cost range for each single JobsRetained in a given LoanRange. Here, a fascinating insight is revealed to us when we visualize our result:

We see that the upper bound of potential cost for saving a job through a loan in the range of $350k-$1mm is high compared other ranges.

Final Thoughts (for now)

The efficiency of loans issued in the $350k-$1mm range have an unclear economic impact on jobs savings which requires further explanation.

Loans made in the $150k-$350k range appear to be the most efficient at saving jobs given their relative affordability. What is the optimal distribution of loans to this segment to maximize the number of jobs saved?

The cost range of saving a job with a loan size between $5mm-$10mm is about 2x that of a job saved using a loan size between $150k-$350k. How do we assess and compare the benefits of saving jobs that are in different LoanRange?

End

Thanks for following me this far! If you have any recommendations on how to improve my approach, please share in the comments below. Stay tuned for future analyses!

Looking at the first round of 2020 PPP loans

Written by Zach Mueller

No responses yet