Tuesday, April 16, 2019

Using Excel for Large Data Sets

I've used Excel for many, many years and know a whole lot about ins and outs of the program. I'm also an expert on Googling, and can find answers to a whole lot of problems within seconds. Interestingly, these two different worlds often combine and help me to understand the limitations of a software like Excel for large data sets.

This problem came up a few times in my analysis of PCGS and the population report of US pennies. At first, I tried to collect the data from 1792 to present, but Excel was running extremely slowly and wouldn't even begin to graph or analyze trends from roughly 20,000 data points. I would run into a variety of errors, and so decided to cut down my data to just the wheat pennies in the PCGS population report of their penny grading. As we will see, this wasn't enough of a reduction. I still have around 3,000 data points from 400 subjects, and Excel maxes out with it's graphical analysis around 150 data points.

My first question dealt with the correlation between the mintage number and the number of pennies graded. The most rare coins are obviously the most sought after and graded percentage wise, but I wondered if there was a general relationship between these two variables. As you can see if you explore my data set and the visuals at the bottom of the document linked below, as a general rule, roughly 5,000,000 more pennies minted means that on average, around 100 more pennies are graded by PCGS.

Another question that I had was relating the mintage number to the year. I assumed that this was fairly highly correlated, but was also aware that many years, including 1921 and 1936 there were fewer pennies made than even 1909, so I wondered if the general trend was consistent. As can be seen in the graph and correlation calculation at the bottom of the document, roughly 53% of the variation in mintage can be explained by the year. Which is still pretty significant.

A final question that I explored was the relationship between the year and the total number of pennies graded by PCGS in that year. The answer was fairly surprising, there was no direct relationship (up to 10% variation) between these variables. That is to say, just because a penny is older than another doesn't necessarily mean it has more or less chance of being graded by PCGS. These are fairly independent variables. However, showing this graphically in Excel is impossible, as showing this trend would require roughly 400 different data points on either axis, which is impossible according to the variety of Excel boards that I read.

Anyway, this has been interesting. I much prefer using other software to analyze large data, but for now all that I have available is Excel. Maybe someday I'll do this analysis again in a more conducive environment for large data sets.

Here is a link to my Excel Document. Note that some functionality may be broken if you are not using an updated version or have the statistical toolbar enabled/installed.

https://drive.google.com/open?id=1axGsr4hjR9xPq15kpbDXviVYKU6d2jQU

No comments:

Post a Comment