This blog entry was originally written by Patrick Florer, I am just migrating the post to the new SIRA site.
(this is the second of three posts)
Does the shape of PERT distribution match up to an actual data set?
At the beginning of this post (Part 1), I raised the question about the validity of the PERT function and whether the distributions it creates correspond to anything.
What follows are the results of an attempt to answer this question using a small data set extracted from a Ponemon Institute report called “Compliance Cost Associated with the Storage of Unstructured Information”, sponsored by Novell and published in May, 2011. I selected this report because, starting on page 14, all of the raw data are presented in tabular format. As an aside, this is the first report I have come across that publishes the raw data - please take note, Verizon, if you are reading this!
Here is a histogram of the 94 actual observations, created using the standard functionality in Excel (Data\Data Analysis\Histogram) and tweaked a bit to show probability instead of frequency.
As you can see, the histogram is suggestive of a positively-skewed distribution - with some exceptions – there are several peaks and valleys. What these peaks and valleys mean is unclear – it could simply be observations that are missing – the study size was small: N = 94 organizations. Or they could be real – only more observations would tell us.
At this point, I asked myself – what if the Ponemon study had captured and had published minimum, maximum, and most likely values instead of single point estimates? If it had, then we could have constructed a more informative histogram.
In an attempt to simulate what things might have looked like, I took the Ponemon study raw data, computed minimum and maximum values for each of the 94 data points, and then ran a Monte Carlo simulation, using the following parameters:
Most Likely = the actual reported cost estimate provided by the report.
Min = Most Likely x a random number between 0 and 1
Max = Most Likely x ((1 + a random number between 0 and 1) x Most Likely))
gamma/lambda was set to 4 for all.
Since true minimum and maximum values were not reported by the study, I decided that using a random number as a multiplier to calculate both the minimum and the maximum values seemed as defensible as anything else for the purpose of my simulation.
I then ran 10,000 iterations of Monte Carlo simulation for each of the 94 BetaPERT functions, which resulted in 940,000 total estimates. Using 940,000 data points, standard functionality in Excel (Data\Data Analysis\Histogram), and a tweak to show probability instead of frequency, I created the following histogram:
This histogram is even more suggestive of a positively-skewed distribution.
But the same questions remain: Are the dips and valleys representative of missing observations, or are they real? And, how well would a BetaPERT function predict the shape of this histogram? How well would any other probability function perform, for that matter? And, perhaps most importantly, what, if anything, can we extrapolate about other compliance cost data sets from this one?
So, it was time for another experiment or two!
To be continued …