Every year around St. Patrick’s Day, NCAA Division I men’s basketball holds a single elimination tournament to select a national championship. It’s more commonly known as the “Final Four” or “March Madness.” For math aficionados, it provides a wealth of data to analyze.
Since 1985, the tournament has started with 64 teams entering the First Round. (Since 2001, there have been Opening Round games prior to the First Round, but we’re going to focus on the field of 64.) The 64 teams are divided into four regions (usually East, West, Midwest, and South). In each region of 16 teams, the teams are seeded, or ranked, from 1 to 16, with #1 considered the top seed.
There are 32 games in the First Round. In each of the four regions, the games are:
#1 seed plays #16 seed
#8 plays #9
#5 plays #12
#4 plays #13
#6 plays #11
#3 plays #14
#7 plays #10
#2 plays #15
With such a rich set of data, I decided to analyze First Round upsets. Since the tournament is single elimination, an early upset could have devastating consequences for predictions in later rounds. I decided to look back at the First Round games since 1986: 32 years of 32 games, for a total of 1,024 data points.
I’ve always maintained anecdotally that the most likely upsets have been #9 beating #8, and #12 beating #5. Let me check my assumptions using PTC Mathcad.
First, I created a spreadsheet where I marked each First Round upset with a U:
Then I read the information into PTC Mathcad and performed a basic analysis of the upsets:
Historically, there have been about 8 upsets in the First Round every year, or 25.2%. That’s actually higher than I expected.
Next, let’s write a program to count the upsets for each year:
Some built-in functions allow us to determine that the maximum number of upsets was 13 in 2016, and the least was 3 in 2000:
We’ll plot the data to see any fluctuations:
Oddly, the number of upsets generally appears to ping-pong. It’ll be interesting to see if there are an unusually high number of upsets in 2018.
I’m itching to test my hypothesis regarding which seeds are most likely to experience upsets. Again, I’ll write a program, but this one will need a nested for-loop, since there are four games each year per seed:
The program generates a matrix listing the seed, the total number of upsets, the fraction of upsets for that seed (upsets divided by 128 games), and upsets per year (upsets divided by 32 years):
Conclusions about seed upsets:
Well, whaddaya know, my anecdotal hunches have a ring of truth.
We would expect that the four regions should balance out in terms of upsets. I’ll write a program similar to the previous one, with nested for-loops to count the upsets per region:
(A note about regions: The Southeast region was replaced by the South region in 1998, but made a reappearance in 2011. In this analysis, South and Southeast were collected together. Also in 2011, there was no Midwest region, but a Southwest region. The 2011 Southwest results are included in the Midwest results.)
Once again, we’ll generate a matrix of results:
Wow! The East, West, and South / Southeast are about the same, but the Midwest has a disproportionately high number of upsets.
I’ve been following the NCAA tournament for years, but this is the first time I’ve taken advantage of the information that’s always been available to me. Analyzing the rich dataset allows me to follow the events with additional insights and expectation for the outcomes. What assumptions can you confirm or dispel by analyzing your data with PTC Mathcad?
Want to check some championship hypotheses for yourself? Download PTC Mathcad Express for free.