Large Numbers Madness

For sports fans, it’s the most wonderful time of the year: the first week of the NCAA men’s basketball tournament!

The bracket for the 2021 NCAA men's basketball tournament
This year’s NCAA men’s basketball tournament bracket – click for a larger version

Last night, the qualifying teams were announced, along with their matchups in the first round. And here it is – click for a larger version that you can read.

That means that today, millions of people in the U.S. and around the world are carefully studying this bracket, trying to pick the winners of each game in each round, for fun and profit. And so a question that often comes up this time of year is this:

What are the chances of picking a perfect NCAA bracket?

That is, what is the probability of correctly predicting the winner of every single game, from the first first-round game through to the championship?

Seemingly simple games can produce absurdly large numbers quickly, as we saw earlier with a simple deck of cards. March Madness doesn’t produce numbers quite that absurd as the cards did, but still so absurd that it’s hard to wrap your mind around just how large the number is.

The UMBC Retrievers celebrate after winning the first 16-vs-1 matchup in NCAA men's basketball tournament history
Miracles do happen: (16) UMBC 74 (1) Virginia 54 in 2019

Let’s ignore the play-in games on Wednesday night, since those exist only to confuse and annoy. That leaves 64 teams, which means that to crown a champion it will take 63 games. Assume that in each game, each team has a 50-50 chance of winning, meaning you can flip a coin to predict the winner. (This is not a realistic assumption, since for example, the 1-seed has won 139 out of 140 games – but since miracles can happen, let’s go for it.)

That means that the probability of getting every game right is 1 in 263, which is

1 in 9,223,372,036,854,775,808

That’s one correct choice out of 9 quintillion 223 quadrillion 372 trillion 36 billion 854 million 775 thousand 808 possible combinations.

Loki from the Avengers movie series
What kind of divine being would be annoying enough… oh, right. Never mind.

How absurdly huge is this? See below for the calculations, but in short, let me put it this way.

Let’s say that some really annoying divine being decided they wanted to cheat by getting humans to pick every possible bracket. They would have to kidnap every human on Earth – all eight billion of us – and endow us all with eternal life and super-speed to pick brackets.

Then they would magically transport us back to the year 1983, just in time to watch one of the most famous moments in college basketball history:

Then we would get to work – all eight billion of us – picking one possible NCAA bracket per second. No stopping to eat or sleep, just filling out March Madness brackets once per second, all day all night all year. It could have happened earlier, but in the worst case it would be just about now – after about 38 years of work – that someone, somewhere would get a perfect bracket. And that’s just for this year; picking next year’s bracket could keep us busy until the year 2060.

But take heart – you don’t have to get them all right to win your office pool. As they say, you and your friend are being chased by a bear, you don’t have to outrun the bear. You just have to outrun your friend.

Replace the bear in that analogy with the laws of probability, and good luck with your picks this year!

The (highly approximate) calculations

  • There are 263 possible combinations, which is about 9 x 1018
  • There are 8 billion people on Earth (8 x 109)
  • So everyone picking one bracket per second is 8 x 109 brackets per second
  • There are about 30 million seconds in a year (3 x 107)
  • So every human picking one bracket per second for a year multiplies to about 2.4 x 1017 brackets
  • 9 x 1018 brackets divided by 2.4 x 1017 brackets per year is about 38 years

Dealing with large numbers

Take a deck of cards. Give it a good shuffle, and then lay the cards out one by one, like the picture below. Here’s the question:

How many possibilities are there for which cards will go where? And how many times will you likely have to deal to get the same arrangement of cards again?

Fortunately my spouse is a huge nerd and has Lord of the Rings playing cards

To deal a pack of cards, start with the first card. There are 52 possibilities; I got the 7 of spades. Once the first card is dealt, you have 51 cards left, so there are 51 possibilities for what the next card will be. The next card could be one of 50 possibilities, the one after that could be one of 49, and so on – until the last card, which can only have one possibility. To find the total number of possible orderings, multiply out all the possibilities to get:

52 x 51 x 50 x 49 x ………… x 3 x 2 x 1

Math has a special name for this operation: factorial, symbolized with an exclamation point (!). The factorial of a number is equal to that number times the previous number, times the number before that, and so on all the way down to 1. The exclamation mark is an appropriate symbol, because the factorial can get shockingly big shockingly quickly.

The first few:

1! = 1
2! = 2 x 1 = 2
3! = 3 x 2 x 1 = 6
4! = 4 x 3 x 2 x 1 = 24
5! = 5 x 4 x 3 x 2 x 1 = 120
6! = 6 x 5 x 4 x 3 x 2 x 1 = 720
7! = 7 x 6 x 5 x 4 x 3 x 2 x 1 = 5,040
8! = 8 x 7 x 6 x 5 x 4 x 3 x 2 x 1 = 40,320

That means that if you deal seven cards, you have 5,040 possibilities for their order. If you deal eight cards, you have 40,320 possibilities for their order.

What about for the full 52 cards – what does 52! equal?

52! = 80,658,175,170,943,878,571,660,636,856,403,766,975,289,505,440,883,277,824,000,000,000,000

That means that when you shuffle and deal a deck of cards, there are 80 unvigintillion 658 vigintillion 175 novemdecillion 170 octodecillion 943 septendecillion 878 sexdecillion 571 quindecillion 660 quattuordecillion 636 tredecillion 856 duodecillion 403 undecillion 766 decillion 975 nonillion 289 octillion 505 septillion 440 sextillion 883 quintillion 277 quadrillion 824 trillion different possibilities for how the cards could fall.

How big a number is that? It is unimaginably, unfathomably, amazingly huge. To call this number “astronomical” would be an insult to astronomy.

To get a sense for exactly how huge this number is, let’s consider the second half of the question I asked above: how many times would you have to deal before you could expect to see the same arrangement of cards again?

Because the cards are shuffled anew each time, every deal is independent of those that came before and after, and thus there is no definitive way or predicting the answer. It could happen again pretty soon, or it could never happen again for the rest of the history of the Universe.

But we can get some insight into the question by considering another random chance event that is much, much more likely: rolling a die. On average, how many times do you have to roll a six-sided die before you might expect to roll a 6?

A table of numbers showing the number of times I had to roll a 6-sided die to get a 6
How many times did I need roll a 6-sided die to get a 6?
Yes, I really did roll a die this many times. I am committed to my art.

If you know enough statistics to know that recurrences occur in a geometric distribution (and if you know why it depends somewhat on what you mean by “you might expect,” you can skip to the answer – but it’s always more fun to figure it out. So I did an experiment. I rolled a die, and kept rolling until I got a 6. I wrote down the number of rolls that it took. And then did it again, and again, a total of 200 times.

Over those 200 trials, the average (mean) number of rolls required to get a six was 6 rolls. I tried again with rolling an 8-sided die, and over 50 trials it took on average 8 rolls to get an 8. And it sort of makes sense that the number of trials you need to get a specific outcome should be roughly equal to the number of possible outcomes.

Which means that for dealing cards, where there are 80,658,175,170,943,878,571,660,636,856,403,766,975,289,505,440,883,277,824,000,000,000,000 possible outcomes, you should expect to have to deal out the deck 80,658,175,170,943,878,571,660,636,856,403,766,975,289,505,440,883,277,824,000,000,000,000 times before getting the cars in the same order again.

How many times is that? Here’s a way of thinking about how big that number is, adapted from an essay by fellow data scientist Scott Czep.

Stand in the middle of Times Square in New York City. Deal out deck of cards and note the arrangement. Put the deck back together, shuffle, and take a step. Repeat: deal, shuffle, step. Deal, shuffle, step. Continue until you are standing on the beach in Los Angeles, California. But you’re not done yet. Far from it. Turn around and deal, shuffle, step, all the way back to New York. Then turn around and do the same thing: deal, shuffle step back to LA.

Now take an eyedropper and remove one drop of water from the Pacific Ocean. You’re there anyway. Turn around and deal, shuffle step back to New York, then back to LA. Then remove another drop from the Pacific Ocean.

Continue doing this until the Pacific Ocean is empty.

You’re still not done, though. Magically refill the entire Pacific Ocean, and then remove a pebble from the top of Mount Everest. Turn around and deal, shuffle, step back to New York. Deal, shuffle, step, New York, deal, shuffle, step, LA, drop, New York, LA, drop, New York, LA, drop, ocean empty, pebble off Mount Everest.

Continue until you have removed all of Mount Everest.

Congratulations! You have removed Mount Everest once, after having emptied the Pacific one drop at a time one billion billion times, while walking back and forth between New York and LA between drops, dealing and shuffling cards each time. To celebrate your accomplishment, lay a piece of paper on the ground. Magically refill the Pacific, magically rebuild Mount Everest all the way to the top, and repeat.

Deal, shuffle, step, New York, LA, drop, deal, shuffle, step, New York, LA, drop, ocean empty, pebble, deal, shuffle, step, New York LA, drop, ocean empty, Mount Everest gone, add a sheet of paper to the stack, repeat, keep stacking.

Continue until the stack of papers reaches the Sun.

After all this – countless pieces of paper on the stack, countless countless removals of Everest, countless countless countless emptyings of the Pacific, and countless countless countless countless hands of cards, are you done? No.

Repeat this entire process 66 more times. Sometime during your 67th trip, you reach the number of deals at which you might expect to see the same layout you saw on your first deal, back in New York so long ago.

Think about this next time you see a pack of cards: enormous complexity can hide in the simplest things.

Postscript: The Numbers

Doing these calculations was way easier because Scott Czep did most of the work in his essay about 52 factorial, I just changed a few things. If you see any errors, they are mine rather than his, and please let me know in the comments! And definitely check out Scott’s essay and the rest of his blog.

  • 52! in scientific notation is about 8 x 1067, so that is the number we are aiming for.
  • An average person has a stride length of a bit less than one meter, but let’s round up to 1 meter to make the calculations easier. It’s one hand of cards per step.
  • The distance from New York to Los Angeles is 4500 km, so back and forth is 9000 km = 9 x 107 meters. At one step per meter and one hand of cards per step, that’s 9 x 106 hands. (Note: we’ll ignore the initial New York to LA walk since it will make so little difference with the number of times we’ll have to do the round trip.)
    • Current total: 9 x 107 hands dealt
  • There are about 20 drops of water per cubic millimeter
  • The Pacific Ocean contains 700 million cubic kilometers of water (7 x 108 km3 = 7 x 1023 mL = 1.4 x 1025 drops
    • Current total: 1.2 x 1033 hands dealt
  • Assume a pebble weighs about 1 ounce.
  • Mount Everest weighs 357 trillion pounds = 5.7 x 1015 ounces = 5.7 x 1015 pebbles.
    • Current total: 6.8 x 1048 hands dealt
  • A sheet of paper is about 0.1 mm thick
  • The distance from the Earth to the Sun is 150 billion km = 1.5 x 1017 mm = 1.5 x 1018 sheets of paper
    • Current total: 1.2 x 1066 hands dealt
  • Repeat the entire cycle 66.7 times to get to the target of 8 x 1067 hands

Kill moose and squirrel: Russians pretend to be Americans online

Boris, Natasha, and Fearless Leader from an episode of The Adventures of Rocky and Bullwinkle and Friends“Justin” set his Twitter location as Austin, Texas, but his time zone was set to Moscow Standard Time.

When a Smart Data Science Friend (hi Scott!) shared this in October 2017, I knew that organizations in Russia had mounted used social media to support Donald Trump’s presidential campaign, but I hadn’t realized the scale or effectiveness of their efforts. Looking back, we can see that “Justin” wasn’t alone. How many other Russians were out there on Twitter pretending to be Americans?

I’d like to find out.

Last Wednesday, I wrote about an article published on FiveThirtyEight.com: “Why We’re Sharing 3 Million Russian Troll Tweets. As I said there, I downloaded the dataset that they generously made available on GitHub (technically, I forked the repository into my own GitHub space) and then loaded it into the SciServer online science suite. Let me know if you’d like to join this effort.

I used Python’s regular expression features to do a quick search through 1,849,687 million English-language tweets in the FiveThirtyEight dataset. I looked for tweets that showed evidence of claiming to be Americans – featuring “I’m” or “I am” plus some form of “USA” or “America” or “American”. The screenshot below shows me running that command inside a Jupyter notebook in SciServer Compute:

Python commands using the re and pandas modules
Python commands to find tweets like “I’m American”

The search returned 177 tweets from 84 separate authors – counts that should in no way be considered scientific or used for any kind of analysis, either quantitative or qualitative. I then read through all 177 tweets and selected only those that unambiguously claimed to be American citizens/voters.

I was left with 29 tweets from 20 separate Twitter handles, covering the period from December 2014 to August 2017. (Of course, there is no reason to think that Russians impersonating Americans suddenly stopped in August 2017.) Here are five selected randomly:

  • @ISRAEL_WILLS on February 9, 2015: Hope everyone had a great day yesterday. I’m happy we don’t have a war on the American soil. Thank you to all the military serving today. 🙂
  • @JANI_S_JAC on July 4, 2015: #HappyIndependenceDay I’m a patriot and it’s sad for me to see what’s happening to America today
  • @TEN_GOP on February 3, 2016: If I were a dem I’d be embarrassed by who represents my party. But I could never be a dem. For I’m American! #TCOT
  • @TEN_GOP on March 25, 2016: ‘@COJeepGirl well b/c I’m American and Hussein is the President atm’
  • @JANSKEESTR on August 15, 2017: I voted for Trump because I knew he’s the only man who could save America from liberal degeneracy. I’m still sure that I made a right choice

It’s worth reiterating: all of these people are claiming — directly, unambiguously — to be American citizens. All of them are Russian. It is not illegal in most circumstances to claim that you are someone else online. Nor is it illegal for a foreign citizen to have an opinion on a U.S. election. But I find it profoundly disturbing that we are only now realizing the full extent of these Russian operations. The best we can do right now is to try to understand how these trolls have operated in the past in hopes of preventing similar incidents in the future.

One thing is clear: America has never been so ready for your Rocky & Bullwinkle references. Russian catfisher @TEN_GOP says it best:

Trump was a strong & fearless LEADER today. I’m proud to be an American.

P.S. Here is an Excel spreadsheet containing the tweets that I identified, if you’d like to play with the data yourself.

How do you influence an election?

One of the most fascinating stories of the 2016 U.S. Presidential election was the story of how a well-planned social media campaign based in Russia may (or may not?) have influenced the result.

There is now no doubt that this campaign existed, according to multiple reliable sources. And the fact that we had no idea at the time should make us very, very worried.

If we didn’t know it at the time, can we at least look back with hindsight understand how it happened? That’s the idea behind a new analysis, published yesterday on my favorite source for news and analysis, FiveThirtyEight.com.

The article describes the research of two professors at Clemson University, Darren Linvill and Patrick Warren. They used Clemson’s Social Media Listening Center to recover tweets from 3,841 Twitter handles associated with the Internet Research Agency, the most prominent of the Russia-based organizations accused of creating fake accounts to influence the election. Their dataset covers the period from June 2015 to December 2017, and includes nearly three million tweets.

The result of the two researchers’ work is a preprint called “Troll Factories: The Internet Research Agency and State-Sponsored Agenda Building,” currently undergoing peer review (PDF available on Warren’s website).

The image below, from the FiveThirtyEight article, shows how the number of tweets from these accounts varies with time.

roeder-russiantweets-1

The best part of all this is that Linvill and Warren have worked with FiveThirtyEight to publish their entire dataset online through FiveThirtyEight’s GitHub account. And I have uploaded their dataset into the SciServer online science platform. If you’re interested in looking at this data with me, send me an email.

Of course, a dataset is only as useful as the questions that you ask of it. So what can we learn from this one? I have no interest in questions that reduce to “lol Trump voters are stupid” – that is neither useful nor even true. What questions will give us insights into how social media can influence public perception? And what questions will give us insights into how to make sure this doesn’t happen again in the 2018 elections?

Here are a few questions off the top of my head:

  • How did the topics discussed by these troll accounts change after Trump won the election?
  • What strategies did the trolls employ when talking to Democrats?
  • If we identify a control sample of accounts who are genuine Trump supporters (or genuine Black Lives Matter activists, etc.) and blindly run a content analysis, can we tell the difference? If so, how?

What research questions occur to you?