An Unexpected Border

The world’s most unexpected border: these two countries seem to be a world apart, but high in the Hindu Kush mountains is a curvy 50-mile border between Afghanistan and China.

A screenshot of Google Earth showing the Afghanistan-China border

The only sane place to cross is the illegal unmarked border at Wakhjir Pass (15,780 feet), used only by the occasional drug smuggler:


This is also the largest time zone jump in the world (excluding the International Date Line) – cross into China at noon, and suddenly it’s 3:30 PM.

Russia doesn’t care

Boris, Natasha, and Fearless Leader from an episode of The Adventures of Rocky and Bullwinkle and FriendsAnother news day, another insight into the role of social media in the 2016 election cycle. I’ve written before about the role that fake social media accounts originating in Russia played in providing support for Donald Trump and other candidates. Unsurprisingly — at least in retrospect — those social media efforts went beyond support for specific candidates.

In a new study published today (Broniatowski et al., 2018) and reported on by many reliable media outlets, researchers at George Washington University studied three years’ worth of tweets containing keywords related to vaccines.

The reason that vaccines are important is that there is an ongoing debate in the U.S. about whether vaccines cause autism (they don’t), whether the initial study saying they did was a deliberate fraud (it was), and whether parents should vaccinate their children (they should). I have a lot of sympathy for parents who are reluctant about vaccinating their children — injecting your children with known disease-causing agents is undeniably creepy. But it works, to protect them and to protect other children too.

I haven’t read the full paper yet, but the bottom line is that during the period from 2014 to 2017 (and of course continuing through to today), Russian agents used (and continue to use) the vaccine debate to shift how Americans talk about social and political issues. But there is a fascinating difference between the candidate-based study I wrote about before and this one:

When discussing vaccines, the Russian trolls took both sides in the debate.

They didn’t care.

This at last provides some insight into why the Russians have invested time and energy into running social media campaigns in the U.S. They are seeking to divide our country, because a divided country is a weak country.

Don’t let them do it. Find someone who disagrees with you and talk to them, right now.

The Flag Test

A man holds the flag of Nepal at a paradeIf there’s one thing I am obsessed with — other than democratizing science, learning from data, evidence-based practice, people who aren’t what they seem, exploring the world with Google Earth, and Australian Rules Football — it’s flags.

A flag is a symbol of a group of people, a source of pride, and looks beautiful waving in the breeze. A flag represents all the best and worst impulses of humanity. When they all come together, like at the United Nations, you get a real sense of how we all come together as a world.

I’m not alone; there is a large and somewhat nerdy community of flag lovers called vexillophiles, from the Latin for “lover of flags.” We have an unofficial website, multiple YouTube channels, and an international organization (which of course has its own flag.

In my interactions with fellow vexillophiles, I have discovered an informal, tongue-in-cheek personality test, which is:

What do you think of the flag of Nepal? Is it kind of cool, or is it an Abomination Unto Flags?


The flag of Nepal, shown in the photo above and the illustration here, is the world’s only non-rectangular national flag (although there are some sub-national nonrectangular flags, most famously the state of Ohio and the city of Tampa, Florida). It consists of two red triangles, the bottom one slightly larger. The top triangle has a stylized, symbolic drawing of the moon, and the bottom has a similar drawing of the Sun. Both triangles are outlined with a thin blue border.

To be in the first category, you don’t have to like the flag, it’s enough to say, “that’s kind of cool, I respect what they were going for there.” Being in the second category requires strong opinions about what does or does not constitute a “real” flag.

Your opinion on the Nepal flag correlates with many other things, especially your opinion about whether people should get off your lawn. It’s not an absolute predictor, and of course correlation is not causation, but it’s an effect that I have noticed.

Zero guesses which side is the “get off my lawn” side, and zero guesses which side I’m on.

(Spoiler alert, highlight to reveal: It’s cool.)

Nepal flag photo: Hom Lamsal, Nepal Republic Media
Nepal flag illustration: Wikimedia users Pumbaa80 and Achim1999.

Except they weren’t: Iron Eyes Cody

Except they weren’t: An occasional series about people and things which are Not What They Seem

A middle-aged Iron Eyes Cody, dressed in a traditional Native American cloak and with a feather on his head, in an undated publicity photoTo three generations of movie fans, Iron Eyes Cody was THE Hollywood Indian. He was born in Oklahoma in 1904 to a Cherokee father and a Cree mother. He spent his youth performing in traveling Wild West shows, where he taught himself the sign languages of other Nations. In 1924, he moved to California, and within two years was appearing as an uncredited extra in Hollywood.

His career took off from there, and he eventually appeared in more than 200 films and TV series, particularly Westerns. He played in films with A-list actors like John Wayne (The Big Trail in 1930) and Steve McQueen (A Man Called Horse in 1970). But his most famous role came at age 65 in a Public Service Announcement TV commercial that was an early advocate for environmental conservation movement. It’s horribly dated now, but it had a real impact on changing public attitudes:

Cody wrote an autobiography, died in 1999 at age 94, and is buried in the “cemetery of the stars,” Hollywood Forever Cemetery. He is in the mausoleum with his beloved wife Bertha, not far from stars like Victor Fleming, James Garner, and Marilyn Monroe.

Over a career spanning nearly 70 years, Iron Eyes Cody’s career perfectly traced America’s changing attitudes toward the people known first as Indians, then as American Indians, then as Native Americans — all the while staying true to his heritage as a Native American.

Except he wasn’t.

He was born as Espera Oscar de Corti in small-town Louisiana, the son of two immigrants from Sicily who ran the town grocery store. He moved to California at 19, where he used his dark skin, talent for telling a good story, and genuine acting talent to score a long and successful career as an actor.

The truth began to come out in 1996, when his half-sister gave an interview to the New Orleans Times-Picayune newspaper. de Corti/Cody denied the rumor, but it was officially confirmed after his death.

What do we make of his story? Was this the worst kind of cultural appropriation, the story of a white man literally taking on a fake Native American identity? Was it a well-meaning fib that had a happy ending and actually did some good? Did it start out for convenience, but then eventually de Corti managed to convince himself he really was Cody?

If it helps: he married a for-reals Native American woman, adopted two children from reservations in his fake-home state of Oklahoma, and spent much of his life advocating and fundraising for Native-led charities and causes.

Questions like these are why I find these except-they-weren’t stories so fascinating.

What do you think?

Perfect Part 2: I’m one of 360,000!

Did I mention that I was in the stands for Felix Hernandez’s perfect game?

Oh, I did? Cool.

On Wednesday, I asked a question that my friend Jon had asked me soon after that 2012 game: How many people alive today have seen a perfect Major League Baseball game in person?

In Wednesday’s post, I outlined an approach to solving this seemingly-difficult problem. To repeat:

  1. We have the total attendance for each game. (It varies from the 6,298 people who saw Catfish Hunter’s perfect game in 1968 to the 64,519 who saw Don Larsen’s in the 1956 World Series.
  2. Assume that the percentage of people at each age and sex at the game was the same as the percentage of people at each age and sex in the U.S. as a whole. Again, this is almost certainly not true, but I’m not sure how to do better.
  3. All those people are older by the amount of years that have passed since the game. Someone who saw David Wells’s 1998 perfect game at age 48 would be 68 today.
  4. Assume that anyone whose age today turns out to be greater than 76 (for men) or 81 (for women) has gone to the big game in the sky.
  5. Add up the number of people still alive who saw each game to get the total alive across all games. That is the answer to Jon’s question.
  6. Additional step! This extra step was suggested by my other friend Ed, who asked “What about the weirdos that have seen more than one perfect game?” Are there cases where adding up the attendance would double-count some people, and if so, how do we account for that?

I closed Wendesday’s post asking how to get the data we would need for this approach. We can get attendance figures from the Wikipedia article on MLB perfect games, and I suggested that we should be able to find demographic data (age and sex) from the U.S. Census. Did anyone think of how to find that data? No one commented, but that doesn’t mean you weren’t thinking about it.

But to jump to the reveal:

I got the U.S. population by age and sex from 1900 to 2000 from Demographic Trends in the 20th Century (table 5 on page A-7), and for 2010 from 2010 Census Briefs: Age and Sex Composition. Both are free to download, and use for any purpose, by following the links.

Graph of male and female population in the U.S. census from 1900 to 2010.  Population scale goes from 0 to 180 million. The numbers stay equal until about 1950, then the number of women begins to exceed the number of men.
From the census data: number of men (blue) and women (orange) living in the U.S. in each decade from 1900 to 2010. Click for a larger version.

BONUS QUESTION: Notice how the number of men and women is about 50/50 until 1950, and after that, women exceed men. Why is that?

Of course, the U.S. census is only done every 10 years, and perfect games occurred in interim years like 1965 and 2004. So to get the age/sex populations in those interim years, I interpolated between the once-a-decade censuses by assuming a linear population growth, with slope (P2 – P1) / 10 for each of the categories. Note that this assumes that populations shifted within categories only every 10 years, which is not a good assumption but is good enough for this quick analysis. For games in 2011 and 2012, I carried the 2000-2010 growth rate forward another two years.

The census doesn’t publish population counts for every age; instead, they report a single count for all ages within five-year intervals (bins), starting with “0-5” and ending with “85+”. Within each age bin, they report the number of men and the number of women. They also give the total number of men and women within overall, and the U.S. bottom-line population total.

From these numbers, I calculated the percentage of people in each age/sex bin in years in which perfect games took place. I then multiplied this number by the attendance of each perfect game to find the number of people expected in each age/sex bin (as predicted by our simple model in which spectators at a baseball game is a representative sample of the U.S. population).

Then I subtracted the number of years between that perfect game and today, and added that number to each of the bins. That shows us how old that game’s crowd would be today. I added up the counts only the bins for people who are today 76 or younger (for men) or 81 or younger (for women).

So, calculate the percentage of people in each age/sex bin in each census. Multiply that percentage by the total number of people at each game to figure out how many members of each age/sex bin there were at the game. Increase the age of the population to today, and remove any bins that would be greater than 76/81 today. For example, Don Larsen’s perfect game took place 62 years ago, so I added up population counts for men who were then younger than (76 – 62 =) 14. That includes everyone from the age bins 0-4, 5-9, and 10-14. From a similar analysis of women, I added up everyone from age bins 0-4, 5-9, 10-14, 15-19. (In some cases the years overlapped bins – for example, Kenny Rogers’s perfect game in 1994 returns women 57 or younger, which overlaps the 55-59 age bin. So I added 2/5 of the count from that age bin.)

Adding up all remaining age bins leaves the number of people who attended each perfect game who are probably still alive in 2018. For example, of the 64,519 people who attended Don Larsen’s perfect game (it was in the World Series!), an estimated 21,422 are still alive today. We’ve made so many assumptions that I wouldn’t trust that number to be anywhere near exact, so let’s report it as “about 21,000”.

Doing this for all perfect games and then adding up for the total gives 363,000. But then, there’s Ed’s question: were there some people who saw multiple perfect games, meaning that we’ve double-counted some people? Do we need to adjust our estimate down to make up for it?

If this were 2011, I’d say no. Before then, perfect games had happened in different cities in different years. But then in 2012, Philip Humber’s and Felix Hernandez’s perfect games took place at Safeco Field in Seattle. Thus, a Mariners 2012 season ticket holder would have seen metaphorical lightning strike twice – two perfect games in one stadium in one season.

So to account for this effect, we need to estimate the percentage of the crowd at a Major League Baseball game that are season ticket holders who attend every game (or at least attended two games, and were incredibly lucky in choosing the two games they attended). I have no idea how to estimate that percentage. But consider that we know for sure two people that attended Hernandez’s perfect game but not Humber’s – me and my lovely spouse. And I know several Baltimore Orioles season ticket holders, none of them have attended every single game this season. So, let’s pick an estimate that I think is on the high side of reasonable: 10 percent.

That means that to account for the double-counting, we need to subtract about 10% of the crowd for Hernandez’s perfect game because we suspect they had already seen Humber’s perfect game four months before. Ten percent of 21,889 is (rounding off to remind ourselves this is just an estimate) about 2,000. So subtract out 2,000 people from our preliminary estimate of 363,000 to leave 361,000 people who have seen one or more perfect games.

Perfect Game number Pitcher Date

People who saw the game People who saw the game who are alive today
1 Lee Richmond 6/12/1880 unknown 0
2 John Montgomery Ward 6/17/1880 unknown 0
3 Cy Young 5/5/1904 10,267 0
4 Addie Joss 10/2/1908 10,598 0
5 Charlie Robertson 4/30/1922 25,000 0
6 Don Larsen 10/8/1956 64,519 21,000
7 Jim Bunning 6/21/1964 32,026 14,000
8 Sandy Koufax 9/9/1965 29,139 13,000
9 Catfish Hunter 5/8/1968 6,298 3,000
10 Len Barker 3/15/1981 7,290 4,000
11 Mike Witt 9/30/1984 8,375 5,000
12 Tom Browning 9/16/1988 16,591 11,000
13 Dennis Martinez 7/28/1991 45,560 33,000
14 Kenny Rogers 7/28/1994 46,581 34,000
15 David Wells 5/17/1998 49,820 38,000
16 David Cone 7/18/1999 41,930 33,000
17 Randy Johnson 5/18/2004 23,381 19,000
18 Mark Buerhrle 7/23/2009 28,036 24,000
19 Dallas Braden 5/9/2010 12,288 11,000
20 Roy Halladay 5/29/2010 25,086 22,000
21 Philip Humber 4/21/2012 22,472 20,000
22 Matt Cain 6/13/2012 42,298 38,000
23 Felix Hernandez 8/15/2012 21,889 20,000
  (remove double-counts from Seattle 2012)   -2,000 -2,000
  TOTAL   567,444 361,000

Again, there are a lot of potential sources of systematic error in this analysis, so I don’t think we can be confident enough in our estimates to go down to the level of 1,000 people either way. So again, let’s round to 360,000.

Thus, our estimate shows that about 360,000 people alive today have seen a Major League Baseball perfect game. And I am one of them.

This has been a quick analysis of a small, self-contained question, but it showcases many features of the thought process that data scientists go through each day. I hope it’s been fun to follow along. The most important part is to always keep in the back of your mind: How might I be wrong?

With that in mind, then: how might this analysis be wrong? What assumptions did we make that we should not have made? What can we do to improve our estimates?

I’d love to hear your thoughts on this, no matter what your experience with math and science have been. The entire point of this blog is to bring the excitement of science to everyone.

Don’t make Eeyore sad, comment below with your thoughts!

Dank meme: A picture of Eeyore with the caption "No one commented"

Perfect Part 1: How many people alive today have seen a perfect MLB game?

Me in the stands at Safeco Field for Felix Hernandez's perfect game on August 15, 2012Today marks the sixth anniversary of a baseball accomplishment: the most recent perfect game in Major League Baseball history. Since 1876, Major League Baseball teams have played 216,449 games, and only 23 have ended in perfect games (and here is a list of all of them). And I was there for the most recent one.

Which was nice.

A “perfect game” is when a pitcher goes an entire game without allowing any hits, walks, or errors – and thus no one from the opposing team even reaches first base. This photo shows me looking gleeful at the conclusion of the one six years ago. My lovely spouse and I were on vacation in Seattle, and decided to go watch a Seattle Mariners game. Or rather, I wanted to watch the game, so my lovely spouse came along and brought a book to read. (Did I mention my spouse is lovely?)

Someday the baseball gods will extract their terrible revenge on me for all the gloating I have done in the last six years. But the reason I tell the story today is not to further gloat (although: did I mention I saw a perfect game???), but rather to use it as an opportunity to answer a question that my friend Jon posed when he first heard the story:

How many people alive today have seen a perfect game?

The scoreboard at the end of the game reads: "Felix Hernandez, first perfect game in Mariners history"
Did I mention I saw Felix Hernandez’s perfect game six years ago? Which was nice.

So what approach can we take to find an answer to Jon’s question? The question can be divided into two parts: how many people have seen a perfect game, and how many of those people are alive today?

The first part of that question is straightforward, mostly. All MLB games today have their attendance recorded as part of their official scoring, and nearly all baseball scorecards ever are available from the Baseball Reference website. For example, the box score from perfect game #4 – Addie Joss’s in Cleveland on October 2, 1908 – shows that there were 10,598 people in attendance. There are no attendance figures for the first two perfect games, which were both pitched in 1880, but there are for the other 21.

I looked up the attendance from the box scores for each game of the 21 for which figures are available. The total attendance is 570,144. That means that, excluding the first two games in the 1800s (for which the attendance is unknown), 570,144 people have seen an MLB perfect game in person.

So now for the more interesting part: how many of those people are alive today? There’s no way to know for sure, of course – at least not without making the ridiculous effort to track down every person who was there. But we can get a decent estimate based on population and lifespan data. (Note: this is why we don’t need to worry about attendance those first two games – no one born in the 1800s is alive today).

The approach is this: make a guess of who would have been at each game. How many men and how many women? How many people of each age – children, seniors, etc.? Then follow those people into their futures, using population statistics to figure out how many of them are still alive in 2018.

For example: when I saw Felix Hernandez’s perfect game in 2012, I was a 34-year-old male. Now I am a 40-year-old male. But if the perfect game I had seen was instead Cy Young’s in 1904, I would today have been a 148-year-old male, and there are no 148-year-old males alive.

The scorecard for the Tampa Bay Rays in that game: 27 batters, 27 out
My scorecard for the game: F. Hernandez (SEA): 27 batters faced, 27 outs

This approach requires a major starting assumption: what are the demographic characteristics of a Major League Baseball audience? As a starting point, let’s assume that the stadium audience for these perfect games was representative of the whole U.S. population at the time. That is almost certainly not true, but it’s an easy starting point. Toward the end of this post, I give some thoughts on how you might improve estimates.

Next, assuming that the people at each game were representative of the U.S. population, let’s “age the population,” by applying lifespan data for this distribution. The average lifespan in the U.S. today is 76 for men and 81 for women. But, of course, some people live longer than that, and some less long. At first, I thought I would need to calculate simulated lifespans for everyone who might have been in the stadium. For example, if we wanted to see how many people alive today saw Charlie Robertson’s perfect game in 1922, we would need to estimate the likelihood that someone who was there – say a one-year-old baby, is still alive at age 97. But then I realized – no need!

If the research question were about that 1922 game specifically, then we would indeed have to use such a detailed statistical approach. But the research question is about how many people have seen ANY Major League Baseball perfect game. This is where the math gets slightly depressing, sorry, but if that now-97-year-old baby is still alive, s/he is balanced out by someone who saw one of the later games and died young. So we can just use the average lifespan for men and women as an estimate of the probability that someone is still alive – if they would today be older than 76-for-men-81-for-women, we can assume they are no longer with us. That also means that we don’t need to consider games for which the youngest person who might have been there is now older than 81. So we can start with Don Larsen’s perfect game in 1956.

So here is the approach we will use:

  1. We have the total attendance for each game. (It varies from the 6,298 people who saw Catfish Hunter’s perfect game in 1968 to the 64,519 who saw Larsen’s in the 1956 World Series.
  2. Assume that the percentage of people at each age and sex at the game was the same as the percentage of people at each age and sex in the U.S. as a whole. Again, this is almost certainly not true, but I’m not sure how to do better.
  3. All those people are older by the amount of years that have passed since the game. Someone who saw David Wells’s 1998 perfect game at age 48 would be 68 today.
  4. Assume that anyone whose age today turns out to be greater than 76 (for men) or 81 (for women) has gone to the big game in the sky.
  5. Add up the number of people still alive who saw each game to get the total alive across all games. That is the answer to Jon’s question.

So the only additional data we need, besides the attendance for each game, is the percentage of people at each age and sex in the U.S. population at the time each game occurred. We can multiply the percentages of each age and sex by the total number at each game to find the number of people at each age and sex at each game. Data for age and sex for the entire history of the U.S. is easily obtainable from the U.S. Census at

So what’s the answer?

Tune in Friday to find out!