Digital devils response
July 16th, 2009Bernd Beber has responded to my (and others’) criticisms of his article on suspected Iranian election fraud authored with Alexandra Scacco. An annotated and updated version of the article is available.
A key criticism leveled against Alex Scacco’s and my Washington Post op-ed on the election in Iran is that we argue that a fair election is unlikely to produce a lot of variation in last-digit frequencies, but then use an inappropriate test in evaluating the data from Iran against this claim. We should have reported the results from a chi-square test, not the probability of particular digits occurring more or less often than expected.
Is a chi-square test the most appropriate statistic for this type of data? Yes. That’s exactly why we report the result in the annotated version of our op-ed. (We initially reported only a nearly equivalent test statistic involving the standard deviation of last-digit frequencies, but since then we’ve clarified that this is the same result one obtains from a chi-square test.)
But is this test the most appropriate one for a general audience? Only if there isn’t a more transparent alternative that captures the same intuition and gives the same substantive result. In our view, the test statistic we report is precisely such an alternative.
I agree that it would be inappropriate to use a chi-square test in the limited space provided in an opinion column. That said, the event identified here has a 3.5% chance of occurring, while a similar standard deviation (or chi-square statistic, I agree that both tests are fine) would occur 7.7% of the time. However, I disagree that a more easily digestable argument that overstates the rarity of the Iranian results by over 50% is an appropriate way to present the argument to a lay audience.
Beber notes that p<0.1 is viewed as significant in political science. True, but my criticism isn’t that the Iranian results significantly vary from random expectations. Instead, it’s whether the observed anomalies “leave little room for reasonable doubt.”
We can quibble over the exact language we used to describe our finding. I’m happy to concede that we find “significant” as opposed to “strong” evidence, or that a fair election is “significantly unlikely” as opposed to “extremely unlikely” to produce the kind of digit patterns we see in the data from Iran. But the substantive conclusion doesn’t change.
The intended conclusion may very well have been that a fair election was unlikely to result in this exact occurance, but this isn’t how the article was received. Online articles linking to the WaPo piece describe it as a “nice analysis of why the Iranian election is probably fraudulent,” “a persuasive case … that the vote totals reported by the Iranian government were fabricated,” and the authors’ own political science department summarizes the article by saying, “The probability that Iran’s presidential election was fair is less than .005.” The article was perceived to prove the liklihood that the election was fraudulent, not the liklihood that a fair election would have produced the same results. This might seem like a distinction without a difference, but that’s not the case. Applying the exact same test used here, the 1944 US election is much more likely to be fraudulent than the Iranian election.
Beber’s response to criticisms about cherry picking tests is largely that the same tests were applied in analyzing a Nigerian election where fraud was observed through other means:
The analysis in that paper shows that there are two types of tests that are effective in the sense that they (a) have a theoretical rationale, (b) don’t sound an alarm when they shouldn’t (in “clean” election data from Sweden), (c) sound an alarm when they should (in very probably manipulated election data from Nigeria). Those two kinds of tests focus on last-digit frequencies and non-adjacency in last and second-to-last digits. Those are exactly the tests we apply to the data from Iran, tests shown to be efficacious before Iran’s election took place. Again, that’s not cherry-picking.
The tests applied to Iran are similar, but they are not the same. In Nigeria, the frequencies of zeros were seen to be outside the 95% confidence interval for three different columns on ward return sheets. In Iran, a combination of all candidates’ results (but not registered voter totals, which weren’t reported, or total voters, which were) didn’t result in any digit being represented outside the 95% confidence level. Only combining the probability that two digits would simultaneously stray from their expected frequency was something that significant detected, and when a more appropriate test is applied (chi-square or standard deviation), the results do not pass the 95% threshold examined in Nigeria. The significant reported result is obtained by combining this probability with the probability of seeing a lower-than-expected number of unique, non-adjacent digits in the final two digits of returns. However, there is no precedent in the work on Nigeria for either looking at non-adjacent digits on a nationwide level (they were used to analyze the liklihood of fraud in individual wards) or for employing the combined liklihood of digit frequency and digit adjacency as a test of fraud.
I don’t see that a case was made to claim that there’s “little room for reasonable doubt” that Iran’s provincial election results were fabricated based on this evidence. Given that, weeks later, there’s still no physical evidence of fraud even though tens of thousands of Iranians were involved (a large fraction supporting Mousavi) in counting tens of millions of ballots and results for individual precincts and ballot boxes have been released, I think my skepticism was warranted. There are structural problems in Iran that prevent any election from being fair, and procedural problems that were raised by Mousavi before the election began, but evidence for outright theft is scant.