Archive for June, 2009

No more complaints about “reverse discrimination,” please?

Monday, June 29th, 2009

Now that the Court’s actually ruled in favor of this stupid argument (White people discriminating against White people … clearly what the Civil Rights Act is meant to ameliorate), will this mean more or fewer absurd complaints about reverse racism and reverse discrimination?

Invitrogen started charging for Vector NTI…

Saturday, June 27th, 2009

So I had to come up with a new way to easily find out whether mutations that come up in cloning are something to worry about or not.  Here’s a Python script to do the job.  Paste in the intended sequence (in frame) and then a chunk of the mutated sequence either beginning or ending with the mutated base (enough of the sequence so that there’s a unique alignment) and figure out what the mutation was.  Dictionary specifying the genetic code courtesy someone else on a Python mailing list (lost the link) .  Easy to use:

python whatsthemutation.py

Original Sequence? ATGTTAAAACGTATCAAAATTGTGACCAGCTTACTGCTGGTTTTGGCCGTTTTTGGCCTT
Mutated region? CAATTGTGACCAGCTTACTG
AAA -> ACA
K -> T

After some investigation, it looks like CLC Sequence Viewer is the best program to use to access all of my old Vector NTI data without losing annotations and things.  Of course, there’s the problem that someone might try to charge for that in the future, but oh well.  There’s definitely demand from at least the handful of people I know using Vector NTI now for software that can import its data completely, has a database, and allows for simple construction and annotation of plasmids.  Primer design and integration with sequence alignment is nice, too, but not as neccessary.

Thinking I should trust my gut more

Saturday, June 27th, 2009

I just noticed that I added this little box to the sidebar of the blog in early 2008:

HOT, CURRENTLY.

Trellises. Trading dollars for risky mortgages, so long as I’m not doing it. The rapidly approaching annual lapse in winter vegetables’ exclusive appeal.

Seems like it would’ve been smart to act on that hunch.  I also wanted to buy NVIDIA stock in early 2000 and I think it posted one of the biggest gains in the market over the next two years in the middle of the tech crash.  I think this goes to show, more than anything else, that it’s easy to see what will likely happen in the economy but impossible to predict when it will happen.

More on that devil

Thursday, June 25th, 2009

Following up on the last post, here’s an exercise in applying Beber and Scacco’s analysis to random numbers.  I’m going to generate 10 sets of 116 random numbers and see how many contain similarly suspicious patterns.  Here is the code and output from MATLAB:

freqs = [];
for i = 1:10
    a = ceil(10.*rand(116,1)) - 1;
    aFreq = [];
    for i = 0:9
        aFreq = [aFreq length(find(a==i))];
    end
    freqs = [freqs; aFreq];
end

freqs =
     9    14    13    14     8    10     7    16    12    13
    11    12    11    10     9    16    10    13    12    12
    20    11     6     9    15    13    11    15     9     7
    12    13     6    13    14    12     6    16    15     9
    16    10    15    12     7    11    14    14     9     8
    10    12    11    11    10    12    11    13    16    10
    16    11    14    10    14     7    11    10     9    14
    10    11    12    10    18    12    10     9     9    15
     8    10    12    11    24    10     9     9    10    13
    13     7    13     4    11    16    12    16    17     7

Each row is the frequency of the digits 0 through 9 in a set of 116 random numbers.  Beber and Scacco identify fraud based on the premise that “humans are bad at making up numbers. Cognitive psychologists have found that study participants in lab experiments asked to write sequences of random digits will tend to select some digits more frequently than others.”  In the Iranian example, they see 20 sevens and 5 fives in last digit of 116 vote counts from Iranian elections.  How often can we identify an equivalent phenomenon in random numbers?

I’ll simulate the number of times each event happens in 10,000 simulations using this code:

ct = 0;
for sim = 1:10000
    a = ceil(10.*rand(116,1)) - 1;
    aFreq = [];
    for i = 0:9
        aFreq = [aFreq; length(find(a==i))];
    end
    if length(find(aFreq(2:4)>=13))==3
        ct = ct+1;
    end
end

Here, the condition I’m looking for is an overabundance of the numbers 1, 2, and 3, which is what Beber and Scacco identify as indicative of human manipulation in their work on Nigerian elections.  Seeing the numbers 1-3 each 13 times or more occurs in only 365 of 10,000 simulations – it is as rare as the phenomenon observed in Iran, and fits better with experimental observations of fraudulent random numbers.

Let’s look at all of these numbers and see which ones show unexpected rare phenomena:

Row Times Condition
1   365   1,2,3>=13 (too many low #s)
2   284   N>=9 9<=X<=13 (too little variation)
3   227   N>=3 X>=15 & N>=1 X>=20 (3 high, 1 very high)
4   573   N>=2 X>=15 & N>=2 X<=6 (2 high, 2 low)
5
6   50    N>=8 10<=X<=12 (too little variation)
7
8   228   N>=8 9<=X<=12  (too little variation)
9   53    N>=1 X>=24 (too many 4s)
10  97    N>=3 X>=16 & N>=1 X<=4 (too many 5s,7s,8s and too few 3s)

So for 10 random sets of numbers it’s pretty easy to find phenomena in 8 of them as or more rare than what happened in Iran.  Samples 5 and 7 are now suspicious because they don’t display any obvious rare pattern… was the person faking this data onto my game?

Is the devil in the digits?

Wednesday, June 24th, 2009

In the Washington Post, two Columbia political science students claim to “use statistics more systematically to show [that the Iranian elections results were altered behind closed doors].”  They are confident that this leaves “very little room for reasonable doubt” that the results were not at least partially fabricated.  They identify two unexpected occurrences in the last two digits of the number of votes received by the four candidates in Iran’s provinces (116 total numbers):

1. In the last digit, the number 7 occurs 17% of the time (N=20) and the number 5 occurs 4% of the time (N=5).  The probability of this phenomena for 116 random numbers is 3.5%.

2. The last digit and the penultimate digit are non-adjacent only 62% of the time (0 is adjacent to both 9 and 1, so there is a 70% probability that a random number will be adjacent to any other number).  The probability of this is 4.2%

3. The probability of both occurrences happening simultaneously is 0.5% [sic - it is actually 0.15%, the product of the probabilities of each event]

They correctly state that the odds of this happening in a fair election are extremely low; they incorrectly infer that this leaves little doubt of fraud.  Focusing on the first point, let’s see what the authors have to say here:

Why would fraudulent numbers look any different? The reason is that humans are bad at making up numbers. Cognitive psychologists have found that study participants in lab experiments asked to write sequences of random digits will tend to select some digits more frequently than others.

The numbers look suspicious. We find too many 7s and not enough 5s in the last digit. We expect each digit (0, 1, 2, and so on) to appear at the end of 10 percent of the vote counts. But in Iran’s provincial results, the digit 7 appears 17 percent of the time, and only 4 percent of the results end in the number 5.

Indeed, cognitive psychologists say this.  What’s more, the authors previously looked into possible Nigerian election fraud and discussed this further:

We showed that we can expect the last digits of electoral results to occur with equal frequency given a wide range of distributional assumptions, and we then emphasized the fact that humans tend to be biased in the production of random numbers: They tend to select small digits, avoid repetition, and favor adjacent numerals.

None of the literature they cite says anything about the numbers 5 and 7, and the phenomenon observed here actually runs counter to experimental evidence of human attempts at producing random numbers.

They equate the probability of seeing one number too frequently and one number too infrequently with the probability that the last digits are random.  These probabilities are not equivalent.  It’s easy to see that there are any number of equivalent, similarly improbable events:

1. X appears too frequently and Y appears too infrequently

2. Both X and Y appear too frequently

3. Both X and Y appear too infrequently

4. X, Y, and Z appear too frequently

5. X, Y, and Z appear too infrequently

etc.

It’s trivial to continue and think of dozens of equivalent events all with a 3.5% probability.  In fact, there is a 100% chance that a string of 116 random digits will feature such a pattern (update: I suspect this, but I’m not remotely capable of proving it).

The correct way to investigate whether a set of numbers might be random is using Pearson’s chi-square test.  We first calculate the chi-square test statistic for an expected digit frequency of 11.6 per 116 numbers.  The digits 0 through 9 are observed 9, 11, 8, 9, 10, 5, 14, 20, 17, and 13 times.  The test statistic is 15.6.  Since our data has 10 possible values there are 9 degrees of freedom, and the critical value required to reject the null hypothesis at a 95% confidence level is 16.9 – you simply can’t conclude with a high degree of confidence that the numbers aren’t entirely random.

What’s more, here is the authors’ example of the results for a fair election:

As a point of comparison, we can analyze the state-by-state vote counts for John McCain and Barack Obama in last year’s U.S. presidential election. The frequencies of last digits in these election returns never rise above 14 percent or fall below 6 percent, a pattern we would expect to see in seventy out of a hundred fair elections.

Why look at the last digit when the second-to-last digit should also be random?  If you look at the second-to-last digit in this same data set, you’ll find 20% 7s and 5% 8s.  The odds of this happening are 1.5% – well below the odds of the 7s and 5s phenomenon in Iran.