Qs about palindrome assignment

Bioinformatics Models and Algorithms

Moderator: KevinKarplus

Qs about palindrome assignment

Postby Pheller » 10/31/2009 05:26 pm

A few questions about the palindrome assignment:

1) The formula for the variance depends on N, the number of "almost independent observations". Is this the number of same-length-as-the-palindrome substrings of the sequence we're checking?


2) Given N we can compute the Z-score, so: the score_palindromes program reports palindromes whose E-value are less than some maximum. Am I right that small E means either unusually common or unusually rare? And how do we compute E? Is it the natural log of the Z-score?

TIA,
Phil
Pheller
User
 
Posts: 21
Joined: 09/26/2009 02:06 pm

Re: Qs about palindrome assignment

Postby jstjohn » 10/31/2009 09:07 pm

Hey Phil,
Not sure exactly how to calculate E scores from Z scores. I do know that it isn't the natural log of the Z score though. You can check your ideas on the numbers that Kevin provides:

#kmer observed expected Z_score E_value
GTTAAC 38 673.62 -24.49 5.716e-131
GAATTC 604 1337.26 -20.06 5.812e-88
... and so on

On the assignment he did give us the formula for calculating the P value from the Z score which was:

P(z > t) = erfc(t/sqrt(2))/2
P(z < t) = erfc(-t/sqrt(2))/2
(given the erfc function that he told us how to import from the c math libraries)

Which in our case I translated to being (Warning! I might be wrong here!)
Pval(zscore) = erfc(-zscore/sqrt(2))/2 if zscore is negative

Pval(zscore) = erfc(zscore/sqrt(2))/2 otherwise


Now I am in the process of figuring out change the P value into an E value. In class he mentioned some stuff about how the Evalue = nP_1 and how P_n = 1-(1-P_1)^n. I think the P value we are getting is P_n? I am not sure whether n is the number of palindromes we are checking, or something else...
User avatar
jstjohn
User
 
Posts: 18
Joined: 09/30/2009 03:01 pm

Re: Qs about palindrome assignment

Postby jstjohn » 10/31/2009 09:19 pm

Also this may be a coincidence... but he mentioned that P_n is roughly equal to nP_1 when nP_1 is small. Just out of curiosity I multiplied the smallest P value (from Kevin's z score -24.49) by his number of palindromes considered (64) and I got:

>>> libm.erfcl(24.49/math.sqrt(2))/2
9.4395048289635416e-133
>>> 64 * 9.4395048289635416e-133
6.0412830905366666e-131

which is very close to his reported E value of 5.716e-131.


Maybe our result is P_1 then? I am not sure...
User avatar
jstjohn
User
 
Posts: 18
Joined: 09/30/2009 03:01 pm

Re: Qs about palindrome assignment

Postby Pheller » 10/31/2009 09:51 pm

I'll ask about Z to E in class on Monday if we don't figure it out before then.
Pheller
User
 
Posts: 21
Joined: 09/26/2009 02:06 pm

Re: Qs about palindrome assignment

Postby jstjohn » 10/31/2009 10:09 pm

I found one other thing on NCBI:

P = 1 - e-E

so looking at the last example given we have E=9.825e-03 and Z=-3.61

compute P from Z:
>>> libm.erfcl(3.61/math.sqrt(2))/2
0.00015309850257375551

compute P from E:
>>> 1 - (math.e ** - 9.825e-03)
0.0097768923689033338

>>> 0.00015309850257375551 * 64
0.0097983041647203527

close enough for me...
User avatar
jstjohn
User
 
Posts: 18
Joined: 09/30/2009 03:01 pm

Re: Qs about palindrome assignment

Postby jstjohn » 10/31/2009 10:25 pm

Ugg.. maybe it isn't close enough... Lets ask if we can't get it exactly. I can't chalk that up to a counting error because I am using the professor's Z and E values going into this computation. I should be getting closer.
User avatar
jstjohn
User
 
Posts: 18
Joined: 09/30/2009 03:01 pm

cross-platform erf

Postby Pheller » 10/31/2009 11:53 pm

Yeah, it's definitely too late for us to do this without adult supervision.

One last observation: When I pasted in the code to get the erf and erfc functions, of course the whole thing crashed on my windows box. I copied the code to my Mac and it worked fine. You have to qualify the function names: libm.erfc(stuff).

Where's the yawning smiley? How about this one? :lol: Looks like he's about to sneeze.


-- Phil
Pheller
User
 
Posts: 21
Joined: 09/26/2009 02:06 pm

Re: Qs about palindrome assignment

Postby lanolson » 11/01/2009 09:58 am

john, you said:

Which in our case I translated to being (Warning! I might be wrong here!)
Pval(zscore) = erfc(-zscore/sqrt(2))/2 if zscore is negative
Pval(zscore) = erfc(zscore/sqrt(2))/2 otherwise

I think this is the correct interpretation of getting the p-val. This gives us P_1, i.e. the one-hypothesis p-val, no? To get the E-val then, don't we just multiply P_1 by n, the number of hypotheses we're considering (or the number of different palindromes we're considering for a given length?)

I'm just looking at this, and haven't written code yet. I may have completely misinterpreted E-value and should wait to play with this before opening my mouth.
lanolson
Newbie
 
Posts: 12
Joined: 09/27/2009 02:11 pm

Re: Qs about palindrome assignment

Postby KevinKarplus » 11/01/2009 10:00 am

P(z<t) or P(z>t) is a P_1 value, not a P_n value. The E value is n*P_1, where n is the number of different hypotheses being tested.

The erfc method for computing the P_1 value is a little more precise than some of the approximations mentioned here, assuming a normal distribution.

It is possible to get still more precise by using the binomial distribution, rather than the normal approximation to the binomial distribution, but the difference will be very small when there are many sites that contribute to the binomial distribution—the asymptotically good normal approximation is good enough when we are doing genome-wide searches.
User avatar
KevinKarplus
Guru
 
Posts: 76
Joined: 09/08/2009 02:15 pm
Location: PSB 318

Re: Qs about palindrome assignment

Postby Pheller » 11/01/2009 06:45 pm

I'm still unclear as to how to convert Z-score to E-score. If I'm not alone in this, can we go over it in class on Monday?
Pheller
User
 
Posts: 21
Joined: 09/26/2009 02:06 pm

Re: Qs about palindrome assignment

Postby KevinKarplus » 11/01/2009 08:09 pm

User avatar
KevinKarplus
Guru
 
Posts: 76
Joined: 09/08/2009 02:15 pm
Location: PSB 318

Re: Qs about palindrome assignment

Postby jstjohn » 11/03/2009 02:14 pm

Unrelated to converting scores, but related to this assignment: In our output, do you want us to put the # commented lines to standard error and the tab seperated lines to standard out? For example your output looks like:

# Reading from /cse/faculty/karplus/.html/bme205/f05/H.influenzae.fa.gz
# Reading from H.influenzae.fa.gz
...
#kmer observed expected Z_score E_value
GTTAAC 38 673.62 -24.49 5.716e-131
...

and I was thinking that I would throw all of the lines that begin with a # to standard error. But then this line seems like a good header for an output file: (#kmer observed expected Z_score E_value) so the user knows what he or she is looking at.

Anyhow I am just wondering exactly where you want our output so it is easiest for you to test whether or not our programs are working.

One other point is that when the E_value is extremely small, say less that 1E-250, my output zeros out and I get stuff like 0E+0.0. Is this ok?
User avatar
jstjohn
User
 
Posts: 18
Joined: 09/30/2009 03:01 pm

Re: Qs about palindrome assignment

Postby KevinKarplus » 11/03/2009 02:38 pm

I was thinking of the lines starting with "#" as comment lines in the same file as the data, so stdout would be fine for them. By using "#" as the comment character, one can still use gnuplot to look at the data, and many other programs also will ignore "#" comments.

I don't really like seeing E values of 0 (resulting from floating-point underflow), though lots of programs do have that flaw. The program I wrote gets underflow to 0 for Evalues on lots of palindromes. For H.influenzae, I got that underflow for palindromes like TATA, GC.GC, and GATC. So for this assignment, I would not worry about the underflow.

For those who want bonus points, try looking up
log erfc
with google and finding ways to report E-values that are very, very small.
User avatar
KevinKarplus
Guru
 
Posts: 76
Joined: 09/08/2009 02:15 pm
Location: PSB 318

Re: Qs about palindrome assignment

Postby jstjohn » 11/03/2009 03:03 pm

Oh one more thing. Is there a reason you printed out both of these lines, given that they refer to the same reading of the same file, and if so would you like us to follow that as well? Right now I just print the file name as the user input it, and only do that once per file during the reading stage.

# Reading from /cse/faculty/karplus/.html/bme205/f05/H.influenzae.fa.gz
# Reading from H.influenzae.fa.gz
User avatar
jstjohn
User
 
Posts: 18
Joined: 09/30/2009 03:01 pm

Re: Qs about palindrome assignment

Postby KevinKarplus » 11/03/2009 03:06 pm

That must have been bad copy and paste. There is only one "Reading from" line output by my program.
User avatar
KevinKarplus
Guru
 
Posts: 76
Joined: 09/08/2009 02:15 pm
Location: PSB 318

Next

Return to BME 205: Bioinformatics Models and Algorithms

Who is online

Users browsing this forum: No registered users and 1 guest