log-2008-07
- 1 250k array problems
- 2 2008-07-15 Candidate gene list rank test
- 3 2008-08-22 Top SNP HG test against candidate gene list
- 4 2008-07-30 sql to check bad or missing 192 in 250k
- 5 2008-07-31 encode FRI deletions
- 5.1 Deletion 3
- 5.2 2008-08-01 Kruskal-Wallis 250k vs 471 FRI SNPs from alignment 1843
- 5.3 2008-08-04 create version 4 SNPs for 3 FRI deletions
- 5.4 2008-08-05 epistasis among 3 FRI deletions
- 5.5 2008-09-26 merge FRI deletion data with other 250k data for next-step QC/imputation
- 5.6 2008-11-24 add 3 FRI-deletion SNPs into snps_context of stock_250k
- 5.7 2009-9-19 Association of FLC expr (id=43) phenotype with FRI alleles as cofactor
- 6 2008-08-16 ecotypeid typo in /Network/Data/250k/dataFreeze_080608/pipeline_log.txt
1 250k array problems
1.1 2008-07-04 check out the intensity histogram of QC probes in CHLA-06-27
Arrays coming out of CHLA on 06-27-2008 have 4 bad ones (mismatch rate >=12%) out of 9 arrays. DNA quantity and call NA rate don't seem to perfectly correlate with mismatch rate. Finally, an ad-hoc badness of intensity histogram, which roughly corresponds to how well the histogram conforms to that of the 107 normalized arrays, predicts the mismatch rate well.
| Table 1 | |||||||
|---|---|---|---|---|---|---|---|
| MAC# | Sample ID | EcotypeID | Array ID | USC concen | gDNA nanodrop ng/ul | Amp DNA nanodrop | total hyb ug |
| O661 | CS75405_Dog-8 | 9064 | 498 | 88.1 | 3.5 | 243.9 | 15.3657 |
| O662 | CS75418_Lerik1-6 | 9077 | 499 | 49.8 | 4.4 | 292.7 | 18.4401 |
| O663 | CS75445_Lag1-6 | 9104 | 500 | 58.7 | 9.3 | 391.8 | 24.6834 |
| O664 | CS75474_Yeg-7 | 9133 | 501 | 66.1 | 14.1 | 324.1 | 20.4183 |
| O665 | CS75493_Cala-8 | 9152 | 502 | 29.4 | 11.3 | 267.2 | 16.8336 |
| O667 | CS75506_Truk-5 | 9165 | 503 | 44.1 | 5.3 | 314.3 | 19.8009 |
| O668 | CS75510_Kastel-1 | 9169 | 504 | 49.6 | 12.6 | 224.8 | 14.1624 |
| O669 | CS75520_Ayu-Dag-3 | 9179 | 505 | 35.3 | 7.8 | 273.9 | 17.2557 |
| O670 | CS75530_Koch-5 | 9189 | 506 | 63.5 | 46.5 | 206.9 | 13.0347 |
| O666 | CS75494_SanMartin-1 | 84.2 | 12.5 | 118.7 | 7.4781 | ||
| Table 2 | ||||
|---|---|---|---|---|
| oligo call prob <=0.85 | histogram | simple call | ||
| MAC# | mismatch rate | NA rate | badness | NA rate |
| O661 | 0.349206 | 0.11634297 | 5 | 0.494372123 |
| O662 | 0.0625 | 0.13308178 | 3 | 0.289242268 |
| O663 | 0.0169492 | 0.12275931 | 1 | 0.220464712 |
| O664 | 0.245902 | 0.15323995 | 5 | 0.460922665 |
| O665 | 0.0793651 | 0.12949747 | 2 | |
| O667 | 0.0833333 | 0.15901667 | 2 | |
| O668 | 0.127273 | 0.15527548 | 4 | 0.426447398 |
| O669 | 0.047619 | 0.11306439 | 1 | 0.235252469 |
| O670 | 0.403226 | 0.12322193 | 5 | 0.503318797 |
have 3 doubts:
- the array quality could be bad. magnus said it's expired in April. or it could be something happened in my trip from chicago to here, i put them in a luggage bag, the x-ray machine of airport security could have damaged them ...
- our calling algorithm needs to adjust to the new intensity distribution of these arrays.
- scanner alignment is not well.
For 1, we can't really do anything. hope to tell it apart from the image.
For 2, we'll try and see.
For 3, Aaron suggested to look at the boundaries of the raw images.
Eye-balling of the raw images showed the arrays are well-aligned but are not as bright as those chicago arrays. Also bad arrays are also less brighter than the good arrays. This leads to 2 possibilities:
- DNA quantity is not enough for bad arrays. The numbers in the table were not accurate?
- DNA hybridization doesn't work well for bad arrays. Washed away?? Check what the intensity histogram of QC probes look like?
Procedures to get QC probe intensity histogram
commandline to output QC probe intensity:
crocea@banyan:~/script/affy/sdk/calvin_files$ for ((i=498;i<507;i++)); do echo $i; filename=/Network/Data/250k/db/raw_data/$i\_raw_data.cel; ./OutputQCIntensity -i $filename -d \ ~/script/variation/genotyping/250ksnp/data/atSNPtilx520433_rev2/Full/atSNPtil_geno/LibFiles/atSNPtil_geno.cdf \ -o /tmp/CHLA-06-27/$i.tsv; done
command line to draw the intensity (internally calling above to get intensity matrix):
./src/PlotQCProbeIntensityHistogram.py -i 498 -o /tmp/qc_plots/498.png -m 50000
The 9 histograms are in qc-probe-intensity-histogram/ in the order of array-id (Field 'id' in table stock_250k.array_info). There's no difference between bad arrays (id=498, 501, 504, 506) and good arrays. This means:
- QC probes work fine.
- DNA hybridization works fine. Otherwise, QC probes should have bad hybridization in bad arrays, which result in lower intensity.
- The DNA quantity is lower in bad arrays. The sharp peak in the intensity histogram of all probes in bad arrays, which the "badness" measure is based upon, represents noise?
On the other hand, compare the QC probe histogram between CHLA and chicago, although not always, the ones from chicago , peak >20000, are right-shifted (bigger) as to CHLA arrays, peak around 15000, which suggests the overall CHLA array intensity is lower than chicago.
1.1.1 2008-07-13 mismatch rates correlates with simple-calling NA rate
NA rate calculated from simple calling algorithm (last column in Table 2) totally correlates with mismatch rate. now it totally makes sense.
1.2 2009-3-10 CHLA 01-2009 batch
QC for total 429 arrays (ecotype id of 1 array isn't resolved). 104 (=24.2%) has mismatch rate above 10%. a few might be ecotype id mis-linking (might be mislabelled upstream procedures, not necessarily glenda's fault) because the NA (non-calling) rate is pretty low (26 out of 104 has NA rate below 10%). but most of 104 have high NA rates.
391 arrays have jpegs. (430 arrays total though). ~24 have either a curvy line crossing the array or a small smudge, ~13 have large smudge. there're also some arrays substantially less brighter that others.
1.2.1 streak of badness
bad is defined as among those 104 (mismatch rate >10%).
14 out of 20 from P811 to P830 are bad.
8 out of 10 from P997 to Q006 are bad.
23 out of 31 from Q010 to Q040 (CHLA id) are bad.
16 out of 19 from Q087 to Q105 are bad.
10 out of 14 from Q117 to Q130 are bad.
total = 71.
1.2.2 2009-3-13 cross match
MIB-27 (177) <= UKID115-122 (5821-5828)
LDV-9 (158) <= TOU-L-5(389)
UKSE06-514 (5262) <= UKSE06-406 (5194) UKSE06-407 (5195) UKSE06-409( 5197)
By 4 mismatches out of 61 pairs:
UKSE06-465 (5231) <= 5224/UKSE06-453 5235/UKSE06-469
1.3 2009-4-16 27 mis-labelled arrays among all current arrays
The way to get this is to compare 250k calls of failed arrays against all 149SNP data. True identity is called if the mismatch rate is <=2% and the number of corresponding pairs is >=40. Sometimes, one failed 250k array could be matched to dozens of 149SNP data. Then all of them are listed. so explains why there are 215 of them. They actually correspond to only 27 failed runs.
code:
call_method_id = 3 qc_method_id = 9 output_fname = '/tmp/250k_good_bad_cross_label_arrays.tsv' reportBadArrays(db_250k, call_method_id, qc_method_id, output_fname)
| ecotype_id | nativename | array_original_filename | array_created | quality | true ecotype_id | true nativename |
|---|---|---|---|---|---|---|
| 8419 | Wil-1 | ./mnt/chla/03-04-09Arabidopsis/Q259-A-atSNPtilx520433-01-1 (Wil-1_8419).CEL | 2 | 7412 | Wil-1 | |
| 7329 | Santa Clara | ./mnt/chla/03-04-09Arabidopsis/Q263-A-atSNPtilx520433-01-1 (Santa Clara_7329).CEL | 2009/04/05 00:41:59 | 2 | 7226;6909...; | Li-5;Col-0... |
| 6088 | Stu1-1 | ./mnt/chla/03-04-09Arabidopsis/Q269-A-atSNPtilx520433-01-1 (Stu1-1_6088).CEL | 2 | 6089 | Stu1-2 | |
| 9394 | Hag 2 | /Network/Data/250k/raw_data/CHLA-01-2009/P776-A-atSNPtilx520433-01-1 (Hag 2).CEL | 2 | 9370 | EkS 3 | |
| 9370 | EkS 3 | /Network/Data/250k/raw_data/CHLA-01-2009/P777-A-atSNPtilx520433-01-1 (EkS 3).CEL | 2 | 9368;9369 | EkS 1;EkS 2 | |
| 177 | MIB-27 | /Network/Data/250k/raw_data/CHLA-01-2009/P845-A-atSNPtilx520433-01-1 (MIB-27).CEL | 2 | 5821;... | UKID115;... | |
| 158 | LDV-9 | /Network/Data/250k/raw_data/CHLA-01-2009/P890-A-atSNPtilx520433-01-1 (LDV-9).CEL | 2 | 389 | TOU-L-5 | |
| 171 | MIB-20 | /Network/Data/250k/raw_data/CHLA-01-2009/P905-A-atSNPtilx520433-01-1 (MIB-20).CEL | 2 | 199;29;226 | MIB-51;CAM-22;MIB-88 | |
| 5262 | UKSE06-514 | /Network/Data/250k/raw_data/CHLA-01-2009/Q091-A-atSNPtilx520433-01-1 (UKSE06-514).CEL | 2 | 5194;... | UKSE06-406;... | |
| 5270 | UKSE06-527 | /Network/Data/250k/raw_data/CHLA-01-2009/Q092-A-atSNPtilx520433-01-1 (UKSE06-527).CEL | 2 | 5179;... | UKSE06-384;... | |
| 5247 | UKSE06-484 | /Network/Data/250k/raw_data/CHLA-01-2009/Q098-A-atSNPtilx520433-01-1 (UKSE06-484).CEL | 2009/03/05 00:37:21 | 2 | 436;439... | EM-076;EM-126;... |
| 5273 | UKSE06-530 | /Network/Data/250k/raw_data/CHLA-01-2009/Q099-A-atSNPtilx520433-01-1 (UKSE06-530).CEL | 2 | 5180;5182;... | UKSE06-385;UKSE06-387;... | |
| 5192 | UKSE06-404 | /Network/Data/250k/raw_data/CHLA-01-2009/Q102-A-atSNPtilx520433-01-1 (UKSE06-404).CEL | 2 | 5264;5265;... | UKSE06-520;UKSE06-521(1);... | |
| 5201 | UKSE06-413 | /Network/Data/250k/raw_data/CHLA-01-2009/Q103-A-atSNPtilx520433-01-1 (UKSE06-413).CEL | 2 | 5255;5261;... | UKSE06-502;UKSE06-513;... | |
| 5229 | UKSE06-463 | /Network/Data/250k/raw_data/CHLA-01-2009/Q104-A-atSNPtilx520433-01-1 (UKSE06-463).CEL | 2 | 400;4762;... | WHA2;UKSW06-162;... | |
| 1366 | Ham-10-239 | /Network/Data/250k/raw_data/CHLA-01-2009/Q145-A-atSNPtilx520433-01-1 (Ham-10-239).CEL | 2 | 1349;1351;... | Ham-4;Ham-11;... | |
| 6112 | T540 | /Network/Data/250k/raw_data/CHLA-12-29-08/P642-A-atSNPtilx520433-01-1 (T540).CEL | 2 | 6105 | T450 | |
| 4802 | UKSW06-202 | /Network/Data/250k/raw_data/JB-atSNPtile-24S-4-30-08/UKCW06202.CEL | 2 | 4805 | UKSW06-205 | |
| 5202 | UKSE06-414 | /Network/Data/250k/raw_data/JB-atSNPtile-24S-4-30-08/UKKT06414.CEL | 2 | 5196;5198;... | UKSE06-408;UKSE06-410;... | |
| 362 | TOU-C-3 | /Network/Data/250k/raw_data/JB-atSNPTile-24S-5-19-08/FR373TOU.CEL | 2 | 222 | MIB-83 | |
| 373 | TOU-H-12 | /Network/Data/250k/raw_data/JB-atSNPTile-24S-5-7-08/FR386TOU.CEL | 2 | 309 | TOU-A1-23 | |
| 5341 | UKSE06-628 | /Network/Data/250k/raw_data/JB-atSNPTile-24S-5-7-08/UKKT06628.CEL | 2 | 5339;4755;... | UKSE06-626;UKSW06-155;... | |
| 6005 | DraIV 6-35 | /Network/Data/250k/raw_data/JB-atSNPtile-8S-4-9-08/DRAIV6-35.CEL | 2 | 6006 | DraIV 6-36 | |
| 5380 | UKNW06-059 | /Network/Data/250k/raw_data/JB-atSNPtile-8S-4-9-08/UKLD06059.CEL | 2 | 5291;5292 | UKSE06-555;UKSE06-556 | |
| 5381 | UKNW06-060 | /Network/Data/250k/raw_data/JB-atSNPtile-8S-4-9-08/UKLD06060.CEL | 2 | 5292;5291 | UKSE06-556;UKSE06-555 | |
| 5264 | UKSE06-520 | /Network/Data/250k/raw_data/JB-atSNPtile-8S-6-4-08/UKKT06520.CEL | 2 | 5192 | UKSE06-404 | |
| 5207 | UKSE06-429 | /Network/Data/250k/raw_data/yanli2-7-08/UKKT06429.CEL | 2 | 4755;4756;... | UKSW06-155;UKSW06-156;... |
2 2008-07-15 Candidate gene list rank test
2.1 Data & Methods
Two sample rank sum test, aka Mann-Whitney (use R's wilcox.test). P-values from the candidate gene list is one sample. P-values from non-candidate genes is another sample. P-value for a gene is defined as the p-value from the most-significant SNP within the gene or nearby (within 50kb). Association between genes and SNPs is defined in table snps_context.
Candidate genes are stored in table candidate_gene_list and candidate_gene_list_type.
2.2 Results
| Kruskal Wallis | KW | Other methods | Emma from this column | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Phenotype | all data | 96 accession | margarita | Random Forest | K.250k | transform | K.2010 | K.250k-i1-2010 | K.2010_u2_250K | K.2010_i3u2_250K |
| LD | 1.19e-3 | 1.26e-2 | 1.70e-2 | 1.946e-5 | 1.287e-5 | 3.05e-5 | 3.32e-3 | 7.22e-4 | 3.32e-3 | 5.58e-4 |
| LDV | 1.70e-4 | 1.98e-2 | 3.33e-2 | 2.21e-2 | 4.12e-4 | 2.33e-4 | 2.23e-4 | 4.48e-4 | 2.24e-4 | 5.65e-4 |
| SD | 6.40e-3 | 1.85e-1 | 9.63e-2 | 1.13e-2 | 7.14e-4 | 1.69e-3 | 2.24e-3 | 3.02e-4 | 2.24e-3 | 6.33e-5 |
| SDV | 2.63e-4 | 3.70e-2 | 5.11e-2 | 1.16e-2 | 3.12e-4 | 7.29e-05 | 1.97e-4 | 2.61e-6 | 1.97e-4 | 1.26e-06 |
2.2.1 Notations:
- K.250k (analysis_method_id=4):
- Kinship matrix in emma is generated by 250k data alone. phenotype not transformed.
- transform (analysis_method_id=7):
- Kinship ditto. phenotype log transformed.
- K.2010 (analysis_method_id=8):
- Kinship matrix is generated by a SNP matrix of 2010 + 250k accession X 2010 SNP loci. giving 2010 priority if they disagree. phenotype log transformed.
- K.250k-i1-2010 (analysis_method_id=9):
- Kinship matrix is generated by the overlapping sites between 250k and 2010, but giving 250K priority if they disagree. phenotype log transformed.
- K.2010_u2_250K (analysis_method_id=10):
- Kinship matrix is generated by a SNP matrix of 2010 accession X 2010 SNP loci. 250k substitutes at 2010 loci for accessions who don't have 2010 data. phenotype log transformed.
- K.2010_i3u2_250K (analysis_method_id=11):
- Kinship is generated by the overlapping loci between 2010 and 250K, giving 2010 priority if they disagree. phenotype log transformed.
2.3 Conclusions
as far as that flowering time short list is not very bogus (suzi?), a few conclusions are obvious:
margarita is not that great.
One question here is whether Margarita tends to miss the 50 kb window, or simply find more population structure. Either way, I don't view Margarita as one of the main methods. We're basically running it to see if it finds something the others miss.
~170 accession is better than 96 accessions.
random forest is unstable. for LD, it's perfect. but not for the other 3 phenotype.
kinship matrix generated by 2010 alone is less optimal? K.2010 and K.2010_u2_250K gave same stuff.
2.3.1 biases (from magnus)
For all SNPs associated with a gene, take the most significant one or all of them?
watch out for biases. For example, do we have more SNPs within 50 kb of the candidate genes? If so, could this explain the result when you take the minimal p-value? To get rid of bias, perhaps compare the ranks of ALL SNPs, divided into those within X kb of a candidate, and those within X kb of a non-candidate. Since this would generate overlapping sets, perhaps also try those within X kb of a candidate, and those not within X kb of a candidate (risk of other bias here due to varying gene content).
2.3.1.1 2008-08-22
This bias does the rank test result. In the other method, taking all SNPs associated with a gene, the scene that some candidate gene lists (Hyaloperonaspora_14, FT_list3_28) are significant across all phenotypes disappeared.
margarita results came out way stronger in all than most significant. It also greatly surpassed the significance level achieved by other methods under same phenotype and candidate gene list.
2.3.2 Questions/Bugs
2008-07-22 for KW test, why rank sum test is very significant for FT_short, but not for FT_long? considering FT_long is FT_short + half other genes.
it's a bug when taking maximum SNP score/pvalue for a gene. For pvalue, without -log() transformation, maximum becomes the least significant.
3 2008-08-22 Top SNP HG test against candidate gene list
How many top SNPs to choose?
pick the closest gene on the 2 sides or all genes within a threshold?
Flowering time phenotypes match FT gene lists better in the all than closest mode. 78_YEL_Lesion significantly matches the FT lists in the closest mode. The all mode enables several candidate gene lists (ion and others) to be able to have genes in the top SNP list and show significance for some phenotypes.
How much should that threshold be?
which number is the appropriate minimum MAF?
3.1 commandlines
submit results from one analysis method but for all phenotypes into db:
for ((i=1;i<=186;i++)); do echo $i; ~/script/variation/src/Results2DB_250k.py -i \ /Network/Data/250k/tmp-bvilhjal/emma_results/Emma_trans_censored_Kinship_250K_i1_2010_$i\_* \ -l 9 -a 6 -e $i -f Emma_trans_censor_ana9_ph$i -u yh -c -p pass**d ; done;
output rank test p-values for several association results:
./script/variation/src/GeneListRankTest.py -e \ 19,20,21,22,571,572,573,574,389,390,391,392,204,205,206,207,943,\ 944,1045,1046,1151,1152,1153,1154,1337,1338,1339,1340,1522,1523,1524,1525,1707,1708,1709,1710 \ -l 1 -o ~/Desktop/rank_test_results_pheno1234_noRF.tsv -u yh
4 2008-07-30 sql to check bad or missing 192 in 250k
4.1 check missing 192 in 250k
thru stock.batch_ecotype to see how many are genotyped in 250k
select distinct b.ecotypeid from stock.batch_ecotype b, stock.ecotype_duplicate2tg_ecotypeid et, view_qc v where v.ecotype_id=et.tg_ecotypeid and b.batchid=2 and b.ecotypeid=et.ecotypeid and v.call_method_id=3;
thru at.magnus_192_vs_accession to see how many are genotyped in 250k
select distinct a.* from at.magnus_192_vs_accession a, at.accession2tg_ecotypeid ae, call_info c, array_info ar where ar.ecotype_id=ae.ecotype_id and a.accession_id=ae.accession_id and c.method_id=3 and c.array_id=ar.id;
show all 192, with not genotyped shown as NULL ecotype_id
select distinct a.accession_id, a.linename, ae.nativename, ae.stockparent, ae.ecotype_id from at.magnus_192_vs_accession a left outer join at.accession2tg_ecotypeid ae on a.accession_id=ae.accession_id and ae.ecotype_id in (select distinct ar.maternal_ecotype_id from call_info c, array_info ar where c.method_id=3 and c.array_id=ar.id) order by ecotype_id, linename;
show all 192, the ones not genotyped or genotyped but mismatch_rate>=0.12 having NULL ecotype_id
select distinct a.accession_id, a.linename, ae.nativename, ae.stockparent, ae.ecotype_id from at.magnus_192_vs_accession a left outer join at.accession2tg_ecotypeid ae on a.accession_id=ae.accession_id and ae.ecotype_id in (select distinct ar.maternal_ecotype_id from call_info c, array_info ar, call_qc q where q.call_info_id=c.id and c.method_id=3 and c.array_id=ar.id and q.qc_method_id=9 and q.mismatch_rate<0.12) order by ecotype_id, linename;
to get only the ones not genotyped or genotyped but mismatch_rate>=0.12
select newt.accession_id, ae.nativename, ae.stockparent, ae.ecotype_id from (select distinct a.accession_id, a.linename, ae.nativename, ae.stockparent, ae.ecotype_id from at.magnus_192_vs_accession a left outer join at.accession2tg_ecotypeid ae on a.accession_id=ae.accession_id and ae.ecotype_id in (select distinct ar.maternal_ecotype_id from call_info c, array_info ar, call_qc q where q.call_info_id=c.id and c.method_id=3 and c.array_id=ar.id and q.qc_method_id=9 and q.mismatch_rate<0.12) order by ecotype_id, linename) as newt, at.accession2tg_ecotypeid ae where newt.ecotype_id is NULL and newt.accession_id=ae.accession_id;
get the Qc info
select table2.*, q.call_info_id, q.QC_method_id, q.NA_rate, q.no_of_totals, q.mismatch_rate, q.no_of_mismatches, q.no_of_non_NA_pairs from (select newt.accession_id, ae.nativename, ae.stockparent, ae.ecotype_id from (select distinct a.accession_id, a.linename, ae.nativename, ae.stockparent, ae.ecotype_id from at.magnus_192_vs_accession a left outer join at.accession2tg_ecotypeid ae on a.accession_id=ae.accession_id and ae.ecotype_id in (select distinct ar.maternal_ecotype_id from call_info c, array_info ar, call_qc q where q.call_info_id=c.id and c.method_id=3 and c.array_id=ar.id and q.qc_method_id=9 and q.mismatch_rate<0.12) order by ecotype_id, linename) as newt, at.accession2tg_ecotypeid ae where newt.ecotype_id is NULL and newt.accession_id=ae.accession_id) as table2, call_qc q where table2.ecotype_id=q.ecotype_id and q.qc_method_id=9 and q.call_method_id=3;
outer join, to get QC info, also including the ones with no QC
select table2.*, q.call_info_id, q.QC_method_id, q.NA_rate, q.no_of_totals, q.mismatch_rate, q.no_of_mismatches, q.no_of_non_NA_pairs from (select newt.accession_id, ae.nativename, ae.stockparent, ae.ecotype_id from (select distinct a.accession_id, a.linename, ae.nativename, ae.stockparent, ae.ecotype_id from at.magnus_192_vs_accession a left outer join at.accession2tg_ecotypeid ae on a.accession_id=ae.accession_id and ae.ecotype_id in (select distinct ar.maternal_ecotype_id from call_info c, array_info ar, call_qc q where q.call_info_id=c.id and c.method_id=3 and c.array_id=ar.id and q.qc_method_id=9 and q.mismatch_rate<0.12) order by ecotype_id, linename) as newt, at.accession2tg_ecotypeid ae where newt.ecotype_id is NULL and newt.accession_id=ae.accession_id) as table2 left outer join call_qc q on table2.ecotype_id=q.ecotype_id and q.qc_method_id=9 and q.call_method_id=3;
get the array id, outer join
select t3.*, c.array_id from (select table2.*, q.call_info_id, q.QC_method_id, q.NA_rate, q.no_of_totals, q.mismatch_rate, q.no_of_mismatches, q.no_of_non_NA_pairs from (select newt.accession_id, ae.nativename, ae.stockparent, ae.ecotype_id from (select distinct a.accession_id, a.linename, ae.nativename, ae.stockparent, ae.ecotype_id from at.magnus_192_vs_accession a left outer join at.accession2tg_ecotypeid ae on a.accession_id=ae.accession_id and ae.ecotype_id in (select distinct ar.maternal_ecotype_id from call_info c, array_info ar, call_qc q where q.call_info_id=c.id and c.method_id=3 and c.array_id=ar.id and q.qc_method_id=9 and q.mismatch_rate<0.12) order by ecotype_id, linename) as newt, at.accession2tg_ecotypeid ae where newt.ecotype_id is NULL and newt.accession_id=ae.accession_id) as table2 left outer join call_qc q on table2.ecotype_id=q.ecotype_id and q.qc_method_id=9 and q.call_method_id=3) as t3 left outer join call_info c on c.id=t3.call_info_id;
get array original_filename, outer join
select t4.*, a.original_filename from (select t3.*, c.array_id from (select table2.*, q.call_info_id, q.QC_method_id, q.NA_rate, q.no_of_totals, q.mismatch_rate, q.no_of_mismatches, q.no_of_non_NA_pairs from (select newt.accession_id, ae.nativename, ae.stockparent, ae.ecotype_id from (select distinct a.accession_id, a.linename, ae.nativename, ae.stockparent, ae.ecotype_id from at.magnus_192_vs_accession a left outer join at.accession2tg_ecotypeid ae on a.accession_id=ae.accession_id and ae.ecotype_id in (select distinct ar.maternal_ecotype_id from call_info c, array_info ar, call_qc q where q.call_info_id=c.id and c.method_id=3 and c.array_id=ar.id and q.qc_method_id=9 and q.mismatch_rate<0.12) order by ecotype_id, linename) as newt, at.accession2tg_ecotypeid ae where newt.ecotype_id is NULL and newt.accession_id=ae.accession_id) as table2 left outer join call_qc q on table2.ecotype_id=q.ecotype_id and q.qc_method_id=9 and q.call_method_id=3) as t3 left outer join call_info c on c.id=t3.call_info_id) as t4 left outer join array_info a on t4.array_id=a.id order by accession_id;
4.2 2009-4-22 check missing 192 in one 250k call-method
select a.ecotype_id, nt.* from (select distinct m.*, newt.accession_id aid from (select a.accession_id, v.* from view_call v, at.accession2tg_ecotypeid a where v.call_method_id =32 and v.ecotype_id=a.ecotype_id) as newt right outer join at.magnus_192_vs_accession m on m.accession_id=newt.accession_id order by aid, accession_id) as nt, at.accession2tg_ecotypeid a where a.accession_id=nt.accession_id order by aid, accession_id;
4.3 to Provide a list of unique lines within each of the following regions
simply do that
select distinct e.id, e.name, e.stockparent, e.nativename, s.name as site_name, c.abbr as country_abbr, q.call_info_id, q.mismatch_rate from stock.ecotype e, stock.site s, stock.address a, stock.country c, call_qc q where q.ecotype_id=e.id and e.siteid=s.id and s.addressid=a.id and a.countryid=c.id and q.qc_method_id=9 and q.mismatch_rate<0.12 and q.call_method_id=3 order by country_abbr;
remove call_info_id, mismatch_rate
select distinct e.id as ecotype_id, e.name, e.stockparent, e.nativename, s.name as site_name, c.abbr as country_abbr from stock.ecotype e, stock.site s, stock.address a, stock.country c, call_qc q where q.ecotype_id=e.id and e.siteid=s.id and s.addressid=a.id and a.countryid=c.id and q.qc_method_id=9 and q.mismatch_rate<0.12 and q.call_method_id=3 order by country_abbr;
get the count for each country
select country_abbr, count(ecotype_id) as no_of_ecotypes from (select distinct e.id as ecotype_id, e.name, e.stockparent, e.nativename, s.name as site_name, c.abbr as country_abbr from stock.ecotype e, stock.site s, stock.address a, stock.country c, call_qc q where q.ecotype_id=e.id and e.siteid=s.id and s.addressid=a.id and a.countryid=c.id and q.qc_method_id=9 and q.mismatch_rate<0.12 and q.call_method_id=3 order by country_abbr) t1 group by country_abbr order by no_of_ecotypes;
5 2008-07-31 encode FRI deletions
check /research/ref1/JohansonEtAl2000 for details about 2 deletions and another BsmFI site polymorphism.
| symbol | TAIR ID | gene id | chromosome | strand | start | stop |
|---|---|---|---|---|---|---|
| FRI | AT4G00650 | 828044 | 4 | +1 | 269026 | 271503 |
| alignment id | 1843 | 1856 | 1857 | 2709 | 2716 | 2175 (fake alignment made by chris to denote deletion 1) |
|---|---|---|---|---|---|---|
| start | 268482 | 270596 | 270464 | 268482 | 269684 | 268715 |
| stop | 271540 | 271051 | 271052 | 270480 | 270480 | 269962 |
| covering deletion | 1 2 3 | 3 | 3 | 1 2 | 2 | NULL |
| alignment id | 1843 | 1856 | 1857 | 2709 | 2716 |
|---|---|---|---|---|---|
| 1843 | deletion 3 in 2-3.54% | all SNPs<1.4% | ~40 SNPs in deletion 1 shows ~20%, others <1% | (deletion 2 and another @270303)<1.4% |
email from chris 2008-07-23:
This file is the reference (Col-0) sequence around FRI. The grey highlighted sequence absent in Ler (see lower in the file for the extra sequence that Ler has there). Red highlights the ATG start codon of FRI. Red also highlights the bases between which accessions that don't have the Col-0 deletion have 16 more bases. The Ler deletion position starts at 268715 and the Col deletion starts after 269962.
| Deletion 1 (Ler allele) | Insertion 1 | Insertion 2 (relative to non-Col, Col-allele) | Bsm FI restriction site | deletion 3 | |
| start (relative to Col) | 268715 | 268715 | 269962 | 269461 (might be off 1 or 2) | 270702 |
| stop | 269090 (pass ATG promoter) | 268715 | 269962 | 269469 | 270780 |
| sequence | (omitted) | TCTACAAGTATGGACATTAATCTACAAGTTT | TTGTCCCTATGGTCTC | GAACGTATA (in COl = ERI amino acid) | |
| sequence-length | 376 | 31 | 16 | GGACGTATG (=GRM) | 79 (not continuous, some islands in between) |
| Ler | Y | Y | Y (N) | GRM | N |
| Col | N | N | N (Y) | ERI | N |
original statement from chris:
| Ler deletion position: | |
|---|---|
| 268715 | |
| Ler has 31 insert: | |
| TCTACAAGTATGGACATTAATCTACAAGTTT | |
| another insertion position relative to Col-0: | |
| 269962 | |
| insertion sequence relative to Col: | |
| TTGTCCCTATGGTCTC | |
Figure 1. Alignment around FRI, for 193 accessions. more figures in svn repository, variation/doc/FRI.
5.1 Deletion 3
It is discovered by looking at 2010 alignments as in Table 6. Despite the start-stop described in Table 8, there are 5 positions right ahead of this deletion and 5 positions right after this deletion, harboring a different haplotype compared to non-deletion haplotype.
5.2 2008-08-01 Kruskal-Wallis 250k vs 471 FRI SNPs from alignment 1843
In table at.alignment, id=1843 is around FRI. It's sequenced for 193 accessions. Pull polymorphic positions out of it as SNPs. deletion is also counted as polymorphic. Do a simple KW on it against phenotype LD. compare it with 250k (~170 accession) KW result.
Figure 2. Kruskal-Wallis 250k vs 471 FRI SNPs from alignment 1843. The gene under the 183kb-on-the-right peak, (highest in the 250k margarita result) is AT4G01040, "glycosyl hydrolase family 18 protein".
So the FRI deletions didn't pop out as strong as i would expect but considering the dependency of two deletions (Col has deletion 1 seq but not deletion 2 seq, FRI is dead still), it's sort of ok. maybe we can code two deletions as one SNP, or we just run margarita or some other haplotype-based algorithm ...
The peak within deletion 1 (around Position 268800) is ~200bp upstream of FRI's starting codon, might be some key regulatory site. The peak @268809 in deletion 1 has one crucial difference with other SNPs (such as @268785) to make it stand out. It has a base call for ecotype 8240 while SNP @268785 regard it as deletion. Due to the fact ecotype 8240's LD(long-day vernalization flowering time) is 200 (late flowering), this partial deletion 1 doesn't affect FRI function.
5.3 2008-08-04 create version 4 SNPs for 3 FRI deletions
A new alignment with same info as alignment id=1843. Pick 3 distinctive SNPs representing each deletion based on a SNP matrix figure (by DrawSNPMatrix.py).
Steps:
table alignment. insert a new alignment with same (chromosome, start, end) as 1843
table locus. create 3 new loci. as (chromosome,position,offset): (4,268809,0), (4,269962,8), (4,270712,0)
table allele. create 2 alleles for each locus. based on same alleles at those 3 loci
table genotype. copy genotype info over from 6 alignment=1843 alleles:
insert into fri_deletion_alignment_1843 (allele, accession) select 262564, accession from genotype where allele=172949; insert into fri_deletion_alignment_1843 (allele, accession) select 262565, accession from genotype where allele=172950; insert into fri_deletion_alignment_1843 (allele, accession) select 262566, accession from genotype where allele=173485; insert into fri_deletion_alignment_1843 (allele, accession) select 262567, accession from genotype where allele=173484; insert into genotype (allele, accession, quality) select 262568, accession, quality from genotype where allele=173566; insert into genotype (allele, accession, quality) select 262569, accession, quality from genotype where allele=173567;
| name | (chromosome,position,offset) | new alignment id | new locus id | allele ids | id of old locus from alignment 1843 | allele ids of old locus |
|---|---|---|---|---|---|---|
| deletion 1 (Ler allele) | (4,268809,0) | 2924 | 125250 | (262564=A, 262565=-) | 82038 | (172949=A, 172950=-) |
| deletion 2 (Col allele) | (4,269962,8) | 2924 | 125251 | (262566=T, 262567=-) | 82293 | (173485=T, 173484=-) |
| deletion 3 (new) | (4,270712,0) | 2924 | 125252 | (262568=A, 262569=-) | 82333 | (173566=A, 173567=-) |
5.3.1 2008-08-06 FRI Deletion data from Shindo2005
Hagenblad2004 claims to have data for 196 accessions. but according to magnus, its accessions don't match the ones in db at. Besides, it doesn't have table summarizing all info and has only sequences submitted to GenBank. So resort to Shindo2005.
Source 1 is file data.csv from Magnus. It has 192 accessions.
Source 2 is file Supplemental Figure 4A/B from Shindo2005 (extraction aided by character recognition program 'gocr', function outputShindo2005() in variation/src/misc.py). Panel A has 94 accessions. Panel B has 95 accessions. Branch l in panel A belongs to deletion 2, 17 in total. Branch a from panel A, b-e, 18 from panel B belong to deletion 1, 32 in total. Some name correction in table below, not mentioning numerous swedish letters, extra hyphens, etc.
| Shindo2005 Name | name in table at.magnus_192_vs_accession |
| Kas-1 | Kas-2 |
| Blh-0 | Blh-1 |
| Wil-2 | Will |
Source 3 is inference from alignment 1843, in table fri_deletion_alignment_1843 (which is temporary table).
Source 2 and 3 don't have conflicting assignment. But source 1 has conflicts with both 2 and 3. check table below.
| Accessname Name | accession id | ecotype id | wrong assignment | correct assignment | Supp Fig 4 | Alignment Evidence |
|---|---|---|---|---|---|---|
| Kent | 289 | 8238 | no deletion | deletion 1 | branch b-e in panel B | alignment 1843, 2709 |
| Blh-1 | 317 | 8265 | no deletion | deletion 2 | branch l in panel A | alignment 1843, 2716 |
| RRS-7 | 1 | 7514 | no deletion | deletion 1 | branch b-e in panel B | alignment 1843, 2709 |
| Ren-1 | 47 | 6959 | no deletion | deletion 1 | branch a from panel A | alignment 1843, 2709 left-shifted a bit |
| Ms-0 | 92 | 6938 | deletion 1 | deletion 2 | doesn't exist | alignment 1843, 2709, 2716 |
As source 2 and 3 are backed up by publication figure and real sequences, i regard source 1 as faulty. Source 2 and 3 complement each other. Final data is a distinct combination of source 2 & 3 and put into table genotype.
sql commandline to identify conflicts between different sources:
select f1.* , f2.* from fri_deletion_alignment_1843 f1, allele a1, fri_deletion_Shindo2005_Magnus_data_csv f2, allele a2 where f1.accession=f2.accession and f1.allele=a1.id and f2.allele=a2.id and a1.locus=a2.locus and f1.allele!=f2.allele order by f2.allele, f2.accession;
5.4 2008-08-05 epistasis among 3 FRI deletions
Figure 2. a primitive epistasis study among the 3 deletions in FRI (represented by '4_268809_0', '4_269962_8' and '4_270712_0'). using boolean OR to merge deletions. data is 471 SNPs discovered from alignment 1843 for 193 accessions.
Panel b,c,d are not exactly like Hagenblad2004 Figure 4, in which ecotypes are removed based on their phenotype and genotype. It has similar spirit which is that multiple genetic defects contribute to early flowering. It echoes a sense of conditional probability in bayesian networks.
Panel e,f proves the power of boolean thinking. besides, it doesn't suffer the sample size shrinking problem in conditional probability.
This figure also proves that deletion 3 in the 3 UTR region has similar effect although only 4 ecotypes have it.
5.5 2008-09-26 merge FRI deletion data with other 250k data for next-step QC/imputation
Output two deletions with no offset (and other SNPs) by setting version=4 in variation/src/Output2010InCertainSNPs.py.:
./src/Output2010InCertainSNPs.py -o data/2010/data_2010_ecotype_id_y0002_version4_offset_8.tsv -r -y0002 -v 4 -f 0
Remove all NA columns (only 3 deletions in version 4) by calling:
./src/FilterStrainSNPMatrix.py -i data/2010/data_2010_ecotype_id_y0002_version4.tsv -o data/2010/data_2010_ecotype_id_y0002_version4_n1c1d110.tsv -n 1 -c 1 -d 110 -r
Output the other deletion by setting version=4 and offset=8 in Output2010InCertainSNPs.py.:
./src/Output2010InCertainSNPs.py -o data/2010/data_2010_ecotype_id_y0002_version4_offset_8.tsv -r -y0002 -v 4 -f 8
Remove all NA columns:
./src/FilterStrainSNPMatrix.py -i data/2010/data_2010_ecotype_id_y0002_version4_offset_8.tsv -o data/2010/data_2010_ecotype_id_y0002_version4_offset8_n1c1d110.tsv -n 1 -c 1 -d 110 -r
identify the column with SNP_id=`4_269962` (useless because this dataset has different number of strains):
awk '$1=="ecotype_id"{for (i=1;i<=NF;i++) if($i=="4_269962"){print i}}' data/2010/data_2010_ecotype_id_y0002_version4_offset_8.tsv cut -f 176 data/2010/data_2010_ecotype_id_y0002_version4_offset_8.tsv > data/2010/4_269962_offset8.tsvmerge 3 deletions all together:
../pymodule/TwoSNPData.py -i data/2010/data_2010_ecotype_id_y0002_version4_n1c1d110.tsv -j data/2010/data_2010_ecotype_id_y0002_version4_offset8_n1c1d110.tsv -o data/2010/4_269962_offset8.tsv -c1 -w1
mask the deletion integer, -1, as a nucleotide code int, 2, to avoid being dropped when merging.
2009-1-6 add 3 FRI-deletion SNPs into 250k data matrix. Note: TwoSNPData.py removes the 2nd-column array id and replaces them with ecotype id.:
run `pymodule/TwoSNPData.py` ~/script/pymodule/TwoSNPData.py -i ~/panfs/NPUTE_data/input/250k_l3_y.85_uniq_ecotype_20090119.tsv -j ./banyan_home/script/variation/data/2010/4_269962_offset8.tsv -o ~/panfs/NPUTE_data/input/250k_l3_y.85_uniq_ecotype_20090119_3_FRI_del_no_array_id.tsv -c1 -w0 -m1
2009-1-27 not necessary anymore. -m1 in TwoSNPData.py carry out this role. 2009-1-6 run recoverArrayID2ndCol() from variation/src/misc.py to restore array id in the merged 250k-FRI-3-del file. WATCH: the 1st column (strain id) of input_fname has to be unique, otherwise a random one among duplicates would get a possibly wrong array id assigned.:
input_fname = os.path.expanduser('~/panfs/NPUTE_data/input/250k_l3_y.85_uniq_ecotype_20080919_3_FRI_del_no_array_id.tsv') data_with_array_id_fname = os.path.expanduser('~/panfs/NPUTE_data/input/250k_l3_y.85_uniq_ecotype_20080919.tsv') output_fname = os.path.expanduser('~/panfs/NPUTE_data/input/250k_l3_y.85_uniq_ecotype_20080919_3_FRI_del.tsv') recoverArrayID2ndCol(input_fname, data_with_array_id_fname, output_fname)
5.6 2008-11-24 add 3 FRI-deletion SNPs into snps_context of stock_250k
They were already added into table snps on 2008-08-17 by me.
information in table snps is
| id | name | chromosome | position | end_position | allele1 | allele2 +--------+---------------------+------------+----------+--------------+---------+--------- | 250924 | 4_268809_0_FRI_del1 | 4 | 268809 | NULL | A | - | 250925 | 4_269962_8_FRI_del2 | 4 | 269962 | NULL | T | - | 250926 | 4_270712_0_FRI_del3 | 4 | 270712 | NULL | A | -
So add them into table snps_context as touch on FRI (gene_id=828044):
snps_id disp_pos gene_id gene_strand left_or_right disp_pos_comment 250924 64 828044 + touch touch 250925 936 828044 + touch touch 250926 1686 828044 + touch touch
The 1st deletion covers both promoter and initial coding region. disp_pos is calculated as gene_start_pos-deletion_stop_pos.
Found out the 3 FRI-deletion SNPs are already in table snps_context on 2008-08-19. But they only appear in papaya's db, not banyan's. what happened??
5.7 2009-9-19 Association of FLC expr (id=43) phenotype with FRI alleles as cofactor
Figure. The top two panels are LM and EMMA without any cofactor. Every two-panel beneath is one pair of comparison. The title, such as 'LM_cof_Col 6.76' , means linear model, with FRI_Col Allele as cofactor, -log(pvalue)=6.76 at that line. "cof" in the title stands for cofactor. Scores below 0.5 are omitted. NGA4 is the only gene classified as "Flower development" in GeneOntology while not in our flowering time list.
Nothing really changes (except scale). Everything we said in the paper is still valid. The increase of top pvalue in cofactor scenario is still evident. The two peaks, one around FRI, one between GA1 and DFL2, switch the top position from Col-cofactor to Ler-cofactor, which really gives us some insight into the underlying haplotype sharing (1st peak in LD with FRI-Ler allele, 2nd peak in LD with FRI-Col allele).
6 2008-08-16 ecotypeid typo in /Network/Data/250k/dataFreeze_080608/pipeline_log.txt
In step 9, quote The accessions (ecotype Ids) tossed were: Santa Clara (7329), Uod-2 (8428), Blh-1 (7034). The last one, Blh-1's ecotypeid is 8265. 7034 is another ecotype in 149SNP with same nativename but different stockparent.
| nativename | ecotypeid | array id |
|---|---|---|
| Santa Clara | 7329 | 347 |
| Uod-2 | 8428 | 198 |
| Blh-1 | 8265 | 243 |



