log-2009-01
- 1 2009-2-9 interaction between DOG1 and others
- 2 2009-2-9 CNV
- 3 2009-3-9 LD
- 4 2009-3-24 Caroline Dean's 30 sequences around FLC
- 5 2009-5-1 formalize public version of the database
1 2009-2-9 interaction between DOG1 and others
1.1 Intra DOG1
by boolean KW, with FLC, FRI, peak at chr4 1.3Mb.
1.2 2009-2-15 Boolean KW
for pairs resulting high pvalues under a certain boolean operator, why is that? can you do more?
proportion of each operator as the dominant operator in different pvalue range.
Figure Intra-Gene (SNPs within 10k vicinity of the gene) Top Boolean Operator Percentage by log pvalue
1.3 2009-2-15 linear model epistasis
questions:
- figure out how many SNP pairs among all don't have all 4-allele combination.
- epistasis pvalue=0, what happened?
2 2009-2-9 CNV
2.1 GADA/mGADAr2 parameters
| name | meaning | usual value | reference |
|---|---|---|---|
| aAlpha | a in Gamma(a;b), controls magnitude of the jump between breakpoints, higher => less jumps | 0.5 | Pique-Regi2008Supp |
| TBackElim | (amp1-amp2)/stddev | 4 | Pique-Regi2008Supp |
| MinSegLen | minimum number of probes to call a segment | 5 | Pique-Regi2008Supp |
Choose the cutoff to call duplication/deletion based on the histogram of amplitudes for known duplication/deletions.
2.2 QC
2.2.1 position of insertion matters
One CNV probe (1, 5923348) is adjacent to one insertion from 1_5923357_1 until 1_5923357_23. But the insertion is on the border of the probe with 4 base-pair overlap. The first 2 and the 4th base of the insertion matches the reference sequence and with 1_5923357_0 a polymorphism not matching the insertion. The probe has 0 or 1 base mismatch.
2.2.2 2009-2-17 observations
based on the 2010 SNP data, for each CNV probe, i extracted following information
- how many mismatches (polymorphic SNPs) for the accessions we know
- how many deletions
- how many insertions
the deletion/insertion count is not perfect because i didn't blast the probe sequence against the raw fragments. i just look at whether there are known SNPs in the probe and whether it's the reference allele. as a result, the region with a deletion could still match the CNV probe. the region with insertion could still match the probe as well. we saw a couple of cases like this.
so i probably need to refine it by blasting. this paper, http://banyan.usc.edu/research/ref1/Zhang2008, already did something similar, however didn't provide data download.
might have to get the duplication QC dataset from somewhere else cuz 2010 is not genome-wide. maybe the solexa and the Ler contigs mentioned in Zhang2008 above.
we have a few observations:
- most mismatch/deletion/insertion has an effect on the probe intensity, whose size depends on how many and which position in the probe.
- genome-wide, we can clearly see 5kb, 10kb, 20kb deletions. there are regions (~800kb window) encompassing lots of polymorphisms (frequent jump between deletions/SNPs and reference). also regions look pretty much like reference throughout.
- boundary of inversion (maybe indel) regions given by tina matches the crude deletion calls in the CNV.
- i did a GWA(genome-wide association) based on the crudely-called CNV. i choosed a cutoff based on one known FRI deletion and only made deletion calls, no duplication calls (high intensity just regarded as normal). the end-result captured the SVP very well (highest). a couple of other peaks overlapping with SNP gwa, a couple don't. give you the sort of news that it's not crap.
2.2.3 2009-2-15 what kind of polymorphism can we get out of CNV data?
The probe intensity could be affected even by one single mismatch, of course the position of the mismatch has a big impact on the hybridization.
For SNP polymorphism, how many polymorphic sites and what positions would ensure a statistically-acceptable separation between perfect match and mismatch.
small indels or large?
Focusing on small indels and SNP polymorphisms makes this thing inching close to Justin's SFP.
2.2.4 2009-2-15 raw intensity vs amplitude by GADA
raw intensity shows difference between different probe-mismatching. while amplitude doesn't. GADA probably smoothed it out.
2.3 2009-6-5 Regions from Karel Riha
http://www.gmi.oeaw.ac.at/en/research/research-groups/karel-riha/
| phenotype | chromosome | start | stop |
|---|---|---|---|
| 7 | 1 | 7128000 | 7140000 |
code to draw CNV matrices underlying it:
#use CNV discrete calls (1/-1) as SNP matrix DrawSNPRegion.py -i /tmp/CycB25CycB23Region.tsv -I /Network/Data/250k/tmp-yh/CNV/call_method_17_CNV_array_intensity_norm_chr1_GADA_M10_discrete_m.3125.tsv -N /Network/Data/250k/tmp-yh/phenotype.tsv -l 28 -o /Network/Data/250k/tmp-yh/snp_region_CNV_M10_discrete_m.3125 -j /Network/Data/250k/tmp-yh/at_gene_model_pickelf -e 17 -u yh -s -V 2 #use CNV amplitude (float) output as SNP matrix DrawSNPRegion.py -i /tmp/CycB25CycB23Region.tsv -I /Network/Data/250k/tmp-yh/CNV/call_method_17_CNV_array_intensity_norm_chr1_GADA_M10_out_amp.tsv -N /Network/Data/250k/tmp-yh/phenotype.tsv -l 28 -o /Network/Data/250k/tmp-yh/snp_region_CNV_M10_amp -j /Network/Data/250k/tmp-yh/at_gene_model_pickelf -e 17 -u yh -s -V 3
| phenotype | chromosome | start | stop |
|---|---|---|---|
| 7 | 2 | 1479516 | 1579515 |
| 7 | 2 | 1842649 | 1942649 |
2.4 2009-7-31 compile mGADAr2 (Pique-Regi2009) mex on hpc-cmb
All the code has been imported into svn, https://dl324b-1.cmb.usc.edu/projects/hapmap/browser/variation/trunk/bin/GADA/mGADAr2. Read the wiki page regarding how to check it out.
Roger uses c++ comment style // in his c files. By default, matlab adds -ansi to CFLAGS which forces gcc to exit upon encountering //. So need to override that option first.
Run below and choose gccopts.sh
/usr/usc/matlab/default/bin/mex -setup
modify ~/.matlab/R2008a/mexopts.sh to remove -ansi from CFLAGS (there're several of them for different architectures).
Run MakeMex.m in matlab to compile all c extentions.
Run testFitGadaModel.m to do a test on simulation data.
Read svn documentation of testFitGadaModelVersion2.m and figure out how to scale it
2.5 2009-10-12 CNV normalization
Col-0 (reference genome) intensity ranges from 1.6x to 4.7x (log transformed). There are apparent duplications (probes are not well-chosen to be unique in ref genome or ref genome/TAIR 8 contains errors). Same probe appears with high intensity in different arrays for the same Col-0.
The normalization procedure proposed by Roger (Quantile Normalization + Column Median subtraction + Row Median subtraction) seems to cause the genome-wide pattern to become more zigzag. The column and row median subtraction step is questionable. The idea is to get rid of reference intensity so that zero corresponds to reference, while column and row median are not good reference candidate.
A better idea is to use intensity from multiple Col-0 arrays as row or column reference.
2.6 2009-11-04 False negative/positive rate with PE-deletions as gold standard
Compare tiling-array derived deletions with Paired-end derived ones.
| PE-deletions: | Paired-end derived deletions |
|---|---|
| tiling-deletion: | |
| deletions derived from tiling array data. All segments from the algorithm whose amplitudes are below cutoff. | |
| min_QC_segment_size: | |
| =5 (minimum segment length of PE-deletion. Setting it to 5 renders it useless.) | |
| min_no_of_probes: | |
| =5 (minimum number of tiling probes covered by a PE-deletion, to exclude small deletions | |
| max_boundary_diff: | |
| =10kb (maximum distance tolerated between endpoints of one tiling-deletion and PE-deletion) | |
| max_diff_perc: | diff_perc=max_boundary_diff/tiling-deletion-length. maximum diff_perc tolerated. |
| cutoff: | maximum amplitude to call a deletion. |
Besides criteria above, if a tiling-deletion is within a PE-deletion , it's also called a match. The rationale is because tiling-deletions haven't undergone a merging process. A large deletion might be fragmented into several small adjacent deletions due to probe intensity fluctuation.
| FNR: | false negative rate |
|---|---|
| FPR: | false positive rate |
| Ecotype | FNR (cutoff=-0.33) | FNR (cutoff=-0.5) | FPR (cutoff=-0.33) | FPR (cutoff=-0.5) |
|---|---|---|---|---|
| Cvi-0 (6911)* | 349/473=73.8% | 365/473=77.2% | 6019/6227=96.6% | 3379/3549=95.2% |
| Fei-0 (8215) | 196/372=52.7% | 231/372=62.1% | 1882/2111.=89.1% | 1019/1187=85.8% |
| Kastel-1 (9169) | 335/405=82.7% | 373/405=92.1% | 600/672=89.3% | 376/409=91.9% |
| Sha (6962) | 103/342=30.1% | 122/342=35.6% | 7563/8315=90.9% | 4634/5248=88.3% |
* Cvi-0 PE data has only two chromosomes (1&4). FPR is inflated.
| Ecotype | FNR (cutoff=-0.33) | FNR (cutoff=-0.5) | FPR (cutoff=-0.33) | FPR (cutoff=-0.5) |
|---|---|---|---|---|
| Cvi-0 (6911)* | 324/473=68.5% | 344/473=72.7% | 5979/6227=96.0% | 3346/3549=94.2% |
| Fei-0 (8215) | 174/372=46.7% | 216/372=58.1% | 1859/2111=88.1% | 1004/1187=84.6% |
| Kastel-1 (9169) | 333/405=82.2% | 371/405=91.6% | 598/672=89.1% | 374/409=91.4% |
| Sha (6962) | 77/342=22.5% | 97/342=28.4% | 7495/8315=90.1% | 4580/5248=87.3% |
3 2009-3-9 LD
dataset: call method 29
Figure Histogram of LD using snp dataset call method id 29. Only pairs with r2>=0.85 are plotted here.
Figure Histogram of LD using snp dataset call method id 29 with only 96 "2010" accessions. Only pairs with r2>=0.85 are plotted here.
3.1 SNP pairs with LD (r2) >= 0.95
31591 pairs with LD (r2) >=0.95. around 4500 pairs with r2=1.
13 out of 31591 are inter-chromosomal which roughly correspond to 4 regional links. 3rd column is number of base-pairs that are identical by dynamic programming, match=1, mismatch=-1, del=-2. 4th column is r2.:
link 1:
1-14286837, 5-8441445, 11, 0.96324
1-15607026, 5-8441445, 16, 0.96324
link 2:
1, 17398608, 5, 12107290, 15, 0.95658
1, 17398608, 5, 12283101, 9, 0.97071
1, 17398608, 5, 12350652, 16, 0.95578
1, 17398608, 5, 12442929, 13, 0.97024
1, 17398608, 5, 12579638, 13, 0.95658
1, 17398608, 5, 12623022, 13, 0.95658
1, 17398608, 5, 13072637, 11, 0.95658
link 3:
1, 25993897, 3, 15752587, 9, 0.96007
1, 25997144, 3, 15752587, 23, 0.96007
1, 25997609, 3, 15752587, 13/14, 0.96998
link 4:
1, 26287598, 4, 18405157, 15, 0.95170
500 out of 31578 are over 130,046 bp apart on the same chromosome. 250 out of 31578 are over 296,165 bp apart on the same chromosome.
| #pairs | min distance |
|---|---|
| 500 | 130046 |
| 250 | 296165 |
| 100 | 464200 |
| 50 | 633269 |
| 10 | 857081 |
| 5 | 965347 |
| 3 | 1320189 |
| 2 | 2320103 |
| 1 | 18805634 |
3.2 2009-4-12 genotype combinations of SNP pairs
For two SNPs, look at the genotype combinations across accessions with non-NA phenotype. This is from call method 29 and phenotype 7.:
mpiexec ~/script/variation/src/MpiInterSNPPairAsso.py -i ~/panfs/250k/call_method_29_binary.tsv -n ~/panfs/250k/snps_context_g0_m10000 -p ~/panfs/250k/phenotype.tsv -o ~/panfs/250k/InterSNPCount -w 7 -y3 -u yh -a yh324 -t ~/panfs/db/results/type_1/ -l 29 -f 23000
out of 6,014,661,600 (need to figure out why it's this number) pairs, majority (5,877,321,290=97.7%) are 4-combos.
| #pairs | combo type | intra-chromosomal | inter-chromosomal |
|---|---|---|---|
| 1621 | 2-combo | 1621 | 0 |
| 137338689 (2.28%) | 3-combo | 25932687 | 111406002 (81.1%) |
| 5877321290 (97.7%) | 4-combo | 1086044792 | 4791276498 (81.5%) |
4 2009-3-24 Caroline Dean's 30 sequences around FLC
sequence spanning from 3170860 to 3182869 on chromosome 5. figured out by blasting head and tail of Col-0's sequence against At genome at NCBI. their sequences are in reverse-complement order (in FLC transcription order though).
get the reference from db and output in fasta format:
from pymodule.db import get_sequence_segment seq=get_sequence_segment(db_250k.metadata.bind, 186532933, 3170000, 3183500, annot_assembly_table='genome.annot_assembly', raw_sequence_table='genome.raw_sequence')
reverse complement (reverseComplementFastaInput() in class AnalyzeSNPData of variation.src.misc.py):
input_fname = os.path.expanduser('~/script/variation/doc/FLC/CarolineDeanFLC.seq') output_fname = os.path.expanduser('~/script/variation/doc/FLC/CarolineDeanFLC_rc.seq') AnalyzeSNPData.reverseComplementFastaInput(input_fname, output_fname)paste the reference and caroline sequences into one:
cat ref.seq CarolineDeanFLC_rc.seq >CarolineDeanFLC_rc_with_ref.seq
run clustalx. after full alignment, adjust by aligning a few selected ones with apparent alignment-shift problems.
discover 7159 SNPs (spurious deletions in the beginning and end):
fasta_input_fname = os.path.expanduser('~/script/variation/doc/FLC/CarolineDeanFLC_rc_with_ref.align.fasta') output_fname = os.path.expanduser('~/script/variation/doc/FLC/CarolineDeanFLC_rc_with_ref_snp_matrix.tsv') AnalyzeSNPData.outputSNPsOutOfFastaAlignMatrixInYuFormat(fasta_input_fname, output_fname, chromosome=5, start=3170860)chop off beginning and end deletions:
cut -f 1-2,600-6000 CarolineDeanFLC_rc_with_ref_snp_matrix.tsv > CarolineDeanFLC_rc_with_ref_snp_matrix_600_6000.tsv
assign 7087(Col-1) as id to name='ref' (reference sequence), didn't choose 8279 (Col-0) because same id would cause one to step on the other and the other disappear. SNP data only allows one unique ecotypeid.
draw SNP region by ~/script/variation/src/DrawSNPRegion.py:
~/script/variation/src/DrawSNPRegion.py -i /tmp/FLC_region.tsv -I ~/script/variation/doc/FLC/CarolineDeanFLC_rc_with_ref_snp_matrix_600_6000.tsv -N /Network/Data/250k/tmp-yh/phenotype.tsv -l 28 -o ~/script/variation/doc/FLC/snp_region_FLC -j /Network/Data/250k/tmp-yh/at_gene_model_pickelf -e 29 -u yh -s -p yh324 -V 4
| ecotypeid | nativename |
|---|---|
| 6046 | LOV-5 |
| 6900 | BIL-5 |
| 6901 | BIL-7 |
| 8382 | SPR-1-2 |
| 7519 | OMO-2-3 |
| 6956 | PU2-7 |
| 6969 | TAMM-27 |
| 6826 | KZ-1 |
| 6980 | WS-0 |
| 6897 | AG-0 |
| 7183 | KAS-1 |
| 6945 | NOK-3 |
| 6971 | TS-5 |
| 6929 | KONDARA |
| 7182 | KA-0 |
| 7285 | OST-0 |
| 8306 | HOV-4-1 |
| 8247 | SAN-2 |
| 8266 | BOO-2-1 |
| 8242 | LILL-1 |
| 8258 | B-4-1 |
| 6074 | ...R-1 |
| 8240 | KULTUREN-1 |
| 8237 | KVLINGE-1 |
| 6064 | NYL-2 |
| 8284 | DRAII-1 |
| 8279 | Col |
| 6043 | Lov1 |
| 8397 | Ull-2-5 |
| 8402 | Var-2-6 |
5 2009-5-1 formalize public version of the database
5.1 2009-5-1 db dump outline
5.1.1 whole dump
whole dump of following databases from papaya to cypress:
at, chip, dbsnp, genome, stock
5.1.2 stock_250k
Select dump of following tables:
analysis_method array_info biology_category call_info call_info_pc_values call_method call_method_eigen_values call_phenotype_qq_plots call_qc candidate_gene_list candidate_gene_list_type candidate_list_super_type contaminant_type maf_vs_score_plot phenotype_avg phenotype_hist_plots phenotype_method probes qc_method readme results results_gene results_gene2type results_method results_method_json results_method_type score_rank_histogram_type snp_annotation snp_annotation_type snps snps_context web_help null_distribution_type score_rank_histogram_type candidate_gene_top_snp_test_rm_type view_array view_call view_qc
views:
view_array, view_call, view_qc
5.2 2009-4-8 Phenotypes to be published
id as in table phenotype_method: 1-35,39-48,57-82,158-159,161-179,182-186,272-274,277-283
non-flowering time ids: 8-35,60-79,158-159,161-179,182-186
5.3 2009-5-1 procedures
outlined the procedures of chopping the full 250k database down to the public one. (instead of create a select dump out of the full 250k, i'll use copy and delete. it's easier.)
- dataset: only keep the 192 version (actually encompassing 199 accessions)
- remove analysis methods other than KW and EMMA
- associations results: only the association results based on 192 dataset(above)
- candidate gene lists: only recently added ones, id>=130.
- phenotypes: Delete phenotypes not supposed to be public. For some phenotypes, delete the phenotype data beyond "192" or even "96".
sql:
delete from candidate_gene_list where id <130; delete from call_method where id not in (29, 32); delete from analysis_method where id not in (1,4,7); delete from phenotype_method where id not in ();
5.4 2009-5-4 todo
add on delete cascade on foreign keys in table results_method. 2009-5-19 done.
create a master-slave relationship from cypress to papaya. Before that, tables that need to be synchronized.
Tables that are changed on cypress
- stock_250k: web_help phenotype_method
Tables to be changed on cypress
stock_250k: remove phenotype values outside 199 accessions in table phenotype_avg:
delete from phenotype_avg where ecotype_id not in (select ecotype_id from view_call where call_method_id=32);
Tables that are changed on papaya
- dbsnp: new table accession2call_method, accession, call_method. (re-dump)
- stock_250k: qc_method (two new columns and 3 new entries)
After master-slave is established
- update stock.ecotype on cypress based on chicago's version.
- add index (tax_id, gene_symbol) to table gene_symbol2id in db genome





