Personal tools
You are here: Home research variation log-2009-01
Document Actions

log-2009-01

1   2009-2-9 interaction between DOG1 and others

1.1   Intra DOG1

by boolean KW, with FLC, FRI, peak at chr4 1.3Mb.

1.2   2009-2-15 Boolean KW

for pairs resulting high pvalues under a certain boolean operator, why is that? can you do more?

proportion of each operator as the dominant operator in different pvalue range.

results/Intra-gene-10k-SNPpair_7_FT_22C_top_bool_type_pvalue_hist_perc_bool_by_log_pvalue.png

Figure Intra-Gene (SNPs within 10k vicinity of the gene) Top Boolean Operator Percentage by log pvalue

results/Inter-Top5Perc-SNP-7_FT_22C_top_bool_type_pvalue_hist_perc_bool_by_log_pvalue.png

Figure Inter-Top5%-SNP Top Boolean Operator Percentage by log pvalue

1.3   2009-2-15 linear model epistasis

questions:

  1. figure out how many SNP pairs among all don't have all 4-allele combination.
  2. epistasis pvalue=0, what happened?

2   2009-2-9 CNV

2.1   GADA/mGADAr2 parameters

2009-8-18 Three important parameters
name meaning usual value reference
aAlpha a in Gamma(a;b), controls magnitude of the jump between breakpoints, higher => less jumps 0.5 Pique-Regi2008Supp
TBackElim (amp1-amp2)/stddev 4 Pique-Regi2008Supp
MinSegLen minimum number of probes to call a segment 5 Pique-Regi2008Supp

Choose the cutoff to call duplication/deletion based on the histogram of amplitudes for known duplication/deletions.

2.2   QC

2.2.1   position of insertion matters

One CNV probe (1, 5923348) is adjacent to one insertion from 1_5923357_1 until 1_5923357_23. But the insertion is on the border of the probe with 4 base-pair overlap. The first 2 and the 4th base of the insertion matches the reference sequence and with 1_5923357_0 a polymorphism not matching the insertion. The probe has 0 or 1 base mismatch.

results/alignment_265_y1.png.png

Figure 1 Alignment 265 with 2nd insertion at position (1,5923357).

results/alignment_265_y1.png_legend.png

Figure 2 12-color legend for figure 1.

2.2.2   2009-2-17 observations

based on the 2010 SNP data, for each CNV probe, i extracted following information

  1. how many mismatches (polymorphic SNPs) for the accessions we know
  2. how many deletions
  3. how many insertions

the deletion/insertion count is not perfect because i didn't blast the probe sequence against the raw fragments. i just look at whether there are known SNPs in the probe and whether it's the reference allele. as a result, the region with a deletion could still match the CNV probe. the region with insertion could still match the probe as well. we saw a couple of cases like this.

so i probably need to refine it by blasting. this paper, http://banyan.usc.edu/research/ref1/Zhang2008, already did something similar, however didn't provide data download.

might have to get the duplication QC dataset from somewhere else cuz 2010 is not genome-wide. maybe the solexa and the Ler contigs mentioned in Zhang2008 above.

we have a few observations:

  1. most mismatch/deletion/insertion has an effect on the probe intensity, whose size depends on how many and which position in the probe.
  2. genome-wide, we can clearly see 5kb, 10kb, 20kb deletions. there are regions (~800kb window) encompassing lots of polymorphisms (frequent jump between deletions/SNPs and reference). also regions look pretty much like reference throughout.
  3. boundary of inversion (maybe indel) regions given by tina matches the crude deletion calls in the CNV.
  4. i did a GWA(genome-wide association) based on the crudely-called CNV. i choosed a cutoff based on one known FRI deletion and only made deletion calls, no duplication calls (high intensity just regarded as normal). the end-result captured the SVP very well (highest). a couple of other peaks overlapping with SNP gwa, a couple don't. give you the sort of news that it's not crap.
2.2.3   2009-2-15 what kind of polymorphism can we get out of CNV data?

The probe intensity could be affected even by one single mismatch, of course the position of the mismatch has a big impact on the hybridization.

For SNP polymorphism, how many polymorphic sites and what positions would ensure a statistically-acceptable separation between perfect match and mismatch.

small indels or large?

Focusing on small indels and SNP polymorphisms makes this thing inching close to Justin's SFP.

2.2.4   2009-2-15 raw intensity vs amplitude by GADA

raw intensity shows difference between different probe-mismatching. while amplitude doesn't. GADA probably smoothed it out.

2.3   2009-6-5 Regions from Karel Riha

http://www.gmi.oeaw.ac.at/en/research/research-groups/karel-riha/

region around CycB25 and CycB23
phenotype chromosome start stop
7 1 7128000 7140000

code to draw CNV matrices underlying it:

#use CNV discrete calls (1/-1) as SNP matrix
DrawSNPRegion.py -i /tmp/CycB25CycB23Region.tsv
-I /Network/Data/250k/tmp-yh/CNV/call_method_17_CNV_array_intensity_norm_chr1_GADA_M10_discrete_m.3125.tsv
-N /Network/Data/250k/tmp-yh/phenotype.tsv -l 28 -o /Network/Data/250k/tmp-yh/snp_region_CNV_M10_discrete_m.3125
-j /Network/Data/250k/tmp-yh/at_gene_model_pickelf -e 17 -u yh -s -V 2

#use CNV amplitude (float) output as SNP matrix
DrawSNPRegion.py -i /tmp/CycB25CycB23Region.tsv
-I /Network/Data/250k/tmp-yh/CNV/call_method_17_CNV_array_intensity_norm_chr1_GADA_M10_out_amp.tsv
-N /Network/Data/250k/tmp-yh/phenotype.tsv -l 28 -o /Network/Data/250k/tmp-yh/snp_region_CNV_M10_amp
-j /Network/Data/250k/tmp-yh/at_gene_model_pickelf -e 17 -u yh -s -V 3
regions around At2g04395 and At2g05210
phenotype chromosome start stop
7 2 1479516 1579515
7 2 1842649 1942649

2.4   2009-7-31 compile mGADAr2 (Pique-Regi2009) mex on hpc-cmb

All the code has been imported into svn, https://dl324b-1.cmb.usc.edu/projects/hapmap/browser/variation/trunk/bin/GADA/mGADAr2. Read the wiki page regarding how to check it out.

Roger uses c++ comment style // in his c files. By default, matlab adds -ansi to CFLAGS which forces gcc to exit upon encountering //. So need to override that option first.

  1. Run below and choose gccopts.sh

    /usr/usc/matlab/default/bin/mex -setup

  2. modify ~/.matlab/R2008a/mexopts.sh to remove -ansi from CFLAGS (there're several of them for different architectures).

  3. Run MakeMex.m in matlab to compile all c extentions.

  4. Run testFitGadaModel.m to do a test on simulation data.

  5. Read svn documentation of testFitGadaModelVersion2.m and figure out how to scale it

2.5   2009-10-12 CNV normalization

Col-0 (reference genome) intensity ranges from 1.6x to 4.7x (log transformed). There are apparent duplications (probes are not well-chosen to be unique in ref genome or ref genome/TAIR 8 contains errors). Same probe appears with high intensity in different arrays for the same Col-0.

The normalization procedure proposed by Roger (Quantile Normalization + Column Median subtraction + Row Median subtraction) seems to cause the genome-wide pattern to become more zigzag. The column and row median subtraction step is questionable. The idea is to get rid of reference intensity so that zero corresponds to reference, while column and row median are not good reference candidate.

A better idea is to use intensity from multiple Col-0 arrays as row or column reference.

2.6   2009-11-04 False negative/positive rate with PE-deletions as gold standard

Compare tiling-array derived deletions with Paired-end derived ones.

PE-deletions:Paired-end derived deletions
tiling-deletion:
 deletions derived from tiling array data. All segments from the algorithm whose amplitudes are below cutoff.
min_QC_segment_size:
 =5 (minimum segment length of PE-deletion. Setting it to 5 renders it useless.)
min_no_of_probes:
 =5 (minimum number of tiling probes covered by a PE-deletion, to exclude small deletions
max_boundary_diff:
 =10kb (maximum distance tolerated between endpoints of one tiling-deletion and PE-deletion)
max_diff_perc:diff_perc=max_boundary_diff/tiling-deletion-length. maximum diff_perc tolerated.
cutoff:maximum amplitude to call a deletion.

Besides criteria above, if a tiling-deletion is within a PE-deletion , it's also called a match. The rationale is because tiling-deletions haven't undergone a merging process. A large deletion might be fragmented into several small adjacent deletions due to probe intensity fluctuation.

FNR:false negative rate
FPR:false positive rate
False negative/positive rates max_diff_perc=0.2
Ecotype FNR (cutoff=-0.33) FNR (cutoff=-0.5) FPR (cutoff=-0.33) FPR (cutoff=-0.5)
Cvi-0 (6911)* 349/473=73.8% 365/473=77.2% 6019/6227=96.6% 3379/3549=95.2%
Fei-0 (8215) 196/372=52.7% 231/372=62.1% 1882/2111.=89.1% 1019/1187=85.8%
Kastel-1 (9169) 335/405=82.7% 373/405=92.1% 600/672=89.3% 376/409=91.9%
Sha (6962) 103/342=30.1% 122/342=35.6% 7563/8315=90.9% 4634/5248=88.3%

* Cvi-0 PE data has only two chromosomes (1&4). FPR is inflated.

False negative/positive rates max_diff_perc=0.3
Ecotype FNR (cutoff=-0.33) FNR (cutoff=-0.5) FPR (cutoff=-0.33) FPR (cutoff=-0.5)
Cvi-0 (6911)* 324/473=68.5% 344/473=72.7% 5979/6227=96.0% 3346/3549=94.2%
Fei-0 (8215) 174/372=46.7% 216/372=58.1% 1859/2111=88.1% 1004/1187=84.6%
Kastel-1 (9169) 333/405=82.2% 371/405=91.6% 598/672=89.1% 374/409=91.4%
Sha (6962) 77/342=22.5% 97/342=28.4% 7495/8315=90.1% 4580/5248=87.3%

3   2009-3-9 LD

dataset: call method 29

results/call_method_29_LD_D_prime_m0.8_r2_min_r2_0.85_hist.png

Figure Histogram of LD using snp dataset call method id 29. Only pairs with r2>=0.85 are plotted here.

results/call_method_29_only96_LD_D_prime_m0.8_r2_min_r2_0.85_hist.png

Figure Histogram of LD using snp dataset call method id 29 with only 96 "2010" accessions. Only pairs with r2>=0.85 are plotted here.

3.1   SNP pairs with LD (r2) >= 0.95

31591 pairs with LD (r2) >=0.95. around 4500 pairs with r2=1.

13 out of 31591 are inter-chromosomal which roughly correspond to 4 regional links. 3rd column is number of base-pairs that are identical by dynamic programming, match=1, mismatch=-1, del=-2. 4th column is r2.:

link 1:

1-14286837, 5-8441445, 11, 0.96324

1-15607026, 5-8441445, 16, 0.96324

link 2:

1, 17398608, 5, 12107290, 15, 0.95658

1, 17398608, 5, 12283101, 9, 0.97071

1, 17398608, 5, 12350652, 16, 0.95578

1, 17398608, 5, 12442929, 13, 0.97024

1, 17398608, 5, 12579638, 13, 0.95658

1, 17398608, 5, 12623022, 13, 0.95658

1, 17398608, 5, 13072637, 11, 0.95658

link 3:

1, 25993897, 3, 15752587, 9, 0.96007

1, 25997144, 3, 15752587, 23, 0.96007

1, 25997609, 3, 15752587, 13/14, 0.96998

link 4:

1, 26287598, 4, 18405157, 15, 0.95170

500 out of 31578 are over 130,046 bp apart on the same chromosome. 250 out of 31578 are over 296,165 bp apart on the same chromosome.

Table Number of Pairs Over some Distanace apart on the same chromosome
#pairs min distance
500 130046
250 296165
100 464200
50 633269
10 857081
5 965347
3 1320189
2 2320103
1 18805634

3.2   2009-4-12 genotype combinations of SNP pairs

For two SNPs, look at the genotype combinations across accessions with non-NA phenotype. This is from call method 29 and phenotype 7.:

mpiexec ~/script/variation/src/MpiInterSNPPairAsso.py -i ~/panfs/250k/call_method_29_binary.tsv -n
~/panfs/250k/snps_context_g0_m10000  -p ~/panfs/250k/phenotype.tsv -o ~/panfs/250k/InterSNPCount -w 7 -y3
-u yh -a yh324 -t ~/panfs/db/results/type_1/ -l 29 -f 23000

out of 6,014,661,600 (need to figure out why it's this number) pairs, majority (5,877,321,290=97.7%) are 4-combos.

stats of genotype combinations
#pairs combo type intra-chromosomal inter-chromosomal
1621 2-combo 1621 0
137338689 (2.28%) 3-combo 25932687 111406002 (81.1%)
5877321290 (97.7%) 4-combo 1086044792 4791276498 (81.5%)

4   2009-3-24 Caroline Dean's 30 sequences around FLC

  • sequence spanning from 3170860 to 3182869 on chromosome 5. figured out by blasting head and tail of Col-0's sequence against At genome at NCBI. their sequences are in reverse-complement order (in FLC transcription order though).

  • get the reference from db and output in fasta format:

    from pymodule.db import get_sequence_segment
    seq=get_sequence_segment(db_250k.metadata.bind, 186532933, 3170000, 3183500,
    annot_assembly_table='genome.annot_assembly', raw_sequence_table='genome.raw_sequence')
    
  • reverse complement (reverseComplementFastaInput() in class AnalyzeSNPData of variation.src.misc.py):

    input_fname = os.path.expanduser('~/script/variation/doc/FLC/CarolineDeanFLC.seq')
    output_fname = os.path.expanduser('~/script/variation/doc/FLC/CarolineDeanFLC_rc.seq')
    AnalyzeSNPData.reverseComplementFastaInput(input_fname, output_fname)
    
  • paste the reference and caroline sequences into one:

    cat ref.seq CarolineDeanFLC_rc.seq >CarolineDeanFLC_rc_with_ref.seq
    
  • run clustalx. after full alignment, adjust by aligning a few selected ones with apparent alignment-shift problems.

  • discover 7159 SNPs (spurious deletions in the beginning and end):

    fasta_input_fname = os.path.expanduser('~/script/variation/doc/FLC/CarolineDeanFLC_rc_with_ref.align.fasta')
    output_fname = os.path.expanduser('~/script/variation/doc/FLC/CarolineDeanFLC_rc_with_ref_snp_matrix.tsv')
    AnalyzeSNPData.outputSNPsOutOfFastaAlignMatrixInYuFormat(fasta_input_fname, output_fname, chromosome=5, start=3170860)
    
  • chop off beginning and end deletions:

    cut -f 1-2,600-6000 CarolineDeanFLC_rc_with_ref_snp_matrix.tsv >
    CarolineDeanFLC_rc_with_ref_snp_matrix_600_6000.tsv
    
  • assign 7087(Col-1) as id to name='ref' (reference sequence), didn't choose 8279 (Col-0) because same id would cause one to step on the other and the other disappear. SNP data only allows one unique ecotypeid.

  • draw SNP region by ~/script/variation/src/DrawSNPRegion.py:

    ~/script/variation/src/DrawSNPRegion.py -i /tmp/FLC_region.tsv
    -I ~/script/variation/doc/FLC/CarolineDeanFLC_rc_with_ref_snp_matrix_600_6000.tsv
    -N /Network/Data/250k/tmp-yh/phenotype.tsv -l 28 -o ~/script/variation/doc/FLC/snp_region_FLC
    -j /Network/Data/250k/tmp-yh/at_gene_model_pickelf -e 29 -u yh -s -p yh324 -V 4
    
30 accessions
ecotypeid nativename
6046 LOV-5
6900 BIL-5
6901 BIL-7
8382 SPR-1-2
7519 OMO-2-3
6956 PU2-7
6969 TAMM-27
6826 KZ-1
6980 WS-0
6897 AG-0
7183 KAS-1
6945 NOK-3
6971 TS-5
6929 KONDARA
7182 KA-0
7285 OST-0
8306 HOV-4-1
8247 SAN-2
8266 BOO-2-1
8242 LILLš-1
8258 BŒ-4-1
6074 ...R-1
8240 KULTUREN-1
8237 KŠVLINGE-1
6064 NYL-2
8284 DRAII-1
8279 Col
6043 Lov1
8397 Ull-2-5
8402 Var-2-6

5   2009-5-1 formalize public version of the database

5.1   2009-5-1 db dump outline

5.1.1   whole dump

whole dump of following databases from papaya to cypress:

at, chip, dbsnp, genome, stock
5.1.2   stock_250k

Select dump of following tables:

analysis_method array_info biology_category call_info call_info_pc_values call_method call_method_eigen_values
call_phenotype_qq_plots call_qc candidate_gene_list candidate_gene_list_type candidate_list_super_type 
contaminant_type maf_vs_score_plot  phenotype_avg phenotype_hist_plots phenotype_method probes 
qc_method readme results results_gene results_gene2type results_method
results_method_json results_method_type score_rank_histogram_type snp_annotation
snp_annotation_type snps snps_context web_help 
null_distribution_type score_rank_histogram_type candidate_gene_top_snp_test_rm_type  view_array view_call view_qc 

views:

view_array, view_call, view_qc

5.2   2009-4-8 Phenotypes to be published

id as in table phenotype_method: 1-35,39-48,57-82,158-159,161-179,182-186,272-274,277-283

non-flowering time ids: 8-35,60-79,158-159,161-179,182-186

5.3   2009-5-1 procedures

outlined the procedures of chopping the full 250k database down to the public one. (instead of create a select dump out of the full 250k, i'll use copy and delete. it's easier.)

  1. dataset: only keep the 192 version (actually encompassing 199 accessions)
  2. remove analysis methods other than KW and EMMA
  3. associations results: only the association results based on 192 dataset(above)
  4. candidate gene lists: only recently added ones, id>=130.
  5. phenotypes: Delete phenotypes not supposed to be public. For some phenotypes, delete the phenotype data beyond "192" or even "96".

sql:

delete from candidate_gene_list where id <130;
delete from call_method where id not in (29, 32);
delete from analysis_method where id not in (1,4,7);
delete from phenotype_method where id not in ();

5.4   2009-5-4 todo

  1. add on delete cascade on foreign keys in table results_method. 2009-5-19 done.

  2. create a master-slave relationship from cypress to papaya. Before that, tables that need to be synchronized.

    1. Tables that are changed on cypress

      1. stock_250k: web_help phenotype_method
    2. Tables to be changed on cypress

      1. stock_250k: remove phenotype values outside 199 accessions in table phenotype_avg:

        delete from phenotype_avg where ecotype_id not in (select ecotype_id from view_call where call_method_id=32);
        
    3. Tables that are changed on papaya

      1. dbsnp: new table accession2call_method, accession, call_method. (re-dump)
      2. stock_250k: qc_method (two new columns and 3 new entries)
  3. After master-slave is established

    1. update stock.ecotype on cypress based on chicago's version.
    2. add index (tax_id, gene_symbol) to table gene_symbol2id in db genome
« November 2009 »
Su Mo Tu We Th Fr Sa
1234567
89101112 1314
1516171819 20 21
22232425262728
2930
 

Powered by Plone CMS, the Open Source Content Management System

This site conforms to the following standards: