In bioinformatics, various computational techniques, such as statistics,
combinatorial methods, and machine learning, have been widely applied. In
this talk, we will present our attempts to solve several computational
molecular biology problems with machine learning techniques including both
supervised and unsupervised learning. First, we designed a class of edit
kernels incorporating biological priori information to measure the
sequence similarity in support vector machines (SVMs). Meanwhile, we
successfully applied the new developed edit kernels with SVMs to predict
translation initiation sites (TISs) in eukaryotic mRNAs with a
high accuracy (99.9%). Second, the generalized linear discriminant
analysis (GLDA) was proposed to solve the curse of dimensionality and the
small sample size problem in cancer diagnosis with gene expression
profiling. For unsupervised learning, we developed both clustering and
biclustering algorithms to analyze gene expression data. The new
clustering method, by minimizing conditional entropy, can automatically
estimate the number of clusters and detect outliers. This method is more
practical in biology research since the exact number of clusters is
usually unknown and the outliers often contain important information. For
gene expression data with various heterogeneous conditions, the normal
clustering algorithms cannot apply due to the assumption that related
genes co-express across all conditions. For this kind of data, we proposed
the universal biclustering algorithm based on the theory of Kolmogorov
complexity to simultaneously group genes and conditions. From the above
research, we believe that machine learning will play an important role in
future genomic research.
Biosketch
Haifeng Li received a B.S. degree in Computational Mathematics from Sichuan
University in 1998 and a M.S. degree in Statistics from the University of
New Orleans in 2002. Currently, he is a Ph.D. candidate in the Department
of Computer Science and Engineering at the University of California,
Riverside. His research interests are in the general area of
bioinformatics and computational biology, machine learning, and design and
analysis of algorithms.
Abstract
combinatorial methods, and machine learning, have been widely applied. In
this talk, we will present our attempts to solve several computational
molecular biology problems with machine learning techniques including both
supervised and unsupervised learning. First, we designed a class of edit
kernels incorporating biological priori information to measure the
sequence similarity in support vector machines (SVMs). Meanwhile, we
successfully applied the new developed edit kernels with SVMs to predict
translation initiation sites (TISs) in eukaryotic mRNAs with a
high accuracy (99.9%). Second, the generalized linear discriminant
analysis (GLDA) was proposed to solve the curse of dimensionality and the
small sample size problem in cancer diagnosis with gene expression
profiling. For unsupervised learning, we developed both clustering and
biclustering algorithms to analyze gene expression data. The new
clustering method, by minimizing conditional entropy, can automatically
estimate the number of clusters and detect outliers. This method is more
practical in biology research since the exact number of clusters is
usually unknown and the outliers often contain important information. For
gene expression data with various heterogeneous conditions, the normal
clustering algorithms cannot apply due to the assumption that related
genes co-express across all conditions. For this kind of data, we proposed
the universal biclustering algorithm based on the theory of Kolmogorov
complexity to simultaneously group genes and conditions. From the above
research, we believe that machine learning will play an important role in
future genomic research.
Biosketch
Haifeng Li received a B.S. degree in Computational Mathematics from Sichuan
University in 1998 and a M.S. degree in Statistics from the University of
New Orleans in 2002. Currently, he is a Ph.D. candidate in the Department
of Computer Science and Engineering at the University of California,
Riverside. His research interests are in the general area of
bioinformatics and computational biology, machine learning, and design and
analysis of algorithms.