Clustering of temporal gene expression data by regularized spline regression and an energy based similarity measure

作者:

Highlights:

摘要

Clustering analysis of temporal gene expression data is widely used to study dynamic biological systems, such as identifying sets of genes that are regulated by the same mechanism. However, most temporal gene expression data often contain noise, missing data points, and non-uniformly sampled time points, which imposes challenges for traditional clustering methods of extracting meaningful information. In this paper, we introduce an improved clustering approach based on the regularized spline regression and an energy based similarity measure. The proposed approach models each gene expression profile as a B-spline expansion, for which the spline coefficients are estimated by regularized least squares scheme on the observed data. To compensate the inadequate information from noisy and short gene expression data, we use its correlated genes as the test set to choose the optimal number of basis and the regularization parameter. We show that this treatment can help to avoid over-fitting. After fitting the continuous representations of gene expression profiles, we use an energy based similarity measure for clustering. The energy based measure can include the temporal information and relative changes of the time series using the first and second derivatives of the time series. We demonstrate that our method is robust to noise and can produce meaningful clustering results.

论文关键词:Spline model,Regularized regression,Energy operator,Temporal gene expression data analysis,Clustering

论文评审过程:Received 7 November 2009, Revised 22 June 2010, Accepted 2 July 2010, Available online 13 July 2010.

论文官网地址:https://doi.org/10.1016/j.patcog.2010.07.011