научная статья по теме COMBINED CLUSTERING MODELS FOR THE ANALYSIS OF GENE EXPRESSION Физика

Текст научной статьи на тему «COMBINED CLUSTERING MODELS FOR THE ANALYSIS OF GENE EXPRESSION»

ЯДЕРНАЯ ФИЗИКА, 2010, том 73, № 2, с. 263-267

= ЭЛЕМЕНТАРНЫЕ ЧАСТИЦЫ И ПОЛЯ

COMBINED CLUSTERING MODELS FOR THE ANALYSIS

OF GENE EXPRESSION

©2010 M. Angelova*, J. Ellman**

Northumbria University, Newcastle upon Tyne, UK Received April 22, 2009

Clustering has become one of the fundamental tools for analyzing gene expression and producing gene classifications. Clustering models enable finding patterns of similarity in order to understand gene function, gene regulation, cellular processes and sub-types of cells. The clustering results however have to be combined with sequence data or knowledge about gene functionality in order to make biologically meaningful conclusions. In this work, we explore a new model that integrates gene expression with sequence or text information.

1. INTRODUCTION

Life sciences are currently undergoing an information revolution as a result of development of techniques and tools that allow the collection of biological information at a high level of detail and large quantities. Microarray technology provides some of the most promising tools available to researchers today as it allows to measure simultaneously the expression levels of thousand of genes under controlled experimental conditions. The ability of this technology to take a snapshot of a whole gene expression pattern opens enormous possibilities. For example, DNA microarrays have been successfully used to study genome-wide patterns of gene expression [1—4] and are capable of providing fundamental insights into biological processes such as gene function and gene regulation [1, 2], cell cycle [1, 4], and cancer [2, 3]. The motivation for the large-scale gene expression analysis lays with the central dogma of molecular biology [5, 6], which justifies the premise that information about the functional state of an organism is to a great extend determined by the information on the gene expression.

One of the most powerful automatic techniques for the analysis of high-throughput gene expression data is clustering [4]. It is the exploratory, unsupervised process of partitioning data into groups (clusters) by finding similarity patterns within gene expression data. An underlying assumption in clustering is that genes in a cluster are functionally related. This implies that many of the genes could also be co-regulated and thus share transcription factor binding motifs in their upstream sequences [7]. Clustering

E-mail: maia.angelova@unn.ac.uk

E-mail: jeremy.ellman@unn.ac.uk

results need to be evaluated by biologically significant information, such as previously known biological facts, theories and results. Biological and medical literature databases store such published information and can be used to cross-reference experimental and analytical results, and even drive the interpretation and organization of the expression data [8, 9].

In this paper we discuss a combined model that integrates gene expression results with sequence data and published knowledge about gene functionalities in order to produce clustering results with more biological significance. In Section 2 we review three clustering models used in the analysis of gene expression data from microarray experiments. Combined clustering models are described in Section 3, where sequence data and published knowledge about gene functions are considered. In Section 4 a case study illustrates the models with results for the bacteria E. coli followed by the conclusions and discussion in Section 5.

2. CLUSTERING MODELS

Clustering is accomplished by finding similarities between data objects according to characteristics found in actual data. Genes with similar expression patterns, known as co-expressed genes, can be clustered together. Co-expressed genes in the same cluster are likely to have similar cellular functions, or be involved in the same cellular processes. A strong correlation of expression patterns could indicate co-regulation [4].

Gene expression data from a microarray experiment can be represented by a real-value m x n expression matrix, W = {wij |1 < i < n, 1 < j < m}, where the rows of W, gi = {wij |1 < j < m}, form

the expression patterns of the genes, the columns of W, sj = {wij |1 < i < n}, represent the expression profiles of the samples, and each cell of the expression matrix W, wij, is the measured expression level of gene i in sample j, i = 1,2,..., n,j = 1,2,...,m. In what follows, the vector r indicates a gene expression data object, which can represent a gene gi in the n-dimensional gene space or a sample sj expression profile in the m-dimensional array space. The original gene expression matrix contains noise, missing values, and systematic variations arising from the experimental procedure. Data preparation, normalization and preliminary statistical analysis in most cases is necessary before clustering analysis is performed.

The similarity is defined as a function, Sim, that usually measures distance or correlation between data objects, representing genes or samples in the expression matrix. The choice of similarity measure determines the output of the clustering algorithm and the interpretation of the results. The distance measures the proximity between data objects ri and rj and represents the dissimilarity or unlikeness between the data objects. A typical distance measure is the Euclidean distance,

D{vi, Tj

\

J2(Wid - Wjd)2' (1)

d=1 i,j = 1,2,.

Other distance measures used in [10] are Manhattan and Minkowski distance.

The correlation measures the similarity or alike-ness between the shapes of two objects r and rj. It measures the relationship between gene expression profiles. A typical correlation function is the Pearson correlation coefficient,

P(Ti, Tj) =

SdLi {wid- ßi){wjd- ßj) VZ'd=Md - - m)2

(2)

where ¡i and ¡j are the means for the objects ri and rj, respectively. Cosine correlation coefficient, Jaccard similarity and dice similarity are other metrics used in [10].

The choice of similarity metric has received much discussion [4, 11], but little work has been done on evaluating different metrics in the analysis of gene expression. In [10], the impact of similarity metrics in K-means and Gaussian mixture (EM) model were investigated. K-means and EM partition data into clusters according to a chosen similarity function. Both methods require the user to specify the number of clusters, which is often difficult to know in advance,

and are very sensitive to the choice of similarity measure and the number of clusters.

K-means is an iterative partitioning method in which objects are moved among a pre-specified number of clusters, K, until an optimal solution is reached. The algorithm minimizes a global error criterion, known as cost function [4], which depends on the number of clusters K, cluster centres mi and similarity function Sim. Although there is no universally accepted definition and the cost function should be tailored to the problem, it is usually defined as "within-cluster" sum of the squared distances between each data object r belonging to the cluster Ci and its cluster centre mi,

к

CF2 = EE It - mil2

i=i reCi

(3)

and represents the total error. The K-means algorithm minimizes CF2 but converges to a local rather than the global minimum depending on the initial parameters.

In probabilistic models, data is assumed to be drawn from a series of probability distributions, usually assumed to be multivariate Gaussian distributions. These models use the Expectation-Maximization (EM) algorithm [12] to produce the best fit between the data and a series of Gaussian distributions.

The EM alghorithm uses the likelihood as a similarity measure [4]. The algorithm takes into account that each object can belong to each cluster with a certain probability and finds a maximal log-likelihood, given by

к

L = Y, log £ XkPk(rilMk)

(4)

i=1

4fc=1

where \k is the probability that data object ri belongs to cluster Ck, \k > 0, k ^k = 1. Each cluster Ck is represented by a model Mk, p(rilMk) is the probability density of ri in the model Mk. Each model Mk can be represented by a multivariate d-dimensional Gaussian distribution with mean ¡ k and covariance Yjk. Like the cost function (3), the maximal log-likelihood (4) has local extrema rather then a global extremum.

K-means and EM are related, when the cost function corresponds to an underlying probabilistic mixture model, K-means can be regarded as an approximation to the classical EM algorithm on a (spherical) Gaussian mixture model.

The entropy-based models use the entropy of the clusters as a similarity metric. The entropy measures the uncertainty of a random variable. In thermodynamics, the entropy is a measure of the disorder in the

ri

system. Applied to clustering, the concept of entropy means that each cluster should have a low entropy as objects in the same cluster are similar. Thus, the search for clusters with minimal entropy can be used as a clustering criterium.

Following Shannon's definition [13], the entropy of the clusters can be written as

к

H = £ Pj H (X \Cj),

(5)

j=i

where H(XICj) is the entropy of the cluster Cj, pj is the probability of the cluster Cj such that j pj = = 1, and K is the number of clusters. A clustering algorithm that minimizes (5) has been developed in [14]. The entropy of the cluster H(XICj) can be measured using the actual relationship between data objects and clusters. The choice of particular data distribution (such as Gaussian distribution) can lead to a poor representation of the data. An alternative method is based on the actual density of data objects using the Parzen density approach [15]. The probability p(Cj |ri) is evaluated using the Parzen density estimation for the clustering problem as [14],

P(Cj\ri) =

n

ij

ni

(6)

where nij is the number of samples ri from cluster Cj and ni is the number of all samples located in a selected region R(ri). The entropy clustering criterion can be written as

к

H

- У^ У^ —log ( —

ni n

i=l

Для дальнейшего прочтения статьи необходимо приобрести полный текст. Статьи высылаются в формате PDF на указанную при оплате почту. Время доставки составляет менее 10 минут. Стоимость одной статьи — 150 рублей.

Показать целиком