COMPARISON OF QSAR MODELS BASED ON COMBINATIONS OF GENETIC ALGORITHM, STEPWISE MULTIPLE LINEAR REGRESSION, AND ARTIFICIAL NEURAL NETWORK METHODS TO PREDICT KD OF SOME DERIVATIVES OF AROMATIC SULFONAMIDES AS CARBONIC ANHYDRASE II INHIBITORS

AFSHIN MALEKI; ARAM FARAJI; HIUA DARAEI; LOGHMAN ALAEI

COMPARISON OF QSAR MODELS BASED ON COMBINATIONS OF GENETIC ALGORITHM, STEPWISE MULTIPLE LINEAR REGRESSION, AND ARTIFICIAL NEURAL NETWORK METHODS TO PREDICT Kd OF SOME DERIVATIVES OF AROMATIC SULFONAMIDES

*Kurdistan Environmental Health Research Center, Kurdistan University of Medical Sciences, Sanandaj, Iran **Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran ***Faculty of Pharmacy, Kermanshah Medical Sciences University, Kermanshah, Iran Received March 4, 2013; in final form, May 14, 2013

Four stepwise multiple linear regressions (SMLR) and a genetic algorithm (GA) based multiple linear regressions (MLR), together with artificial neural network (ANN) models, were applied for quantitative structure-activity relationship (QSAR) modeling of dissociation constants (Kd) of 62 arylsulfonamide (ArSA) derivatives as human carbonic anhydrase II (HCA II) inhibitors. The best subsets of molecular descriptors were selected by SMLR and GA-MLR methods. These selected variables were used to generate MLR and ANN models. The predictability power of models was examined by an external test set and cross validation. In addition, some tests were done to examine other aspects of the models. The results show that for certain purposes GA-MLR is better than SMLR and for others, ANN overcomes MLR models.

Keywords: human carbonic anhydrase II, dissociation constants, QSAR, genetic algorithm, artificial neural network

DOI: 10.7868/S0132342313060067

INTRODUCTION

Carbonic anhydrases (CA; carbonate hydro-lyase, EC 4.2.1.1) have a remarkable position among the zinc-containing enzymes studied in recent years. In higher vertebrates, such as human, 15 different isozymes were discovered; among them,human carbonic anhydrase II (HCA II) is physiologically one of the most important ones. CAs are considered to be important for they are involved in crucial physiological pathways connected with catalysis of reversible hydration of carbonic dioxide. These chemical species are important in numerous biologic processes. Due to their significant role, inhibition of these enzymes by carbonic anhydrase inhibitors (CAIs) may be beneficial for the design of useful therapeutic agents in the management and prevention of many related diseases [1—3].

Sulfonamides (SA) represent an important class of biologically active compounds. Mann and Keilin studied inhibitory effects of chemicals that led to important drugs such as the SAs with CA inhibitory properties. The aryl sulfonamides (ArSA) have several characteristics that make them particularly a suitable case for biophysical and organic chemistry studies of inhib-

# C o rre sp o nding author (tel./fax: +98 (871) 662-51-31, e-mail: hiua.daraei@muk.ac.ir).

itor binding and drug design. With the SAs as the lead structure, different classes of pharmacological agents have been obtained. Many derivatives of SA belonging to the heterocyclic and aromatic classes have been synthesized and investigated for their biological activity [2, 4-7].

There are some analytical methods for determining the activity parameters, and in the present work we focused on dissociation constant, Kd [8]. Although, these approaches are tiresome, expensive and time-consuming, and require an adequate amount of pure compounds, so they are not suitable for high throughput screening of various structures [9, 10].

The physiological activity of a molecule can be quantitatively connected to some special molecular parameters (descriptors). Quantitative structure-activity relationship (QSAR), as a valuable tool in rational drug design, tries to make connections between chemical structure and biological activity by means of mathematical equations [11, 12]. A wide range of descriptors including constitutional, topological, and quantum chemical descriptors have been described for use in QSAR analysis [13, 14]. QSAR models can be used as suitable tools in drug design for they are good candidates to decrease time and effort needed to cre-

ate new molecules by reducing the expensive and time-consuming trial-and-error tests. Regarding the biological importance of sulfonamide compounds as potent CA inhibitors (CAIs), QSAR models have been suggested for predicting CA inhibitory effects of different aromatic and heterocyclic sulfonamides using different molecular descriptors. QSAR studies on different properties of sulfonamide compounds, including some sulfa drugs using distance-based topological indices, benzene sulfonamide carbonic anhydrase inhibitors, and a series of sulfa drugs as inhibitors of Pneumocystis carinii dihydropteroate synthetase, are found in recent scientific reports [15—17].

The main goal of this work is to establish an accurate QSAR model between the molecular descriptors of 62 compounds and Kd values measured in different labs using MLR and ANN modeling algorithms. To perform these analyses, GA-MLR and SMLR are applied as variable selection methods. The most important aspect of the proposed QSAR models is the comparison performed between ANN & MLR and GA-MLR & SMLR modeling approaches as descriptor selecting methods.

MATERIAL AND METHODS

The Data Set

The first step in formation of QSAR equations is to make a list of compounds for which the experimentally determined inhibitory activities are known. To attain this, a list containing 62 derivatives was extracted from the values reported by Krishnamurthy et al. [8]. Our data set included para-, meta-, and ortho-substi-tuted derivatives, which are shown in Table 1. The logKd values are ranging from — 1.00 to 3.72. The data set was randomly divided into three sections, the training (60%), the validation (20%), and the external test sets (20%), containing 37, 12, and 13 structures, respectively. The training and validation sets were used to build and optimize the QSAR model and the external test set was used to evaluate the prediction power of the obtained model. To avoid complexity and interference of different isoenzymes properties, only the human type of the CA was used (PDB no. 1CA2).

Molecular Descriptors Generation

An important step in QSAR investigations is the numerical representation of the chemical structures called molecular descriptors. The model performance and accuracy of the results depend to a large extent on the way descriptors are obtained. In this study, the molecules were first depicted in ChemDraw and then, in Chem3D ultra (Version 8.0). Then, optimization was done by the semi-empirical PM3 method until the root mean square (RMS) gradient reached 0.01 in MOPAC interface of ChemOffice [18]. After that, the MOPAC output files were used by ChemPropPro,

ChemPropStd, ClogP, and MM2 as MOPAC servers. More than 60 molecular descriptors were derived to properly characterize the chemical structures of the 62 derivatives. They were grouped into five classes named constitutional, geometrical, topological, electronic, and quantum chemical descriptors. All of these descriptors were obtained solely from molecular structure and no experimental data required.

Feature Selection

In QSAR studies, it is important to construct a model with the least structure-based molecular descriptors because this will lead to a simple and interpretable model. Therefore, a pre-selection of descriptors was carried out by eliminating those descriptors that are not available for each structure (descriptors with a small variation in magnitude for all structures and descriptors which show a very small correlation with logKd values). In order to reduce descriptors more, two additional common methods were used, namely, SMLR and GA-MLR.

Stepwise MLR is a commonly used regression method which is proposed to evaluate only a small number of subsets by either adding or deleting a variable at a time according to a given condition. The number of remaining variables in the model is assigned based on the levels of significance assumed for inclusion and exclusion of variables from the model, that is 0.05 and 0.1, respectively [2, 19].

The GA-MLR used here was the same as that reported previously [9, 20, 21]. It applies a binary set as the coding technique for the problem, the presence or absence of a descriptor in a chromosome is assumed to be 1 or 0 [20, 21]. The genetic algorithm starts with a 100 member random generated population of factor subsets each represented by a chromosome. Each chromosome shows a possible solution for the optimization problem and is characterized by a fitness function (n). The fitness function was introduced by Depczynski et al. [20]. A program was written using MATLAB 7 for GA-MLR algorithm. The root mean square errors for calibration (RMSEC) and prediction (RMSEP), along with the fitness function, were calculated according to equation (1) [20].

n =

[(mc - n - 1)RMSEC2 + mpRMSEP2]

(mc + m_ - n -1)

1/2

(1)

where mc and mp are the numbers of compounds in calibration and prediction data sets, respectively. Chromosomes with the least numbers of selected descriptors (n) and the highest fitness functions were considered as informative ones (parental chromosomes) and used to produce the daughter chromosomes. The operators used in this study (to produce new generation) were selection (20%), crossover (70%), and mutation (10%) operators. In the selection

Table 1. Chemical structures of 62 derivatives

H2NO2S^^—R

No. R No. R No. R

1 -H 2 -CI 3 -NO2

4 NH2 5 -CH2CH3 6 (CH2)2CH3

7 (CH^C^ 8 -(CH2)4CH3

9 -NH2 10 -NHCH2CH3 11 -NH(CH2)2CH3

12 -NH(CH2)4CH3 13 -NH(CH2)5CH3 14 -NH(CH2)6CH3

15 -NH(CH2)7CH3 16 ■Is 17

18 ^NH ^NO2 19 20 F

21 j-1 F F 22 j-1 F F F F 23 N^

24 ^"HN^I^j 25 ^N 26 —HN

27 n OH C CH HN ^CH2 O j " OH 28 n OH V CH NH ^CH2 i CH2 O C^ OH 29 NH \ /CH JO H2C C. C' OH H2N 0

30 \ NH /CH O / \ H2C C H2C' OH / H2N 0 31 H, HN \ /°H JO H2C C. ho' 0H 32 X NH >H o /CH Cx H3C noh OH

Table 1. (Contd.)

No. R No. R No. R

Для дальнейшего прочтения статьи необходимо приобрести полный текст. Статьи высылаются в формате PDF на указанную при оплате почту. Время доставки составляет менее 10 минут. Стоимость одной статьи — 150 рублей.

научная статья по теме COMPARISON OF QSAR MODELS BASED ON COMBINATIONS OF GENETIC ALGORITHM, STEPWISE MULTIPLE LINEAR REGRESSION, AND ARTIFICIAL NEURAL NETWORK METHODS TO PREDICT KD OF SOME DERIVATIVES OF AROMATIC SULFONAMIDES AS CARBONIC ANHYDRASE II INHIBITORS Химия

Текст научной статьи на тему «COMPARISON OF QSAR MODELS BASED ON COMBINATIONS OF GENETIC ALGORITHM, STEPWISE MULTIPLE LINEAR REGRESSION, AND ARTIFICIAL NEURAL NETWORK METHODS TO PREDICT KD OF SOME DERIVATIVES OF AROMATIC SULFONAMIDES AS CARBONIC ANHYDRASE II INHIBITORS»