2024-05-01T21:30:33Zhttp://open-archive.highwire.org/handler

oai:open-archive.highwire.org:bioinfo:23/11/13092015-07-29HighWireOUPbioinfo:23:11

Simulating psoriasis by altering transit amplifying cells Grabe, Niels Neuber, Karsten SYSTEMS BIOLOGY Computational models of tissue homeostasis will facilitate a deeper understanding of many diseases. They link molecular networks, cellular differentiation and the spatial and temporal organization of tissues. Here we show an approach which is able to computationally turn a healthy <it>in silico</it> epidermis into one with four central properties of psoriatic epidermis. We achieve this by altering a single simulation parameter in the cellular differentiation program of the simulated epidermal keratinocytes: the fractional time period during which transit amplifying cells proliferate (τ). Prolonging τ results in the four main pathological characteristics of psoriatic skin: (1) an absolute increase of the germinative compartment, (2) an absolute increase of the differentiated compartment, (3) a higher proportion of germinative cells and (4) a marked reduction in turnover time. The prolongation of <it>τ</it> is able to increase the proliferation capacity of the epidermal tissue without altering the cell cycle frequency. Contact: <inter-ref locator="niels.grabe@med.uni-heidelberg.de" locator-type="email">niels.grabe@med.uni-heidelberg.de</inter-ref> Oxford University Press 2007-06-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/11/1309 http://dx.doi.org/10.1093/bioinformatics/btm042 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/11/13132015-07-29HighWireOUPbioinfo:23:11

A sequential Monte Carlo EM approach to the transcription factor binding site identification problem Jackson, Edmund S. Fitzgerald, William J. SEQUENCE ANALYSIS Motivation: A significant and stubbornly intractable problem in genome sequence analysis has been the <it>de novo</it> identification of transcription factor binding sites in promoter regions. Although theoretically pleasing, probabilistic methods have faced difficulties due to model mismatch and the nature of the biological sequence. These problems result in inference in a high dimensional, highly multimodal space, and consequently often display only local convergence and hence unsatisfactory performance. Algorithm: In this article, we derive and demonstrate a novel method utilizing a sequential Monte Carlo-based expectation-maximization (EM) optimization to improve performance in this scenario. The Monte Carlo element should increase the robustness of the algorithm compared to classical EM. Furthermore, the parallel nature of the sequential Monte Carlo algorithm should be more robust than Gibbs sampling approaches to multimodality problems. Results: We demonstrate the superior performance of this algorithm on both semi-synthetic and real data from <it>Escherichia coli</it>. Availability: http://sigproc-eng.cam.ac.uk/∼ej230/smc_em_tfbsid.tar.gz Contact: <inter-ref locator="ej230@cam.ac.uk" locator-type="email">ej230@cam.ac.uk</inter-ref> Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-06-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/11/1313 http://dx.doi.org/10.1093/bioinformatics/btm054 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/11/13212015-07-29HighWireOUPbioinfo:23:11

De novo SVM classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures Ng, Kwang Loong Stanley Mishra, Santosh K. STRUCTURAL BIOINFORMATICS Motivation: MicroRNAs (miRNAs) are small ncRNAs participating in diverse cellular and physiological processes through the post-transcriptional gene regulatory pathway. Critically associated with the miRNAs biogenesis, the hairpin structure is a necessary feature for the computational classification of novel precursor miRNAs (<it>pre</it>-<it>miRs</it>). Though many of the abundant genomic inverted repeats (pseudo hairpins) can be filtered computationally, novel species-specific <it>pre</it>-<it>miRs</it> are likely to remain elusive. Results: <it>miPred</it> is a <it>de novo</it> Support Vector Machine (SVM) classifier for identifying <it>pre</it>-<it>miRs</it> without relying on phylogenetic conservation. To achieve significantly higher sensitivity and specificity than existing (quasi) <it>de novo</it> predictors, it employs a Gaussian Radial Basis Function kernel (RBF) as a similarity measure for 29 global and intrinsic hairpin folding attributes. They characterize a <it>pre</it>-<it>miR</it> at the dinucleotide sequence, hairpin folding, non-linear statistical thermodynamics and topological levels. Trained on 200 human <it>pre-miRs</it> and 400 pseudo hairpins, <it>miPred</it> achieves 93.50% (5-fold cross-validation accuracy) and 0.9833 (ROC score). Tested on the remaining 123 human <it>pre-miRs</it> and 246 pseudo hairpins, it reports 84.55% (sensitivity), 97.97% (specificity) and 93.50% (accuracy). Validated onto 1918 <it>pre-miRs</it> across 40 non-human species and 3836 pseudo hairpins, it yields 87.65% (92.08%), 97.75% (97.42%) and 94.38% (95.64%) for the mean (overall) sensitivity, specificity and accuracy. Notably, <it>A.mellifera</it>, <it>A.geoffroyi</it>, <it>C.familiaris</it>, <it>E.Barr</it>, <it>H.Simplex virus</it>, <it>H.cytomegalovirus</it>, <it>O.aries</it>, <it>P.patens</it>, <it>R.lymphocryptovirus</it>, <it>Simian virus</it> and <it>Z.mays</it> are unambiguously classified with 100.00% (sensitivity) and >93.75% (specificity). Availability: Data sets, raw statistical results and source codes are available at <inter-ref locator="http://web.bii.a-star.edu.sg/~stanley/Publications" locator-type="url">http://web.bii.a-star.edu.sg/~stanley/Publications</inter-ref> Contact: <inter-ref locator="stanley@bii.a-star.edu.sg" locator-type="email">stanley@bii.a-star.edu.sg</inter-ref>; <inter-ref locator="santosh@bii.a-star.edu.sg" locator-type="email">santosh@bii.a-star.edu.sg</inter-ref> Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-06-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/11/1321 http://dx.doi.org/10.1093/bioinformatics/btm026 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/11/13312015-07-29HighWireOUPbioinfo:23:11

Searching for three-dimensional secondary structural patterns in proteins with ProSMoS Shi, Shuoyong Zhong, Yi Majumdar, Indraneel Sri Krishna, S. Grishin, Nick V. STRUCTURAL BIOINFORMATICS Motivation: Many evolutionarily distant, but functionally meaningful links between proteins come to light through comparison of spatial structures. Most programs that assess structural similarity compare two proteins to each other and find regions in common between them. Structural classification experts look for a particular structural motif instead. Programs base similarity scores on superposition or closeness of either Cartesian coordinates or inter-residue contacts. Experts pay more attention to the general orientation of the main chain and mutual spatial arrangement of secondary structural elements. There is a need for a computational tool to find proteins with the same secondary structures, topological connections and spatial architecture, regardless of subtle differences in 3D coordinates. Results: We developed ProSMoS—a Protein Structure Motif Search program that emulates an expert. Starting from a spatial structure, the program uses previously delineated secondary structural elements. A meta-matrix of interactions between the elements (parallel or antiparallel) minding handedness of connections (left or right) and other features (e.g. element lengths and hydrogen bonds) is constructed prior to or during the searches. All structures are reduced to such meta-matrices that contain just enough information to define a protein fold, but this definition remains very general and deviations in 3D coordinates are tolerated. User supplies a meta-matrix for a structural motif of interest, and ProSMoS finds all proteins in the protein data bank (PDB) that match the meta-matrix. ProSMoS performance is compared to other programs and is illustrated on a β-Grasp motif. A brief analysis of all β-Grasp-containing proteins is presented. Program availability: ProSMoS is freely available for non-commercial use from <inter-ref locator="ftp://iole.swmed.edu/pub/ProSMoS" locator-type="url">ftp://iole.swmed.edu/pub/ProSMoS</inter-ref>. Contact: <inter-ref locator="grishin@chop.swmed.edu" locator-type="email">grishin@chop.swmed.edu</inter-ref> Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-06-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/11/1331 http://dx.doi.org/10.1093/bioinformatics/btm121 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/11/13392015-07-29HighWireOUPbioinfo:23:11

Modeling nonlinearity in dilution design microarray data Zheng, Xiuwen Huang, Hung-Chung Li, Wenyuan Liu, Peng Li, Quan-Zhen Liu, Ying GENE EXPRESSION Motivation: Dilution design (Mixed tissue RNA) has been utilized by some researchers to evaluate and assess the performance of multiple microarray platforms. Current microarray data analysis approaches assume that the quantified signal intensities are linearly related to the expression of the corresponding genes in the sample. However, there are sources of nonlinearity in microarray expression measurements. Such nonlinearity study in the expressions of the RNA mixtures provides a new way to analyze gene expression data, and we argue that the nonlinearity can reveal novel information for microarray data analysis. Therefore, we proposed a statistical model, called proportion model, which is based on the linear regression analysis. To approximately quantify the nonlinearity in the dilution design, a new calibration, beta ratio (BR) was derived from the proportion model. Furthermore, a new adjusted fold change (adj-FC) was proposed to predict the true FC without nonlinearity, in particular for large FC. Results: We applied our method to one microarray dilution dataset. The experimental results indicated that, to some extent, there are global biases comparing with the linear assumption for the significant genes. Further analysis of those highly expressed genes with significant nonlinearity revealed some promising results, e.g. ‘poison’ effect was discovered for some genes in RNA mixtures. The adj-FCs of those genes with ‘poison’ effect, indicate that the nonlinearity can be also caused by the inherent feature of the genes besides signal noise and technical variation. Moreover, when percentage of overlapping genes (POG) was used as a cross-platform consistency measure, adj-FC outperformed simple fold change to show that Affymetrix and Illumina platforms are consistent. Availability: The R codes which implements all described methods, and some Supplementary material, are freely available from <inter-ref locator="http://www.utdallas.edu/~ying.liu/BetaRatio.htm" locator-type="url">http://www.utdallas.edu/~ying.liu/BetaRatio.htm</inter-ref> Contact: <inter-ref locator="ying.liu@utdallas.edu" locator-type="email">ying.liu@utdallas.edu</inter-ref> <it>Supplementary information</it>: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-06-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/11/1339 http://dx.doi.org/10.1093/bioinformatics/btm002 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/11/13482015-07-29HighWireOUPbioinfo:23:11

Quantitating tissue specificity of human genes to facilitate biomarker discovery Vasmatzis, George Klee, Eric W. Kube, Dagmar M. Therneau, Terry M. Kosari, Farhad GENE EXPRESSION We describe a method to identify candidate cancer biomarkers by analyzing numeric approximations of tissue specificity of human genes. These approximations were calculated by analyzing predicted tissue expression distributions of genes derived from mapping expressed sequence tags (ESTs) to the human genome sequence using a binary indexing algorithm. Tissue-specificity values facilitated high-throughput analysis of the human genes and enabled the identification of genes highly specific to different tissues. Tissue expression distributions for several genes were compared to estimates obtained from other public gene expression datasets and experimentally validated using quantitative RT-PCR on RNA isolated from several human tissues. Our results demonstrate that most human genes (∼98%) are expressed in many tissues (low specificity), and only a small number of genes possess very specific tissue expression profiles. These genes comprise a rich dataset from which novel therapeutic targets and novel diagnostic serum biomarkers may be selected. Contact: <inter-ref locator="vasm@mayo.edu" locator-type="email">vasm@mayo.edu</inter-ref> Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-06-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/11/1348 http://dx.doi.org/10.1093/bioinformatics/btm102 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/11/13562015-07-29HighWireOUPbioinfo:23:11

Extending the pathway analysis framework with a test for transcriptional variance implicates novel pathway modulation during myogenic differentiation Kemp, Daniel M. Nirmala, N. R. Szustakowski, Joseph D. GENE EXPRESSION Motivation: We describe an extension of the pathway-based enrichment approach for analyzing microarray data via a robust test for transcriptional variance. The use of a variance test is intended to identify additional patterns of transcriptional regulation in which many genes in a pathway are up- and down-regulated. Such patterns may be indicative of the reciprocal regulation of pathway activators and inhibitors or of the differential regulation of separate biological sub-processes and should extend the number of detectable patterns of transcriptional modulation. Results: We validated this new statistical approach on a microarray experiment that captures the temporal transcriptional profile of muscle differentiation in mouse C2C12 cells. Comparisons of the transcriptional state of myoblasts and differentiated myotubes via a robust variance test implicated several novel pathways in muscle cell differentiation previously overlooked by a standard enrichment analysis. Specifically, pathways involved in cell structure, calcium-mediated signaling and muscle-specific signaling were identified as differentially modulated based on their increased transcriptional variance. These biologically relevant results validate this approach and demonstrate the flexible nature of pathway-based methods of data analysis. Availability: The software is available as Supplementary Material. Contact: <inter-ref locator="joseph.szustakowski@novartis.com" locator-type="email">joseph.szustakowski@novartis.com</inter-ref> Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-06-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/11/1356 http://dx.doi.org/10.1093/bioinformatics/btm116 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/11/13632015-07-29HighWireOUPbioinfo:23:11

Classification based upon gene expression data: bias and precision of error rates Wood, Ian A. Visscher, Peter M. Mengersen, Kerrie L. GENE EXPRESSION Motivation: Gene expression data offer a large number of potentially useful predictors for the classification of tissue samples into classes, such as diseased and non-diseased. The predictive error rate of classifiers can be estimated using methods such as cross-validation. We have investigated issues of interpretation and potential bias in the reporting of error rate estimates. The issues considered here are optimization and selection biases, sampling effects, measures of misclassification rate, baseline error rates, two-level external cross-validation and a novel proposal for detection of bias using the permutation mean. Results: Reporting an optimal estimated error rate incurs an optimization bias. Downward bias of 3–5% was found in an existing study of classification based on gene expression data and may be endemic in similar studies. Using a simulated non-informative dataset and two example datasets from existing studies, we show how bias can be detected through the use of label permutations and avoided using two-level external cross-validation. Some studies avoid optimization bias by using single-level cross-validation and a test set, but error rates can be more accurately estimated via two-level cross-validation. In addition to estimating the simple overall error rate, we recommend reporting class error rates plus where possible the conditional risk incorporating prior class probabilities and a misclassification cost matrix. We also describe baseline error rates derived from three trivial classifiers which ignore the predictors. Availability: R code which implements two-level external cross-validation with the PAMR package, experiment code, dataset details and additional figures are freely available for non-commercial use from <inter-ref locator="http://www.maths.qut.edu.au/profiles/wood/permr.jsp" locator-type="url">http://www.maths.qut.edu.au/profiles/wood/permr.jsp</inter-ref> Contact: i.wood@qut.edu.au Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-06-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/11/1363 http://dx.doi.org/10.1093/bioinformatics/btm117 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/11/13712015-07-29HighWireOUPbioinfo:23:11

Simulating Epstein-Barr virus infection with C-ImmSim Castiglione, Filippo Duca, Karen Jarrah, Abdul Laubenbacher, Reinhard Hochberg, Donna Thorley-Lawson, David SYSTEMS BIOLOGY Motivation: Epstein-Barr virus (EBV) infects greater than 90% of humans benignly for life but can be associated with tumors. It is a uniquely human pathogen that is amenable to quantitative analysis; however, there is no applicable animal model. Computer models may provide a virtual environment to perform experiments not possible in human volunteers. Results: We report the application of a relatively simple stochastic cellular automaton (C-ImmSim) to the modeling of EBV infection. Infected B-cell dynamics in the acute and chronic phases of infection correspond well to clinical data including the establishment of a long term persistent infection (up to 10 years) that is absolutely dependent on access of latently infected B cells to the peripheral pool where they are not subject to immunosurveillance. In the absence of this compartment the infection is cleared. Availability: The latest version 6 of C-ImmSim is available under the GNU General Public License and is downloadable from <inter-ref locator="www.iac.cnr.it/~filippo/cimmsim.html" locator-type="url">www.iac.cnr.it/~filippo/cimmsim.html</inter-ref> Contact: <inter-ref locator="david.thorley-lawson@tufts.edu" locator-type="email">david.thorley-lawson@tufts.edu</inter-ref> Oxford University Press 2007-06-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/11/1371 http://dx.doi.org/10.1093/bioinformatics/btm044 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/11/13782015-07-29HighWireOUPbioinfo:23:11

From structure to dynamics of metabolic pathways: application to the plant mitochondrial TCA cycle Steuer, Ralf Nesi, Adriano Nunes Fernie, Alisdair R. Gross, Thilo Blasius, Bernd Selbig, Joachim SYSTEMS BIOLOGY Motivation: Mitochondrial metabolism, dominated by the reactions of the tricarboxylic acid (TCA) cycle, is of vital importance for a wide range of metabolic processes. In particular for autotrophic tissue, such as plant leaves, the TCA cycle marks the point of divergence of anabolic pathways and plays an essential role in biosynthesis. However, despite extensive knowledge about its stoichiometric properties, the function and the dynamical capabilities of the TCA cycle remain largely unknown. Methods and Results: Based on a recently proposed formalism, we investigate the dynamic and functional properties of the mitochondrial TCA cycle of plants. Starting with the structural properties, as described by the elementary flux modes of the system, we aim for the transition from structure to the dynamics of the TCA cycle. Using a parametric description of the system, encompassing all possible differential equations and parameter values, we detect and quantify regimes of different dynamic behavior. Optimizing the system with respect to dynamic stability, we demonstrate that maximal stability is associated with specific (relative) metabolite concentrations and flux values that are subsequently compared to the experimental literature. Our analysis also serves as a general example how to elucidate the transition from the structure to the dynamics of metabolic pathways. Contact: <inter-ref locator="steuer@agnld.uni-potsdam.de" locator-type="email">steuer@agnld.uni-potsdam.de</inter-ref> Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-06-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/11/1378 http://dx.doi.org/10.1093/bioinformatics/btm065 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/11/13862015-07-29HighWireOUPbioinfo:23:11

Nested Containment List (NCList): a new algorithm for accelerating interval query of genome alignment and interval databases Alekseyenko, Alexander V. Lee, Christopher J. DATA AND TEXT MINING Motivation: The exponential growth of sequence databases poses a major challenge to bioinformatics tools for querying alignment and annotation databases. There is a pressing need for methods for finding overlapping sequence intervals that are highly scalable to database size, query interval size, result size and construction/updating of the interval database. Results: We have developed a new interval database representation, the Nested Containment List (NCList), whose query time is <it>O</it>(<it>n</it> + log <it>N</it>), where <it>N</it> is the database size and <it>n</it> is the size of the result set. In all cases tested, this query algorithm is 5–500-fold faster than other indexing methods tested in this study, such as MySQL multi-column indexing, MySQL binning and R-Tree indexing. We provide performance comparisons both in simulated datasets and real-world genome alignment databases, across a wide range of database sizes and query interval widths. We also present an in-place NCList construction algorithm that yields database construction times that are ∼100-fold faster than other methods available. The NCList data structure appears to provide a useful foundation for highly scalable interval database applications. Availability: NCList data structure is part of Pygr, a bioinformatics graph database library, available at <inter-ref locator="http://sourceforge.net/projects/pygr" locator-type="url">http://sourceforge.net/projects/pygr</inter-ref> Contact: <inter-ref locator="leec@chem.ucla.edu" locator-type="email">leec@chem.ucla.edu</inter-ref> Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-06-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/11/1386 http://dx.doi.org/10.1093/bioinformatics/btl647 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/11/13942015-07-29HighWireOUPbioinfo:23:11

Data reduction of isotope-resolved LC-MS spectra Du, Peicheng Sudha, Rajagopalan Prystowsky, Michael B. Angeletti, Ruth Hogue DATA AND TEXT MINING Motivation: Data reduction of liquid chromatography-mass spectrometry (LC-MS) spectra can be a challenge due to the inherent complexity of biological samples, noise and non-flat baseline. We present a new algorithm, LCMS-2D, for reliable data reduction of LC-MS proteomics data. Results: LCMS-2D can reliably reduce LC-MS spectra with multiple scans to a list of elution peaks, and subsequently to a list of peptide masses. It is capable of noise removal, and deconvoluting peaks that overlap in <it>m</it>/<it>z</it>, in retention time, or both, by using a novel iterative peak-picking step, a ‘rescue’ step, and a modified variable selection method. LCMS-2D performs well with three sets of annotated LC-MS spectra, yielding results that are better than those from PepList, msInspect and the vendor software BioAnalyst. Availability: The software LCMS-2D is available under the GNU general public license from <inter-ref locator="http://www.bioc.aecom.yu.edu/labs/angellab/" locator-type="url">http://www.bioc.aecom.yu.edu/labs/angellab/</inter-ref>as a standalone C program running on LINUX. Contact: <inter-ref locator="pdu@us.ibm.com" locator-type="email">pdu@us.ibm.com</inter-ref> Oxford University Press 2007-06-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/11/1394 http://dx.doi.org/10.1093/bioinformatics/btm083 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/11/14012015-07-29HighWireOUPbioinfo:23:11

Regression analysis and modelling of data acquisition for SELDI-TOF mass spectrometry Sköld, Martin Rydén, Tobias Samuelsson, Viktoria Bratt, Charlotte Ekblad, Lars Olsson, Håkan Baldetorp, Bo DATA AND TEXT MINING Motivation: Pre-processing of SELDI-TOF mass spectrometry data is currently performed on a largel y <it>ad hoc</it> basis. This makes comparison of results from independent analyses troublesome and does not provide a framework for distinguishing different sources of variation in data. Results: In this article, we consider the task of pooling a large number of single-shot spectra, a task commonly performed automatically by the instrument software. By viewing the underlying statistical problem as one of heteroscedastic linear regression, we provide a framework for introducing robust methods and for dealing with missing data resulting from a limited span of recordable intensity values provided by the instrument. Our framework provides an interpretation of currently used methods as a maximum-likelihood estimator and allows theoretical derivation of its variance. We observe that this variance depends crucially on the total number of ionic species, which can vary considerably between different pooled spectra. This variation in variance can potentially invalidate the results from naive methods of discrimination/classification and we outline appropriate data transformations. Introducing methods from robust statistics did not improve the standard errors of the pooled samples. Imputing missing values however—using the EM algorithm—had a notable effect on the result; for our data, the pooled height of peaks which were frequently truncated increased by up to 30%. Contact: <inter-ref locator="martins@maths.lth.se" locator-type="email">martins@maths.lth.se</inter-ref> Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-06-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/11/1401 http://dx.doi.org/10.1093/bioinformatics/btm104 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/11/14102015-07-29HighWireOUPbioinfo:23:11

SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data Shatkay, Hagit Höglund, Annette Brady, Scott Blum, Torsten Dönnes, Pierre Kohlbacher, Oliver DATA AND TEXT MINING Motivation: Knowing the localization of a protein within the cell helps elucidate its role in biological processes, its function and its potential as a drug target. Thus, subcellular localization prediction is an active research area. Numerous localization prediction systems are described in the literature; some focus on specific localizations or organisms, while others attempt to cover a wide range of localizations. Results: We introduce SherLoc, a new comprehensive system for predicting the localization of eukaryotic proteins. It integrates several types of sequence and text-based features. While applying the widely used support vector machines (SVMs), SherLoc’s main novelty lies in the way in which it selects its text sources and features, and integrates those with sequence-based features. We test SherLoc on previously used datasets, as well as on a new set devised specifically to test its predictive power, and show that SherLoc consistently improves on previous reported results. We also report the results of applying SherLoc to a large set of yet-unlocalized proteins. Availability: SherLoc, along with Supplementary Information, is available at: <inter-ref locator="http://www-bs.informatik.uni-tuebingen.de/Services/SherLoc/" locator-type="url">http://www-bs.informatik.uni-tuebingen.de/Services/SherLoc/</inter-ref> Contact: <inter-ref locator="shatkay@cs.queensu.ca" locator-type="email">shatkay@cs.queensu.ca</inter-ref> Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-06-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/11/1410 http://dx.doi.org/10.1093/bioinformatics/btm115 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/11/14182015-07-29HighWireOUPbioinfo:23:11

MedicCyc: a biochemical pathway database for Medicago truncatula Urbanczyk-Wochniak, Ewa Sumner, Lloyd W. DATABASES AND ONTOLOGIES Motivation: There is an imperative need to integrate functional genomics data to obtain a more comprehensive systems-biology view of the results. We believe that this is best achieved through the visualization of data within the biological context of metabolic pathways. Accordingly, metabolic pathway reconstruction was used to predict the metabolic composition for <it>Medicago truncatula</it> and these pathways were engineered to enable the correlated visualization of integrated functional genomics data. Results: Metabolic pathway reconstruction was used to generate a pathway database for <it>M. truncatula</it> (MedicCyc), which currently features more than 250 pathways with related genes, enzymes and metabolites. MedicCyc was assembled from more than 225 000 <it>M. truncatula</it> ESTs (MtGI Release 8.0) and available genomic sequences using the Pathway Tools software and the MetaCyc database. The predicted pathways in MedicCyc were verified through comparison with other plant databases such as AraCyc and RiceCyc. The comparison with other plant databases provided crucial information concerning enzymes still missing from the ongoing, but currently incomplete <it>M. truncatula</it> genome sequencing project. MedicCyc was further manually curated to remove non-plant pathways, and Medicago-specific pathways including isoflavonoid, lignin and triterpene saponin biosynthesis were modified or added based upon available literature and in-house expertise. Additional metabolites identified in metabolic profiling experiments were also used for pathway predictions. Once the metabolic reconstruction was completed, MedicCyc was engineered to visualize <it>M. truncatula</it> functional genomics datasets within the biological context of metabolic pathways. Availability: freely accessible at <inter-ref locator="http://www.noble.org/MedicCyc/" locator-type="url">http://www.noble.org/MedicCyc/</inter-ref> Contact: <inter-ref locator="lwsumner@noble.org" locator-type="email">lwsumner@noble.org</inter-ref> Oxford University Press 2007-06-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/11/1418 http://dx.doi.org/10.1093/bioinformatics/btm040 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/11/14242015-07-29HighWireOUPbioinfo:23:11

Unsupervised segmentation of continuous genomic data Day, Nathan Hemmaplardh, Andrew Thurman, Robert E. Stamatoyannopoulos, John A. Noble, William S. GENOME ANALYSIS Summary: The advent of high-density, high-volume genomic data has created the need for tools to summarize large datasets at multiple scales. HMMSeg is a command-line utility for the scale-specific segmentation of continuous genomic data using hidden Markov models (HMMs). Scale specificity is achieved by an optional wavelet-based smoothing operation. HMMSeg is capable of handling multiple datasets simultaneously, rendering it ideal for integrative analysis of expression, phylogenetic and functional genomic data. Availability: http://noble.gs.washington.edu/proj/hmmseg Contact: <inter-ref locator="rthurman@u.washington.edu" locator-type="email">rthurman@u.washington.edu</inter-ref> Oxford University Press 2007-06-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/11/1424 http://dx.doi.org/10.1093/bioinformatics/btm096 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/11/14272015-07-29HighWireOUPbioinfo:23:11

Ngila: global pairwise alignments with logarithmic and affine gap costs Cartwright, Reed A. SEQUENCE ANALYSIS Summary: Ngila is an application that will find the best alignment of a pair of sequences using log-affine gap costs, which are the most biologically realistic gap costs. Availability: Portable source code for Ngila can be downloaded from its development website, <inter-ref locator="http://scit.us/projects/ngila/" locator-type="url">http://scit.us/projects/ngila/</inter-ref>. It compiles on most operating systems. Contact: <inter-ref locator="racartwr@ncsu.edu" locator-type="email">racartwr@ncsu.edu</inter-ref> or <inter-ref locator="reed@scit.us" locator-type="email">reed@scit.us</inter-ref> Supplementary information: Appendices Oxford University Press 2007-06-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/11/1427 http://dx.doi.org/10.1093/bioinformatics/btm095 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/11/14292015-07-29HighWireOUPbioinfo:23:11

PROTMAP2D: visualization, comparison and analysis of 2D maps of protein structure Pietal, Michal J. Tuszynska, Irina Bujnicki, Janusz M. STRUCTURAL BIOINFORMATICS Motivation: Protein structure comparison is a fundamental problem in structural biology and bioinformatics. Two-dimensional maps of distances between residues in the structure contain sufficient information to restore the 3D representation, while maps of contacts reveal characteristic patterns of interactions between secondary and super-secondary structures and are very attractive for visual analysis. The overlap of 2D maps of two structures can be easily calculated, providing a sensitive measure of protein structure similarity. PROTMAP2D is a software tool for calculation of contact and distance maps based on user-defined criteria, quantitative comparison of pairs or series of contact maps (e.g. alternative models of the same protein, model versus native structure, different trajectories from molecular dynamics simulations, etc.) and visualization of the results. Availability: PROTMAP2D for Windows / Linux / MacOSX is freely available for academic users from <inter-ref locator="http://genesilico.pl/protmap2d.htm" locator-type="url">http://genesilico.pl/protmap2d.htm</inter-ref> Contact: <inter-ref locator="iamb@genesilico.pl" locator-type="email">iamb@genesilico.pl</inter-ref> Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-06-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/11/1429 http://dx.doi.org/10.1093/bioinformatics/btm124 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/11/14312015-07-29HighWireOUPbioinfo:23:11

IlluminaGUI: Graphical User Interface for analyzing gene expression data generated on the Illumina platform Schultze, Joachim L. Eggle, Daniela GENE EXPRESSION Summary: IlluminaGUI is a graphical user interface implemented for analyzing microarray data from the Illumina BeadChip platform. All key components of a microarray experiment, including quality control, normalization, inference and classification methods are provided in a ‘point and click’ approach. IlluminaGUI is implemented as a R package based on the R-Tcl/Tk interface and is available for platforms on which R runs including Windows, Mac and Unix-type machines. Availability: <inter-ref locator="http://IlluminaGUI.dnsalias.org" locator-type="url">http://IlluminaGUI.dnsalias.org</inter-ref> Contact: <inter-ref locator="joachim.schultze@uk-koeln.de" locator-type="email">joachim.schultze@uk-koeln.de</inter-ref> Oxford University Press 2007-06-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/11/1431 http://dx.doi.org/10.1093/bioinformatics/btm101 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/11/14342015-07-29HighWireOUPbioinfo:23:11

uBioRSS: Tracking taxonomic literature using RSS Leary, Patrick R. Remsen, David P. Norton, Catherine N. Patterson, David J. Sarkar, Indra Neil DATA AND TEXT MINING Summary: Web content syndication through standard formats such as RSS and ATOM has become an increasingly popular mechanism for publishers, news sources and blogs to disseminate regularly updated content. These standardized syndication formats deliver content directly to the subscriber, allowing them to locally aggregate content from a variety of sources instead of having to find the information on multiple websites. The uBioRSS application is a ‘taxonomically intelligent’ service customized for the biological sciences. It aggregates syndicated content from academic publishers and science news feeds, and then uses a taxonomic Named Entity Recognition algorithm to identify and index taxonomic names within those data streams. The resulting name index is cross-referenced to current global taxonomic datasets to provide context for browsing the publications by taxonomic group. This process, called taxonomic indexing, draws upon services developed specifically for biological sciences, collectively referred to as ‘taxonomic intelligence’. Such value-added enhancements can provide biologists with accelerated and improved access to current biological content. Availability: <inter-ref locator="http://names.ubio.org/rss/" locator-type="url">http://names.ubio.org/rss/</inter-ref> Contact: <inter-ref locator="sarkar@mbl.edu" locator-type="email">sarkar@mbl.edu</inter-ref> Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-06-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/11/1434 http://dx.doi.org/10.1093/bioinformatics/btm109 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/11/14372015-07-29HighWireOUPbioinfo:23:11

BioDownloader: bioinformatics downloads and updates in a few clicks Shapovalov, Maxim V. Canutescu, Adrian A. Dunbrack, Roland L. DATABASES AND ONTOLOGIES Summary: There are many ftp or http servers storing data required for biological research. While some download applications are available, there is no user-friendly download application with a graphical interface specifically designed and adapted to meet the requirements of bioinformatics. BioDownloader is a program for downloading and updating files from ftp and http servers. It is optimized to work robustly with large numbers of files. It allows the selective retrieval of only the required files (batch downloads, multiple file masks, ls-lR file parsing, recursive search, recent updates, etc.). BioDownloader has a built-in repository containing the settings for common bioinformatics file-synchronization needs, including the Protein Data Bank (PDB) and National Center for Biotechnology Information (NCBI) databases. It can post-process downloaded files, including archive extraction and file conversions. Availability: The program can be installed from <inter-ref locator="http://dunbrack.fccc.edu/BioDownloader" locator-type="url">http://dunbrack.fccc.edu/BioDownloader</inter-ref>. The software is freely available for both non-commercial and commercial users under the BSD license. Contact: <inter-ref locator="Roland.Dunbrack@fccc.edu" locator-type="email">Roland.Dunbrack@fccc.edu</inter-ref> Oxford University Press 2007-06-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/11/1437 http://dx.doi.org/10.1093/bioinformatics/btm120 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/10/11812015-07-29HighWireOUPbioinfo:23:10

IMEx: Imperfect Microsatellite Extractor Mudunuri, Suresh B. Nagarajaram, Hampapathalu A. GENOME ANALYSIS Motivation: Microsatellites, also known as simple sequence repeats, are the tandem repeats of nucleotide motifs of the size 1–6 bp found in every genome known so far. Their importance in genomes is well known. Microsatellites are associated with various disease genes, have been used as molecular markers in linkage analysis and DNA fingerprinting studies, and also seem to play an important role in the genome evolution. Therefore, it is of importance to study distribution, enrichment and polymorphism of microsatellites in the genomes of interest. For this, the prerequisite is the availability of a computational tool for extraction of microsatellites (perfect as well as imperfect) and their related information from whole genome sequences. Examination of available tools revealed certain lacunae in them and prompted us to develop a new tool. Results: In order to efficiently screen genome sequences for microsatellites (perfect as well as imperfect), we developed a new tool called IMEx (Imperfect Microsatellite Extractor). IMEx uses simple string-matching algorithm with sliding window approach to screen DNA sequences for microsatellites and reports the motif, copy number, genomic location, nearby genes, mutational events and many other features useful for in-depth studies. IMEx is more sensitive, efficient and useful than the available widely used tools. IMEx is available in the form of a stand-alone program as well as in the form of a web-server. Availability: A World Wide Web server and the stand-alone program are available for free access at <inter-ref locator="http://203.197.254.154/IMEX/" locator-type="url">http://203.197.254.154/IMEX/</inter-ref> or <inter-ref locator="http://www.cdfd.org.in/imex" locator-type="url">http://www.cdfd.org.in/imex</inter-ref> Contact: <inter-ref locator="han@cdfd.org.in" locator-type="email">han@cdfd.org.in</inter-ref> Oxford University Press 2007-05-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/10/1181 http://dx.doi.org/10.1093/bioinformatics/btm097 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/10/11882015-07-29HighWireOUPbioinfo:23:10

GAPWM: a genetic algorithm method for optimizing a position weight matrix Li, Leping Liang, Yu Bass, Robert L. SEQUENCE ANALYSIS Motivation: Position weight matrices (PMWs) are simple models commonly used in motif-finding algorithms to identify short functional elements, such as <it>cis</it>-regulatory motifs, on genes. When few experimentally verified motifs are available, estimation of the PWM may be poor. The resultant PWM may not reliably discriminate a true motif from a false one. While experimentally identifying such motifs remains time-consuming and expensive, low-resolution binding data from techniques such as ChIP-on-chip and ChIP-PET have become available. We propose a novel but simple method to improve a poorly estimated PWM using ChIP data. Methodology: Starting from an existing PWM, a set of ChIP sequences, and a set of background sequences, our method, GAPWM, derives an improved PWM via a genetic algorithm that maximizes the area under the receiver operating characteristic (ROC) curve. GAPWM can easily incorporate prior information such as base conservation. We tested our method on two PMWs (Oct4/Sox2 and p53) using three recently published ChIP data sets (human Oct4, mouse Oct4 and human p53). Results: GAPWM substantially increased the sensitivity/specificity of a poorly estimated PWM and further improved the quality of a good PWM. Furthermore, it still functioned when the starting PWM contained a major error. The ROC performance of GAPWM compared favorably with that of MEME and others. With increasing availability of ChIP data, our method provides an alternative for obtaining high-quality PWMs for genome-wide identification of transcription factor binding sites. Availability: The C source code and all data used in this report are available at <inter-ref locator="http://dir.niehs.nih.gov/dirbb/gapwm" locator-type="url">http://dir.niehs.nih.gov/dirbb/gapwm</inter-ref> Contact: <inter-ref locator="li3@niehs.nih.gov" locator-type="email">li3@niehs.nih.gov</inter-ref> Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-05-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/10/1188 http://dx.doi.org/10.1093/bioinformatics/btm080 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/10/11952015-07-29HighWireOUPbioinfo:23:10

A fast and flexible approach to oligonucleotide probe design for genomes and gene families Feng, Shengzhong Tillier, Elisabeth R.M. SEQUENCE ANALYSIS Motivation: With hundreds of completely sequenced microbial genomes available, and advancements in DNA microarray technology, the detection of genes in microbial communities consisting of hundreds of thousands of sequences may be possible. The existing strategies developed for DNA probe design, geared toward identifying specific sequences, are not suitable due to the lack of coverage, flexibility and efficiency necessary for applications in metagenomics. Methods: ProDesign is a tool developed for the selection of oligonucleotide probes to detect members of gene families present in environmental samples. Gene family-specific probe sequences are generated based on specific and shared words, which are found with the spaced seed hashing algorithm. To detect more sequences, those sharing some common words are re-clustered into new families, then probes specific for the new families are generated. Results: The program is very flexible in that it can be used for designing probes for detecting many genes families simultaneously and specifically in one or more genomes. Neither the length nor the melting temperature of the probes needs to be predefined. We have found that ProDesign provides more flexibility, coverage and speed than other software programs used in the selection of probes for genomic and gene family arrays. Availability: ProDesign is licensed free of charge to academic users. ProDesign and Supplementary Material can be obtained by contacting the authors. A web server for ProDesign is available at <inter-ref locator="http://www.uhnresearch.ca/labs/tillier/ProDesign/ProDesign.html" locator-type="url">http://www.uhnresearch.ca/labs/tillier/ProDesign/ProDesign.html</inter-ref> Contact: <inter-ref locator="e.tillier@utoronto.ca" locator-type="email">e.tillier@utoronto.ca</inter-ref> or <inter-ref locator="fsz@ncic.ac.cn" locator-type="email">fsz@ncic.ac.cn</inter-ref> Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-05-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/10/1195 http://dx.doi.org/10.1093/bioinformatics/btm114 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/10/12032015-07-29HighWireOUPbioinfo:23:10

AutoSCOP: automated prediction of SCOP classifications using unique pattern-class mappings Gewehr, Jan E. Hintermair, Volker Zimmer, Ralf STRUCTURAL BIOINFORMATICS Motivation: The sequence patterns contained in the available motif and hidden Markov model (HMM) databases are a valuable source of information for protein sequence annotation. For structure prediction and fold recognition purposes, we computed mappings from such pattern databases to the protein domain hierarchy given by the ASTRAL compendium and applied them to the prediction of SCOP classifications. Our aim is to make highly confident predictions also for non-trivial cases if possible and abstain from a prediction otherwise, and thus to provide a method that can be used as a first step in a pipeline of prediction methods. We describe two successful examples for such pipelines. With the AutoSCOP approach, it is possible to make predictions in a large-scale manner for many domains of the available sequences in the well-known protein sequence databases. Results: AutoSCOP computes unique sequence patterns and pattern combinations for SCOP classifications. For instance, we assign a SCOP superfamily to a pattern found in its members whenever the pattern does not occur in any other SCOP superfamily. Especially on the fold and superfamily level, our method achieves both high sensitivity (above 93%) and high specificity (above 98%) on the difference set between two ASTRAL versions, due to being able to abstain from unreliable predictions. Further, on a harder test set filtered at low sequence identity, the combination with profile–profile alignments improves accuracy and performs comparably even to structure alignment methods. Integrating our method with structure alignment, we are able to achieve an accuracy of 99% on SCOP fold classifications on this set. In an analysis of false assignments of domains from new folds/superfamilies/families to existing SCOP classifications, AutoSCOP correctly abstains for more than 70% of the domains belonging to new folds and superfamilies, and more than 80% of the domains belonging to new families. These findings show that our approach is a useful additional filter for SCOP classification prediction of protein domains in combination with well-known methods such as profile–profile alignment. Availability: A web server where users can input their domain sequences is available at <inter-ref locator="http://www.bio.ifi.lmu.de/autoscop" locator-type="url">http://www.bio.ifi.lmu.de/autoscop</inter-ref> Contact: <inter-ref locator="jan.gewehr@ifi.lmu.de" locator-type="email">jan.gewehr@ifi.lmu.de</inter-ref> Oxford University Press 2007-05-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/10/1203 http://dx.doi.org/10.1093/bioinformatics/btm089 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/10/12112015-07-29HighWireOUPbioinfo:23:10

Glycan classification with tree kernels Yamanishi, Yoshihiro Bach, Francis Vert, Jean-Philippe STRUCTURAL BIOINFORMATICS Motivation: Glycans are covalent assemblies of sugar that play crucial roles in many cellular processes. Recently, comprehensive data about the structure and function of glycans have been accumulated, therefore the need for methods and algorithms to analyze these data is growing fast. Results: This article presents novel methods for classifying glycans and detecting discriminative glycan motifs with support vector machines (SVM). We propose a new class of tree kernels to measure the similarity between glycans. These kernels are based on the comparison of tree substructures, and take into account several glycan features such as the sugar type, the sugar bound type or layer depth. The proposed methods are tested on their ability to classify human glycans into four blood components: leukemia cells, erythrocytes, plasma and serum. They are shown to outperform a previously published method. We also applied a feature selection approach to extract glycan motifs which are characteristic of each blood component. We confirmed that some leukemia-specific glycan motifs detected by our method corresponded to several results in the literature. Availability: Softwares are available upon request. Contact: <inter-ref locator="yoshi@kuicr.kyoto-u.ac.jp" locator-type="email">yoshi@kuicr.kyoto-u.ac.jp</inter-ref> Supplementary information: Datasets are available at the following website: <inter-ref locator="http://web.kuicr.kyoto-u.ac.jp/supp/yoshi/glycankernel/" locator-type="url">http://web.kuicr.kyoto-u.ac.jp/supp/yoshi/glycankernel/</inter-ref> Oxford University Press 2007-05-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/10/1211 http://dx.doi.org/10.1093/bioinformatics/btm090 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/10/12172015-07-29HighWireOUPbioinfo:23:10

Pooling mRNA in microarray experiments and its effect on power Zhang, Wuyan Carriquiry, Alicia Nettleton, Dan Dekkers, Jack C.M. GENE EXPRESSION Motivation: Microarrays can simultaneously measure the expression levels of many genes and are widely applied to study complex biological problems at the genetic level. To contain costs, instead of obtaining a microarray on each individual, mRNA from several subjects can be first pooled and then measured with a single array. mRNA pooling is also necessary when there is not enough mRNA from each subject. Several studies have investigated the impact of pooling mRNA on inferences about gene expression, but have typically modeled the process of pooling as if it occurred in some transformed scale. This assumption is unrealistic. Results: We propose modeling the gene expression levels in a pool as a weighted average of mRNA expression of all individuals in the pool on the original measurement scale, where the weights correspond to individual sample contributions to the pool. Based on these improved statistical models, we develop the appropriate F statistics to test for differentially expressed genes. We present formulae to calculate the power of various statistical tests under different strategies for pooling mRNA and compare resulting power estimates to those that would be obtained by following the approach proposed by Kendziorski <it>et al</it>. (<cross-ref type="bib" refid="B5">2003</cross-ref>). We find that the Kendziorski estimate tends to exceed true power and that the estimate we propose, while somewhat conservative, is less biased. We argue that it is possible to design a study that includes mRNA pooling at a significantly reduced cost but with little loss of information. Contact: <inter-ref locator="alicia@iastate.edu" locator-type="email">alicia@iastate.edu</inter-ref> Oxford University Press 2007-05-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/10/1217 http://dx.doi.org/10.1093/bioinformatics/btm081 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/10/12252015-07-29HighWireOUPbioinfo:23:10

Domain-enhanced analysis of microarray data using GO annotations Liu, Jiajun Hughes-Oliver, Jacqueline M. Menius, J. Alan GENE EXPRESSION Motivation: New biological systems technologies give scientists the ability to measure thousands of bio-molecules including genes, proteins, lipids and metabolites. We use domain knowledge, e.g. the Gene Ontology, to guide analysis of such data. By focusing on domain-aggregated results at, say the molecular function level, increased interpretability is available to biological scientists beyond what is possible if results are presented at the gene level. Results: We use a ‘top–down’ approach to perform domain aggregation by first combining gene expressions before testing for differentially expressed patterns. This is in contrast to the more standard ‘bottom–up’ approach, where genes are first tested individually then aggregated by domain knowledge. The benefits are greater sensitivity for detecting signals. Our method, domain-enhanced analysis (DEA) is assessed and compared to other methods using simulation studies and analysis of two publicly available leukemia data sets. Availability: Our DEA method uses functions available in R (<inter-ref locator="http://www.r-project.org/" locator-type="url">http://www.r-project.org/</inter-ref>) and SAS (<inter-ref locator="http://www.sas.com/" locator-type="url">http://www.sas.com/</inter-ref>). The two experimental data sets used in our analysis are available in R as Bioconductor packages, ‘ALL’ and ‘golubEsets’ (<inter-ref locator="http://www.bioconductor.org/" locator-type="url">http://www.bioconductor.org/</inter-ref>). Contact: <inter-ref locator="jliu6@stat.ncsu.edu" locator-type="email">jliu6@stat.ncsu.edu</inter-ref> Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-05-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/10/1225 http://dx.doi.org/10.1093/bioinformatics/btm092 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/10/12352015-07-29HighWireOUPbioinfo:23:10

Using DNA microarrays to study gene expression in closely related species Oshlack, Alicia Chabot, Adrien E. Smyth, Gordon K. Gilad, Yoav GENE EXPRESSION Motivation: Comparisons of gene expression levels within and between species have become a central tool in the study of the genetic basis for phenotypic variation, as well as in the study of the evolution of gene regulation. DNA microarrays are a key technology that enables these studies. Currently, however, microarrays are only available for a small number of species. Thus, in order to study gene expression levels in species for which microarrays are not available, researchers face three sets of choices: (i) use a microarray designed for another species, but only compare gene expression levels within species, (ii) construct a new microarray for every species whose gene expression profiles will be compared or (iii) build a multi-species microarray with probes from each species of interest. Here, we use data collected using a multi-primate cDNA array to evaluate the reliability of each approach. Results: We find that, for inter-species comparisons, estimates of expression differences based on multi-species microarrays are more accurate than those based on multiple species-specific arrays. We also demonstrate that within-species expression differences can be estimated using a microarray for a closely related species, without discernible loss of information. Contact: A.O. (<inter-ref locator="oshlack@wehi.edu.au" locator-type="email">oshlack@wehi.edu.au</inter-ref>) or Y.G. (<inter-ref locator="gilad@uchicago.edu" locator-type="email">gilad@uchicago.edu</inter-ref>) Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-05-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/10/1235 http://dx.doi.org/10.1093/bioinformatics/btm111 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/10/12432015-07-29HighWireOUPbioinfo:23:10

A mixture model approach to the tests of concordance and discordance between two large-scale experiments with two-sample groups Lai, Yinglei Adam, Bao-ling Podolsky, Robert She, Jin-Xiong GENE EXPRESSION Motivation: Due to advances in experimental technologies, such as microarray, mass spectrometry and nuclear magnetic resonance, it is feasible to obtain large-scale data sets, in which measurements for a large number of features can be simultaneously collected. However, the sample sizes of these data sets are usually small due to their relatively high costs, which leads to the issue of concordance among different data sets collected for the same study: features should have consistent behavior in different data sets. There is a lack of rigorous statistical methods for evaluating this concordance or discordance. Methods: Based on a three-component normal-mixture model, we propose two likelihood ratio tests for evaluating the concordance and discordance between two large-scale data sets with two sample groups. The parameter estimation is achieved through the expectation-maximization (E-M) algorithm. A normal-distribution-quantile-based method is used for data transformation. Results: To evaluate the proposed tests, we conducted some simulation studies, which suggested their satisfactory performances. As applications, the proposed tests were applied to three SELDI-MS data sets with replicates. One data set has replicates from different platforms and the other two have replicates from the same platform. We found that data generated by SELDI-MS showed satisfactory concordance between replicates from the same platform but unsatisfactory concordance between replicates from different platforms. Availability: The R codes are freely available at <inter-ref locator="http://home.gwu.edu/~ylai/research/Concordance" locator-type="url">http://home.gwu.edu/~ylai/research/Concordance</inter-ref> Contact: <inter-ref locator="ylai@gwu.edu" locator-type="email">ylai@gwu.edu</inter-ref> Oxford University Press 2007-05-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/10/1243 http://dx.doi.org/10.1093/bioinformatics/btm103 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/10/12512015-07-29HighWireOUPbioinfo:23:10

Modeling sequence sequence interactions for drug response Lin, Min Li, Hongying Hou, Wei Johnson, Julie A. Wu, Rongling GENETICS AND POPULATION ANALYSIS Motivation: Genetic interactions or epistasis may play an important role in the genetic etiology of drug response. With the availability of large-scale, high-density single nucleotide polymorphism markers, a great challenge is how to associate haplotype structures and complex drug response through its underlying pharmacodynamic mechanisms. Results: We have derived a general statistical model for detecting an interactive network of DNA sequence variants that encode pharmacodynamic processes based on the haplotype map constructed by single nucleotide polymorphisms. The model was validated by a pharmacogenetic study for two predominant beta-adrenergic receptor (βAR) subtypes expressed in the heart, β1AR and β2AR. Haplotypes from these two receptors trigger significant interaction effects on the response of heart rate to different dose levels of dobutamine. This model will have implications for pharmacogenetic and pharmacogenomic research and drug discovery. Availability: A computer program written in Matlab can be downloaded from the webpage of statistical genetics group at the University of Florida. Contact: <inter-ref locator="rwu@mail.ifas.ufl.edu" locator-type="email">rwu@mail.ifas.ufl.edu</inter-ref> Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-05-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/10/1251 http://dx.doi.org/10.1093/bioinformatics/btm110 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/10/12582015-07-29HighWireOUPbioinfo:23:10

Metabolic systems cost-benefit analysis for interpreting network structure and regulation Carlson, Ross P. SYSTEMS BIOLOGY Motivation: Interpretation of bioinformatics data in terms of cellular function is a major challenge facing systems biology. This question is complicated by robust metabolic networks filled with structural features like parallel pathways and isozymes. Under conditions of nutrient sufficiency, metabolic networks are well known to be regulated for thermodynamic efficiency however; efficient biochemical pathways are anabolically expensive to construct. While parameters like thermodynamic efficiency have been extensively studied, a systems-based analysis of anabolic proteome synthesis ‘costs’ and the cellular function implications of these costs has not been reported. Results: A cost-benefit analysis of an <it>in silico Escherichia coli</it> network revealed the relationship between metabolic pathway proteome synthesis requirements, DNA-coding sequence length, thermodynamic efficiency and substrate affinity. The results highlight basic metabolic network design principles. Pathway proteome synthesis requirements appear to have shaped biochemical network structure and regulation. Under conditions of nutrient scarcity and other general stresses, <it>E.coli</it> expresses pathways with relatively inexpensive proteome synthesis requirements instead of more efficient but also anabolically more expensive pathways. This evolutionary strategy provides a cellular function-based explanation for common network motifs like isozymes and parallel pathways and possibly explains ‘overflow’ metabolisms observed during nutrient scarcity. Contact: <inter-ref locator="alicia@iastate.edu" locator-type="email">alicia@iastate.edu</inter-ref> Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-05-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/10/1258 http://dx.doi.org/10.1093/bioinformatics/btm082 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/10/12652015-07-29HighWireOUPbioinfo:23:10

The impact of function perturbations in Boolean networks Xiao, Yufei Dougherty, Edward R. SYSTEMS BIOLOGY Motivation: A network is said to be <it>robust</it> relative to a certain network characteristic if a small change in network structure does not significantly affect the characteristic. From the perspective of network stability, robustness is desirable; however, from the perspective of intervention to exert influence on network behavior, it is undesirable. For Boolean networks, there are two fundamental types of robustness. One type pertains to perturbing the state of the network and the other to perturbing the rule-based structure. Results: This article explores the impact of function perturbations in Boolean networks from two aspects: (1) analysis: predict the impact on network state transitions and attractors via analytical approaches or identify a perturbation by observing its consequences; (2) synthesis: preserve or modify the network characteristics, especially attractors, by introducing a judicious change to the functions. The results are applied to achieve intervention that structurally alters the network to achieve a more favorable steady-state distribution and to identify the function perturbation that has led to altered observed behavior. The intervention procedure is applied to a WNT5A network to reduce the risk of metastasis in melanoma, and the identification procedure is applied to a <it>Drosophila melanogaster</it> segmentation polarity gene network to identify regulatory function perturbation. Contact: <inter-ref locator="edward@ece.tamu.edu" locator-type="email">edward@ece.tamu.edu</inter-ref> Oxford University Press 2007-05-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/10/1265 http://dx.doi.org/10.1093/bioinformatics/btm093 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/10/12742015-07-29HighWireOUPbioinfo:23:10

A new method to measure the semantic similarity of GO terms Wang, James Z. Du, Zhidian Payattakool, Rapeeporn Yu, Philip S. Chen, Chin-Fu DATA AND TEXT MINING Motivation: Although controlled biochemical or biological vocabularies, such as Gene Ontology (GO) (<inter-ref locator="http://www.geneontology.org" locator-type="url">http://www.geneontology.org</inter-ref>), address the need for consistent descriptions of genes in different data sources, there is still no effective method to determine the functional similarities of genes based on gene annotation information from heterogeneous data sources. Results: To address this critical need, we proposed a novel method to encode a GO term's semantics (biological meanings) into a numeric value by aggregating the semantic contributions of their ancestor terms (including this specific term) in the GO graph and, in turn, designed an algorithm to measure the semantic similarity of GO terms. Based on the semantic similarities of GO terms used for gene annotation, we designed a new algorithm to measure the functional similarity of genes. The results of using our algorithm to measure the functional similarities of genes in pathways retrieved from the saccharomyces genome database (SGD), and the outcomes of clustering these genes based on the similarity values obtained by our algorithm are shown to be consistent with human perspectives. Furthermore, we developed a set of online tools for gene similarity measurement and knowledge discovery. Availability: The online tools are available at: <inter-ref locator="http://bioinformatics.clemson.edu/G-SESAME" locator-type="url">http://bioinformatics.clemson.edu/G-SESAME</inter-ref> Contact: <inter-ref locator="jzwang@cs.clemson.edu" locator-type="email">jzwang@cs.clemson.edu</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.clemson.edu/Publication/Supplement/gsp.htm" locator-type="url">http://bioinformatics.clemson.edu/Publication/Supplement/gsp.htm</inter-ref> Oxford University Press 2007-05-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/10/1274 http://dx.doi.org/10.1093/bioinformatics/btm087 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/10/12822015-07-29HighWireOUPbioinfo:23:10

UniRef: comprehensive and non-redundant UniProt reference clusters Suzek, Baris E. Huang, Hongzhan McGarvey, Peter Mazumder, Raja Wu, Cathy H. DATABASES AND ONTOLOGIES Motivation: Redundant protein sequences in biological databases hinder sequence similarity searches and make interpretation of search results difficult. Clustering of protein sequence space based on sequence similarity helps organize all sequences into manageable datasets and reduces sampling bias and overrepresentation of sequences. Results: The UniRef (UniProt Reference Clusters) provide clustered sets of sequences from the UniProt Knowledgebase (UniProtKB) and selected UniProt Archive records to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences. Currently covering >4 million source sequences, the UniRef100 database combines identical sequences and subfragments from any source organism into a single UniRef entry. UniRef90 and UniRef50 are built by clustering UniRef100 sequences at the 90 or 50% sequence identity levels. UniRef100, UniRef90 and UniRef50 yield a database size reduction of ∼10, 40 and 70%, respectively, from the source sequence set. The reduced redundancy increases the speed of similarity searches and improves detection of distant relationships. UniRef entries contain summary cluster and membership information, including the sequence of a representative protein, member count and common taxonomy of the cluster, the accession numbers of all the merged entries and links to rich functional annotation in UniProtKB to facilitate biological discovery. UniRef has already been applied to broad research areas ranging from genome annotation to proteomics data analysis. Availability: UniRef is updated biweekly and is available for online search and retrieval at <inter-ref locator="http://www.uniprot.org" locator-type="url">http://www.uniprot.org</inter-ref>, as well as for download at <inter-ref locator="ftp://ftp.uniprot.org/pub/databases/uniprot/uniref" locator-type="url">ftp://ftp.uniprot.org/pub/databases/uniprot/uniref</inter-ref> Contact: <inter-ref locator="bes23@georgetown.edu" locator-type="email">bes23@georgetown.edu</inter-ref> Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-05-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/10/1282 http://dx.doi.org/10.1093/bioinformatics/btm098 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/10/12892015-07-29HighWireOUPbioinfo:23:10

Enhancements and modifications of primer design program Primer3 Koressaar, Triinu Remm, Maido SEQUENCE ANALYSIS Summary: The determination of annealing temperature is a critical step in PCR design. This parameter is typically derived from the melting temperature of the PCR primers, so for successful PCR work it is important to determine the melting temperature of primer accurately. We introduced several enhancements in the widely used primer design program Primer3. The improvements include a formula for calculating melting temperature and a salt correction formula. Also, the new version can take into account the effects of divalent cations, which are included in most PCR buffers. Another modification enables using lowercase masked template sequences for primer design. Availability: Features described in this article have been implemented into the development code of Primer3 and will be available in future versions (version 1.1 and newer) of Primer3. Also, a modified version is compiled under the name of mPrimer3 which is distributed independently. The web-based version of mPrimer3 is available at <inter-ref locator="http://bioinfo.ebc.ee/mprimer3/" locator-type="url">http://bioinfo.ebc.ee/mprimer3/</inter-ref> and the binary code is freely downloadable from the URL <inter-ref locator="http://bioinfo.ebc.ee/download/" locator-type="url">http://bioinfo.ebc.ee/download/</inter-ref>. Contact: <inter-ref locator="maido.remm@ut.ee" locator-type="email">maido.remm@ut.ee</inter-ref> Oxford University Press 2007-05-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/10/1289 http://dx.doi.org/10.1093/bioinformatics/btm091 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/10/12922015-07-29HighWireOUPbioinfo:23:10

iPTREE-STAB: interpretable decision tree based method for predicting protein stability changes upon mutations Huang, Liang-Tsung Gromiha, M. Michael Ho, Shinn-Ying STRUCTURAL BIOINFORMATICS Summary: We have developed a web server, iPTREE-STAB for discriminating the stability of proteins (stabilizing or destabilizing) and predicting their stability changes (ΔΔG) upon single amino acid substitutions from amino acid sequence. The discrimination and prediction are mainly based on decision tree coupled with adaptive boosting algorithm, and classification and regression tree, respectively, using three neighboring residues of the mutant site along N- and C-terminals. Our method showed an accuracy of 82% for discriminating the stabilizing and destabilizing mutants, and a correlation of 0.70 for predicting protein stability changes upon mutations. Availability: <inter-ref locator="http://bioinformatics.myweb.hinet.net/iptree.htm" locator-type="url">http://bioinformatics.myweb.hinet.net/iptree.htm</inter-ref> Contact: <inter-ref locator="michael-gromiha@aist.go.jp" locator-type="email">michael-gromiha@aist.go.jp</inter-ref> Supplementary information: Dataset and other details are given. Oxford University Press 2007-05-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/10/1292 http://dx.doi.org/10.1093/bioinformatics/btm100 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/10/12942015-07-29HighWireOUPbioinfo:23:10

GenABEL: an R library for genome-wide association analysis Aulchenko, Yurii S. Ripke, Stephan Isaacs, Aaron van Duijn, Cornelia M. GENETICS AND POPULATION ANALYSIS Here we describe an R library for genome-wide association (GWA) analysis. It implements effective storage and handling of GWA data, fast procedures for genetic data quality control, testing of association of single nucleotide polymorphisms with binary or quantitative traits, visualization of results and also provides easy interfaces to standard statistical and graphical procedures implemented in base R and special R libraries for genetic analysis. We evaluated GenABEL using one simulated and two real data sets. We conclude that GenABEL enables the analysis of GWA data on desktop computers. Availability: <inter-ref locator="http://cran.r-project.org" locator-type="url">http://cran.r-project.org</inter-ref> Contact: <inter-ref locator="i.aoultchenko@erasmusmc.nl" locator-type="email">i.aoultchenko@erasmusmc.nl</inter-ref> Oxford University Press 2007-05-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/10/1294 http://dx.doi.org/10.1093/bioinformatics/btm108 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/10/12972015-07-29HighWireOUPbioinfo:23:10

SBML export interface for the systems biology toolbox for MATLAB Schmidt, Hening Drews, Gunnar Vera, Julio Wolkenhauer, Olaf SYSTEMS BIOLOGY Summary: In this application note, we present an Systems biology markup language (SBML) export interface for the Systems Biology Toolbox for MATLAB. This interface allows modelers to automatically convert models, represented in the toolbox's own format (SBmodels) to SBML files. Since SBmodels do not explicitly contain all the information that is required to generate SBML, the necessary information is gathered by parsing SBmodels. The export can be done in two different ways. First, it is possible to call the export from the command line, thereby directly converting a model to an SBML file. The second option is to inspect and edit the conversion results with the help of a graphical user interface and to subsequently export the model to SBML. Availability: The SBML export interface has been integrated into the Systems Biology Toolbox for MATLAB, which is open source and freely available from <inter-ref locator="http://www.sbtoolbox2.org" locator-type="url">http://www.sbtoolbox2.org</inter-ref>. The website also contains a tutorial, extensive documentation and examples. Contact: <inter-ref locator="henning@fcc.chalmers.se" locator-type="email">henning@fcc.chalmers.se</inter-ref> Oxford University Press 2007-05-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/10/1297 http://dx.doi.org/10.1093/bioinformatics/btm105 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/10/12992015-07-29HighWireOUPbioinfo:23:10

Cyclone: java-based querying and computing with Pathway/Genome databases Fèvre, François Le Smidtas, Serge Schächter, Vincent SYSTEMS BIOLOGY Summary: Cyclone aims at facilitating the use of BioCyc, a collection of Pathway/Genome Databases (PGDBs). Cyclone provides a fully extensible Java Object API to analyze and visualize these data. Cyclone can read and write PGDBs, and can write its own data in the CycloneML format. This format is automatically generated from the BioCyc ontology by Cyclone itself, ensuring continued compatibility. Cyclone objects can also be stored in a relational database CycloneDB. Queries can be written in SQL, and in an intuitive and concise object-oriented query language, Hibernate Query Language (HQL). In addition, Cyclone interfaces easily with Java software including the Eclipse IDE for HQL edition, the Jung API for graph algorithms or Cytoscape for graph visualization. Availability: Cyclone is freely available under an open source license at: <inter-ref locator="http://sourceforge.net/projects/nemo-cyclone" locator-type="url">http://sourceforge.net/projects/nemo-cyclone</inter-ref> Contact: <inter-ref locator="cyclone@genoscope.cns.fr" locator-type="email">cyclone@genoscope.cns.fr</inter-ref> Supplementary information: For download and installation instructions, tutorials, use cases and examples, see <inter-ref locator="http://nemo-cyclone.sourceforge.net" locator-type="url">http://nemo-cyclone.sourceforge.net</inter-ref> Oxford University Press 2007-05-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/10/1299 http://dx.doi.org/10.1093/bioinformatics/btm107 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/10/13012015-07-29HighWireOUPbioinfo:23:10

BioGuideSRS: querying multiple sources with a user-centric perspective Cohen-Boulakia, Sarah Biton, Olivier Davidson, Susan Froidevaux, Christine DATABASES AND ONTOLOGIES Summary: Biologists are frequently faced with the problem of integrating information from multiple heterogeneous sources with their own experimental data. Given the large number of public sources, it is difficult to choose which sources to integrate without assistance. When doing this manually, biologists differ in their <it>preferences</it> concerning the sources to be queried as well as the <it>strategies</it>, i.e. the querying process they follow for navigating through the sources. In response to these findings, we have developed BioGuide to assist scientists search for relevant data within external sources while taking their preferences and strategies into account. In this article, we present BioGuideSRS, a user-friendly system which automatically retrieves instances of data by using BioGuide on top of the sequence retrieval system (SRS). BioGuideSRS is an Applet that can be run from its web page on any system with Java 5.0. Availability: <inter-ref locator="http://www.bioguide-project.net" locator-type="url">http://www.bioguide-project.net</inter-ref> Contact: <inter-ref locator="sarahcb@seas.upenn.edu" locator-type="email">sarahcb@seas.upenn.edu</inter-ref> Oxford University Press 2007-05-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/10/1301 http://dx.doi.org/10.1093/bioinformatics/btm088 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/10/13042015-07-29HighWireOUPbioinfo:23:10

Mediante: a web-based microarray data manager Le Brigand, Kevin Barbry, Pascal DATABASES AND ONTOLOGIES Summary: Mediante is a MIAME-compliant microarray data manager that links together annotations and experimental data. Developed as a J2EE three-tier application, Mediante integrates a management system for production of long oligonucleotide microarrays, an experimental data repository suitable for home made or commercial microarrays, and a user interface dedicated to the management of microarrays projects. Several tools allow quality control of hybridizations and submission of validated data to public repositories. Availability: <inter-ref locator="http://www.microarray.fr" locator-type="url">http://www.microarray.fr</inter-ref> Contact: <inter-ref locator="barbry@ipmc.cnrs.fr" locator-type="email">barbry@ipmc.cnrs.fr</inter-ref> or <inter-ref locator="lebrigand@ipmc.cnrs.fr" locator-type="email">lebrigand@ipmc.cnrs.fr</inter-ref> Supplementary information: <inter-ref locator="http://www.microarray.fr/SP/lebrigand2007/" locator-type="url">http://www.microarray.fr/SP/lebrigand2007/</inter-ref> Oxford University Press 2007-05-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/10/1304 http://dx.doi.org/10.1093/bioinformatics/btm106 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/10/13072015-07-29HighWireOUPbioinfo:23:10

DPTF: a database of poplar transcription factors Zhu, Qi-Hui Guo, An-Yuan Gao, Ge Zhong, Ying-Fu Xu, Meng Huang, Minren Luo, Jinchu DATABASES AND ONTOLOGIES Summary: The database of poplar transcription factors (DPTF) is a plant transcription factor (TF) database containing 2576 putative poplar TFs distributed in 64 families. These TFs were identified from both computational prediction and manual curation. We have provided extensive annotations including sequence features, functional domains, GO assignment and expression evidence for all TFs. In addition, DPTF contains cross-links to the <it>Arabidopsis</it> and rice transcription factor databases making it a unique resource for genome-scale comparative studies of transcriptional regulation in model plants. Availiability: DPTF is available at <inter-ref locator="http://dptf.cbi.pku.edu.cn" locator-type="url">http://dptf.cbi.pku.edu.cn</inter-ref> Contact: <inter-ref locator="dptf@mail.cbi.pku.edu.cn" locator-type="email">dptf@mail.cbi.pku.edu.cn</inter-ref> Oxford University Press 2007-05-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/10/1307 http://dx.doi.org/10.1093/bioinformatics/btm113 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/15732015-07-29HighWireOUPbioinfo:23:13

OSLay: optimal syntenic layout of unfinished assemblies Richter, Daniel C. Schuster, Stephan C. Huson, Daniel H. GENOME ANALYSIS Summary: The whole genome shotgun approach to genome sequencing results in a collection of contigs that must be ordered and oriented to facilitate efficient gap closure. We present a new tool OSLay that uses synteny between matching sequences in a target assembly and a reference assembly to layout the contigs (or scaffolds) in the target assembly. The underlying algorithm is based on maximum weight matching. The tool provides an interactive visualization of the computed layout and the result can be imported into the assembly editing tool Consed to support the design of primer pairs for gap closure. Motivation: To enhance efficiency in the gap closure phase of a genome project it is crucial to know which contigs are adjacent in the target genome. Related genome sequences can be used to layout contigs in an assembly. Availability: OSLay is freely available from: <inter-ref locator="http://www-ab.informatik.unituebingen.de/software/oslay" locator-type="url">http://www-ab.informatik.unituebingen.de/software/oslay</inter-ref> Contact: <inter-ref locator="drichter@informatik.uni-tuebingen.de" locator-type="email">drichter@informatik.uni-tuebingen.de</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/1573 http://dx.doi.org/10.1093/bioinformatics/btm153 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/15802015-07-29HighWireOUPbioinfo:23:13

iHMMune-align: hidden Markov model-based alignment and identification of germline genes in rearranged immunoglobulin gene sequences Gaëta, Bruno A. Malming, Harald R. Jackson, Katherine J.L. Bain, Michael E. Wilson, Patrick Collins, Andrew M. SEQUENCE ANALYSIS Motivation: Immunoglobulin heavy chain (IGH) genes in mature B lymphocytes are the result of recombination of IGHV, IGHD and IGHJ germline genes, followed by somatic mutation. The correct identification of the germline genes that make up a variable VH domain is essential to our understanding of the process of antibody diversity generation as well as to clinical investigations of some leukaemias and lymphomas. Results: We have developed iHMMune-align, an alignment program that uses a hidden Markov model (HMM) to model the processes involved in human IGH gene rearrangement and maturation. The performance of iHMMune-align was compared to that of other immunoglobulin gene alignment utilities using both clonally related and randomly selected IGH sequences. This evaluation suggests that iHMMune-align provides a more accurate identification of component germline genes than other currently available IGH gene characterization programs. Availability: iHMMune-align cross-platform Java executable and web interface are freely available to academic users and can be accessed at <inter-ref locator="http://www.emi.unsw.edu.au/~ihmmune/" locator-type="url">http://www.emi.unsw.edu.au/~ihmmune/</inter-ref> Contact: <inter-ref locator="bgaeta@cse.unsw.edu.au" locator-type="email">bgaeta@cse.unsw.edu.au</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/1580 http://dx.doi.org/10.1093/bioinformatics/btm147 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/15882015-07-29HighWireOUPbioinfo:23:13

Murlet: a practical multiple alignment tool for structural RNA sequences Kiryu, Hisanori Tabei, Yasuo Kin, Taishin Asai, Kiyoshi STRUCTURAL BIOINFORMATICS Motivation: Structural RNA genes exhibit unique evolutionary patterns that are designed to conserve their secondary structures; these patterns should be taken into account while constructing accurate multiple alignments of RNA genes. The Sankoff algorithm is a natural alignment algorithm that includes the effect of base-pair covariation in the alignment model. However, the extremely high computational cost of the Sankoff algorithm precludes its application to most RNA sequences. Results: We propose an efficient algorithm for the multiple alignment of structural RNA sequences. Our algorithm is a variant of the Sankoff algorithm, and it uses an efficient scoring system that reduces the time and space requirements considerably without compromising on the alignment quality. First, our algorithm computes the match probability matrix that measures the alignability of each position pair between sequences as well as the base pairing probability matrix for each sequence. These probabilities are then combined to score the alignment using the Sankoff algorithm. By itself, our algorithm does not predict the consensus secondary structure of the alignment but uses external programs for the prediction. We demonstrate that both the alignment quality and the accuracy of the consensus secondary structure prediction from our alignment are the highest among the other programs examined. We also demonstrate that our algorithm can align relatively long RNA sequences such as the eukaryotic-type signal recognition particle RNA that is ∼300 nt in length; multiple alignment of such sequences has not been possible by using other Sankoff-based algorithms. The algorithm is implemented in the software named ‘Murlet’. Availability: The C++ source code of the Murlet software and the test dataset used in this study are available at <inter-ref locator="http://www.ncrna.org/papers/Murlet/" locator-type="url">http://www.ncrna.org/papers/Murlet/</inter-ref> Contact: <inter-ref locator="kiryu-h@aist.go.jp" locator-type="email">kiryu-h@aist.go.jp</inter-ref> Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/1588 http://dx.doi.org/10.1093/bioinformatics/btm146 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/15992015-07-29HighWireOUPbioinfo:23:13

Meta-analysis of gene expression data: a predictor-based approach Fishel, Irit Kaufman, Alon Ruppin, Eytan GENE EXPRESSION Motivation: With the increasing availability of cancer microarray data sets there is a growing need for integrative computational methods that evaluate multiple independent microarray data sets investigating a common theme or disorder. Meta-analysis techniques are designed to overcome the low sample size typical to microarray experiments and yield more valid and informative results than each experiment separately. Results: We propose a new meta-analysis technique that aims at finding a set of classifying genes, whose expression level may be used to answering the classification question in hand. Specifically, we apply our method to two independent lung cancer microarray data sets and identify a joint core subset of genes which putatively play an important role in tumor genesis of the lung. The robustness of the identified joint core set is demonstrated on a third unseen lung cancer data set, where it leads to successful classification using very few top-ranked genes. Identifying such a set of genes is of significant importance when searching for biologically meaningful biomarkers. Contact: <inter-ref locator="ruppin@post.tau.ac.il" locator-type="email">ruppin@post.tau.ac.il</inter-ref> Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/1599 http://dx.doi.org/10.1093/bioinformatics/btm149 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/16072015-07-29HighWireOUPbioinfo:23:13

Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach Pihur, Vasyl Datta, Susmita Datta, Somnath GENE EXPRESSION Motivation: Biologists often employ clustering techniques in the explorative phase of microarray data analysis to discover relevant biological groupings. Given the availability of numerous clustering algorithms in the machine-learning literature, an user might want to select one that performs the best for his/her data set or application. While various validation measures have been proposed over the years to judge the quality of clusters produced by a given clustering algorithm including their biological relevance, unfortunately, a given clustering algorithm can perform poorly under one validation measure while outperforming many other algorithms under another validation measure. A manual synthesis of results from multiple validation measures is nearly impossible in practice, especially, when a large number of clustering algorithms are to be compared using several measures. An automated and objective way of reconciling the rankings is needed. Results: Using a Monte Carlo cross-entropy algorithm, we successfully combine the ranks of a set of clustering algorithms under consideration via a weighted aggregation that optimizes a distance criterion. The proposed weighted rank aggregation allows for a far more objective and automated assessment of clustering results than a simple visual inspection. We illustrate our procedure using one simulated as well as three real gene expression data sets from various platforms where we rank a total of eleven clustering algorithms using a combined examination of 10 different validation measures. The aggregate rankings were found for a given number of clusters <it>k</it> and also for an entire range of <it>k</it>. Availability: R code for all validation measures and rank aggregation is available from the authors upon request. Contact: <inter-ref locator="somnath.datta@louisville.edu" locator-type="email">somnath.datta@louisville.edu</inter-ref> Supplementary information: Supplementary information are available at <inter-ref locator="http://www.somnathdatta.org/Supp/RankCluster/supp.htm" locator-type="url">http://www.somnathdatta.org/Supp/RankCluster/supp.htm</inter-ref>. Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/1607 http://dx.doi.org/10.1093/bioinformatics/btm158 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/16162015-07-29HighWireOUPbioinfo:23:13

A network-based method for target selection in metabolic networks Guimerà, R. Sales-Pardo, M. Amaral, L.A.N. SYSTEMS BIOLOGY Motivation: The lack of new antimicrobials, combined with increasing microbial resistance to old ones, poses a serious threat to public health. With hundreds of genomes sequenced, systems biology promises to help in solving this problem by uncovering new drug targets. Results: Here, we propose an approach that is based on the mapping of the interactions between biochemical agents, such as proteins and metabolites, onto complex networks. We report that nodes and links in complex biochemical networks can be grouped into a small number of classes, based on their role in connecting different functional modules. Specifically, for metabolic networks, in which nodes represent metabolites and links represent enzymes, we demonstrate that some enzyme classes are more likely to be essential, some are more likely to be species-specific and some are likely to be both essential and specific. Our network-based enzyme classification scheme is thus a promising tool for the identification of drug targets. Contact: <inter-ref locator="rguimera@northwestern.edu" locator-type="email">rguimera@northwestern.edu</inter-ref> Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/1616 http://dx.doi.org/10.1093/bioinformatics/btm150 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/16232015-07-29HighWireOUPbioinfo:23:13

Time-varying modeling of gene expression regulatory networks using the wavelet dynamic vector autoregressive method Fujita, A. Sato, J.R. Garay-Malpartida, H.M. Morettin, P.A. Sogayar, M.C. Ferreira, C.E. SYSTEMS BIOLOGY Motivation: A variety of biological cellular processes are achieved through a variety of extracellular regulators, signal transduction, protein–protein interactions and differential gene expression. Understanding of the mechanisms underlying these processes requires detailed molecular description of the protein and gene networks involved. To better understand these molecular networks, we propose a statistical method to estimate time-varying gene regulatory networks from time series microarray data. One well known problem when inferring connectivity in gene regulatory networks is the fact that the relationships found constitute correlations that do not allow inferring causation, for which, a priori biological knowledge is required. Moreover, it is also necessary to know the time period at which this causation occurs. Here, we present the Dynamic Vector Autoregressive model as a solution to these problems. Results: We have applied the Dynamic Vector Autoregressive model to estimate time-varying gene regulatory networks based on gene expression profiles obtained from microarray experiments. The network is determined entirely based on gene expression profiles data, without any prior biological knowledge. Through construction of three gene regulatory networks (of p53, NF-κB and <it>c-myc</it>) for HeLa cells, we were able to predict the connectivity, Granger-causality and dynamics of the information flow in these networks. Contact: <inter-ref locator="cef@ime.usp.br" locator-type="email">cef@ime.usp.br</inter-ref> Supplementary information: Additional figures may be found at <inter-ref locator="http://mariwork.iq.usp.br/dvar/" locator-type="url">http://mariwork.iq.usp.br/dvar/</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/1623 http://dx.doi.org/10.1093/bioinformatics/btm151 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/16312015-07-29HighWireOUPbioinfo:23:13

Alignment of molecular networks by integer quadratic programming Li, Zhenping Zhang, Shihua Wang, Yong Zhang, Xiang-Sun Chen, Luonan SYSTEMS BIOLOGY Motivation: With more and more data on molecular networks (e.g. protein interaction networks, gene regulatory networks and metabolic networks) available, the discovery of conserved patterns or signaling pathways by comparing various kinds of networks among different species or within a species becomes an increasingly important problem. However, most of the conventional approaches either restrict comparative analysis to special structures, such as pathways, or adopt heuristic algorithms due to computational burden. Results: In this article, to find the conserved substructures, we develop an efficient algorithm for aligning molecular networks based on both molecule similarity and architecture similarity, by using integer quadratic programming (IQP). Such an IQP can be relaxed into the corresponding quadratic programming (QP) which almost always ensures an integer solution, thereby making molecular network alignment tractable without any approximation. The proposed framework is very flexible and can be applied to many kinds of molecular networks including weighted and unweighted, directed and undirected networks with or without loops. Availability: Matlab code and data are available from <inter-ref locator="http://zhangroup.aporc.org/bioinfo/MNAligner" locator-type="url">http://zhangroup.aporc.org/bioinfo/MNAligner</inter-ref> or <inter-ref locator="http://intelligent.eic.osaka-sandai.ac.jp/chenen/software/MNAligner" locator-type="url">http://intelligent.eic.osaka-sandai.ac.jp/chenen/software/MNAligner</inter-ref>, or upon request from authors. Contact: <inter-ref locator="zxs@amt.ac.cn" locator-type="email">zxs@amt.ac.cn</inter-ref>, <inter-ref locator="chen@eic.osaka-sandai.ac.jp" locator-type="email">chen@eic.osaka-sandai.ac.jp</inter-ref> Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/1631 http://dx.doi.org/10.1093/bioinformatics/btm156 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/16402015-07-29HighWireOUPbioinfo:23:13

Comparing association network algorithms for reverse engineering of large-scale gene regulatory networks: synthetic versus real data Soranzo, Nicola Bianconi, Ginestra Altafini, Claudio SYSTEMS BIOLOGY Motivation: Inferring a gene regulatory network exclusively from microarray expression profiles is a difficult but important task. The aim of this work is to compare the predictive power of some of the most popular algorithms in different conditions (like data taken at equilibrium or time courses) and on both synthetic and real microarray data. We are in particular interested in comparing similarity measures both of linear type (like correlations and partial correlations) and of non-linear type (mutual information and conditional mutual information), and in investigating the underdetermined case (less samples than genes). Results: In our simulations we see that all network inference algorithms obtain better performances from data produced with ‘structural’ perturbations, like gene knockouts at steady state, than with any dynamical perturbation. The predictive power of all algorithms is confirmed on a reverse engineering problem from <it>Escherichia coli</it> gene profiling data: the edges of the ‘physical’ network of transcription factor–binding sites are significantly overrepresented among the highest weighting edges of the graph that we infer directly from the data without any structure supervision. Comparing synthetic and <it>in vivo</it> data on the same network graph allows us to give an indication of how much more complex a real transcriptional regulation program is with respect to an artificial model. Availability: Software is freely available at the URL <inter-ref locator="http://people.sissa.it/~altafini/papers/SoBiAl07/" locator-type="url">http://people.sissa.it/~altafini/papers/SoBiAl07/</inter-ref> Contact: <inter-ref locator="altafini@sissa.it" locator-type="email">altafini@sissa.it</inter-ref> Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/1640 http://dx.doi.org/10.1093/bioinformatics/btm163 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/16482015-07-29HighWireOUPbioinfo:23:13

An efficient method for the detection and elimination of systematic error in high-throughput screening Makarenkov, Vladimir Zentilli, Pablo Kevorkov, Dmytro Gagarin, Andrei Malo, Nathalie Nadon, Robert DATA AND TEXT MINING Motivation: High-throughput screening (HTS) is an early-stage process in drug discovery which allows thousands of chemical compounds to be tested in a single study. We report a method for correcting HTS data prior to the hit selection process (i.e. selection of active compounds). The proposed correction minimizes the impact of systematic errors which may affect the hit selection in HTS. The introduced method, called a <it>well correction</it>, proceeds by correcting the distribution of measurements within wells of a given HTS assay. We use simulated and experimental data to illustrate the advantages of the new method compared to other widely-used methods of data correction and hit selection in HTS. Contact: <inter-ref locator="makarenkov.vladimir@uqam.ca" locator-type="email">makarenkov.vladimir@uqam.ca</inter-ref> Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/1648 http://dx.doi.org/10.1093/bioinformatics/btm145 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/16582015-07-29HighWireOUPbioinfo:23:13

A quantitative model for linking two disparate sets of articles in MEDLINE Torvik, Vetle I. Smalheiser, Neil R. DATA AND TEXT MINING Background: Identifying information that implicitly links two disparate sets of articles is a fundamental and intuitive data mining strategy that can help investigators address real scientific questions. The Arrowsmith two-node search finds title words and phrases (so-called B-terms) that are shared across two sets of articles within MEDLINE and displays them in a manner that facilitates human assessment. A serious stumbling-block has been the lack of a quantitative model for predicting which of the hundreds if not thousands of B-terms computed for a given search are most likely to be relevant to the investigator. Methodology/Principal Findings: Using a public two-node search interface, field testers devised a set of two-node searches under real life conditions and a certain number of B-terms were marked relevant. These were employed as ‘gold standards;’ each B-term was characterized according to eight complementary features that were strongly correlated with relevance. A logistic regression model was developed that permits one to estimate the probability of relevance for each B-term, to rank B-terms according to their likely relevance, and to estimate the overall number of relevant B-terms inherent in a given two-node search. Conclusions/Significance: The model greatly simplifies and streamlines the process of carrying out a two-node search, and may be applicable to a number of other literature-based discovery applications, including the so-called one-node search and related gene-centric strategies that incorporate implicit links to predict how genes may be related to each other and to human diseases. This should encourage much wider exploration of text mining for implicit information among the general scientific community. Availability: Two-node searches can be carried out freely at <inter-ref locator="http://arrowsmith.psych.uic.edu" locator-type="url">http://arrowsmith.psych.uic.edu</inter-ref> Contact: <inter-ref locator="neils@uic.edu" locator-type="email">neils@uic.edu</inter-ref>, <inter-ref locator="vtorvik@uic.edu" locator-type="email">vtorvik@uic.edu</inter-ref> Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/1658 http://dx.doi.org/10.1093/bioinformatics/btm161 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/16662015-07-29HighWireOUPbioinfo:23:13

Phenotypic clustering of yeast mutants based on kinetochore microtubule dynamics Jaqaman, K. Dorn, J. F. Marco, E. Sorger, P. K. Danuser, G. DATA AND TEXT MINING Motivation: Kinetochores are multiprotein complexes which mediate chromosome attachment to microtubules (MTs) of the mitotic spindle. They regulate MT dynamics during chromosome segregation. Our goal is to identify groups of kinetochore proteins with similar effects on MT dynamics, revealing pathways through which kinetochore proteins transform chemical and mechanical input signals into cues of MT regulation. Results: We have developed a hierarchical, agglomerative clustering algorithm that groups <it>Saccharomyces cerevisiae</it> strains based on MT-mediated chromosome dynamics measured by high-resolution live cell microscopy. Clustering is based on parameters of autoregressive moving average (ARMA) models of the probed dynamics. We have found that the regulation of wildtype MT dynamics varies with cell cycle and temperature, but not with the chromosome an MT is attached to. By clustering the dynamics of mutants, we discovered that the three genes <it>IPL1, DAM1</it> and <it>KIP3</it> co-regulate MT dynamics. Our study establishes the clustering of chromosome and MT dynamics by ARMA descriptors as a sensitive framework for the systematic identification of kinetochore protein subcomplexes and pathways for the regulation of MT dynamics. Availability: The clustering code, written in M<scp>atlab</scp>, can be downloaded from <inter-ref locator="http://lccb.scripps.edu" locator-type="url">http://lccb.scripps.edu</inter-ref>. (‘download’ hyperlink at bottom of website). Contact: <inter-ref locator="kjaqaman@scripps.edu" locator-type="email">kjaqaman@scripps.edu</inter-ref> Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/1666 http://dx.doi.org/10.1093/bioinformatics/btm230 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/16742015-07-29HighWireOUPbioinfo:23:13

The qualitative and time-dependent character of spatial relations in biomedical ontologies Bittner, Thomas Goldberg, Louis J. DATABASES AND ONTOLOGIES Motivation: The formal representation of mereological aspects of canonical anatomy (parthood relations) is relatively well understood. The formal representation of other aspects of canonical anatomy, such as connectedness and adjacency relations between anatomical parts, their shape and size as well as the spatial arrangement of anatomical parts within larger anatomical structures are, however, much less well understood and represented in existing computational anatomical and bio-medical ontologies only insufficiently. Results: In this article, we provide a methodology of how to incorporate this kind of information into anatomical and bio-medical ontologies by applying techniques of representing qualitative spatial information from Artificial Intelligence. In particular, we focus on how to explicitly take into account the qualitative and time-dependent character of these relations. As a running example, we use the human temporomandibular joint (TMJ). Availability: Using the presented methodology, a formal ontology was developed which is accessible on <inter-ref locator="http://www.ifomis.org/bfo/fol" locator-type="url">http://www.ifomis.org/bfo/fol</inter-ref>. This ontology may help to improve the logical and ontological rigor of bio-medical ontologies such as the OBO relation ontology. Contact: <inter-ref locator="bittner3@buffalo.edu" locator-type="email">bittner3@buffalo.edu</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/1674 http://dx.doi.org/10.1093/bioinformatics/btm155 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/16832015-07-29HighWireOUPbioinfo:23:13

SciRoKo: a new tool for whole genome microsatellite search and investigation Kofler, Robert Schlötterer, Christian Lelley, Tamas GENOME ANALYSIS Summary: SciRoKo is a user-friendly software tool for the identification of microsatellites in genomic sequences. The combination of an extremely fast search algorithm with a built-in summary statistic tool makes SciRoKo an excellent tool for full genome analysis. Compared to other already existing tools, SciRoKo also allows the analysis of compound microsatellites. Availability: free for use: <inter-ref locator="www.kofler.or.at/Bioinformatics" locator-type="url">www.kofler.or.at/Bioinformatics</inter-ref> Contact: <inter-ref locator="robert.kofler@boku.ac.at" locator-type="email">robert.kofler@boku.ac.at</inter-ref> Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/1683 http://dx.doi.org/10.1093/bioinformatics/btm157 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/16862015-07-29HighWireOUPbioinfo:23:13

CTX-BLAST: context sensitive version of protein BLAST Gambin, Anna Wojtalewicz, Piotr SEQUENCE ANALYSIS Summary: We present a software tool CTX-BLAST that incorporates contextual alignment model into the popular protein BLAST program. Our alignment tool allows us to investigate the effect of context-dependency in the protein alignment much more efficient than using previous dynamic algorithms. The software makes use of non-symmetric contextual substitution tables and calculates the statistical significance of a given alignment according to the contextual statistical model. Availability: CTX-BLAST is an open source software freely available from <inter-ref locator="www.sourceforge.net/projects/CTX-BLAST" locator-type="url">www.sourceforge.net/projects/CTX-BLAST</inter-ref>. A program for statistical estimation of E-value parameters and the contextual substitution table CTX-BLOSUM62 are also provided. Contact: <inter-ref locator="aniag@mimuw.edu.pl" locator-type="email">aniag@mimuw.edu.pl</inter-ref> Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/1686 http://dx.doi.org/10.1093/bioinformatics/btm136 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/16892015-07-29HighWireOUPbioinfo:23:13

AutoCSA, an algorithm for high throughput DNA sequence variant detection in cancer genomes Dicks, E. Teague, J. W. Stephens, P. Raine, K. Yates, A. Mattocks, C. Tarpey, P. Butler, A. Menzies, A. Richardson, D. Jenkinson, A. Davies, H. Edkins, S. Forbes, S. Gray, K. Greenman, C. Shepherd, R. Stratton, M. R. Futreal, P. A. Wooster, R. SEQUENCE ANALYSIS The undertaking of large-scale DNA sequencing screens for somatic variants in human cancers requires accurate and rapid processing of traces for variants. Due to their often aneuploid nature and admixed normal tissue, heterozygous variants found in primary cancers are often subtle and difficult to detect. To address these issues, we have developed a mutation detection algorithm, AutoCSA, specifically optimized for the high throughput screening of cancer samples. Availability: <inter-ref locator="http://www.sanger.ac.uk/genetics/CGP/Software/AutoCSA." locator-type="url">http://www.sanger.ac.uk/genetics/CGP/Software/AutoCSA</inter-ref>. Contact: <inter-ref locator="mrs@sanger.ac.uk" locator-type="email">mrs@sanger.ac.uk</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/1689 http://dx.doi.org/10.1093/bioinformatics/btm152 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/16922015-07-29HighWireOUPbioinfo:23:13

SNP detection exploiting multiple sources of redundancy in large EST collections improves validation rates Hayes, Ben J. Nilsen, Kjetil Berg, Paul R. Grindflek, Eli Lien, Sigbjørn SEQUENCE ANALYSIS Motivation: Single nucleotide polymorphism (SNP) detection exploiting redundancy in expressed sequence tag (EST) collections that arises from the presence of transcripts of the same gene from different individuals has been used to generate large collections of SNPs for many species. A second source of redundancy, namely that EST collections can contain multiple transcripts of the same gene from the same individual, can be exploited to distinguish true SNPs from sequencing error. In this article, we demonstrate with Atlantic salmon and pig EST collections that splitting the EST collection in two, detecting SNPs in both subsets, then accepting only cross-validated SNPs increases validation rates. Results: In the pig data set, 676 cross-validated putative SNPs were detected in a collection of 160 689 ESTs. When validating a subset of these by genotyping on MassARRAY 85.1% of SNPs were polymorphic in successful assays. In the salmon data set, 856 cross-validated putative SNPs were detected in a collection of 243 674 ESTs. Validation by genotyping showed that 81.0% of the cross-validated putative SNPs were polymorphic in successful assays. Availability: Cross-validated SNPs are available at dbSNP (<inter-ref locator="http://www.ncbi.nlm.nih.gov/projects/SNP/" locator-type="url">http://www.ncbi.nlm.nih.gov/projects/SNP/</inter-ref>), ss69371838-ss69372575 for the salmon SNPs and ss69372587-ss69373226 for the pig SNPs. Contact: <inter-ref locator="ben.hayes@dpi.vic.gov.au" locator-type="email">ben.hayes@dpi.vic.gov.au</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/1692 http://dx.doi.org/10.1093/bioinformatics/btm154 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/16942015-07-29HighWireOUPbioinfo:23:13

TFmodeller: comparative modelling of protein DNA complexes Contreras-Moreira, Bruno Branger, Pierre-Alain Collado-Vides, Julio STRUCTURAL BIOINFORMATICS Summary: Interactions between proteins and DNA molecules lie at the core of the fundamental cellular processes such as transcriptional regulation. Some of these interactions have been experimentally described at atomic scale, but the molecular details of many others remain to be discovered. TFmodeller exploits the current knowledge about protein–DNA interfaces contained in the Protein Data Bank and uses it to model similar interfaces related by homology. Results are emailed to the user and include an evolutionary contact matrix, a schematic representation of the putative binding interface and atomic coordinates of the modelled complex. The library of complexes used by TFmodeller is updated on a weekly basis and is available for download. Availability: TFmodeller and its web service interface are free for academic users at <inter-ref locator="http://www.ccg.unam.mx/tfmodeller" locator-type="url">http://www.ccg.unam.mx/tfmodeller</inter-ref> Contact: <inter-ref locator="contrera@ccg.unam.mx" locator-type="email">contrera@ccg.unam.mx</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/1694 http://dx.doi.org/10.1093/bioinformatics/btm148 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/16972015-07-29HighWireOUPbioinfo:23:13

GAzer: gene set analyzer Kim, Sang-Bae Yang, Sungjin Kim, Seon-Kyu Kim, Sang Cheol Woo, Hyun Goo Volsky, David J. Kim, Seon-Young Chu, In-Sun GENE EXPRESSION Summary: Gene Set Analyzer (GAzer) is a web-based integrated gene set analysis tool covering previously reported parametric and non-parametric models. Based on a simulation test for the reported algorithms, we classified and implemented three main statistical methods consisting of the <it>z</it>-statistic, gene permutation and sample permutation for ten gene set categories including Gene Ontology (GO) for human, mouse, rat and yeast. This tool identifies significantly altered gene sets scored by <it>z</it>-statistics and <it>P</it>-values from the <it>z</it>-test or permutation test and provides <it>q</it>-values and Bonferroni <it>P</it>-values to correct multiple hypothesis testing. GAzer allows users to observe changes in expression of each gene in a gene set or to see the significance of the gene sets containing a gene(s) of interest, thus allowing interactive data analysis both at the gene and gene set level. Moreover, GAzer offers extensive annotation for each gene. Availability: The GAzer gene set analyzer is freely available at <inter-ref locator="http://integromics.kobic.re.kr/GAzer/" locator-type="url">http://integromics.kobic.re.kr/GAzer/</inter-ref> Contact: <inter-ref locator="kimsy@kribb.re.kr" locator-type="email">kimsy@kribb.re.kr</inter-ref> and <inter-ref locator="chu@kribb.re.kr" locator-type="email">chu@kribb.re.kr</inter-ref> Supplementary information: This can be found on the web page (<inter-ref locator="http://integromics.kobic.re.kr/GAzer/supplement.jsp" locator-type="url">http://integromics.kobic.re.kr/GAzer/supplement.jsp</inter-ref>) Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/1697 http://dx.doi.org/10.1093/bioinformatics/btm144 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/17002015-07-29HighWireOUPbioinfo:23:13

CALIB: a Bioconductor package for estimating absolute expression levels from two-color microarray data Zhao, Hui Engelen, Kristof De Moor, Bart Marchal, Kathleen GENE EXPRESSION In this article we describe a new Bioconductor package ‘CALIB’ for normalization of two-color microarray data. This approach is based on the measurements of external controls and estimates an absolute target level for each gene and condition pair, as opposed to working with log-ratios as a relative measure of expression. Moreover, this method makes no assumptions regarding the distribution of gene expression divergence. Availability: <inter-ref locator="http://bioconductor.org/packages/2.0/bioc" locator-type="url">http://bioconductor.org/packages/2.0/bioc</inter-ref> Open Source Contact: <inter-ref locator="Kathleen.marchal@biw.kuleuven.be" locator-type="email">Kathleen.marchal@biw.kuleuven.be</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/1700 http://dx.doi.org/10.1093/bioinformatics/btm159 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/17022015-07-29HighWireOUPbioinfo:23:13

WilcoxCV: an R package for fast variable selection in cross-validation Boulesteix, Anne-Laure GENE EXPRESSION Summary: In the last few years, numerous methods have been proposed for microarray-based class prediction. Although many of them have been designed especially for the case <it>n</it> &Lt; <it>p</it> (much more variables than observations), preliminary variable selection is almost always necessary when the number of genes reaches several tens of thousands, as usual in recent data sets. In the two-class setting, the Wilcoxon rank sum test statistic is, with the <it>t</it>-statistic, one of the standard approaches for variable selection. It is well known that the variable selection step must be seen as a part of classifier construction and, as such, be performed based on training data only. When classifier accuracy is evaluated via cross-validation or Monte–Carlo cross-validation, it means that we have to perform <it>p</it> Wilcoxon or <it>t</it>-tests for each iteration, which becomes a daunting task for increasing <it>p</it>. As a consequence, many authors often perform variable selection only once using all the available data, which can induce a dramatic underestimation of error rate and thus lead to misleadingly reporting predictive power. We propose a very fast implementation of variable selection based on the Wilcoxon test for use in cross-validation and Monte Carlo cross-validation (also known as random splitting into learning and test sets). This implementation is based on a simple mathematical formula using only the ranks calculated from the original data set. Availability: Our method is implemented in the freely available <ty>R</ty> package <ty>WilcoxCV</ty> which can be downloaded from the Comprehensive R Archive Network at <inter-ref locator="http://cran.r-project.org/src/contrib/Descriptions/WilcoxCV.html" locator-type="url">http://cran.r-project.org/src/contrib/Descriptions/WilcoxCV.html</inter-ref> Contact: <inter-ref locator="boulesteix@slcmsr.org" locator-type="email">boulesteix@slcmsr.org</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/1702 http://dx.doi.org/10.1093/bioinformatics/btm162 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/17052015-07-29HighWireOUPbioinfo:23:13

PQuad a visual analysis platform for proteomic data exploration of microbial organisms Webb-Robertson, Bobbie-Jo M. Peterson, Elena S. Singhal, Mudita Klicker, Kyle R. Oehmen, Christopher S. Adkins, Joshua N. Havre, Susan L. SYSTEMS BIOLOGY Summary: The visual Platform for Proteomics Peptide and Protein data exploration (PQuad) is a multi-resolution environment that visually integrates genomic and proteomic data for prokaryotic systems, overlays categorical annotation and compares differential expression experiments. PQuad requires Java 1.5 and has been tested to run across different operating systems. Availability: <inter-ref locator="http://ncrr.pnl.gov/software" locator-type="url">http://ncrr.pnl.gov/software</inter-ref> Contact: <inter-ref locator="bobbie-jo.webb-robertson@pnl.gov" locator-type="email">bobbie-jo.webb-robertson@pnl.gov</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/1705 http://dx.doi.org/10.1093/bioinformatics/btm132 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/17082015-07-29HighWireOUPbioinfo:23:13

FASPAD: fast signaling pathway detection Hüffner, Falk Wernicke, Sebastian Zichner, Thomas SYSTEMS BIOLOGY Summary: F<scp>aspad</scp> is a user-friendly tool that detects candidates for linear signaling pathways in protein interaction networks based on an approach by Scott <it>et al</it>. (Journal of Computational Biology, <cross-ref type="bib" refid="B3">2006</cross-ref>). Using recent algorithmic insights, it can solve the underlying NP-hard problem quite fast: for protein networks of typical size (several thousand nodes), pathway candidates of length up to 13 proteins can be found within seconds and with a 99.9% probability of optimality. F<scp>aspad</scp> graphically displays all candidates that are found; for evaluation and comparison purposes, an overlay of several candidates and the surrounding network context can also be shown. Availability: F<scp>aspad</scp> is available as free software under the GPL license at <inter-ref locator="http://theinf1.informatik.uni-jena.de/faspad/" locator-type="url">http://theinf1.informatik.uni-jena.de/faspad/</inter-ref> and runs under Linux and Windows. Contact: <inter-ref locator="hueffner@minet.uni-jena.de" locator-type="email">hueffner@minet.uni-jena.de</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/1708 http://dx.doi.org/10.1093/bioinformatics/btm160 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/17102015-07-29HighWireOUPbioinfo:23:13

DITOP: drug-induced toxicity related protein database Zhang, Jing-Xian Huang, Wei-Juan Zeng, Jing-Hua Huang, Wen-Hui Wang, Yi Zhao, Rui Han, Bu-Cong Liu, Qing-Feng Chen, Yu-Zong Ji, Zhi-Liang DATABASES AND ONTOLOGIES Motivation: Drug-induced toxicity related proteins (DITRPs) are proteins that mediate adverse drug reactions (ADRs) or toxicities through their binding to drugs or reactive metabolites. Collection of these proteins facilitates better understanding of the molecular mechanisms of drug-induced toxicity and the rational drug discovery. Drug-induced toxicity related protein database (DITOP) is such a database that is intending to provide comprehensive information of DITRPs. Currently, DITOP contains 1501 records, covering 618 distinct literature-reported DITRPs, 529 drugs/ligands and 418 distinct toxicity terms. These proteins were confirmed experimentally to interact with drugs or their reactive metabolites, thus directly or indirectly cause adverse effects or toxicities. Five major types of drug-induced toxicities or ADRs are included in DITOP, which are the idiosyncratic adverse drug reactions, the dose-dependent toxicities, the drug–drug interactions, the immune-mediated adverse drug effects (IMADEs) and the toxicities caused by genetic susceptibility. Molecular mechanisms underlying the toxicity and cross-links to related resources are also provided while available. Moreover, a series of user-friendly interfaces were designed for flexible retrieval of DITRPs-related information. The DITOP can be accessed freely at <inter-ref locator="http://bioinf.xmu.edu.cn/databases/ADR/index.html" locator-type="url">http://bioinf.xmu.edu.cn/databases/ADR/index.html</inter-ref> Contact: <inter-ref locator="zhiliang.ji@gmail.com" locator-type="email">zhiliang.ji@gmail.com</inter-ref> or <inter-ref locator="appo@bioinf.xmu.edu.cn" locator-type="email">appo@bioinf.xmu.edu.cn</inter-ref> Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/1710 http://dx.doi.org/10.1093/bioinformatics/btm139 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i12015-07-29HighWireOUPbioinfo:23:13

ISMB/ECCB 2007 EDITORIALS Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i1 http://dx.doi.org/10.1093/bioinformatics/btm285 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i102015-07-29HighWireOUPbioinfo:23:13

Multiway analysis of epilepsy tensors Acar, Evrim Aykut-Bingol, Canan Bingol, Haluk Bro, Rasmus Yener, Bülent ORIGINAL PAPERS Motivation: The success or failure of an epilepsy surgery depends greatly on the localization of epileptic focus (origin of a seizure). We address the problem of identification of a seizure origin through an analysis of ictal electroencephalogram (EEG), which is proven to be an effective standard in epileptic focus localization. Summary: With a goal of developing an automated and robust way of visual analysis of large amounts of EEG data, we propose a novel approach based on multiway models to study epilepsy seizure structure. Our contributions are 3-fold. First, we construct an <it>Epilepsy Tensor</it> with three modes, i.e. time samples, scales and electrodes, through wavelet analysis of multi-channel ictal EEG. Second, we demonstrate that multiway analysis techniques, in particular parallel factor analysis (PARAFAC), provide promising results in modeling the complex structure of an epilepsy seizure, localizing a seizure origin and extracting artifacts. Third, we introduce an approach for removing artifacts using multilinear subspace analysis and discuss its merits and drawbacks. Results: Ictal EEG analysis of 10 seizures from 7 patients are included in this study. Our results for 8 seizures match with clinical observations in terms of seizure origin and extracted artifacts. On the other hand, for 2 of the seizures, seizure localization is not achieved using an initial trial of PARAFAC modeling. In these cases, first, we apply an artifact removal method and subsequently apply the PARAFAC model on the epilepsy tensor from which potential artifacts have been removed. This method successfully identifies the seizure origin in both cases. Contact: <inter-ref locator="acare@cs.rpi.edu" locator-type="email">acare@cs.rpi.edu</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i10 http://dx.doi.org/10.1093/bioinformatics/btm210 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i1042015-07-29HighWireOUPbioinfo:23:13

Choosing where to look next in a mutation sequence space: Active Learning of informative p53 cancer rescue mutants Danziger, Samuel A. Zeng, Jue Wang, Ying Brachmann, Rainer K. Lathrop, Richard H. ORIGINAL PAPERS Motivation: Many biomedical projects would benefit from reducing the time and expense of <it>in vitro</it> experimentation by using computer models for <it>in silico</it> predictions. These models may help determine which expensive biological data are most useful to acquire next. Active Learning techniques for choosing the most informative data enable biologists and computer scientists to optimize experimental data choices for rapid discovery of biological function. To explore design choices that affect this desirable behavior, five novel and five existing Active Learning techniques, together with three control methods, were tested on 57 previously unknown p53 cancer rescue mutants for their ability to build classifiers that predict protein function. The best of these techniques, Maximum Curiosity, improved the baseline accuracy of 56–77%. This article shows that Active Learning is a useful tool for biomedical research, and provides a case study of interest to others facing similar discovery challenges. Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i104 http://dx.doi.org/10.1093/bioinformatics/btm166 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i1152015-07-29HighWireOUPbioinfo:23:13

Structural templates predict novel protein interactions and targets from pancreas tumour gene expression data Dawelbait, Gihan Winter, Christof Zhang, Yanju Pilarsky, Christian Grützmann, Robert Heinrich, Jörg-Christian Schroeder, Michael ORIGINAL PAPERS Motivation: Pancreatic ductal adenocarcinoma (PDAC) eludes early detection and is characterized by its aggressiveness and resistance to current therapies. A number of gene expression screens have been carried out to identify genes differentially expressed in cancerous tissue. To identify molecular markers and suitable targets, these genes have been mapped to protein interactions to gain an understanding at systems level. Results: Here, we take such a network-centric approach to pancreas cancer by re-constructing networks from known interactions and by predicting novel protein interactions from structural templates. The pathways we find to be largely affected are signal transduction, actin cytoskeleton regulation, cell growth and cell communication. Our analysis indicates that the alteration of the calcium pathway plays an important role in pancreas-specific tumorigenesis. Furthermore, our structural prediction method identifies 40 novel interactions including the tissue factor pathway inhibitor 2 (TFPI2) interacting with the transmembrane protease serine 4 (TMPRSS4). Since TMPRSS4 is involved in metastasis formation, we hypothezise that the upregulation of TMPRSS4 and the downregulation of its predicted inhibitor TFPI2 plays an important role in this process. Moreover, we examine the potential role of BVDU (RP101) as an inhibitor of TMPRSS4. BDVU is known to support apoptosis and prevent the acquisition of chemoresistance. Our results suggest that BVDU might bind to the active site of TMPRSS4, thus reducing its assistance in metastasis. Contact: <inter-ref locator="ms@biotec.tu-dresden.de" locator-type="email">ms@biotec.tu-dresden.de</inter-ref> Supplementary information: Supplementary data are available at<it>Bioinformatics</it> online. Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i115 http://dx.doi.org/10.1093/bioinformatics/btm188 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i1252015-07-29HighWireOUPbioinfo:23:13

Kernel-based data fusion for gene prioritization De Bie, Tijl Tranchevent, Léon-Charles van Oeffelen, Liesbeth M. M. Moreau, Yves ORIGINAL PAPERS Motivation: Hunting disease genes is a problem of primary importance in biomedical research. Biologists usually approach this problem in two steps: first a set of candidate genes is identified using traditional positional cloning or high-throughput genomics techniques; second, these genes are further investigated and validated in the wet lab, one by one. To speed up discovery and limit the number of costly wet lab experiments, biologists must test the candidate genes starting with the most probable candidates. So far, biologists have relied on literature studies, extensive queries to multiple databases and hunches about expected properties of the disease gene to determine such an ordering. Recently, we have introduced the data mining tool ENDEAVOUR (Aerts <it>et al</it>., <cross-ref type="bib" refid="B1">2006</cross-ref>), which performs this task automatically by relying on different genome-wide data sources, such as Gene Ontology, literature, microarray, sequence and more. Results: In this article, we present a novel kernel method that operates in the same setting: based on a number of different views on a set of training genes, a prioritization of test genes is obtained. We furthermore provide a thorough learning theoretical analysis of the method's guaranteed performance. Finally, we apply the method to the disease data sets on which ENDEAVOUR (Aerts <it>et al</it>., <cross-ref type="bib" refid="B1">2006</cross-ref>) has been benchmarked, and report a considerable improvement in empirical performance. Availability: The MATLAB code used in the empirical results will be made publicly available. Contact: <inter-ref locator="tijl.debie@gmail.com" locator-type="email">tijl.debie@gmail.com</inter-ref> or <inter-ref locator="yves.moreau@esat.kuleuven.be" locator-type="email">yves.moreau@esat.kuleuven.be</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i125 http://dx.doi.org/10.1093/bioinformatics/btm187 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i1332015-07-29HighWireOUPbioinfo:23:13

Co-occurrence analysis of insertional mutagenesis data reveals cooperating oncogenes de Ridder, Jeroen Kool, Jaap Uren, Anthony Bot, Jan Wessels, Lodewyk Reinders, Marcel ORIGINAL PAPERS Motivation: Cancers are caused by an accumulation of multiple independent mutations that collectively deregulate cellular pathways, e.g. such as those regulating cell division and cell-death. The publicly available Retroviral Tagged Cancer Gene Database (RTCGD) contains the data of many insertional mutagenesis screens, in which the virally induced mutations result in tumor formation in mice. The insertion loci therefore indicate the location of putative cancer genes. Additionally, the presence of multiple independent insertions within one tumor hints towards a cooperation between the insertionally mutated genes. In this study we focus on the detection of statistically significant co-mutations. Results: We propose a two-dimensional Gaussian Kernel Convolution method (2DGKC), a computational technique that identifies the cooperating mutations in insertional mutagenesis data. We define the Common Co-occurrence of Insertions (CCI), signifying the co-mutations that are statistically significant across all different screens in the RTCGD. Significance estimates are made on multiple scales, and the results visualized in a scale space, thereby providing valuable extra information on the putative cooperation. The multidimensional analysis of the insertion data results in the discovery of 86 statistically significant co-mutations, indicating the presence of cooperating oncogenes that play a role in tumor development. Since oncogenes may cooperate with several members of a parallel pathway, we combined the co-occurrence data with gene family information to find significant cooperations between oncogenes and families of genes. We show, for instance, the interchangeable cooperation of <it>Myc</it> insertions with insertions in the <it>Pim</it> family. Availability: A list of the resulting CCIs is available at: <inter-ref locator="http://ict.ewi.tudelft.nl/~jeroen/CCI/CCI_list.txt" locator-type="url">http://ict.ewi.tudelft.nl/~jeroen/CCI/CCI_list.txt</inter-ref> Contact: <inter-ref locator="m.j.t.reinders@tudelft.nl" locator-type="email">m.j.t.reinders@tudelft.nl</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i133 http://dx.doi.org/10.1093/bioinformatics/btm202 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i1422015-07-29HighWireOUPbioinfo:23:13

Cotranslational protein folding fact or fiction? Deane, Charlotte M. Dong, Mingqiang Huard, Fabien P.E. Lance, Braddon K. Wood, Graham R. ORIGINAL PAPERS Motivation: Experimentalists have amassed extensive evidence over the past four decades that proteins appear to fold during production by the ribosome. Protein structure prediction methods, however, do not incorporate this property of folding. A thorough study to find the fingerprint of such sequential folding is the first step towards using it in folding algorithms, so assisting structure prediction. Results: We explore computationally the existence of evidence for cotranslational folding, based on large sets of experimentally determined structures in the PDB. Our perspective is that cotranslational folding is the norm, but that the effect is masked in most classes. We show that it is most evident in <it>α</it>/<it>β</it> proteins, confirming recent findings. We also find mild evidence that older proteins may fold cotranslationally. A tool is provided for determining, within a protein, where cotranslation is most evident. Contact: <inter-ref locator="gwood@efs.mq.edu.au" locator-type="email">gwood@efs.mq.edu.au</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i142 http://dx.doi.org/10.1093/bioinformatics/btm175 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i1492015-07-29HighWireOUPbioinfo:23:13

Identification of functional modules from conserved ancestral protein protein interactions Dutkowski, Janusz Tiuryn, Jerzy ORIGINAL PAPERS Motivation: The increasing availability of large-scale protein–protein interaction (PPI) data has fuelled the efforts to elucidate the building blocks and organization of cellular machinery. Previous studies have shown cross-species comparison to be an effective approach in uncovering functional modules in protein networks. This has in turn driven the research for new network alignment methods with a more solid grounding in network evolution models and better scalability, to allow multiple network comparison. Results: We develop a new framework for protein network alignment, based on reconstruction of an ancestral PPI network. The reconstruction algorithm is built upon a proposed model of protein network evolution, which takes into account phylogenetic history of the proteins and the evolution of their interactions. The application of our methodology to the PPI networks of yeast, worm and fly reveals that the most probable conserved ancestral interactions are often related to known protein complexes. By projecting the conserved ancestral interactions back onto the input networks we are able to identify the corresponding conserved protein modules in the considered species. In contrast to most of the previous methods, our algorithm is able to compare many networks simultaneously. The performed experiments demonstrate the ability of our method to uncover many functional modules with high specificity. Availability: Information for obtaining software and supplementary results are available at <inter-ref locator="http://bioputer.mimuw.edu.pl/papers/cappi" locator-type="url">http://bioputer.mimuw.edu.pl/papers/cappi</inter-ref>. Contact: <inter-ref locator="januszd@mimuw.edu.pl" locator-type="email">januszd@mimuw.edu.pl</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i149 http://dx.doi.org/10.1093/bioinformatics/btm194 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i1592015-07-29HighWireOUPbioinfo:23:13

Computational prediction of host-pathogen protein protein interactions Dyer, Matthew D. Murali, T. M. Sobral, Bruno W. ORIGINAL PAPERS Motivation: Infectious diseases such as malaria result in millions of deaths each year. An important aspect of any host-pathogen system is the mechanism by which a pathogen can infect its host. One method of infection is via protein–protein interactions (PPIs) where pathogen proteins target host proteins. Developing computational methods that identify which PPIs enable a pathogen to infect a host has great implications in identifying potential targets for therapeutics. Results: We present a method that integrates known intra-species PPIs with protein-domain profiles to predict PPIs between host and pathogen proteins. Given a set of intra-species PPIs, we identify the functional domains in each of the interacting proteins. For every pair of functional domains, we use Bayesian statistics to assess the probability that two proteins with that pair of domains will interact. We apply our method to the <it>Homo sapiens</it> – <it>Plasmodium falciparum</it> host-pathogen system. Our system predicts 516 PPIs between proteins from these two organisms. We show that pairs of human proteins we predict to interact with the same <it>Plasmodium</it> protein are close to each other in the human PPI network and that <it>Plasmodium</it> pairs predicted to interact with same human protein are co-expressed in DNA microarray datasets measured during various stages of the <it>Plasmodium</it> life cycle. Finally, we identify functionally enriched sub-networks spanned by the predicted interactions and discuss the plausibility of our predictions. Availability: Supplementary data are available at <inter-ref locator="http://staff.vbi.vt.edu/dyermd/publications/dyer2007a.html" locator-type="url">http://staff.vbi.vt.edu/dyermd/publications/dyer2007a.html</inter-ref> Contact: <inter-ref locator="dyermd@vbi.vt.edu" locator-type="email">dyermd@vbi.vt.edu</inter-ref> Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i159 http://dx.doi.org/10.1093/bioinformatics/btm208 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i1672015-07-29HighWireOUPbioinfo:23:13

GPDTI: A Genetic Programming Decision Tree Induction method to find epistatic effects in common complex diseases Estrada-Gil, Jesús K. Fernández-López, Juan C. Hernández-Lemus, Enrique Silva-Zolezzi, Irma Hidalgo-Miranda, Alfredo Jiménez-Sánchez, Gerardo Vallejo-Clemente, Edgar E. ORIGINAL PAPERS Motivation: The identification of risk-associated genetic variants in common diseases remains a challenge to the biomedical research community. It has been suggested that common statistical approaches that exclusively measure main effects are often unable to detect interactions between some of these variants. Detecting and interpreting interactions is a challenging open problem from the statistical and computational perspectives. Methods in computing science may improve our understanding on the mechanisms of genetic disease by detecting interactions even in the presence of very low heritabilities. Results: We have implemented a method using Genetic Programming that is able to induce a Decision Tree to detect interactions in genetic variants. This method has a cross-validation strategy for estimating classification and prediction errors and tests for consistencies in the results. To have better estimates, a new consistency measure that takes into account interactions and can be used in a genetic programming environment is proposed. This method detected five different interaction models with heritabilities as low as 0.008 and with prediction errors similar to the generated errors. Availability: Information on the generated data sets and executable code is available upon request. Contact: <inter-ref locator="jestrada@inmegen.gob.mx" locator-type="email">jestrada@inmegen.gob.mx</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i167 http://dx.doi.org/10.1093/bioinformatics/btm205 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i1752015-07-29HighWireOUPbioinfo:23:13

Anisotropic fluctuations of amino acids in protein structures: insights from X-ray crystallography and elastic network models Eyal, Eran Chennubhotla, Chakra Yang, Lee-Wei Bahar, Ivet ORIGINAL PAPERS Motivation: A common practice in X-ray crystallographic structure refinement has been to model atomic displacements or thermal fluctuations as isotropic motions. Recent high-resolution data reveal, however, significant departures from isotropy, described by anisotropic displacement parameters (ADPs) modeled for individual atoms. Yet, ADPs are currently reported for a limited set of structures, only. Results: We present a comparative analysis of the experimentally reported ADPs and those theoretically predicted by the anisotropic network model (ANM) for a representative set of structures. The relative sizes of fluctuations along different directions are shown to agree well between experiments and theory, while the cross-correlations between the (<it>x</it>-, <it>y</it>- and <it>z</it>-) components of the fluctuations show considerable deviations. Secondary structure elements and protein cores exhibit more robust anisotropic characteristics compared to disordered or flexible regions. The deviations between experimental and theoretical data are comparable to those between sets of experimental ADPs reported for the same protein in different crystal forms. These results draw attention to the effects of crystal form and refinement procedure on experimental ADPs and highlight the potential utility of ANM calculations for consolidating experimental data or assessing ADPs in the absence of experimental data. Availability: The ANM server at <inter-ref locator="http://www.ccbb.pitt.edu/anm" locator-type="url">http://www.ccbb.pitt.edu/anm</inter-ref> is upgraded to permit users to compute and visualize the theoretical ADPs for any PDB structure, thus providing insights into the anisotropic motions intrinsically preferred by equilibrium structures. Contact: <inter-ref locator="bahar@ccbb.pitt.edu" locator-type="email">bahar@ccbb.pitt.edu</inter-ref> Supplementary information: Two Supplementary Material files can be accessed at the journal website. The first presents the tabulated results from computations (Pearson correlations and KL distances with respect to experimental ADPs) reported for each of the 93 proteins in <it>Set I</it> (the averages over all proteins are presented above in <cross-ref type="tbl" refid="T3">Table 3</cross-ref>). The second file consists of three sections: (A) detailed derivation of Equation (<cross-ref type="fd" refid="M7">7</cross-ref>), (B) analysis of the effect of ANM parameters on computed ADPs and identification of parameters that achieve optimal correlation with experiments and (C) description of the method for computing the tangential and radial components of equilibrium fluctuations. Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i175 http://dx.doi.org/10.1093/bioinformatics/btm186 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i1852015-07-29HighWireOUPbioinfo:23:13

Dead-End Elimination with Backbone Flexibility Georgiev, Ivelin Donald, Bruce R. ORIGINAL PAPERS Motivation: Dead-End Elimination (<it>DEE</it>) is a powerful algorithm capable of reducing the search space for structure-based protein design by a combinatorial factor. By using a fixed backbone template, a rotamer library, and a potential energy function, DEE identifies and prunes rotamer choices that are provably not part of the Global Minimum Energy Conformation (<it>GMEC</it>), effectively eliminating the majority of the conformations that must be subsequently enumerated to obtain the GMEC. Since a fixed-backbone model biases the algorithm predictions against protein sequences for which even small backbone movements may result in a significantly enhanced stability, the incorporation of backbone flexibility can improve the accuracy of the design predictions. If explicit backbone flexibility is incorporated into the model, however, the traditional DEE criteria can no longer guarantee that the <it>flexible-backbone GMEC</it>, the lowest-energy conformation when the backbone is allowed to flex, will not be pruned. Results: We derive a novel DEE pruning criterion, <it>flexible-backbone DEE</it> (<it>BD</it>), that is provably accurate with backbone flexibility, guaranteeing that no rotamers belonging to the <it>flexible-backbone GMEC</it> are pruned; we also present further enhancements to BD for improved pruning efficiency. The results from applying our novel algorithms to redesign the β1 domain of protein G and to switch the substrate specificity of the NRPS enzyme GrsA-PheA are then compared against the results from previous fixed-backbone DEE algorithms. We confirm experimentally that traditional-DEE is indeed not provably-accurate with backbone flexibility and that BD is capable of generating conformations with significantly lower energies, thus confirming the feasibility of our novel algorithms. Availability: Contact authors for source code. Contact: <inter-ref locator="brd+ismb07@cs.duke.edu" locator-type="email">brd+ismb07@cs.duke.edu</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i185 http://dx.doi.org/10.1093/bioinformatics/btm197 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i192015-07-29HighWireOUPbioinfo:23:13

Efficient parameter estimation for RNA secondary structure prediction Andronescu, Mirela Condon, Anne Hoos, Holger H. Mathews, David H. Murphy, Kevin P. ORIGINAL PAPERS Motivation: Accurate prediction of RNA secondary structure from the base sequence is an unsolved computational challenge. The accuracy of predictions made by free energy minimization is limited by the quality of the energy parameters in the underlying free energy model. The most widely used model, the Turner99 model, has hundreds of parameters, and so a robust parameter estimation scheme should efficiently handle large data sets with thousands of structures. Moreover, the estimation scheme should also be trained using available experimental free energy data in addition to structural data. Results: In this work, we present constraint generation (CG), the first computational approach to RNA free energy parameter estimation that can be efficiently trained on large sets of structural as well as thermodynamic data. Our CG approach employs a novel iterative scheme, whereby the energy values are first computed as the solution to a constrained optimization problem. Then the newly computed energy parameters are used to update the constraints on the optimization function, so as to better optimize the energy parameters in the next iteration. Using our method on biologically sound data, we obtain revised parameters for the Turner99 energy model. We show that by using our new parameters, we obtain significant improvements in prediction accuracy over current state of-the-art methods. Availability: Our CG implementation is available at <inter-ref locator="http://www.rnasoft.ca/CG/" locator-type="url">http://www.rnasoft.ca/CG/</inter-ref> Contact: <inter-ref locator="andrones@cs.ubc.ca" locator-type="email">andrones@cs.ubc.ca</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i19 http://dx.doi.org/10.1093/bioinformatics/btm223 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i1952015-07-29HighWireOUPbioinfo:23:13

Optimized design and assessment of whole genome tiling arrays Gräf, Stefan Nielsen, Fiona G. G. Kurtz, Stefan Huynen, Martijn A. Birney, Ewan Stunnenberg, Henk Flicek, Paul ORIGINAL PAPERS Motivation: Recent advances in microarray technologies have made it feasible to interrogate whole genomes with tiling arrays and this technique is rapidly becoming one of the most important high-throughput functional genomics assays. For large mammalian genomes, analyzing oligonucleotide tiling array data is complicated by the presence of non-unique sequences on the array, which increases the overall noise in the data and may lead to false positive results due to cross-hybridization. The ability to create custom microarrays using maskless array synthesis has led us to consider ways to optimize array design characteristics for improving data quality and analysis. We have identified a number of design parameters to be optimized including uniqueness of the probe sequences within the whole genome, melting temperature and self-hybridization potential. Results: We introduce the <it>uniqueness score, U</it>, a novel quality measure for oligonucleotide probes and present a method to quickly compute it. We show that <it>U</it> is equivalent to the number of shortest unique substrings in the probe and describe an efficient greedy algorithm to design mammalian whole genome tiling arrays using probes that maximize <it>U</it>. Using the mouse genome, we demonstrate how several optimizations influence the tiling array design characteristics. With a sensible set of parameters, our designs cover 78% of the mouse genome including many regions previously considered ‘untilable’ due to the presence of repetitive sequence. Finally, we compare our whole genome tiling array designs with commercially available designs. Availability: Source code is available under an open source license from <inter-ref locator="http://www.ebi.ac.uk/~graef/arraydesign/" locator-type="url">http://www.ebi.ac.uk/~graef/arraydesign/</inter-ref> Contact: <inter-ref locator="flicek@ebi.ac.uk" locator-type="email">flicek@ebi.ac.uk</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i195 http://dx.doi.org/10.1093/bioinformatics/btm200 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i2052015-07-29HighWireOUPbioinfo:23:13

Using genome-context data to identify specific types of functional associations in pathway/genome databases Green, Michelle L. Karp, Peter D. ORIGINAL PAPERS Background: Hundreds of genes lacking homology to any protein of known function are sequenced every day. Genome-context methods have proved useful in providing clues about functional annotations for many proteins. However, genome-context methods detect many biological types of functional associations, and do not identify which type of functional association they have found. Results: We have developed two new genome-context-based algorithms. Algorithm 1 extends our previous algorithm for identifying missing enzymes in predicted metabolic pathways (pathway holes) to use genome-context features. The new algorithm has significantly improved scope because it can now be applied to pathway reactions to which sequence similarity methods cannot be applied due to an absence of known sequences for enzymes catalyzing the reaction in other organisms. The new method identifies at least one known enzyme in the top ten hits for 58% of EcoCyc reactions that lack enzyme sequences in other organisms. Surprisingly, the addition of genome-context features does not improve the accuracy of the algorithm when sequences for the enzyme do exist in other organisms. Algorithm 2 uses genome-context methods to predict three distinct types of functional relationships between pairs of proteins: pairs that occur in the same protein complex, the same pathway, or the same operon. This algorithm performs with varying degrees of accuracy on each type of relationship, and performs best in predicting pathway and protein complex relationships. Contact: <inter-ref locator="pkarp@ai.sri.com" locator-type="email">pkarp@ai.sri.com</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i205 http://dx.doi.org/10.1093/bioinformatics/btm213 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i2122015-07-29HighWireOUPbioinfo:23:13

Bayesian association of haplotypes and non-genetic factors to regulatory and phenotypic variation in human populations Huang, Jim C. Kannan, Anitha Winn, John ORIGINAL PAPERS Motivation: With the recent availability of large-scale data sets profiling single nucleotide polymorphisms (SNPs) and quantitative traits data across different human subpopulations, there has been much attention directed towards discovering patterns of genetic variation and their connection to gene regulation and the onset/progression of disease. While previous work has focused primarily on correlating individual SNP markers with gene expression and disease, it has been suggested that using haplotype blocks instead of individual markers can significantly increase statistical power. Results: We present BlockMapper, a probabilistic generative model for genotype data and quantitative traits data, such as gene expression or phenotype measurements. BlockMapper discovers the block structure of genotype data and associates these inferred blocks to patterns of variation in quantitative traits data, whilst accounting for non-genetic factors. Our model achieves high accuracy for predicting Crohn's disease phenotype in Chromosome 5q31 and reveals novel cis-associations between two haplotype blocks in the ENm006 genomic region and GDI1, a gene implicated in X-linked mental retardation. Our results underscore the importance of accounting for the influence of large sets of SNPs on patterns of regulatory/phenotypic variation and represent a step towards an understanding of human genetic variation. Contact: <inter-ref locator="jwinn@microsoft.com" locator-type="email">jwinn@microsoft.com</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i212 http://dx.doi.org/10.1093/bioinformatics/btm217 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i2222015-07-29HighWireOUPbioinfo:23:13

Systematic discovery of functional modules and context-specific functional annotation of human genome Huang, Yu Li, Haifeng Hu, Haiyan Yan, Xifeng Waterman, Michael S. Huang, Haiyan Zhou, Xianghong Jasmine ORIGINAL PAPERS Motivation: The rapid accumulation of microarray datasets provides unique opportunities to perform systematic functional characterization of the human genome. We designed a graph-based approach to integrate cross-platform microarray data, and extract recurrent expression patterns. A series of microarray datasets can be modeled as a series of co-expression networks, in which we search for frequently occurring network patterns. The integrative approach provides three major advantages over the commonly used microarray analysis methods: (1) enhance signal to noise separation (2) identify functionally related genes without co-expression and (3) provide a way to predict gene functions in a context-specific way. Results: We integrate 65 human microarray datasets, comprising 1105 experiments and over 11 million expression measurements. We develop a data mining procedure based on frequent itemset mining and biclustering to systematically discover network patterns that recur in at least five datasets. This resulted in 143 401 potential functional modules. Subsequently, we design a network topology statistic based on graph random walk that effectively captures characteristics of a gene's local functional environment. Function annotations based on this statistic are then subject to the assessment using the random forest method, combining six other attributes of the network modules. We assign 1126 functions to 895 genes, 779 known and 116 unknown, with a validation accuracy of 70%. Among our assignments, 20% genes are assigned with multiple functions based on different network environments. Availability: <inter-ref locator="http://zhoulab.usc.edu/ContextAnnotation" locator-type="url">http://zhoulab.usc.edu/ContextAnnotation</inter-ref> Contact: <inter-ref locator="xjzhou@usc.edu" locator-type="email">xjzhou@usc.edu</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i222 http://dx.doi.org/10.1093/bioinformatics/btm222 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i2302015-07-29HighWireOUPbioinfo:23:13

Reconstruction of highly heterogeneous gene-content evolution across the three domains of life Iwasaki, Wataru Takagi, Toshihisa ORIGINAL PAPERS Motivation: Reconstruction of gene-content evolutionary history is fundamental in studying the evolution of genomes and biological systems. To reconstruct plausible evolutionary history, rates of gene gain/loss should be estimated by considering the high level of heterogeneity: e.g. genome duplication and parasitization, respectively, result in high rates of gene gain and loss. Gene-content evolution reconstruction methods that consider this heterogeneity and that are both effective in estimating the rates of gene gain and loss and sufficiently efficient to analyze abundant genomic data had not been developed. Results: An effective and efficient method for reconstructing heterogeneous gene-content evolution was developed. This method comprises analytically integrable modeling of gene-content evolution, analytical formulation of expectation-maximization and efficient calculation of marginal likelihood using an inside-outside-like algorithm. Simulation tests on the scale of hundreds of genomes showed that both the gene gain/loss rates and evolutionary history were effectively estimated within a few days of computational time. Subsequently, this algorithm was applied to an actual data set of nearly 200 genomes to reconstruct the heterogeneous gene-content evolution across the three domains of life. The reconstructed history, which contained several features consistent with biological observations, showed that the trends of gene-content evolution were not only drastically different between prokaryotes and eukaryotes, but were highly variable within each form of life. The results suggest that heterogeneity should be considered in studies of the evolution of gene content, genomes and biological systems. Availability: An R script that implements the algorithm is available upon request. Contact: <inter-ref locator="iwasaki@cb.k.u-tokyo.ac.jp" locator-type="email">iwasaki@cb.k.u-tokyo.ac.jp</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i230 http://dx.doi.org/10.1093/bioinformatics/btm165 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i2402015-07-29HighWireOUPbioinfo:23:13

Different mechanistic requirements for prokaryotic and eukaryotic chaperonins: a lattice study Jacob, Etai Horovitz, Amnon Unger, Ron ORIGINAL PAPERS Motivation: The folding of many proteins <it>in vivo</it> and <it>in vitro</it> is assisted by molecular chaperones. A well-characterized molecular chaperone system is the chaperonin GroEL/GroES from <it>Escherichia coli</it> which has a homolog found in the eukaryotic cytosol called CCT. All chaperonins have a ring structure with a cavity in which the substrate protein folds. An interesting difference between prokaryotic and eukaryotic chaperonins is in the nature of the ATP-mediated conformational changes that their ring structures undergo during their reaction cycle. Prokaryotic chaperonins are known to exhibit a highly cooperative concerted change of their cavity surface while in eukaryotic chaperonins the change is sequential. Approximately 70% of proteins in eukaryotic cells are multi-domain whereas in prokaryotes single-domain proteins are more common. Thus, it was suggested that the different modes of action of prokaryotic and eukaryotic chaperonins can be explained by the need of eukaryotic chaperonins to facilitate folding of multi-domain proteins. Results: Using a 2D square lattice model, we generated two large populations of single-domain and double-domain substrate proteins. Chaperonins were modeled as static structures with a cavity wall with which the substrate protein interacts. We simulated both concerted and sequential changes of the cavity surfaces and demonstrated that folding of single-domain proteins benefits from concerted but not sequential changes whereas double-domain proteins benefit also from sequential changes. Thus, our results support the suggestion that the different modes of allosteric switching of prokaryotic and eukaryotic chaperonin rings have functional implications as it enables eukaryotic chaperonins to better assist multi-domain protein folding. Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i240 http://dx.doi.org/10.1093/bioinformatics/btm180 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i2492015-07-29HighWireOUPbioinfo:23:13

A statistical method for alignment-free comparison of regulatory sequences Kantorovitz, Miriam R. Robinson, Gene E. Sinha, Saurabh ORIGINAL PAPERS Motivation: The similarity of two biological sequences has traditionally been assessed within the well-established framework of alignment. Here we focus on the task of identifying functional relationships between <it>cis</it>-regulatory sequences that are non-orthologous or greatly diverged. ‘Alignment-free’ measures of sequence similarity are required in this regime. Results: We investigate the use of a new score for alignment-free sequence comparison, called the <f><inline-fig> <link locator="btm211i1"></inline-fig></f> score. It is based on comparing the frequencies of all fixed-length words in the two sequences. An important, novel feature of the score is that it is comparable across sequence pairs drawn from arbitrary background distributions. We present a method that gives quadratic improvement in the time complexity of calculating the <f><inline-fig> <link locator="btm211i2"></inline-fig></f> score, over the naïve method. We then evaluate the score on several tissue-specific families of <it>cis</it>-regulatory modules (in <it>Drosophila</it> and human). The new score is highly successful in discriminating functionally related regulatory sequences from unrelated sequence pairs. The performance of the <f><inline-fig> <link locator="btm211i3"></inline-fig></f> score is compared to five other alignment-free similarity measures, and shown to be consistently superior to all of these measures. Availability: Our implementation of the <f><inline-fig> <link locator="btm211i4"></inline-fig></f> score will be made freely available as source code, upon publication of this article, at: <inter-ref locator="http://veda.cs.uiuc.edu/d2z/" locator-type="url">http://veda.cs.uiuc.edu/d2z/</inter-ref> Contact: <inter-ref locator="sinhas@cs.uiuc.edu" locator-type="email">sinhas@cs.uiuc.edu</inter-ref> Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i249 http://dx.doi.org/10.1093/bioinformatics/btm211 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i2562015-07-29HighWireOUPbioinfo:23:13

Learning to extract relations for protein annotation Kim, Jee-Hyub Mitchell, Alex Attwood, Teresa K. Hilario, Melanie ORIGINAL PAPERS Motivation: Protein annotation is a task that describes protein X in terms of topic Y. Usually, this is constructed using information from the biomedical literature. Until now, most of literature-based protein annotation work has been done manually by human annotators. However, as the number of biomedical papers grows ever more rapidly, manual annotation becomes more difficult, and there is increasing need to automate the process. Recently, information extraction (IE) has been used to address this problem. Typically, IE requires pre-defined relations and hand-crafted IE rules or annotated corpora, and these requirements are difficult to satisfy in real-world scenarios such as in the biomedical domain. In this article, we describe an IE system that requires only sentences labelled according to their relevance or not to a given topic by domain experts. Results: We applied our system to meet the annotation needs of a well-known protein family database; the results show that our IE system can annotate proteins with a set of extracted relations by learning relations and IE rules for disease, function and structure from only relevant and irrelevant sentences. Contact: <inter-ref locator="jee.kim@cui.unige.ch" locator-type="email">jee.kim@cui.unige.ch</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i256 http://dx.doi.org/10.1093/bioinformatics/btm168 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i2642015-07-29HighWireOUPbioinfo:23:13

Identification of new drug classification terms in textual resources Kolárik, Corinna Hofmann-Apitius, Martin Zimmermann, Marc Fluck, Juliane ORIGINAL PAPERS Knowledge about biological effects of small molecules helps in the understanding of biological processes and supports the development of new therapeutic agents. DrugBank is a high quality database providing such information about drugs that contains annotation of drug effects and classification of therapeutic effects. However, to broaden the scope of such a database in classifying and annotating drugs, systems for automatic extraction of classification terms and the corresponding annotation of drugs are needed. We have developed an approach for the identification of new terms used in unstructured text that provide information about drug properties. It is based on the identification and extraction of phrases corresponding to lexico-syntactic patterns - so-called Hearst patterns that contain drug names and directly related drug annotation terms. Such phrases could be identified with a high performance in DrugBank text (0.89 F-score) and in Medline abstracts (0.83 F-score). In comparison to DrugBank annotation terminology, a huge amount of new drug annotation terms could be found. The evaluation of terms extracted from Medline showed that 29–53% of them are new valid drug property terms. They could be assigned to existing and new drug property classes not provided by the DrugBank drug annotation. We come to the conclusion that our system can support database content update by providing additionally drug descriptions of pharmacological effects not yet found in databases like DrugBank. Moreover, we propose that automatic normalization of terms improves the annotation and the retrieval of relevant database entries. Contact: <inter-ref locator="corinna.kolarik@scai.fraunhofer.de" locator-type="email">corinna.kolarik@scai.fraunhofer.de</inter-ref> Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i264 http://dx.doi.org/10.1093/bioinformatics/btm196 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i2732015-07-29HighWireOUPbioinfo:23:13

A geometric approach for the alignment of liquid chromatography mass spectrometry data Lange, Eva Gröpl, Clemens Schulz-Trieglaff, Ole Leinenbach, Andreas Huber, Christian Reinert, Knut ORIGINAL PAPERS Motivation: Liquid chromatography coupled to mass spectrometry (LC-MS) and combined with tandem mass spectrometry (LC-MS/MS) have become a prominent tool for the analysis of complex proteomic samples. An important step in a typical workflow is the combination of results from multiple LC-MS experiments to improve confidence in the obtained measurements or to compare results from different samples. To do so, a suitable mapping or <it>alignment</it> between the data sets needs to be estimated. The alignment has to correct for variations in mass and elution time which are present in all mass spectrometry experiments. Results: We propose a novel algorithm to align LC-MS samples and to match corresponding ion species across samples. Our algorithm matches landmark signals between two data sets using a geometric technique based on pose clustering. Variations in mass and retention time are corrected by an affine dewarping function estimated from matched landmarks. We use the pairwise dewarping in an algorithm for aligning multiple samples. We show that our pose clustering approach is fast and reliable as compared to previous approaches. It is robust in the presence of noise and able to accurately align samples with only few common ion species. In addition, we can easily handle different kinds of LC-MS data and adopt our algorithm to new mass spectrometry technologies. Availability: This algorithm is implemented as part of the OpenMS software library for shotgun proteomics and available under the Lesser GNU Public License (LGPL) at <inter-ref locator="www.openms.de" locator-type="url">www.openms.de</inter-ref> Contact: <inter-ref locator="lange@inf.fu-berlin.de" locator-type="email">lange@inf.fu-berlin.de</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i273 http://dx.doi.org/10.1093/bioinformatics/btm209 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i2822015-07-29HighWireOUPbioinfo:23:13

Comparative analysis of microarray normalization procedures: effects on reverse engineering gene networks Lim, Wei Keat Wang, Kai Lefebvre, Celine Califano, Andrea ORIGINAL PAPERS Motivation: An increasingly common application of gene expression profile data is the reverse engineering of cellular networks. However, common procedures to normalize expression profiles generated using the Affymetrix GeneChips technology were originally developed for a rather different purpose, namely the accurate measure of differential gene expression between two or more phenotypes. As a result, current evaluation strategies lack comprehensive metrics to assess the suitability of available normalization procedures for reverse engineering and, in general, for measuring correlation between the expression profiles of a gene pair. Results: We benchmark four commonly used normalization procedures (MAS5, RMA, GCRMA and Li-Wong) in the context of established algorithms for the reverse engineering of protein–protein and protein–DNA interactions. Replicate sample, randomized and human B-cell data sets are used as an input. Surprisingly, our study suggests that MAS5 provides the most faithful cellular network reconstruction. Furthermore, we identify a crucial step in GCRMA responsible for introducing severe artifacts in the data leading to a systematic overestimate of pairwise correlation. This has key implications not only for reverse engineering but also for other methods, such as hierarchical clustering, relying on accurate measurements of pairwise expression profile correlation. We propose an alternative implementation to eliminate such side effect. Contect: <inter-ref locator="califano@c2b2.columbia.edu" locator-type="email">califano@c2b2.columbia.edu</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i282 http://dx.doi.org/10.1093/bioinformatics/btm201 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i2892015-07-29HighWireOUPbioinfo:23:13

Probabilistic whole-genome alignments reveal high indel rates in the human and mouse genomes Lunter, Gerton ORIGINAL PAPERS Motivation: The two mutation processes that have the largest impact on genome evolution at small scales are substitutions, and sequence insertions and deletions (indels). While the former have been studied extensively, indels have received less attention, and in particular, the problem of inferring indel rates between pairs of divergent sequence remains unsolved. Here, I describe a novel and accurate method for estimating neutral indel rates between divergent pairs of genomes. Results: Simulations suggest that new method for estimating indel rates is accurate to within 2%, at divergences corresponding to that of human and mouse. Applying the method to these species, I show that indel rates are up to twice higher than is apparent from alignments, and depend strongly on the local G + C content. These results indicate that at these evolutionary distances, the contribution of indels to sequence divergence is much larger than hitherto appreciated. In particular, the ratio of substitution to indel rates between human and mouse appears to be around γ = 8, rather than the currently accepted value of about γ = 14. Contact: <inter-ref locator="Gerton.lunter@dpag.ox.ac.uk" locator-type="email">Gerton.lunter@dpag.ox.ac.uk</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i289 http://dx.doi.org/10.1093/bioinformatics/btm185 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i292015-07-29HighWireOUPbioinfo:23:13

An ensemble framework for clustering protein protein interaction networks Asur, Sitaram Ucar, Duygu Parthasarathy, Srinivasan ORIGINAL PAPERS Motivation: Protein–Protein Interaction (PPI) networks are believed to be important sources of information related to biological processes and complex metabolic functions of the cell. The presence of biologically relevant functional modules in these networks has been theorized by many researchers. However, the application of traditional clustering algorithms for extracting these modules has not been successful, largely due to the presence of noisy false positive interactions as well as specific topological challenges in the network. Results: In this article, we propose an ensemble clustering framework to address this problem. For base clustering, we introduce two topology-based distance metrics to counteract the effects of noise. We develop a PCA-based consensus clustering technique, designed to reduce the dimensionality of the consensus problem and yield informative clusters. We also develop a soft consensus clustering variant to assign multifaceted proteins to multiple functional groups. We conduct an empirical evaluation of different consensus techniques using topology-based, information theoretic and domain-specific validation metrics and show that our approaches can provide significant benefits over other state-of-the-art approaches. Our analysis of the consensus clusters obtained demonstrates that ensemble clustering can (a) produce improved biologically significant functional groupings; and (b) facilitate soft clustering by discovering multiple functional associations for proteins. Contact: <inter-ref locator="srini@cse.ohio-state.edu" locator-type="email">srini@cse.ohio-state.edu</inter-ref> Supplementary information: Supplementary data are available at <it>Bioinformatics</it> online. Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i29 http://dx.doi.org/10.1093/bioinformatics/btm212 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i2972015-07-29HighWireOUPbioinfo:23:13

Inferring protein DNA dependencies using motif alignments and mutual information Mahony, Shaun Auron, Philip E. Benos, Panayiotis V. ORIGINAL PAPERS Motivation: Mutual information can be used to explore covarying positions in biological sequences. In the past, it has been successfully used to infer RNA secondary structure conformations from multiple sequence alignments. In this study, we show that the same principles allow the discovery of transcription factor amino acids that are coevolving with nucleotides in their DNA-binding targets. Results: Given an alignment of transcription factor binding domains, and a separate alignment of their DNA target motifs, we demonstrate that mutually covarying base-amino acid positions may indicate possible protein–DNA contacts. Examples explored in this study include C2H2 zinc finger, homeodomain and bHLH DNA-binding motif families, where a number of known base-amino acid contacting positions are identified. Mutual information analyses may aid the prediction of base-amino acid contacting pairs for particular transcription factor families, thereby yielding structural insights from sequence information alone. Such inference of protein–DNA contacting positions may guide future experimental studies of DNA recognition. Contact: <inter-ref locator="shaun.mahony@ccbb.pitt.edu" locator-type="email">shaun.mahony@ccbb.pitt.edu</inter-ref> or <inter-ref locator="benos@pitt.edu" locator-type="email">benos@pitt.edu</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i297 http://dx.doi.org/10.1093/bioinformatics/btm215 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i3052015-07-29HighWireOUPbioinfo:23:13

Nested effects models for high-dimensional phenotyping screens Markowetz, Florian Kostka, Dennis Troyanskaya, Olga G. Spang, Rainer ORIGINAL PAPERS Motivation: In high-dimensional phenotyping screens, a large number of cellular features is observed after perturbing genes by knockouts or RNA interference. Comprehensive analysis of perturbation effects is one of the most powerful techniques for attributing functions to genes, but not much work has been done so far to adapt statistical and computational methodology to the specific needs of large-scale and high-dimensional phenotyping screens. Results: We introduce and compare probabilistic methods to efficiently infer a genetic hierarchy from the nested structure of observed perturbation effects. These hierarchies elucidate the structures of signaling pathways and regulatory networks. Our methods achieve two goals: (1) they reveal clusters of genes with highly similar phenotypic profiles, and (2) they order (clusters of) genes according to subset relationships between phenotypes. We evaluate our algorithms in the controlled setting of simulation studies and show their practical use in two experimental scenarios: (1) a data set investigating the response to microbial challenge in <it>Drosophila melanogaster</it>, and (2) a compendium of expression profiles of <it>Saccharomyces cerevisiae</it> knockout strains. We show that our methods identify biologically justified genetic hierarchies of perturbation effects. Availability: The software used in our analysis is freely available in the R package ‘nem’ from <inter-ref locator="www.bioconductor.org" locator-type="url">www.bioconductor.org</inter-ref> Contact: <inter-ref locator="ogt@cs.princeton.edu" locator-type="email">ogt@cs.princeton.edu</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i305 http://dx.doi.org/10.1093/bioinformatics/btm178 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i3132015-07-29HighWireOUPbioinfo:23:13

Biases induced by pooling samples in microarray experiments Mary-Huard, Tristan Daudin, Jean-Jacques Baccini, Michela Biggeri, Annibale Bar-Hen, Avner ORIGINAL PAPERS Motivation: If there is insufficient RNA from the tissues under investigation from one organism, then it is common practice to pool RNA. An important question is to determine whether pooling introduces biases, which can lead to inaccurate results. In this article, we describe two biases related to pooling, from a theoretical as well as a practical point of view. Results: We model and quantify the respective parts of the pooling bias due to the log transform as well as the bias due to biological averaging of the samples. We also evaluate the impact of the bias on the statistical differential analysis of Affymetrix data. Contact: <inter-ref locator="maryhuar@inapg.fr" locator-type="email">maryhuar@inapg.fr</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i313 http://dx.doi.org/10.1093/bioinformatics/btm182 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i3192015-07-29HighWireOUPbioinfo:23:13

Towards realistic codon models: among site variability and dependency of synonymous and non-synonymous rates Mayrose, Itay Doron-Faigenboim, Adi Bacharach, Eran Pupko, Tal ORIGINAL PAPERS Codon evolutionary models are widely used to infer the selection forces acting on a protein. The non-synonymous to synonymous rate ratio (denoted by Ka/Ks) is used to infer specific positions that are under purifying or positive selection. Current evolutionary models usually assume that only the non-synonymous rates vary among sites while the synonymous substitution rates are constant. This assumption ignores the possibility of selection forces acting at the DNA or mRNA levels. Towards a more realistic description of sequence evolution, we present a model that accounts for among-site-variation of both synonymous and non-synonymous substitution rates. Furthermore, we alleviate the widespread assumption that positions evolve independently of each other. Thus, possible sources of bias caused by random fluctuations in either the synonymous or non-synonymous rate estimations at a single site is removed. Our model is based on two hidden Markov models that operate on the spatial dimension: one describes the dependency between adjacent non-synonymous rates while the other describes the dependency between adjacent synonymous rates. The presented model is applied to study the selection pressure across the HIV-1 genome. The new model better describes the evolution of all HIV-1 genes, as compared to current codon models. Using both simulations and real data analyses, we illustrate that accounting for synonymous rate variability and dependency greatly increases the accuracy of Ka/Ks estimation and in particular of positively selected sites. Finally, we discuss the applicability of the developed model to infer the selection forces in regulatory and overlapping regions of the HIV-1 genome. Contact: <inter-ref locator="talp@post.tau.ac.il" locator-type="email">talp@post.tau.ac.il</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i319 http://dx.doi.org/10.1093/bioinformatics/btm176 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i3282015-07-29HighWireOUPbioinfo:23:13

Using dynamic programming to create isotopic distribution maps from mass spectra McIlwain, Sean Page, David Huttlin, Edward L. Sussman, Michael R. ORIGINAL PAPERS Motivation: This article presents a method to identify the isotopic distributions within a mass spectrum using a probabilistic classifier supplemented with dynamic programming. Such a system is needed for a variety of purposes, including generating robust and meaningful features from mass spectra to be used in classification. Results: The primary result of this article is that the dynamic programming approach significantly improves sensitivity, without harming specificity, of a probabilistic classifier for identifying the isotopic distributions. When annotating isotopic distributions where an expert has performed the initial ‘peak-picking’ (removal of noise peaks), the dynamic programming approach gives a true positive rate of 96% and a false positive rate of 0.0%, whereas the classifier alone has a true positive rate of only 47% when the false positive rate is 0.0%. When annotating isotopic distributions in machine peak-picked spectra, which may contain many noise peaks, the dynamic programming approach gives a true positive rate of only 22.0%, but it still keeps a low false positive rate of 1.0% and still outperforms the classifier alone. It is important to note that all these rates are when we require <it>exact</it> matches with the distributions in annotated spectra; in our evaluation a distribution is considered ‘entirely incorrect’ if it is missing even one peak or contains even one extraneous peak. We compared to the THRASH and AID-MS systems using a looser requirement: correctly identifying the distribution that contains the mono-isotopic mass. Under this measure, our dynamic programming approach achieves a true positive rate of 82% and a false positive rate of 1%, which again outperforms the classifier alone. The dynamic programming approach ends up being more conservative than THRASH and AID-MS, yielding both fewer true and false peaks, but the F-score of the dynamic programming approach is significantly better than those of THRASH and AID-MS. All results were obtained with 10-fold cross-validation of 99 sections of mass spectra with a total of 214 hand-annotated isotopic distributions. Availability: Programs are available via <inter-ref locator="http://www.cs.wisc.edu/~mcilwain/IDM" locator-type="url">http://www.cs.wisc.edu/~mcilwain/IDM</inter-ref> Contact: <inter-ref locator="mcilwain@cs.wisc.edu" locator-type="email">mcilwain@cs.wisc.edu</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i328 http://dx.doi.org/10.1093/bioinformatics/btm198 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i3372015-07-29HighWireOUPbioinfo:23:13

A Chado case study: an ontology-based modular schema for representing genome-associated biological information Mungall, Christopher J. Emmert, David B. The FlyBase Consortium, ORIGINAL PAPERS Motivation: A few years ago, FlyBase undertook to design a new database schema to store <it>Drosophila</it> data. It would fully integrate genomic sequence and annotation data with bibliographic, genetic, phenotypic and molecular data from the literature representing a distillation of the first 100 years of research on this major animal model system. In developing this new integrated schema, FlyBase also made a commitment to ensure that its design was <it>generic, extensible</it> and <it>available</it> as open source, so that it could be employed as the core schema of any model organism data repository, thereby avoiding redundant software development and potentially increasing interoperability. Our question was whether we could create a relational database schema that would be successfully reused. Results: Chado is a relational database schema now being used to manage biological knowledge for a wide variety of organisms, from human to pathogens, especially the classes of information that directly or indirectly can be associated with genome sequences or the primary RNA and protein products encoded by a genome. Biological databases that conform to this schema can interoperate with one another, and with application software from the Generic Model Organism Database (GMOD) toolkit. Chado is distinctive because its design is driven by ontologies. The use of ontologies (or controlled vocabularies) is ubiquitous across the schema, as they are used as a means of <it>typing</it> entities. The Chado schema is partitioned into integrated subschemas (modules), each encapsulating a different biological domain, and each described using representations in appropriate ontologies. To illustrate this methodology, we describe here the Chado modules used for describing genomic sequences. Availability: GMOD is a collaboration of several model organism database groups, including FlyBase, to develop a set of open-source software for managing model organism data. The Chado schema is freely distributed under the terms of the Artistic License (<inter-ref locator="http://www.opensource.org/licenses/artistic-license.php" locator-type="url">http://www.opensource.org/licenses/artistic-license.php</inter-ref>) from GMOD (<inter-ref locator="www.gmod.org" locator-type="url">www.gmod.org</inter-ref>). Contact: <inter-ref locator="cjm@fruitfly.org" locator-type="email">cjm@fruitfly.org</inter-ref> or <inter-ref locator="emmert@morgan.harvard.edu" locator-type="email">emmert@morgan.harvard.edu</inter-ref>. Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i337 http://dx.doi.org/10.1093/bioinformatics/btm189 en Copyright (C) 2007, Oxford University Press

oai:open-archive.highwire.org:bioinfo:23/13/i3472015-07-29HighWireOUPbioinfo:23:13

Prediction of DNA-binding residues from sequence Ofran, Yanay Mysore, Venkatesh Rost, Burkhard ORIGINAL PAPERS Motivation: Thousands of proteins are known to bind to DNA; for most of them the mechanism of action and the residues that bind to DNA, i.e. the binding sites, are yet unknown. Experimental identification of binding sites requires expensive and laborious methods such as mutagenesis and binding essays. Hence, such studies are not applicable on a large scale. If the 3D structure of a protein is known, it is often possible to predict DNA-binding sites <it>in silico</it>. However, for most proteins, such knowledge is not available. Results: It has been shown that DNA-binding residues have distinct biophysical characteristics. Here we demonstrate that these characteristics are so distinct that they enable accurate prediction of the residues that bind DNA directly from amino acid sequence, without requiring any additional experimental or structural information. In a cross-validation based on the largest non-redundant dataset of high-resolution protein–DNA complexes available today, we found that 89% of our predictions are confirmed by experimental data. Thus, it is now possible to identify DNA-binding sites on a proteomic scale even in the absence of any experimental data or 3D-structural information. Availability: <inter-ref locator="http://cubic.bioc.columbia.edu/services/disis" locator-type="url">http://cubic.bioc.columbia.edu/services/disis</inter-ref> Contact: <inter-ref locator="yo135@columbia.edu" locator-type="email">yo135@columbia.edu</inter-ref> Oxford University Press 2007-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/23/13/i347 http://dx.doi.org/10.1093/bioinformatics/btm174 en Copyright (C) 2007, Oxford University Press 1714599033810!0001-01-01!9999-12-31!bioinfo:23!100!oai_dc