2024-05-01T22:04:37Zhttp://open-archive.highwire.org/handler

oai:open-archive.highwire.org:bioinfo:27/12/15952015-07-29HighWireOUPbioinfo:27:12

PathScan: a tool for discerning mutational significance in groups of putative cancer genes Wendl, Michael C. Wallis, John W. Lin, Ling Kandoth, Cyriac Mardis, Elaine R. Wilson, Richard K. Ding, Li GENOME ANALYSIS Motivation: The expansion of cancer genome sequencing continues to stimulate development of analytical tools for inferring relationships between somatic changes and tumor development. Pathway associations are especially consequential, but existing algorithms are demonstrably inadequate. Methods: Here, we propose the PathScan significance test for the scenario where pathway mutations collectively contribute to tumor development. Its design addresses two aspects that established methods neglect. First, we account for variations in gene length and the consequent differences in their mutation probabilities under the standard null hypothesis of random mutation. The associated spike in computational effort is mitigated by accurate convolution-based approximation. Second, we combine individual probabilities into a multiple-sample value using Fisher–Lancaster theory, thereby improving differentiation between a few highly mutated genes and many genes having only a few mutations apiece. We investigate accuracy, computational effort and power, reporting acceptable performance for each. Results: As an example calculation, we re-analyze KEGG-based lung adenocarcinoma pathway mutations from the Tumor Sequencing Project. Our test recapitulates the most significant pathways and finds that others for which the original test battery was inconclusive are not actually significant. It also identifies the focal adhesion pathway as being significantly mutated, a finding consistent with earlier studies. We also expand this analysis to other databases: Reactome, BioCarta, Pfam, PID and SMART, finding additional hits in ErbB and EPHA signaling pathways and regulation of telomerase. All have implications and plausible mechanistic roles in cancer. Finally, we discuss aspects of extending the method to integrate gene-specific background rates and other types of genetic anomalies. Availability: PathScan is implemented in Perl and is available from the Genome Institute at: <inter-ref locator="http://genome.wustl.edu/software/pathscan" locator-type="url">http://genome.wustl.edu/software/pathscan</inter-ref>. Contact: <inter-ref locator="mwendl@wustl.edu" locator-type="email">mwendl@wustl.edu</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr193/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1595 http://dx.doi.org/10.1093/bioinformatics/btr193 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/16032015-07-29HighWireOUPbioinfo:27:12

Improved similarity scores for comparing motifs Tanaka, Emi Bailey, Timothy Grant, Charles E. Noble, William Stafford Keich, Uri GENOME ANALYSIS Motivation: A question that often comes up after applying a motif finder to a set of co-regulated DNA sequences is whether the reported putative motif is similar to any known motif. While several tools have been designed for this task, Habib <it>et al.</it> pointed out that the scores that are commonly used for measuring similarity between motifs do not distinguish between a good alignment of two informative columns (say, all-A) and one of two uninformative columns. This observation explains why tools such as T<scp>omtom</scp> occasionally return an alignment of uninformative columns which is clearly spurious. To address this problem, Habib <it>et al.</it> suggested a new score [Bayesian Likelihood 2-Component (BLiC)] which uses a Bayesian information criterion to penalize matches that are also similar to the background distribution. Results: We show that the BLiC score exhibits other, highly undesirable properties, and we offer instead a general approach to adjust any motif similarity score so as to reduce the number of reported spurious alignments of uninformative columns. We implement our method in T<scp>omtom</scp> and show that, without significantly compromising T<scp>omtom</scp>'s retrieval accuracy or its runtime, we can drastically reduce the number of uninformative alignments. Availability and Implementation: The modified T<scp>omtom</scp> is available as part of the MEME Suite at <inter-ref locator="http://meme.nbcr.net" locator-type="url">http://meme.nbcr.net</inter-ref>. Contact: <inter-ref locator="uri@maths.usyd.edu.au" locator-type="email">uri@maths.usyd.edu.au</inter-ref>; <inter-ref locator="e.tanaka@maths.usyd.edu.au" locator-type="email">e.tanaka@maths.usyd.edu.au</inter-ref> Supplementary Information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr257/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1603 http://dx.doi.org/10.1093/bioinformatics/btr257 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/16102015-07-29HighWireOUPbioinfo:27:12

Genome annotation test with validation on transcription start site and ChIP-Seq for Pol-II binding data Bedo, Justin Kowalczyk, Adam GENOME ANALYSIS Motivation: Many ChIP-Seq experiments are aimed at developing gold standards for determining the locations of various genomic features such as transcription start or transcription factor binding sites on the whole genome. Many such pioneering experiments lack rigorous testing methods and adequate ‘gold standard’ annotations to compare against as <it>they themselves</it> are the most reliable source of empirical data available. To overcome this problem, we propose a self-consistency test whereby a dataset is tested against itself. It relies on a supervised machine learning style protocol for <it>in silico</it> annotation of a genome and accuracy estimation to guarantee, at least, self-consistency. Results: The main results use a novel performance metric (a calibrated precision) in order to assess and compare the robustness of the proposed supervised learning method across different test sets. As a proof of principle, we applied the whole protocol to two recent ChIP-Seq ENCODE datasets of STAT1 and Pol-II binding sites. STAT1 is benchmarked against <it>in silico</it>detection of binding sites using available position weight matrices. Pol-II, the main focus of this paper, is benchmarked against 17 algorithms for the closely related and well-studied problem of <it>in silico</it> transcription start site (TSS) prediction. Our results also demonstrate the feasibility of <it>in silico</it> genome annotation extension with encouraging results from a small portion of annotated genome to the remainder. Availability: Available from<inter-ref locator="http://www.genomics.csse.unimelb.edu.au/gat" locator-type="url">http://www.genomics.csse.unimelb.edu.au/gat</inter-ref>. Contact: <inter-ref locator="justin.bedo@nicta.com.au" locator-type="email">justin.bedo@nicta.com.au</inter-ref>; <inter-ref locator="adam.kowalczyk@nicta.com.au" locator-type="email">adam.kowalczyk@nicta.com.au</inter-ref> Supplementary Information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr263/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1610 http://dx.doi.org/10.1093/bioinformatics/btr263 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/16182015-07-29HighWireOUPbioinfo:27:12

Mixture models for analysis of the taxonomic composition of metagenomes Meinicke, Peter Aßhauer, Kathrin Petra Lingner, Thomas SEQUENCE ANALYSIS Motivation: Inferring the taxonomic profile of a microbial community from a large collection of anonymous DNA sequencing reads is a challenging task in metagenomics. Because existing methods for taxonomic profiling of metagenomes are all based on the assignment of fragmentary sequences to phylogenetic categories, the accuracy of results largely depends on fragment length. This dependence complicates comparative analysis of data originating from different sequencing platforms or resulting from different preprocessing pipelines. Results: We here introduce a new method for taxonomic profiling based on mixture modeling of the overall oligonucleotide distribution of a sample. Our results indicate that the mixture-based profiles compare well with taxonomic profiles obtained with other methods. However, in contrast to the existing methods, our approach shows a nearly constant profiling accuracy across all kinds of read lengths and it operates at an unrivaled speed. Availability: A platform-independent implementation of the mixture modeling approach is available in terms of a MATLAB/Octave toolbox at <inter-ref locator="http://gobics.de/peter/taxy" locator-type="url">http://gobics.de/peter/taxy</inter-ref>. In addition, a prototypical implementation within an easy-to-use interactive tool for Windows can be downloaded. Contact: <inter-ref locator="pmeinic@gwdg.de" locator-type="email">pmeinic@gwdg.de</inter-ref>; <inter-ref locator="thomas@gobics.de" locator-type="email">thomas@gobics.de</inter-ref> Supplementary Information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr266/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1618 http://dx.doi.org/10.1093/bioinformatics/btr266 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/16252015-07-29HighWireOUPbioinfo:27:12

MAISTAS: a tool for automatic structural evaluation of alternative splicing products Floris, Matteo Raimondo, Domenico Leoni, Guido Orsini, Massimiliano Marcatili, Paolo Tramontano, Anna STRUCTURAL BIOINFORMATICS Motivation: Analysis of the human genome revealed that the amount of transcribed sequence is an order of magnitude greater than the number of predicted and well-characterized genes. A sizeable fraction of these transcripts is related to alternatively spliced forms of known protein coding genes. Inspection of the alternatively spliced transcripts identified in the pilot phase of the ENCODE project has clearly shown that often their structure might substantially differ from that of other isoforms of the same gene, and therefore that they might perform unrelated functions, or that they might even not correspond to a functional protein. Identifying these cases is obviously relevant for the functional assignment of gene products and for the interpretation of the effect of variations in the corresponding proteins. Results: Here we describe a publicly available tool that, given a gene or a protein, retrieves and analyses all its annotated isoforms, provides users with three-dimensional models of the isoform(s) of his/her interest whenever possible and automatically assesses whether homology derived structural models correspond to plausible structures. This information is clearly relevant. When the homology model of some isoforms of a gene does not seem structurally plausible, the implications are that either they assume a structure unrelated to that of the other isoforms of the same gene with presumably significant functional differences, or do not correspond to functional products. We provide indications that the second hypothesis is likely to be true for a substantial fraction of the cases. Availability: <inter-ref locator="http://maistas.bioinformatica.crs4.it/" locator-type="url">http://maistas.bioinformatica.crs4.it/</inter-ref>. Contact: <inter-ref locator="anna.tramontano@uniromal.it" locator-type="email">anna.tramontano@uniromal.it</inter-ref> Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1625 http://dx.doi.org/10.1093/bioinformatics/btr198 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/16302015-07-29HighWireOUPbioinfo:27:12

A reference dataset for the analyses of membrane protein secondary structures and transmembrane residuesusing circular dichroism spectroscopy Abdul-Gader, Ali Miles, Andrew John Wallace, B. A. STRUCTURAL BIOINFORMATICS Motivation: Empirical analyses of protein secondary structures based on circular dichroism (CD) and synchrotron radiation circular dichroism (SRCD) spectroscopic data rely on the availability of reference datasets comprised of spectra of relevant proteins, whose crystal structures have been determined. Datasets comprised of only soluble proteins have not proven suitable for analysing the spectra of membrane proteins. Results: A new reference dataset, MP180, has been created containing the spectra of 30 membrane proteins encompassing the secondary structure and fold space covered by all known membrane protein structures. In addition a mixed soluble and membrane protein dataset, SMP180, has been created, which includes 98 soluble protein spectra (SP) plus the MP180 spectra. Calculations of both membrane and soluble protein secondary structures using SMP180 are significantly improved with respect to those produced, using soluble protein-only datasets. The SMP180 dataset also enables determination of the percentage of transmembrane residues, thus enhancing the information previously obtainable from CD spectroscopy. Availability and Implementation: Reference dataset online at the DichroWeb analysis server (<inter-ref locator="http://dichroweb.cryst.bbk.ac.uk" locator-type="url">http://dichroweb.cryst.bbk.ac.uk</inter-ref>);individual protein spectra in the Protein Circular Dichroism Data Bank (<inter-ref locator="http://pcddb.cryst.bbk.ac.uk" locator-type="url">http://pcddb.cryst.bbk.ac.uk</inter-ref>). Contact: <inter-ref locator="b.wallace@mail.cryst.bbk.ac.uk" locator-type="email">b.wallace@mail.cryst.bbk.ac.uk</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr234/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1630 http://dx.doi.org/10.1093/bioinformatics/btr234 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/16372015-07-29HighWireOUPbioinfo:27:12

Identification and quantification of metabolites in 1H NMR spectra by Bayesian model selection Zheng, Cheng Zhang, Shucha Ragg, Susanne Raftery, Daniel Vitek, Olga GENE EXPRESSION Motivation: Nuclear magnetic resonance (NMR) spectroscopy is widely used for high-throughput characterization of metabolites in complex biological mixtures. However, accurate interpretation of the spectra in terms of identities and abundances of metabolites can be challenging, in particular in crowded regions with heavy peak overlap. Although a number of computational approaches for this task have recently been proposed, they are not entirely satisfactory in either accuracy or extent of automation. Results: We introduce a probabilistic approach Bayesian Quantification (<it>BQuant</it>), for fully automated database-based identification and quantification of metabolites in local regions of 1H NMR spectra. The approach represents the spectra as mixtures of reference profiles from a database, and infers the identities and the abundances of metabolites by Bayesian model selection. We show using a simulated dataset, a spike-in experiment and a metabolomic investigation of plasma samples that <it>BQuant</it> outperforms the available automated alternatives in accuracy for both identification and quantification. Availability: The R package <it>BQuant</it> is available at: <inter-ref locator="http://www.stat.purdue.edu/~ovitek/BQuant-Web/" locator-type="url">http://www.stat.purdue.edu/~ovitek/BQuant-Web/</inter-ref>. Contact: <inter-ref locator="ovitek@stat.purdue.edu" locator-type="email">ovitek@stat.purdue.edu</inter-ref>; <inter-ref locator="zhengc@purdue.edu" locator-type="email">zhengc@purdue.edu</inter-ref> Supplementary Information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr118/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1637 http://dx.doi.org/10.1093/bioinformatics/btr118 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/16452015-07-29HighWireOUPbioinfo:27:12

Exploiting prior knowledge and gene distances in the analysis of tumor expression profiles with extended Hidden Markov Models Seifert, Michael Strickert, Marc Schliep, Alexander Grosse, Ivo GENE EXPRESSION Motivation: Changes in gene expression levels play a central role in tumors. Additional information about the distribution of gene expression levels and distances between adjacent genes on chromosomes should be integrated into the analysis of tumor expression profiles. Results: We use a Hidden Markov Model with distance-scaled transition matrices (DSHMM) to incorporate chromosomal distances of adjacent genes on chromosomes into the identification of differentially expressed genes in breast cancer. We train the DSHMM by integrating prior knowledge about potential distributions of expression levels of differentially expressed and unchanged genes in tumor. We find that especially the combination of these data and to a lesser extent the modeling of distances between adjacent genes contribute to a substantial improvement of the identification of differentially expressed genes in comparison to other existing methods. This performance benefit is also supported by the identification of genes well known to be associated with breast cancer. That suggests applications of DSHMMs for screening of other tumor expression profiles. Availability: The DSHMM is available as part of the open-source Java library Jstacs (<inter-ref locator="www.jstacs.de/index.php/DSHMM" locator-type="url">www.jstacs.de/index.php/DSHMM</inter-ref>). Contact: <inter-ref locator="seifert@ipk-gatersleben.de" locator-type="email">seifert@ipk-gatersleben.de</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr199/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Supplementary data files are available at the Jstacs's web site. Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1645 http://dx.doi.org/10.1093/bioinformatics/btr199 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/16532015-07-29HighWireOUPbioinfo:27:12

DREME: motif discovery in transcription factor ChIP-seq data Bailey, Timothy L. GENE EXPRESSION Motivation: Transcription factor (TF) ChIP-seq datasets have particular characteristics that provide unique challenges and opportunities for motif discovery. Most existing motif discovery algorithms do not scale well to such large datasets, or fail to report many motifs associated with cofactors of the ChIP-ed TF. Results: We present DREME, a motif discovery algorithm specifically designed to find the short, core DNA-binding motifs of eukaryotic TFs, and optimized to analyze very large ChIP-seq datasets in minutes. Using DREME, we discover the binding motifs of the the ChIP-ed TF and many cofactors in mouse ES cell (mESC), mouse erythrocyte and human cell line ChIP-seq datasets. For example, in mESC ChIP-seq data for the TF Esrrb, we discover the binding motifs for eight cofactor TFs important in the maintenance of pluripotency. Several other commonly used algorithms find at most two cofactor motifs in this same dataset. DREME can also perform <it>discriminative</it> motif discovery, and we use this feature to provide evidence that Sox2 and Oct4 do not bind in mES cells as an obligate heterodimer. DREME is much faster than many commonly used algorithms, scales linearly in dataset size, finds multiple, non-redundant motifs and reports a reliable measure of statistical significance for each motif found. DREME is available as part of the MEME Suite of motif-based sequence analysis tools (<inter-ref locator="http://meme.nbcr.net" locator-type="url">http://meme.nbcr.net</inter-ref>). Contact: <inter-ref locator="t.bailey@uq.edu.au" locator-type="email">t.bailey@uq.edu.au</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr261/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1653 http://dx.doi.org/10.1093/bioinformatics/btr261 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/16602015-07-29HighWireOUPbioinfo:27:12

An optimal peak alignment for comprehensive two-dimensional gas chromatography mass spectrometry using mixture similarity measure Kim, Seongho Fang, Aiqin Wang, Bing Jeong, Jaesik Zhang, Xiang SYSTEMS BIOLOGY Motivation: Comprehensive two-dimensional gas chromatography mass spectrometry (GC × GC–MS) brings much increased separation capacity, chemical selectivity and sensitivity for metabolomics and provides more accurate information about metabolite retention times and mass spectra. However, there is always a shift of retention times in the two columns that makes it difficult to compare metabolic profiles obtained from multiple samples exposed to different experimental conditions. Results: The existing peak alignment algorithms for GC × GC–MS data use the peak distance and the spectra similarity sequentially and require predefined either distance-based window and/or spectral similarity-based window. To overcome the limitations of the current alignment methods, we developed an optimal peak alignment using a novel mixture similarity by employing the peak distance and the spectral similarity measures simultaneously without any variation windows. In addition, we examined the effect of the four different distance measures such as Euclidean, Maximum, Manhattan and Canberra distances on the peak alignment. The performance of our proposed peak alignment algorithm was compared with the existing alignment methods on the two sets of GC × GC–MS data. Our analysis showed that Canberra distance performed better than other distances and the proposed mixture similarity peak alignment algorithm prevailed against all literature reported methods. Availability: The data and software mSPA are available at <inter-ref locator="http://stage.louisville.edu/faculty/x0zhan17/software/software-development" locator-type="url">http://stage.louisville.edu/faculty/x0zhan17/software/software-development</inter-ref>. Contact: <inter-ref locator="s0kim023@louisville.edu" locator-type="email">s0kim023@louisville.edu</inter-ref>; <inter-ref locator="xiang.zhang@louisville.edu" locator-type="email">xiang.zhang@louisville.edu</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr188/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1660 http://dx.doi.org/10.1093/bioinformatics/btr188 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/16672015-07-29HighWireOUPbioinfo:27:12

Pathway analysis of high-throughput biological data within a Bayesian network framework Isci, Senol Ozturk, Cengizhan Jones, Jon Otu, Hasan H. SYSTEMS BIOLOGY Motivation: Most current approaches to high-throughput biological data (HTBD) analysis either perform individual gene/protein analysis or, gene/protein set enrichment analysis for a list of biologically relevant molecules. Bayesian Networks (BNs) capture linear and non-linear interactions, handle stochastic events accounting for noise, and focus on local interactions, which can be related to causal inference. Here, we describe for the first time an algorithm that models biological pathways as BNs and identifies pathways that best explain given HTBD by scoring fitness of each network. Results: Proposed method takes into account the connectivity and relatedness between nodes of the pathway through factoring pathway topology in its model. Our simulations using synthetic data demonstrated robustness of our approach. We tested proposed method, Bayesian Pathway Analysis (BPA), on human microarray data regarding renal cell carcinoma (RCC) and compared our results with gene set enrichment analysis. BPA was able to find broader and more specific pathways related to RCC. Availability: Accompanying BPA software (BPAS) package is freely available for academic use at <inter-ref locator="http://bumil.boun.edu.tr/bpa" locator-type="url">http://bumil.boun.edu.tr/bpa</inter-ref>. Contact: <inter-ref locator="hotu@bidmc.harvard.edu" locator-type="email">hotu@bidmc.harvard.edu</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr269/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1667 http://dx.doi.org/10.1093/bioinformatics/btr269 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/16752015-07-29HighWireOUPbioinfo:27:12

Multiple-rule bias in the comparison of classification rules Yousefi, Mohammadmahdi R. Hua, Jianping Dougherty, Edward R. DATA AND TEXT MINING Motivation: There is growing discussion in the bioinformatics community concerning overoptimism of reported results. Two approaches contributing to overoptimism in classification are (i) the reporting of results on datasets for which a proposed classification rule performs well and (ii) the comparison of multiple classification rules on a single dataset that purports to show the advantage of a certain rule. Results: This article provides a careful probabilistic analysis of the second issue and the ‘multiple-rule bias’, resulting from choosing a classification rule having minimum estimated error on the dataset. It quantifies this bias corresponding to estimating the expected true error of the classification rule possessing minimum estimated error and it characterizes the bias from estimating the true comparative advantage of the chosen classification rule relative to the others by the estimated comparative advantage on the dataset. The analysis is applied to both synthetic and real data using a number of classification rules and error estimators. Availability: We have implemented in C code the synthetic data distribution model, classification rules, feature selection routines and error estimation methods. The code for multiple-rule analysis is implemented in MATLAB. The source code is available at <inter-ref locator="http://gsp.tamu.edu/Publications/supplementary/yousefi11a/" locator-type="url">http://gsp.tamu.edu/Publications/supplementary/yousefi11a/</inter-ref>. Supplementary simulation results are also included. Contact: <inter-ref locator="edward@ece.tamu.edu" locator-type="email">edward@ece.tamu.edu</inter-ref> Supplementary Information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr262/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1675 http://dx.doi.org/10.1093/bioinformatics/btr262 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/16842015-07-29HighWireOUPbioinfo:27:12

PLIO: an ontology for formal description of protein-ligand interactions Ivchenko, Olga Younesi, Erfan Shahid, Mohammad Wolf, Antje Müller, Bernd Hofmann-Apitius, Martin DATABASES AND ONTOLOGIES Motivation: Biomedical ontologies have proved to be valuable tools for data analysis and data interoperability. Protein–ligand interactions are key players in drug discovery and development; however, existing public ontologies that describe the knowledge space of biomolecular interactions do not cover all aspects relevant to pharmaceutical modelling and simulation. Results: The protein–ligand interaction ontology (PLIO) was developed around three main concepts, namely target, ligand and interaction, and was enriched by adding synonyms, useful annotations and references. The quality of the ontology was assessed based on structural, functional and usability features. Validation of the lexicalized ontology by means of natural language processing (NLP)-based methods showed a satisfactory performance (<it>F</it>-score = 81%). Through integration into our information retrieval environment we can demonstrate that PLIO supports lexical search in PubMed abstracts. The usefulness of PLIO is demonstrated by two use-case scenarios and it is shown that PLIO is able to capture both confirmatory and new knowledge from simulation and empirical studies. Availability: The PLIO ontology is made freely available to the public at <inter-ref locator="http://www.scai.fraunhofer.de/bioinformatics/downloads.html" locator-type="url">http://www.scai.fraunhofer.de/bioinformatics/downloads.html</inter-ref>. Contact: <inter-ref locator="martin.hofmann-apitius@scai.fraunhofer.de" locator-type="email">martin.hofmann-apitius@scai.fraunhofer.de</inter-ref> Supplementary Information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr256/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1684 http://dx.doi.org/10.1093/bioinformatics/btr256 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/16912015-07-29HighWireOUPbioinfo:27:12

BamTools: a C++ API and toolkit for analyzing and managing BAM files Barnett, Derek W. Garrison, Erik K. Quinlan, Aaron R. Strömberg, Michael P. Marth, Gabor T. GENOME ANALYSIS Motivation: Analysis of genomic sequencing data requires efficient, easy-to-use access to alignment results and flexible data management tools (e.g. filtering, merging, sorting, etc.). However, the enormous amount of data produced by current sequencing technologies is typically stored in compressed, binary formats that are not easily handled by the text-based parsers commonly used in bioinformatics research. Results: We introduce a software suite for programmers and end users that facilitates research analysis and data management using BAM files. BamTools provides both the first C++ API publicly available for BAM file support as well as a command-line toolkit. Availability: BamTools was written in C++, and is supported on Linux, Mac OSX and MS Windows. Source code and documentation are freely available at <inter-ref locator="http://github.org/pezmaster31/bamtools" locator-type="url">http://github.org/pezmaster31/bamtools</inter-ref>. Contact: <inter-ref locator="barnetde@bc.edu" locator-type="email">barnetde@bc.edu</inter-ref> Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1691 http://dx.doi.org/10.1093/bioinformatics/btr174 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/16932015-07-29HighWireOUPbioinfo:27:12

SMETHILLIUM: spatial normalization METHod for ILLumina InfinIUM HumanMethylation BeadChip Sabbah, Camille Mazo, Gildas Paccard, Caroline Reyal, Fabien Hupé, Philippe GENOME ANALYSIS Summary: DNA methylation is a major epigenetic modification in human cells. Illumina HumanMethylation27 BeadChip makes it possible to quantify the methylation state of 27 578 loci spanning 14 495 genes. We developed a non-parametric normalization method to correct the spatial background noise in order to improve the signal-to-noise ratio. The prediction performance of the proposed method was assessed on three fully methylated samples and three fully unmethylated DNA samples. We demonstrate that the spatial normalization outperforms BeadStudio to predict the methylation state of a given locus. Availability and Implementation: A R script and the data are available at the following address: <inter-ref locator="http://bioinfo.curie.fr/projects/smethillium" locator-type="url">http://bioinfo.curie.fr/projects/smethillium</inter-ref>. Contact: <inter-ref locator="smethillium@curie.fr" locator-type="email">smethillium@curie.fr</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr187/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1693 http://dx.doi.org/10.1093/bioinformatics/btr187 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/16962015-07-29HighWireOUPbioinfo:27:12

MEME-ChIP: motif analysis of large DNA datasets Machanick, Philip Bailey, Timothy L. GENOME ANALYSIS Motivation: Advances in high-throughput sequencing have resulted in rapid growth in large, high-quality datasets including those arising from transcription factor (TF) ChIP-seq experiments. While there are many existing tools for discovering TF binding site motifs in such datasets, most web-based tools cannot directly process such large datasets. Results: The MEME-ChIP web service is designed to analyze ChIP-seq ‘peak regions’—short genomic regions surrounding declared ChIP-seq ‘peaks’. Given a set of genomic regions, it performs (i) <it>ab initio</it> motif discovery, (ii) motif enrichment analysis, (iii) motif visualization, (iv) binding affinity analysis and (v) motif identification. It runs two complementary motif discovery algorithms on the input data—MEME and DREME—and uses the motifs they discover in subsequent visualization, binding affinity and identification steps. MEME-ChIP also performs motif enrichment analysis using the AME algorithm, which can detect very low levels of enrichment of binding sites for TFs with known DNA-binding motifs. Importantly, unlike with the MEME web service, there is no restriction on the size or number of uploaded sequences, allowing very large ChIP-seq datasets to be analyzed. The analyses performed by MEME-ChIP provide the user with a varied view of the binding and regulatory activity of the ChIP-ed TF, as well as the possible involvement of other DNA-binding TFs. Availability: MEME-ChIP is available as part of the MEME Suite at <inter-ref locator="http://meme.nbcr.net" locator-type="url">http://meme.nbcr.net</inter-ref>. Contact: <inter-ref locator="t.bailey@uq.edu.au" locator-type="email">t.bailey@uq.edu.au</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr189/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1696 http://dx.doi.org/10.1093/bioinformatics/btr189 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/16982015-07-29HighWireOUPbioinfo:27:12

Clonality: an R package for testing clonal relatedness of two tumors from the same patient based on their genomicprofiles Ostrovnaya, Irina Seshan, Venkatraman E. Olshen, Adam B. Begg, Colin B. GENOME ANALYSIS Summary: If a cancer patient develops multiple tumors, it is sometimes impossible to determine whether these tumors are independent or clonal based solely on pathological characteristics. Investigators have studied how to improve this diagnostic challenge by comparing the presence of loss of heterozygosity (LOH) at selected genetic locations of tumor samples, or by comparing genomewide copy number array profiles. We have previously developed statistical methodology to compare such genomic profiles for an evidence of clonality. We assembled the software for these tests in a new R package called ‘<it>Clonality</it>’. For LOH profiles, the package contains significance tests. The analysis of copy number profiles includes a likelihood ratio statistic and reference distribution, as well as an option to produce various plots that summarize the results. Availability: Bioconductor (<inter-ref locator="http://bioconductor.org/packages/release/bioc/html/Clonality.html" locator-type="url">http://bioconductor.org/packages/release/bioc/html/Clonality.html</inter-ref>) and <inter-ref locator="http://www.mskcc.org/mskcc/html/13287.cfm" locator-type="url">http://www.mskcc.org/mskcc/html/13287.cfm</inter-ref>. Contact: <inter-ref locator="ostrovni@mskcc.org" locator-type="email">ostrovni@mskcc.org</inter-ref> Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1698 http://dx.doi.org/10.1093/bioinformatics/btr267 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/17002015-07-29HighWireOUPbioinfo:27:12

OTUbase: an R infrastructure package for operational taxonomic unit data Beck, Daniel Settles, Matt Foster, James A. SEQUENCE ANALYSIS Summary: OTUbase is an R package designed to facilitate the analysis of operational taxonomic unit (OTU) data and sequence classification (taxonomic) data. Currently there are programs that will cluster sequence data into OTUs and/or classify sequence data into known taxonomies. However, there is a need for software that can take the summarized output of these programs and organize it into easily accessed and manipulated formats. OTUbase provides this structure and organization within R, to allow researchers to easily manipulate the data with the rich library of R packages currently available for additional analysis. Availability: OTUbase is an R package available through Bioconductor. It can be found at <inter-ref locator="http://www.bioconductor.org/packages/release/bioc/html/OTUbase.html." locator-type="url">http://www.bioconductor.org/packages/release/bioc/html/OTUbase.html.</inter-ref> Contact: <inter-ref locator="msettles@uidaho.edu" locator-type="email">msettles@uidaho.edu</inter-ref> Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1700 http://dx.doi.org/10.1093/bioinformatics/btr196 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/17022015-07-29HighWireOUPbioinfo:27:12

KalignP: Improved multiple sequence alignments using position specific gap penalties in Kalign2 Shu, Nanjiang Elofsson, Arne SEQUENCE ANALYSIS Summary: Kalign2 is one of the fastest and most accurate methods for multiple alignments. However, in contrast to other methods Kalign2 does not allow externally supplied position specific gap penalties. Here, we present a modification to Kalign2, KalignP, so that it accepts such penalties. Further, we show that KalignP using position specific gap penalties obtained from predicted secondary structures makes steady improvement over Kalign2 when tested on Balibase 3.0 as well as on a dataset derived from Pfam-A seed alignments. Availability and Implementation: KalignP is freely available at <inter-ref locator="http://kalignp.cbr.su.se" locator-type="url">http://kalignp.cbr.su.se</inter-ref>. The source code of KalignP is available under the GNU General Public License, Version 2 or later from the same website. Contact: <inter-ref locator="arne@bioinfo.se" locator-type="email">arne@bioinfo.se</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr235/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1702 http://dx.doi.org/10.1093/bioinformatics/btr235 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/17042015-07-29HighWireOUPbioinfo:27:12

FR-HIT, a very fast program to recruit metagenomic reads to homologous reference genomes Niu, Beifang Zhu, Zhengwei Fu, Limin Wu, Sitao Li, Weizhong SEQUENCE ANALYSIS Summary: Fragment recruitment, a process of aligning sequencing reads to reference genomes, is a crucial step in metagenomic data analysis. The available sequence alignment programs are either slow or insufficient for recruiting metagenomic reads. We implemented an efficient algorithm, FR-HIT, for fragment recruitment. We applied FR-HIT and several other tools including BLASTN, MegaBLAST, BLAT, LAST, SSAHA2, SOAP2, BWA and BWA-SW to recruit four metagenomic datasets from different type of sequencers. On average, FR-HIT and BLASTN recruited significantly more reads than other programs, while FR-HIT is about two orders of magnitude faster than BLASTN. FR-HIT is slower than the fastest SOAP2, BWA and BWA-SW, but it recruited 1–5 times more reads. Availability: <inter-ref locator="http://weizhongli-lab.org/frhit" locator-type="url">http://weizhongli-lab.org/frhit</inter-ref>. Contact: <inter-ref locator="liwz@sdsc.edu" locator-type="email">liwz@sdsc.edu</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr252/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1704 http://dx.doi.org/10.1093/bioinformatics/btr252 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/17062015-07-29HighWireOUPbioinfo:27:12

Boulder ALignment Editor (ALE): a web-based RNA alignment tool Stombaugh, Jesse Widmann, Jeremy McDonald, Daniel Knight, Rob SEQUENCE ANALYSIS Summary: The explosion of interest in non-coding RNAs, together with improvements in RNA X-ray crystallography, has led to a rapid increase in RNA structures at atomic resolution from 847 in 2005 to 1900 in 2010. The success of whole-genome sequencing has led to an explosive growth of unaligned homologous sequences. Consequently, there is a compelling and urgent need for user-friendly tools for producing structure-informed RNA alignments. Most alignment software considers the primary sequence alone; some specialized alignment software can also include Watson–Crick base pairs, but none adequately addresses the needs introduced by the rapid influx of both sequence and structural data. Therefore, we have developed the Boulder ALignment Editor (ALE), which is a web-based RNA alignment editor, designed for editing and assessing alignments using structural information. Some features of BoulderALE include the annotation and evaluation of an alignment based on isostericity of Watson–Crick and non-Watson–Crick base pairs, along with the collapsing (horizontally and vertically) of the alignment, while maintaining the ability to edit the alignment. Availability: <inter-ref locator="http://www.microbio.me/boulderale" locator-type="url">http://www.microbio.me/boulderale</inter-ref>. Contact: <inter-ref locator="jesse.stombaugh@colorado.edu" locator-type="email">jesse.stombaugh@colorado.edu</inter-ref> Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1706 http://dx.doi.org/10.1093/bioinformatics/btr258 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/17082015-07-29HighWireOUPbioinfo:27:12

FusionHunter: identifying fusion transcripts in cancer using paired-end RNA-seq Li, Yang Chien, Jeremy Smith, David I. Ma, Jian SEQUENCE ANALYSIS Motivation: Fusion transcripts can be created as a result of genome rearrangement in cancer. Some of them play important roles in carcinogenesis, and can serve as diagnostic and therapeutic targets. With more and more cancer genomes being sequenced by next-generation sequencing technologies, we believe an efficient tool for reliably identifying fusion transcripts will be desirable for many groups. Results: We designed and implemented an open-source software tool, called FusionHunter, which reliably identifies fusion transcripts from transcriptional analysis of paired-end RNA-seq. We show that FusionHunter can accurately detect fusions that were previously confirmed by RT-PCR in a publicly available dataset. The purpose of FusionHunter is to identify potential fusions with high sensitivity and specificity and to guide further functional validation in the laboratory. Availability: <inter-ref locator="http://bioen-compbio.bioen.illinois.edu/FusionHunter/" locator-type="url">http://bioen-compbio.bioen.illinois.edu/FusionHunter/</inter-ref>. Contact: <inter-ref locator="jianma@illinois.edu" locator-type="email">jianma@illinois.edu</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr265/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1708 http://dx.doi.org/10.1093/bioinformatics/btr265 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/17112015-07-29HighWireOUPbioinfo:27:12

A graphical interface for the FoldX forcefield Van Durme, Joost Delgado, Javier Stricher, Francois Serrano, Luis Schymkowitz, Joost Rousseau, Frederic STRUCTURAL BIOINFORMATICS Summary: A graphical user interface for the FoldX protein design program has been developed as a plugin for the YASARA molecular graphics suite. The most prominent FoldX commands such as free energy difference upon mutagenesis and interaction energy calculations can now be run entirely via a windowed menu system and the results are immediately shown on screen. Availability and Implementation: The plugin is written in Python and is freely available for download at <inter-ref locator="http://foldxyasara.switchlab.org/" locator-type="url">http://foldxyasara.switchlab.org/</inter-ref> and supported on Linux, MacOSX and MS Windows. Contact: <inter-ref locator="frederic.rousseau@switch.vib-vub.be" locator-type="email">frederic.rousseau@switch.vib-vub.be</inter-ref>; <inter-ref locator="joost.schymkowitz@switch.vib-vub.be" locator-type="email">joost.schymkowitz@switch.vib-vub.be</inter-ref>; <inter-ref locator="luis.serrano@crg.es" locator-type="email">luis.serrano@crg.es</inter-ref> Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1711 http://dx.doi.org/10.1093/bioinformatics/btr254 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/17132015-07-29HighWireOUPbioinfo:27:12

Grid computing for improving conformational sampling in NMR structurecalculation Mareuil, Fabien Blanchet, Christophe Malliavin, Thérèse E. Nilges, Michael STRUCTURAL BIOINFORMATICS Motivation: Methods for automatic nuclear magnetic resonance (NMR) structure determination need to face a high level of ambiguity encountered in NMR spectra recorded by solid-state NMR and by solution NMR of partially unfolded proteins, leading to time-consuming calculations. The software package Ambiguous Restraints for Iterative Assignment (ARIA) allows for straightforward parallelization of the calculation, as the conformers can be generated in parallel on many nodes. Results: Due to its architecture, the adaptation of ARIA to grid computing can be easily achieved by using the middleware glite and JDL (Job Description Language) scripts. This adaptation makes it possible to address highly ambiguous datasets, because of the much larger conformational sampling that can be generated by use of the grid computational power. Availability: The version 2.3.1 of ARIA implemented on the grid is freely available from the ARIA web site: <inter-ref locator="http://aria.pasteur.fr/downloads" locator-type="url">aria.pasteur.fr/downloads</inter-ref>. Contact: <inter-ref locator="nilges@pasteur.fr" locator-type="email">nilges@pasteur.fr</inter-ref>; <inter-ref locator="tere@pasteur.fr" locator-type="email">tere@pasteur.fr</inter-ref> Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1713 http://dx.doi.org/10.1093/bioinformatics/btr255 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/17152015-07-29HighWireOUPbioinfo:27:12

APOLLO: a quality assessment service for single and multiple protein models Wang, Zheng Eickholt, Jesse Cheng, Jianlin STRUCTURAL BIOINFORMATICS Summary: We built a web server named APOLLO, which can evaluate the absolute global and local qualities of a single protein model using machine learning methods or the global and local qualities of a pool of models using a pair-wise comparison approach. Based on our evaluations on 107 CASP9 (Critical Assessment of Techniques for Protein Structure Prediction) targets, the predicted quality scores generated from our machine learning and pair-wise methods have an average per-target correlation of 0.671 and 0.917, respectively, with the true model quality scores. Based on our test on 92 CASP9 targets, our predicted absolute local qualities have an average difference of 2.60 Å with the actual distances to native structure. Availability: <inter-ref locator="http://sysbio.rnet.missouri.edu/apollo/" locator-type="url">http://sysbio.rnet.missouri.edu/apollo/</inter-ref>. Single and pair-wise global quality assessment software is also available at the site. Contact: <inter-ref locator="chengji@missouri.edu" locator-type="email">chengji@missouri.edu</inter-ref> Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1715 http://dx.doi.org/10.1093/bioinformatics/btr268 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/17172015-07-29HighWireOUPbioinfo:27:12

Mcheza: a workbench to detect selection using dominant markers Antao, Tiago Beaumont, Mark A. GENETICS AND POPULATION ANALYSIS Motivation: Dominant markers (DArTs and AFLPs) are commonly used for genetic analysis in the fields of evolutionary genetics, ecology and conservation of genetic resources. The recent prominence of these markers has coincided with renewed interest in detecting the effects of local selection and adaptation at the level of the genome. Results: We present Mcheza, an application for detecting loci under selection based on a well-evaluated F<inf>ST</inf>-outlier method. The application allows robust estimates to be made of model parameters (e.g. genome-wide average, neutral F<inf>ST</inf>), provides data import and export functions, iterative contour smoothing and generation of graphics in an easy to use graphical user interface with a computation engine that supports multicore processors for enhanced performance. Mcheza also provides functionality to mitigate common analytical errors when scanning for loci under selection. Availability: Mcheza is freely available under GPL version 3 from <inter-ref locator="http://popgen.eu/soft/mcheza" locator-type="url">http://popgen.eu/soft/mcheza</inter-ref>. Contact: <inter-ref locator="tra@popgen.eu" locator-type="email">tra@popgen.eu</inter-ref> Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1717 http://dx.doi.org/10.1093/bioinformatics/btr253 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/17192015-07-29HighWireOUPbioinfo:27:12

Services for prediction of drug susceptibility for HIV proteases and reverse transcriptases at the HIV drug research centre Spjuth, Ola Eklund, Martin Lapins, Maris Junaid, Muhammad Wikberg, Jarl E. S. SYSTEMS BIOLOGY Summary: The HIV Drug Research Centre (HIVDRC) has established Web services for prediction of drug susceptibility for HIV proteases and reverse transcriptases. The services are based on two proteochemometric models which accepts a protease or reverse transcriptase sequence in amino acid form, and outputs the predicted drug susceptibility values. The predictions are based on a comprehensive analysis where all the relevant inhibitors are included, resulting in models with excellent predictive capabilities. Availability and Implementation: The services are implemented as interoperable Web services (REST and XMPP), with supporting web pages to allow for individual analyses. A set of plugins were also developed which make the services available from the Bioclipse workbench for life science. Services are available at <inter-ref locator="http://www.hivdrc.org/services" locator-type="url">http://www.hivdrc.org/services</inter-ref>. Contact: <inter-ref locator="ola.spjuth@farmbio.uu.se" locator-type="email">ola.spjuth@farmbio.uu.se</inter-ref> Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1719 http://dx.doi.org/10.1093/bioinformatics/btr192 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/17212015-07-29HighWireOUPbioinfo:27:12

RuleBender: a visual interface for rule-based modeling Xu, Wen Smith, Adam M. Faeder, James R. Marai, G. Elisabeta SYSTEMS BIOLOGY Summary: Rule-based modeling (RBM) is a powerful and increasingly popular approach to modeling intracellular biochemistry. Current interfaces for RBM are predominantly text-based and command-line driven. Better visual tools are needed to make RBM accessible to a broad range of users, to make specification of models less error prone and to improve workflows. We present R<scp>ULE</scp>B<scp>ENDER</scp>, an open-source visual interface that facilitates interactive debugging, simulation and analysis of RBMs. Availability: R<scp>ULE</scp>B<scp>ENDER</scp> is freely available for Mac, Windows and Linux at <inter-ref locator="http://rulebender.org" locator-type="url">http://rulebender.org</inter-ref>. Contact: <inter-ref locator="faeder@pitt.edu" locator-type="email">faeder@pitt.edu</inter-ref>; <inter-ref locator="marai@cs.pitt.edu" locator-type="email">marai@cs.pitt.edu</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr197/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1721 http://dx.doi.org/10.1093/bioinformatics/btr197 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/17232015-07-29HighWireOUPbioinfo:27:12

Figure summarizer browser extensions for PubMed Central Agarwal, Shashank Yu, Hong DATA AND TEXT MINING Summary: Figures in biomedical articles present visual evidence for research facts and help readers understand the article better. However, when figures are taken out of context, it is difficult to understand their content. We developed a summarization algorithm to summarize the content of figures and used it in our figure search engine (<inter-ref locator="http://figuresearch.askhermes.org/" locator-type="url">http://figuresearch.askhermes.org/</inter-ref>). In this article, we report on the development of web browser extensions for Mozilla Firefox, Google Chrome and Apple Safari to display summaries for figures in PubMed Central and NCBI Images. Availability: The extensions can be downloaded from <inter-ref locator="http://figuresearch.askhermes.org/articlesearch/extensions.php" locator-type="url">http://figuresearch.askhermes.org/articlesearch/extensions.php</inter-ref>. Contact: <inter-ref locator="agarwal@uwm.edu" locator-type="email">agarwal@uwm.edu</inter-ref> Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1723 http://dx.doi.org/10.1093/bioinformatics/btr194 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/17252015-07-29HighWireOUPbioinfo:27:12

Cobweb: a Java applet for network exploration and visualisation von Eichborn, Joachim Bourne, Philip E. Preissner, Robert DATA AND TEXT MINING Summary: Cobweb is a Java applet for real-time network visualization; its strength lies in enabling the interactive exploration of networks. Therefore, it allows new nodes to be interactively added to a network by querying a database on a server. The network constantly rearranges to provide the most meaningful topological view. Availability: Cobweb is available under the GPLv3 and may be freely downloaded at <inter-ref locator="http://bioinformatics.charite.de/cobweb" locator-type="url">http://bioinformatics.charite.de/cobweb</inter-ref>. Contact: <inter-ref locator="joachim.eichborn@charite.de" locator-type="email">joachim.eichborn@charite.de</inter-ref> Supplementary Information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr195/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1725 http://dx.doi.org/10.1093/bioinformatics/btr195 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/17272015-07-29HighWireOUPbioinfo:27:12

PONDEROSA, an automated 3D-NOESY peak picking program, enables automated protein structure determination Lee, Woonghee Kim, Jin Hae Westler, William M. Markley, John L. DATA AND TEXT MINING Summary: PONDEROSA (Peak-picking Of Noe Data Enabled by Restriction of Shift Assignments) accepts input information consisting of a protein sequence, backbone and sidechain NMR resonance assignments, and 3D-NOESY (13C-edited and/or 15N-edited) spectra, and returns assignments of NOESY crosspeaks, distance and angle constraints, and a reliable NMR structure represented by a family of conformers. PONDEROSA incorporates and integrates external software packages (TALOS+, STRIDE and CYANA) to carry out different steps in the structure determination. PONDEROSA implements internal functions that identify and validate NOESY peak assignments and assess the quality of the calculated three-dimensional structure of the protein. The robustness of the analysis results from PONDEROSA's hierarchical processing steps that involve iterative interaction among the internal and external modules. PONDEROSA supports a variety of input formats: SPARKY assignment table (.shifts) and spectrum file formats (.ucsf), XEASY proton file format (.prot), and NMR-STAR format (.star). To demonstrate the utility of PONDEROSA, we used the package to determine 3D structures of two proteins: human ubiquitin and <it>Escherichia coli</it> iron-sulfur scaffold protein variant IscU(D39A). The automatically generated structural constraints and ensembles of conformers were as good as or better than those determined previously by much less automated means. Availability: The program, in the form of binary code along with tutorials and reference manuals, is available at <inter-ref locator="http://ponderosa.nmrfam.wisc.edu/" locator-type="url">http://ponderosa.nmrfam.wisc.edu/</inter-ref>. Contact: <inter-ref locator="whlee@nmrfam.wisc.edu" locator-type="email">whlee@nmrfam.wisc.edu</inter-ref>; <inter-ref locator="markley@nmrfam.wisc.edu" locator-type="email">markley@nmrfam.wisc.edu</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr200/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1727 http://dx.doi.org/10.1093/bioinformatics/btr200 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/17292015-07-29HighWireOUPbioinfo:27:12

Rgtsp: a generalized top scoring pairs package for class prediction Popovici, Vlad Budinská, Eva Delorenzi, Mauro DATA AND TEXT MINING Summary: A top scoring pair (TSP) classifier consists of a pair of variables whose relative ordering can be used for accurately predicting the class label of a sample. This classification rule has the advantage of being easily interpretable and more robust against technical variations in data, as those due to different microarray platforms. Here we describe a parallel implementation of this classifier which significantly reduces the training time, and a number of extensions, including a multi-class approach, which has the potential of improving the classification performance. Availability and Implementation: Full <ty>C++</ty> source code and <ty>R</ty> package <ty>Rgtsp</ty> are freely available from <inter-ref locator="http://lausanne.isb-sib.ch/~vpopovic/research/" locator-type="url">http://lausanne.isb-sib.ch/~vpopovic/research/</inter-ref>. The implementation relies on existing OpenMP libraries. Contact: <inter-ref locator="vlad.popovici@isb-sib.ch" locator-type="email">vlad.popovici@isb-sib.ch</inter-ref> Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1729 http://dx.doi.org/10.1093/bioinformatics/btr233 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/17312015-07-29HighWireOUPbioinfo:27:12

ReMark: an automatic program for clustering orthologs flexibly combining a Recursive and a Markovclustering algorithms Kim, Kangseok Kim, Wonil Kim, Sunshin DATA AND TEXT MINING Summary: ReMark is a fully automatic tool for clustering orthologs by combining a Recursive and a Markov clustering (MCL) algorithms. The ReMark detects and recursively clusters ortholog pairs through reciprocal BLAST best hits between multiple genomes running software program (RecursiveClustering.java) in the first step. Then, it employs MCL algorithm to compute the clusters (score matrices generated from the previous step) and refines the clusters by adjusting an inflation factor running software program (MarkovClustering.java). This method has two key features. One utilizes, to get more reliable results, the diagonal scores in the matrix of the initial ortholog clusters. Another clusters orthologs flexibly through being controlled naturally by MCL with a selected inflation factor. Users can therefore select the fitting state of orthologous protein clusters by regulating the inflation factor according to their research interests. Availability and Implementation: Source code for the orthologous protein clustering software is freely available for non-commercial use at <inter-ref locator="http://dasan.sejong.ac.kr/~wikim/notice.html" locator-type="url">http://dasan.sejong.ac.kr/~wikim/notice.html</inter-ref>, implemented in Java 1.6 and supported on Windows and Linux. Contact: <inter-ref locator="wikim@sejong.ac.kr" locator-type="email">wikim@sejong.ac.kr</inter-ref>; <inter-ref locator="sskim04@hotmail.com" locator-type="email">sskim04@hotmail.com</inter-ref> Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1731 http://dx.doi.org/10.1093/bioinformatics/btr259 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/17342015-07-29HighWireOUPbioinfo:27:12

'Sciencenet'--towards a global search and share engine for all scientific knowledge Lütjohann, Dominic S. Shah, Asmi H. Christen, Michael P. Richter, Florian Knese, Karsten Liebel, Urban DATABASES AND ONTOLOGIES Summary: Modern biological experiments create vast amounts of data which are geographically distributed. These datasets consist of petabytes of raw data and billions of documents. Yet to the best of our knowledge, a search engine technology that searches and cross-links all different data types in life sciences does not exist. We have developed a prototype distributed scientific search engine technology, ‘Sciencenet’, which facilitates rapid searching over this large data space. By ‘bringing the search engine to the data’, we do not require server farms. This platform also allows users to contribute to the search index and publish their large-scale data to support e-Science. Furthermore, a community-driven method guarantees that only scientific content is crawled and presented. Our peer-to-peer approach is sufficiently scalable for the science web without performance or capacity tradeoff. Availability and Implementation: The free to use search portal web page and the downloadable client are accessible at: <inter-ref locator="http://sciencenet.kit.edu" locator-type="url">http://sciencenet.kit.edu</inter-ref>. The web portal for index administration is implemented in ASP.NET, the ‘AskMe’ experiment publisher is written in Python 2.7, and the backend ‘YaCy’ search engine is based on Java 1.6. Contact: <inter-ref locator="urban.liebel@kit.edu" locator-type="email">urban.liebel@kit.edu</inter-ref> Supplementary Material: Detailed instructions and descriptions can be found on the project homepage: <inter-ref locator="http://sciencenet.kit.edu" locator-type="url">http://sciencenet.kit.edu</inter-ref>. Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1734 http://dx.doi.org/10.1093/bioinformatics/btr181 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/17362015-07-29HighWireOUPbioinfo:27:12

Signaling gateway molecule pages--a data model perspective Dinasarapu, Ashok Reddy Saunders, Brian Ozerlat, Iley Azam, Kenan Subramaniam, Shankar DATABASES AND ONTOLOGIES Summary: The Signaling Gateway Molecule Pages (SGMP) database provides highly structured data on proteins which exist in different functional states participating in signal transduction pathways. A molecule page starts with a<it>state</it> of a native protein, without any modification and/or interactions. New states are formed with every post-translational modification or interaction with one or more proteins, small molecules or <it>class molecules</it> and with each change in cellular location. State transitions are caused by a combination of one or more modifications, interactions and translocations which then might be associated with one or more biological processes. In a characterized biological state, a molecule can function as one of several entities or their combinations, including channel, receptor, enzyme, transcription factor and transporter. We have also exported SGMP data to the Biological Pathway Exchange (BioPAX) and Systems Biology Markup Language (SBML) as well as in our custom XML. Availability: SGMP is available at <inter-ref locator="www.signaling-gateway.org/molecule" locator-type="url">www.signaling-gateway.org/molecule</inter-ref>. Contact: <inter-ref locator="shankar@ucsd.edu" locator-type="email">shankar@ucsd.edu</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr190/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1736 http://dx.doi.org/10.1093/bioinformatics/btr190 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/12/17392015-07-29HighWireOUPbioinfo:27:12

Molecular signatures database (MSigDB) 3.0 Liberzon, Arthur Subramanian, Aravind Pinchback, Reid Thorvaldsdóttir, Helga Tamayo, Pablo Mesirov, Jill P. DATABASES AND ONTOLOGIES Motivation: Well-annotated gene sets representing the universe of the biological processes are critical for meaningful and insightful interpretation of large-scale genomic data. The Molecular Signatures Database (MSigDB) is one of the most widely used repositories of such sets. Results: We report the availability of a new version of the database, MSigDB 3.0, with over 6700 gene sets, a complete revision of the collection of canonical pathways and experimental signatures from publications, enhanced annotations and upgrades to the web site. Availability and Implementation: MSigDB is freely available for non-commercial use at <inter-ref locator="http://www.broadinstitute.org/msigdb" locator-type="url">http://www.broadinstitute.org/msigdb</inter-ref>. Contact: <inter-ref locator="gsea@broadinstitute.org" locator-type="email">gsea@broadinstitute.org</inter-ref> Oxford University Press 2011-06-15 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/12/1739 http://dx.doi.org/10.1093/bioinformatics/btr260 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/17412015-07-29HighWireOUPbioinfo:27:13

Bioinformatics challenges for personalized medicine Fernald, Guy Haskin Capriotti, Emidio Daneshjou, Roxana Karczewski, Konrad J. Altman, Russ B. GENOME ANALYSIS Motivation: Widespread availability of low-cost, full genome sequencing will introduce new challenges for bioinformatics. Results: This review outlines recent developments in sequencing technologies and genome analysis methods for application in personalized medicine. New methods are needed in four areas to realize the potential of personalized medicine: (i) processing large-scale robust genomic data; (ii) interpreting the functional effect and the impact of genomic variation; (iii) integrating systems data to relate complex genetic interactions with phenotypes; and (iv) translating these discoveries into medical practice. Contact: <inter-ref locator="russ.altman@stanford.edu" locator-type="email">russ.altman@stanford.edu</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr295/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/1741 http://dx.doi.org/10.1093/bioinformatics/btr295 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/17492015-07-29HighWireOUPbioinfo:27:13

A cautionary note for retrocopy identification: DNA-based duplication of intron-containing genes significantly contributes to the origination of single exon genes Zhang, Yong E. Vibranovski, Maria D. Krinsky, Benjamin H. Long, Manyuan GENOME ANALYSIS Motivation: Retrocopies are important genes in the genomes of almost all higher eukaryotes. However, the annotation of such genes is a non-trivial task. Intronless genes have often been considered to be retroposed copies of intron-containing paralogs. Such categorization relies on the implicit premise that alignable regions of the duplicates should be long enough to cover exon–exon junctions of the intron-containing genes, and thus intron loss events can be inferred. Here, we examined the alternative possibility that intronless genes could be generated by partial DNA-based duplication of intron-containing genes in the fruitfly genome. Results: By building pairwise protein-, transcript- and genome-level DNA alignments between intronless genes and their corresponding intron-containing paralogs, we found that alignments do not cover exon–exon junctions in 40% of cases and thus no intron loss could be inferred. For these cases, the candidate parental proteins tend to be partially duplicated, and intergenic sequences or neighboring genes are included in the intronless paralog. Moreover, we observed that it is significantly less likely for these paralogs to show inter-chromosomal duplication and testis-dominant transcription, compared to the remaining 60% of cases with evidence of clear intron loss (retrogenes). These lines of analysis reveal that DNA-based duplication contributes significantly to the 40% of cases of single exon gene duplication. Finally, we performed an analogous survey in the human genome and the result is similar, wherein 34% of the cases do not cover exon–exon junctions. Thus, genome annotation for retrogene identification should discard candidates without clear evidence of intron loss. Contact: <inter-ref locator="mlong@uchicago.edu" locator-type="email">mlong@uchicago.edu</inter-ref>; <inter-ref locator="zhangy@uchicago.edu" locator-type="email">zhangy@uchicago.edu</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr280/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/1749 http://dx.doi.org/10.1093/bioinformatics/btr280 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/17542015-07-29HighWireOUPbioinfo:27:13

Modular model of TNF{alpha} cytotoxicity Chignola, Roberto Vyshemirsky, Vladislav Farina, Marcello Del Fabbro, Alessio Milotti, Edoardo SYSTEMS BIOLOGY Motivation: Tumour Necrosis Factor alpha (TNF) initiates a complex series of biochemical events in the cell upon binding to its type R1 receptor (TNF-R1). Recent experimental work has unravelled the molecular regulation of the signalling complexes that lead either to cell survival or death. Survival signals are activated by direct binding of TNF to TNF-R1 at the cell membrane whereas apoptotic signals by endocytosed TNF/TNF-R1 complexes. Here we describe a reduced, effective model with few free parameters, where we group some intricate mechanisms into effective modules, that successfully describes this complex set of actions. We study the parameter space to show that the model is structurally stable and robust over a broad range of parameter values. Results: We use state-of-the-art Bayesian methods (a Sequential Monte Carlo sampler) to perform inference of plausible values of the model parameters from experimental data. As a result, we obtain a robust model that can provide a solid basis for further modelling of TNF signalling. The model is also suitable for inclusion in multi-scale simulation programs that are presently under development to study the behaviour of large tumour cell populations. Availability: We provide supplementary material that includes all mathematical details and all algorithms (Matlab code) and models (SBML descriptions). Contact: <inter-ref locator="edoardo.milotti@ts.infn.it" locator-type="email">edoardo.milotti@ts.infn.it</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr297/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/1754 http://dx.doi.org/10.1093/bioinformatics/btr297 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/17582015-07-29HighWireOUPbioinfo:27:13

Genome-wide DNA sequence polymorphisms facilitate nucleosome positioning in yeast Dai, Zhiming Dai, Xianhua Xiang, Qian GENOME ANALYSIS Motivation: The intrinsic DNA sequence is an important determinant of nucleosome positioning. Some DNA sequence patterns can facilitate nucleosome formation, while others can inhibit nucleosome formation. Nucleosome positioning influences the overall rate of sequence evolution. However, its impacts on specific patterns of sequence evolution are still poorly understood. Results: Here, we examined whether nucleosomal DNA and nucleosome-depleted DNA show distinct polymorphism patterns to maintain adequate nucleosome architecture on a genome scale in yeast. We found that sequence polymorphisms in nucleosomal DNA tend to facilitate nucleosome formation, whereas polymorphisms in nucleosome-depleted DNA tend to inhibit nucleosome formation, which is especially evident at nucleosome-disfavored sequences in nucleosome-free regions at both ends of genes. Sequence polymorphisms facilitating nucleosome positioning correspond to stable nucleosome positioning. These results reveal that sequence polymorphisms are under selective constraints to maintain nucleosome positioning. Contact: <inter-ref locator="zhimdai@gmail.com" locator-type="email">zhimdai@gmail.com</inter-ref>; <inter-ref locator="issdxh@mail.sysu.edu.cn" locator-type="email">issdxh@mail.sysu.edu.cn</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr290/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/1758 http://dx.doi.org/10.1093/bioinformatics/btr290 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/17652015-07-29HighWireOUPbioinfo:27:13

Identification of prokaryotic small proteins using a comparative genomic approach Samayoa, Josue Yildiz, Fitnat H. Karplus, Kevin SEQUENCE ANALYSIS Motivation: Accurate prediction of genes encoding small proteins (on the order of 50 amino acids or less) remains an elusive open problem in bioinformatics. Some of the best methods for gene prediction use either sequence composition analysis or sequence similarity to a known protein coding sequence. These methods often fail for small proteins, however, either due to a lack of experimentally verified small protein coding genes or due to the limited statistical significance of statistics on small sequences. Our approach is based upon the hypothesis that true small proteins will be under selective pressure for encoding the particular amino acid sequence, for ease of translation by the ribosome and for structural stability. This stability can be achieved either independently or as part of a larger protein complex. Given this assumption, it follows that small proteins should display conserved local protein structure properties much like larger proteins. Our method incorporates neural-net predictions for three local structure alphabets within a comparative genomic approach using a genomic alignment of 22 closely related bacteria genomes to generate predictions for whether or not a given open reading frame (ORF) encodes for a small protein. Results: We have applied this method to the complete genome for <it>Escherichia coli</it> strain K12 and looked at how well our method performed on a set of 60 experimentally verified small proteins from this organism. Out of a total of 11 407 possible ORFs, we found that 6 of the top 10 and 27 of the top 100 predictions belonged to the set of 60 experimentally verified small proteins. We found 35 of all the true small proteins within the top 200 predictions. We compared our method to Glimmer, using a default Glimmer protocol and a modified small ORF Glimmer protocol with a lower minimum size cutoff. The default Glimmer protocol identified 16 of the true small proteins (all in the top 200 predictions), but failed to predict on 34 due to size cutoffs. The small ORF Glimmer protocol made predictions for all the experimentally verified small proteins but only contained 9 of the 60 true small proteins within the top 200 predictions. Contact: <inter-ref locator="jsamayoa@jhu.edu" locator-type="email">jsamayoa@jhu.edu</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr275/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/1765 http://dx.doi.org/10.1093/bioinformatics/btr275 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/17722015-07-29HighWireOUPbioinfo:27:13

An MCMC algorithm for detecting short adjacent repeats shared by multiple sequences Li, Qiwei Fan, Xiaodan Liang, Tong Li, Shuo–Yen R. SEQUENCE ANALYSIS Motivation: Repeats detection problems are traditionally formulated as string matching or signal processing problems. They cannot readily handle gaps between repeat units and are incapable of detecting repeat patterns shared by multiple sequences. This study detects short adjacent repeats with interunit insertions from multiple sequences. For biological sequences, such studies can shed light on molecular structure, biological function and evolution. Results: The task of detecting short adjacent repeats is formulated as a statistical inference problem by using a probabilistic generative model. An Markov chain Monte Carlo algorithm is proposed to infer the parameters in a <it>de novo</it> fashion. Its applications on synthetic and real biological data show that the new method not only has a competitive edge over existing methods, but also can provide a way to study the structure and the evolution of repeat-containing genes. Availability: The related C++ source code and datasets are available at <inter-ref locator="http://ihome.cuhk.edu.hk/%7Eb118998/share/BASARD.zip" locator-type="url">http://ihome.cuhk.edu.hk/%7Eb118998/share/BASARD.zip</inter-ref>. Contact: <inter-ref locator="xfan@sta.cuhk.edu.hk" locator-type="email">xfan@sta.cuhk.edu.hk</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr287/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/1772 http://dx.doi.org/10.1093/bioinformatics/btr287 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/17802015-07-29HighWireOUPbioinfo:27:13

Exploiting maximal dependence decomposition to identify conserved motifs from a group of aligned signal sequences Lee, Tzong-Yi Lin, Zong-Qing Hsieh, Sheng-Jen Bretaña, Neil Arvin Lu, Cheng-Tsung SEQUENCE ANALYSIS Summary: Bioinformatics research often requires conservative analyses of a group of sequences associated with a specific biological function (e.g. transcription factor binding sites, micro RNA target sites or protein post-translational modification sites). Due to the difficulty in exploring conserved motifs on a large-scale sequence data involved with various signals, a new method, MDDLogo, is developed. MDDLogo applies maximal dependence decomposition (MDD) to cluster a group of aligned signal sequences into subgroups containing statistically significant motifs. In order to extract motifs that contain a conserved biochemical property of amino acids in protein sequences, the set of 20 amino acids is further categorized according to their physicochemical properties, e.g. hydrophobicity, charge or molecular size. MDDLogo has been demonstrated to accurately identify the kinase-specific substrate motifs in 1221 human phosphorylation sites associated with seven well-known kinase families from Phospho.ELM. Moreover, in a set of plant phosphorylation data-lacking kinase information, MDDLogo has been applied to help in the investigation of substrate motifs of potential kinases and in the improvement of the identification of plant phosphorylation sites with various substrate specificities. In this study, MDDLogo is comparable with another well-known motif discover tool, Motif-X. Contact: <inter-ref locator="francis@saturn.yzu.edu.tw" locator-type="email">francis@saturn.yzu.edu.tw</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr291/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/1780 http://dx.doi.org/10.1093/bioinformatics/btr291 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/17882015-07-29HighWireOUPbioinfo:27:13

A detailed investigation of accessibilities around target sites of siRNAs and miRNAs Kiryu, Hisanori Terai, Goro Imamura, Osamu Yoneyama, Hiroyuki Suzuki, Kenji Asai, Kiyoshi STRUCTURAL BIOINFORMATICS Motivation: The importance of RNA sequence analysis has been increasing since the discovery of various types of non-coding RNAs transcribed in animal cells. Conventional RNA sequence analyses have mainly focused on structured regions, which are stabilized by the stacking energies acting on adjacent base pairs. On the other hand, recent findings regarding the mechanisms of small interfering RNAs (siRNAs) and transcription regulation by microRNAs (miRNAs) indicate the importance of analyzing accessible regions where no base pairs exist. So far, relatively few studies have investigated the nature of such regions. Results: We have conducted a detailed investigation of accessibilities around the target sites of siRNAs and miRNAs. We have exhaustively calculated the correlations between the accessibilities around the target sites and the repression levels of the corresponding mRNAs. We have computed the accessibilities with an originally developed software package, called ‘Raccess’, which computes the accessibility of all the segments of a fixed length for a given RNA sequence when the maximal distance between base pairs is limited to a fixed size <it>W</it>. We show that the computed accessibilities are relatively insensitive to the choice of the maximal span <it>W</it>. We have found that the efficacy of siRNAs depends strongly on the accessibility of the very 3′-end of their binding sites, which might reflect a target site recognition mechanism in the RNA-induced silencing complex. We also show that the efficacy of miRNAs has a similar dependence on the accessibilities, but some miRNAs also show positive correlations between the efficacy and the accessibilities in broad regions downstream of their putative binding sites, which might imply that the downstream regions of the target sites are bound by other proteins that allow the miRNAs to implement their functions. We have also investigated the off-target effects of an siRNA as a potential RNAi therapeutic. We show that the off-target effects of the siRNA have similar correlations to the miRNA repression, indicating that they are caused by the same mechanism. Availability: The C++ source code of the Raccess software is available at <inter-ref locator="http://www.ncrna.org/software/Raccess/" locator-type="url">http://www.ncrna.org/software/Raccess/</inter-ref> The microarray data on the measurements of the siRNA off-target effects are also available at the same site. Contact: <inter-ref locator="kiryu-h@k.u-tokyo.ac.jp" locator-type="email">kiryu-h@k.u-tokyo.ac.jp</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr276/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/1788 http://dx.doi.org/10.1093/bioinformatics/btr276 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/17982015-07-29HighWireOUPbioinfo:27:13

HiTRACE: high-throughput robust analysis for capillary electrophoresis Yoon, Sungroh Kim, Jinkyu Hum, Justine Kim, Hanjoo Park, Seunghyun Kladwang, Wipapat Das, Rhiju STRUCTURAL BIOINFORMATICS Motivation: Capillary electrophoresis (CE) of nucleic acids is a workhorse technology underlying high-throughput genome analysis and large-scale chemical mapping for nucleic acid structural inference. Despite the wide availability of CE-based instruments, there remain challenges in leveraging their full power for quantitative analysis of RNA and DNA structure, thermodynamics and kinetics. In particular, the slow rate and poor automation of available analysis tools have bottlenecked a new generation of studies involving hundreds of CE profiles per experiment. Results: We propose a computational method called <it>high-throughput robust analysis for capillary electrophoresis</it> (HiTRACE) to automate the key tasks in large-scale nucleic acid CE analysis, including the profile alignment that has heretofore been a rate-limiting step in the highest throughput experiments. We illustrate the application of HiTRACE on 13 datasets representing 4 different RNAs, 3 chemical modification strategies and up to 480 single mutant variants; the largest datasets each include 87 360 bands. By applying a series of robust dynamic programming algorithms, HiTRACE outperforms prior tools in terms of alignment and fitting quality, as assessed by measures including the correlation between quantified band intensities between replicate datasets. Furthermore, while the smallest of these datasets required 7–10 h of manual intervention using prior approaches, HiTRACE quantitation of even the largest datasets herein was achieved in 3–12 min. The HiTRACE method, therefore, resolves a critical barrier to the efficient and accurate analysis of nucleic acid structure in experiments involving tens of thousands of electrophoretic bands. Availability: HiTRACE is freely available for download at <inter-ref locator="http://hitrace.stanford.edu" locator-type="url">http://hitrace.stanford.edu</inter-ref>. Contact: <inter-ref locator="sryoon@korea.ac.kr" locator-type="email">sryoon@korea.ac.kr</inter-ref>; <inter-ref locator="rhiju@stanford.edu" locator-type="email">rhiju@stanford.edu</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr277/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/1798 http://dx.doi.org/10.1093/bioinformatics/btr277 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/18062015-07-29HighWireOUPbioinfo:27:13

Evaluation of drug-human serum albumin binding interactions with support vector machine aided online automated docking Zsila, Ferenc Bikadi, Zsolt Malik, David Hari, Peter Pechan, Imre Berces, Attila Hazai, Eszter STRUCTURAL BIOINFORMATICS Motivation: Human serum albumin (HSA), the most abundant plasma protein is well known for its extraordinary binding capacity for both endogenous and exogenous substances, including a wide range of drugs. Interaction with the two principal binding sites of HSA in subdomain IIA (site 1) and in subdomain IIIA (site 2) controls the free, active concentration of a drug, provides a reservoir for a long duration of action and ultimately affects the ADME (absorption, distribution, metabolism, and excretion) profile. Due to the continuous demand to investigate HSA binding properties of novel drugs, drug candidates and drug-like compounds, a support vector machine (SVM) model was developed that efficiently predicts albumin binding. Our SVM model was integrated to a free, web-based prediction platform (<inter-ref locator="http://albumin.althotas.com" locator-type="url">http://albumin.althotas.com</inter-ref>). Automated molecular docking calculations for prediction of complex geometry are also integrated into the web service. The platform enables the users (i) to predict if albumin binds the query ligand, (ii) to determine the probable ligand binding site (site 1 or site 2), (iii) to select the albumin X-ray structure which is complexed with the most similar ligand and (iv) to calculate complex geometry using molecular docking calculations. Our SVM model and the potential offered by the combined use of <it>in silico</it> calculation methods and experimental binding data is illustrated. Contact: <inter-ref locator="eszter.hazai@virtuadrug.com" locator-type="email">eszter.hazai@virtuadrug.com</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr284/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/1806 http://dx.doi.org/10.1093/bioinformatics/btr284 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/18142015-07-29HighWireOUPbioinfo:27:13

Prediction of peptides binding to the PKA RII{alpha} subunit using a hierarchical strategy Hou, Tingjun Li, Youyong Wang, Wei STRUCTURAL BIOINFORMATICS Motivation: Favorable interaction between the regulatory subunit of the cAMP-dependent protein kinase (PKA) and a peptide in A-kinase anchoring proteins (AKAPs) is critical for translocating PKA to the subcellular sites where the enzyme phosphorylates its substrates. It is very hard to identify AKAPs peptides binding to PKA due to the high sequence diversity of AKAPs. Results: We propose a hierarchical and efficient approach, which combines molecular dynamics (MD) simulations, free energy calculations, virtual mutagenesis (VM) and bioinformatics analyses, to predict peptides binding to the PKA RIIα regulatory subunit in the human proteome systematically. Our approach successfully retrieved 15 out of 18 documented RIIα-binding peptides. Literature curation supported that many newly predicted peptides might be true AKAPs. Here, we present the first systematic search for AKAP peptides in the human proteome, which is useful to further experimental identification of AKAPs and functional analysis of their biological roles. Contact: <inter-ref locator="tingjunhou@hotmail.com" locator-type="email">tingjunhou@hotmail.com</inter-ref>; <inter-ref locator="tjhou@suda.edu.cn" locator-type="email">tjhou@suda.edu.cn</inter-ref>; <inter-ref locator="wei-wang@ucsd.edu" locator-type="email">wei-wang@ucsd.edu</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr294/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/1814 http://dx.doi.org/10.1093/bioinformatics/btr294 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/18222015-07-29HighWireOUPbioinfo:27:13

Application of the Bayesian MMSE estimator for classification error to gene expression microarray data Dalton, Lori A. Dougherty, Edward R. GENE EXPRESSION Motivation: With the development of high-throughput genomic and proteomic technologies, coupled with the inherent difficulties in obtaining large samples, biomedicine faces difficult small-sample classification issues, in particular, error estimation. Most popular error estimation methods are motivated by intuition rather than mathematical inference. A recently proposed error estimator based on Bayesian minimum mean square error estimation places error estimation in an optimal filtering framework. In this work, we examine the application of this error estimator to gene expression microarray data, including the suitability of the Gaussian model with normal–inverse-Wishart priors and how to find prior probabilities. Results: We provide an implementation for non-linear classification, where closed form solutions are not available. We propose a methodology for calibrating normal-inverse-Wishart priors based on discarded microarray data and examine the performance on synthetic high-dimensional data and a real dataset from a breast cancer study. The calibrated Bayesian error estimator has superior root mean square performance, especially with moderate to high expected true errors and small feature sizes. Availability: We have implemented in C code the Bayesian error estimator for Gaussian distributions and normal–inverse-Wishart priors for both linear classifiers, with exact closed-form representations, and arbitrary classifiers, where we use a Monte Carlo approximation. Our code for the Bayesian error estimator and a toolbox of related utilities are available at <inter-ref locator="http://gsp.tamu.edu/Publications/supplementary/dalton11a" locator-type="url">http://gsp.tamu.edu/Publications/supplementary/dalton11a</inter-ref>. Several supporting simulations are also included. Contact: <inter-ref locator="ldalton@tamu.edu" locator-type="email">ldalton@tamu.edu</inter-ref> Supplementary Information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr272/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/1822 http://dx.doi.org/10.1093/bioinformatics/btr272 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/18322015-07-29HighWireOUPbioinfo:27:13

A multiple network learning approach to capture system-wide condition-specific responses Roy, Sushmita Werner-Washburne, Margaret Lane, Terran SYSTEMS BIOLOGY Motivation: Condition-specific networks capture system-wide behavior under varying conditions such as environmental stresses, cell types or tissues. These networks frequently comprise parts that are unique to each condition, and parts that are shared among related conditions. Existing approaches for learning condition-specific networks typically identify either only differences or only similarities across conditions. Most of these approaches first learn networks per condition independently, and then identify similarities and differences in a post-learning step. Such approaches do not exploit the shared information across conditions during network learning. Results: We describe an approach for learning condition-specific networks that identifies the shared and unique subgraphs during network learning simultaneously, rather than as a post-processing step. Our approach learns networks across condition sets, shares data from different conditions and produces high-quality networks that capture biologically meaningful information. On simulated data, our approach outperformed an existing approach that learns networks independently for each condition, especially for small training datasets. On microarray data of hundreds of deletion mutants in two, yeast stationary-phase cell populations, the inferred network structure identified several common and population-specific effects of these deletion mutants and several high-confidence cases of double-deletion pairs, which can be experimentally tested. Our results are consistent with and extend the existing knowledge base of differentiated cell populations in yeast stationary phase. Availability and Implementation: C++ code can be accessed from <inter-ref locator="http://www.broadinstitute.org/~sroy/condspec/" locator-type="url">http://www.broadinstitute.org/~sroy/condspec/</inter-ref> Contact: <inter-ref locator="sroy@broadinstitute.org" locator-type="email">sroy@broadinstitute.org</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr270/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/1832 http://dx.doi.org/10.1093/bioinformatics/btr270 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/18392015-07-29HighWireOUPbioinfo:27:13

Creating views on integrated multidomain data Rohn, Hendrik Klukas, Christian Schreiber, Falk SYSTEMS BIOLOGY Motivation: Modern data acquisition methods in biology allow the procurement of different types of data in increasing quantity, facilitating a comprehensive view of biological systems. As data are usually gathered and interpreted by separate domain scientists, it is hard to grasp multidomain properties and structures. Consequently, there is a need for the integration of biological data from different sources and of different types in one application, providing various visualization approaches. Results: In this article, methods for the integration and visualization of multimodal biological data are presented. This is achieved based on two graphs representing the meta-relations between biological data and the measurement combinations, respectively. Both graphs are linked and serve as different views of the integrated data with navigation and exploration possibilities. Data can be combined and visualized multifariously, resulting in views of the integrated biological data. Availability: <inter-ref locator="http://vanted.ipk-gatersleben.de/hive/" locator-type="url">http://vanted.ipk-gatersleben.de/hive/</inter-ref>. Contact: <inter-ref locator="rohn@ipk-gatersleben.de" locator-type="email">rohn@ipk-gatersleben.de</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/1839 http://dx.doi.org/10.1093/bioinformatics/btr282 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/18462015-07-29HighWireOUPbioinfo:27:13

Integrative gene network construction for predicting a set of complementary prostate cancer genes Ahn, Jaegyoon Yoon, Youngmi Park, Chihyun Shin, Eunji Park, Sanghyun SYSTEMS BIOLOGY Motivation: Diagnosis and prognosis of cancer and understanding oncogenesis within the context of biological pathways is one of the most important research areas in bioinformatics. Recently, there have been several attempts to integrate interactome and transcriptome data to identify subnetworks that provide limited interpretations of known and candidate cancer genes, as well as increase classification accuracy. However, these studies provide little information about the detailed roles of identified cancer genes. Results: To provide more information to the network, we constructed the network by incorporating genetic interactions and manually curated gene regulations to the protein interaction network. To make our newly constructed network cancer specific, we identified edges where two genes show different expression patterns between cancer and normal phenotypes. We showed that the integration of various datasets increased classification accuracy, which suggests that our network is more complete than a network based solely on protein interactions. We also showed that our network contains significantly more known cancer-related genes than other feature selection algorithms. Through observations of some examples of cancer-specific subnetworks, we were able to predict more detailed and interpretable roles of oncogenes and other cancer candidate genes in the prostate cancer cells. Availability: <inter-ref locator="http://embio.yonsei.ac.kr/~Ahn/tc.php" locator-type="url">http://embio.yonsei.ac.kr/~Ahn/tc.php</inter-ref>. Contact: <inter-ref locator="sanghyun@cs.yonsei.ac.kr" locator-type="email">sanghyun@cs.yonsei.ac.kr</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr283/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/1846 http://dx.doi.org/10.1093/bioinformatics/btr283 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/18542015-07-29HighWireOUPbioinfo:27:13

Model building and intelligent acquisition with application to protein subcellular location classification Jackson, C. Glory–Afshar, E. Murphy, R. F. Kovacevic, J. SYSTEMS BIOLOGY Motivation: We present a framework and algorithms to intelligently acquire movies of protein subcellular location patterns by learning their models as they are being acquired, and simultaneously determining how many cells to acquire as well as how many frames to acquire per cell. This is motivated by the desire to minimize acquisition time and photobleaching, given the need to build such models for all proteins, in all cell types, under all conditions. Our key innovation is to build models during acquisition rather than as a post-processing step, thus allowing us to intelligently and automatically adapt the acquisition process given the model acquired. Results: We validate our framework on protein subcellular location classification, and show that the combination of model building and intelligent acquisition results in time and storage savings without loss of classification accuracy, or alternatively, higher classification accuracy for the same total acquisition time. Availability and implementation: The data and software used for this study will be made available upon publication at <inter-ref locator="http://murphylab.web.cmu.edu/software" locator-type="url">http://murphylab.web.cmu.edu/software</inter-ref> and <inter-ref locator="http://www.andrew.cmu.edu/user/jelenak/Software" locator-type="url">http://www.andrew.cmu.edu/user/jelenak/Software</inter-ref>. Contact: <inter-ref locator="jelenak@cmu.edu" locator-type="email">jelenak@cmu.edu</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/1854 http://dx.doi.org/10.1093/bioinformatics/btr286 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/18602015-07-29HighWireOUPbioinfo:27:13

The role of indirect connections in gene networks in predicting function Gillis, Jesse Pavlidis, Paul SYSTEMS BIOLOGY Motivation: Gene networks have been used widely in gene function prediction algorithms, many based on complex extensions of the ‘guilt by association’ principle. We sought to provide a unified explanation for the performance of gene function prediction algorithms in exploiting network structure and thereby simplify future analysis. Results: We use co-expression networks to show that most exploited network structure simply reconstructs the original correlation matrices from which the co-expression network was obtained. We show the same principle works in predicting gene function in protein interaction networks and that these methods perform comparably to much more sophisticated gene function prediction algorithms. Availability and implementation: Data and algorithm implementation are fully described and available at <inter-ref locator="http://www.chibi.ubc.ca/extended" locator-type="url">http://www.chibi.ubc.ca/extended</inter-ref>. Programs are provided in Matlab m-code. Contact: <inter-ref locator="paul@chibi.ubc.ca" locator-type="email">paul@chibi.ubc.ca</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr288/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/1860 http://dx.doi.org/10.1093/bioinformatics/btr288 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/18672015-07-29HighWireOUPbioinfo:27:13

BINOCh: binding inference from nucleosome occupancy changes Meyer, Clifford A. He, Housheng H. Brown, Myles Liu, X. Shirley GENOME ANALYSIS Summary: Transcription factor binding events are frequently associated with a pattern of nucleosome occupancy changes in which nucleosomes flanking the binding site increase in occupancy, while those in the vicinity of the binding site itself are displaced. Genome-wide information on enhancer proximal nucleosome occupancy can be readily acquired using ChIP-seq targeting enhancer-related histone modifications such as H3K4me2. Here, we present a software package, BINOCh that allows biologists to use such data to infer the identity of key transcription factors that regulate the response of a cell to a stimulus or determine a program of differentiation. Availability: The BINOCh open source Python package is freely available at <inter-ref locator="http://liulab.dfci.harvard.edu/BINOCh" locator-type="url">http://liulab.dfci.harvard.edu/BINOCh</inter-ref> under the FreeBSD license. Contact: <inter-ref locator="cliff@jimmy.harvard.edu" locator-type="email">cliff@jimmy.harvard.edu</inter-ref>; <inter-ref locator="xsliu@jimmy.harvard.edu" locator-type="email">xsliu@jimmy.harvard.edu</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr279/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/1867 http://dx.doi.org/10.1093/bioinformatics/btr279 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/18692015-07-29HighWireOUPbioinfo:27:13

Sim4db and Leaff: utilities for fast batch spliced alignment and sequence indexing Walenz, Brian Florea, Liliana GENOME ANALYSIS Summary: The large number of genomes that will be sequenced will need to be annotated with genes and other functional features. Aligning gene sequences from a related species to the target genome is an economical and highly reliable method to identify genes; unfortunately, existing tools have been lacking in sensitivity and speed. A program we reported, <ty>sim4cc</ty>, was shown to be highly accurate but is limited to comparing one cDNA with one genomic sequence. We present here an optimization of the tool, implemented in the packages <ty>sim4db</ty> and <ty>leaff</ty>. The new tool performs batch alignments of cDNA and genomic sequences in a fraction of the time required by its predecessor, and thus is very well suited for genome-wide analyses. Availability: <ty>Sim4db</ty> and <ty>leaff</ty> are written in C, C++ and Perl for Linux and other Unix platforms. Source code is distributed free of charge from <inter-ref locator="http://sourceforge.net/projects/kmer/" locator-type="url">http://sourceforge.net/projects/kmer/</inter-ref>. Contact: <inter-ref locator="florea@umiacs.umd.edu" locator-type="email">florea@umiacs.umd.edu</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr285/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> Online. Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/1869 http://dx.doi.org/10.1093/bioinformatics/btr285 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/18712015-07-29HighWireOUPbioinfo:27:13

Genome-wide association studies pipeline (GWASpi): a desktop application for genome-wide SNP analysis and management Muñiz-Fernandez, Fernando Carreño–Torres, Angel Morcillo-Suarez, Carlos Navarro, Arcadi GENOME ANALYSIS Motivation: Genome-wide association studies (GWAS) based on single nucleotide polymorphism (SNP) arrays are the most widely used approach to detect loci associated to human traits. Due to the complexity of the methods and software packages available, each with its particular format requiring intricate management workflows, the analysis of GWAS usually confronts scientists with steep learning curves. Indeed, the wide variety of tools makes the parsing and manipulation of data the most time consuming and error prone part of a study. To help resolve these issues, we present GWASpi, a user-friendly, multiplatform, desktop-able application for the management and analysis of GWAS data, with a novel approach on database technologies to leverage the most out of commonly available desktop hardware. GWASpi aims to be a start-to-finish GWAS management application, from raw data to results, containing the most common analysis tools. As a result, GWASpi is easy to use and reduces in up to two orders of magnitude the time needed to perform the fundamental steps of a GWAS. Availability: Freely available on the web at <inter-ref locator="http://www.gwaspi.org" locator-type="url">http://www.gwaspi.org</inter-ref>. Implemented in Java, Apache-Derby and NetCDF-3, with all major operating systems supported. Contact: <inter-ref locator="gwaspi@upf.edu" locator-type="email">gwaspi@upf.edu</inter-ref>; <inter-ref locator="arcadi.navarro@upf.edu" locator-type="email">arcadi.navarro@upf.edu</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/1871 http://dx.doi.org/10.1093/bioinformatics/btr301 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/18732015-07-29HighWireOUPbioinfo:27:13

famCNV: copy number variant association for quantitative traits in families Eleftherohorinou, Hariklia Andersson-Assarsson, Johanna C. Walters, Robin G. El-Sayed Moustafa, Julia S. Coin, Lachlan Jacobson, Peter Carlsson, Lena M. S. Blakemore, Alexandra I. F. Froguel, Philippe Walley, Andrew J. Falchi, Mario GENETICS AND POPULATION ANALYSIS Summary: A program package to enable genome-wide association of copy number variants (CNVs) with quantitative phenotypes in families of arbitrary size and complexity. Intensity signals that act as proxies for the number of copies are modeled in a variance component framework and association with traits is assessed through formal likelihood testing. Availability and implementation: The Java package is made available at <inter-ref locator="www.imperial.ac.uk/medicine/people/m.falchi/" locator-type="url">www.imperial.ac.uk/medicine/people/m.falchi/</inter-ref>. Contact: <inter-ref locator="m.falchi@imperial.ac.uk" locator-type="email">m.falchi@imperial.ac.uk</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/1873 http://dx.doi.org/10.1093/bioinformatics/btr264 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/18762015-07-29HighWireOUPbioinfo:27:13

parmigene--a parallel R package for mutual information estimation and gene network reconstruction Sales, Gabriele Romualdi, Chiara SYSTEMS BIOLOGY Motivation: Inferring large transcriptional networks using mutual information has been shown to be effective in several experimental setup. Unfortunately, this approach has two main drawbacks: (i) several mutual information estimators are prone to biases and (ii) available software still has large computational costs when processing thousand of genes. Results: Here, we present <it>parmigene</it> (PARallel Mutual Information estimation for GEne NEtwork reconstruction), an R package that tries to fill the above gaps. It implements a mutual information estimator based on <it>k</it>-nearest neighbor distances that is minimally biased with respect to the other methods and uses a parallel computing paradigm to reconstruct gene regulatory networks. We test <it>parmigene</it> on <it>in silico</it> and real data. We show that <it>parmigene</it> gives more precise results than existing softwares with strikingly less computational costs. Availability and Implementation: The <it>parmigene</it> package is available on the CRAN network at <inter-ref locator="http://cran.r-project.org/web/packages/" locator-type="url">http://cran.r-project.org/web/packages/</inter-ref>. Contact: <inter-ref locator="chiara.romualdi@unipd.it" locator-type="email">chiara.romualdi@unipd.it</inter-ref> Supplementary Information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr274/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/1876 http://dx.doi.org/10.1093/bioinformatics/btr274 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/18782015-07-29HighWireOUPbioinfo:27:13

MPEA--metabolite pathway enrichment analysis Kankainen, Matti Gopalacharyulu, Peddinti Holm, Liisa Oresic, Matej SYSTEMS BIOLOGY Summary: We present metabolite pathway enrichment analysis (MPEA) for the visualization and biological interpretation of metabolite data at the system level. Our tool follows the concept of gene set enrichment analysis (GSEA) and tests whether metabolites involved in some predefined pathway occur towards the top (or bottom) of a ranked query compound list. In particular, MPEA is designed to handle many-to-many relationships that may occur between the query compounds and metabolite annotations. For a demonstration, we analysed metabolite profiles of 14 twin pairs with differing body weights. MPEA found significant pathways from data that had no significant individual query compounds, its results were congruent with those discovered from transcriptomics data and it detected more pathways than the competing metabolic pathway method did. Availability: The web server and source code of MPEA are available at <inter-ref locator="http://ekhidna.biocenter.helsinki.fi/poxo/mpea/" locator-type="url">http://ekhidna.biocenter.helsinki.fi/poxo/mpea/</inter-ref>. Contact: <inter-ref locator="matti.kankainen@helsinki.fi" locator-type="email">matti.kankainen@helsinki.fi</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr278/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/1878 http://dx.doi.org/10.1093/bioinformatics/btr278 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/18802015-07-29HighWireOUPbioinfo:27:13

TVNViewer: An interactive visualization tool for exploring networks that change over time or space Curtis, Ross E. Yuen, Amos Song, Le Goyal, Anuj Xing, Eric P. SYSTEMS BIOLOGY Summary: The relationship between genes and proteins is a dynamic relationship that changes across time and differs in different cells. The study of these differences can reveal various insights into biological processes and disease progression, especially with the aid of proper tools for network visualization. Toward this purpose, we have developed TVNViewer, a novel visualization tool, which is specifically designed to aid in the exploration and analysis of dynamic networks. Availability: TVNViewer is freely available with documentation and tutorials on the web at <inter-ref locator="http://sailing.cs.cmu.edu/tvnviewer" locator-type="url">http://sailing.cs.cmu.edu/tvnviewer</inter-ref>. Contact: <inter-ref locator="epxing@cs.cmu.edu" locator-type="email">epxing@cs.cmu.edu</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/1880 http://dx.doi.org/10.1093/bioinformatics/btr273 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/18822015-07-29HighWireOUPbioinfo:27:13

Model-based gene set analysis for Bioconductor Bauer, Sebastian Robinson, Peter N. Gagneur, Julien DATABASES AND ONTOLOGIES Summary: Gene Ontology and other forms of gene-category analysis play a major role in the evaluation of high-throughput experiments in molecular biology. Single-category enrichment analysis procedures such as Fisher's exact test tend to flag large numbers of redundant categories as significant, which can complicate interpretation. We have recently developed an approach called model-based gene set analysis (MGSA), that substantially reduces the number of redundant categories returned by the gene-category analysis. In this work, we present the Bioconductor package <it>mgsa</it>, which makes the MGSA algorithm available to users of the R language. Our package provides a simple and flexible application programming interface for applying the approach. Availability: The <it>mgsa</it> package has been made available as part of Bioconductor 2.8. It is released under the conditions of the Artistic license 2.0. Contact: <inter-ref locator="peter.robinson@charite.de" locator-type="email">peter.robinson@charite.de</inter-ref>; <inter-ref locator="julien.gagneur@embl.de" locator-type="email">julien.gagneur@embl.de</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/1882 http://dx.doi.org/10.1093/bioinformatics/btr296 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i12015-07-29HighWireOUPbioinfo:27:13

Editorial EDITORIAL Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i1 http://dx.doi.org/10.1093/bioinformatics/btr302 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i1022015-07-29HighWireOUPbioinfo:27:13

A conditional random fields method for RNA sequence-structure relationship modeling and conformation sampling Wang, Zhiyong Xu, Jinbo SEQUENCE ANALYSIS Accurate tertiary structures are very important for the functional study of non-coding RNA molecules. However, predicting RNA tertiary structures is extremely challenging, because of a large conformation space to be explored and lack of an accurate scoring function differentiating the native structure from decoys. The fragment-based conformation sampling method (e.g. FARNA) bears shortcomings that the limited size of a fragment library makes it infeasible to represent all possible conformations well. A recent dynamic Bayesian network method, BARNACLE, overcomes the issue of fragment assembly. In addition, neither of these methods makes use of sequence information in sampling conformations. Here, we present a new probabilistic graphical model, conditional random fields (CRFs), to model RNA sequence–structure relationship, which enables us to accurately estimate the probability of an RNA conformation from sequence. Coupled with a novel tree-guided sampling scheme, our CRF model is then applied to RNA conformation sampling. Experimental results show that our CRF method can model RNA sequence–structure relationship well and sequence information is important for conformation sampling. Our method, named as TreeFolder, generates a much higher percentage of native-like decoys than FARNA and BARNACLE, although we use the same simple energy function as BARNACLE. Contact: <inter-ref locator="zywang@ttic.edu" locator-type="email">zywang@ttic.edu</inter-ref>; <inter-ref locator="j3xu@ttic.edu" locator-type="email">j3xu@ttic.edu</inter-ref> Supplementary Information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr232/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i102 http://dx.doi.org/10.1093/bioinformatics/btr232 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i1112015-07-29HighWireOUPbioinfo:27:13

Discovering and visualizing indirect associations between biomedical concepts Tsuruoka, Yoshimasa Miwa, Makoto Hamamoto, Kaisei Tsujii, Jun'ichi Ananiadou, Sophia TEXT MINING Motivation: Discovering useful associations between biomedical concepts has been one of the main goals in biomedical text-mining, and understanding their biomedical contexts is crucial in the discovery process. Hence, we need a text-mining system that helps users explore various types of (possibly hidden) associations in an easy and comprehensible manner. Results: This article describes FACTA+, a real-time text-mining system for finding and visualizing indirect associations between biomedical concepts from MEDLINE abstracts. The system can be used as a text search engine like PubMed with additional features to help users discover and visualize indirect associations between important biomedical concepts such as genes, diseases and chemical compounds. FACTA+ inherits all functionality from its predecessor, FACTA, and extends it by incorporating three new features: (i) detecting biomolecular events in text using a machine learning model, (ii) discovering hidden associations using co-occurrence statistics between concepts, and (iii) visualizing associations to improve the interpretability of the output. To the best of our knowledge, FACTA+ is the first real-time web application that offers the functionality of finding concepts involving biomolecular events and visualizing indirect associations of concepts with both their categories and importance. Availability: FACTA+ is available as a web application at <inter-ref locator="http://refine1-nactem.mc.man.ac.uk/facta/" locator-type="url">http://refine1-nactem.mc.man.ac.uk/facta/</inter-ref>, and its visualizer is available at <inter-ref locator="http://refine1-nactem.mc.man.ac.uk/facta-visualizer/" locator-type="url">http://refine1-nactem.mc.man.ac.uk/facta-visualizer/</inter-ref>. Contact: <inter-ref locator="tsuruoka@jaist.ac.jp" locator-type="email">tsuruoka@jaist.ac.jp</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i111 http://dx.doi.org/10.1093/bioinformatics/btr214 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i1202015-07-29HighWireOUPbioinfo:27:13

MeSH: a window into full text for document summarization Bhattacharya, Sanmitra Ha–Thuc, Viet Srinivasan, Padmini TEXT MINING Motivation: Previous research in the biomedical text-mining domain has historically been limited to titles, abstracts and metadata available in MEDLINE records. Recent research initiatives such as TREC Genomics and BioCreAtIvE strongly point to the merits of moving beyond abstracts and into the realm of full texts. Full texts are, however, more expensive to process not only in terms of resources needed but also in terms of accuracy. Since full texts contain embellishments that elaborate, contextualize, contrast, supplement, etc., there is greater risk for false positives. Motivated by this, we explore an approach that offers a compromise between the extremes of abstracts and full texts. Specifically, we create reduced versions of full text documents that contain only important portions. In the long-term, our goal is to explore the use of such summaries for functions such as document retrieval and information extraction. Here, we focus on designing summarization strategies. In particular, we explore the use of MeSH terms, manually assigned to documents by trained annotators, as clues to select important text segments from the full text documents. Results: Our experiments confirm the ability of our approach to pick the important text portions. Using the ROUGE measures for evaluation, we were able to achieve maximum ROUGE-1, ROUGE-2 and ROUGE-SU4 <it>F</it>-scores of 0.4150, 0.1435 and 0.1782, respectively, for our MeSH term-based method versus the maximum baseline scores of 0.3815, 0.1353 and 0.1428, respectively. Using a MeSH profile-based strategy, we were able to achieve maximum ROUGE <it>F</it>-scores of 0.4320, 0.1497 and 0.1887, respectively. Human evaluation of the baselines and our proposed strategies further corroborates the ability of our method to select important sentences from the full texts. Contact: <inter-ref locator="sanmitra-bhattacharya@uiowa.edu" locator-type="email">sanmitra-bhattacharya@uiowa.edu</inter-ref>; <inter-ref locator="padmini-srinivasan@uiowa.edu" locator-type="email">padmini-srinivasan@uiowa.edu</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i120 http://dx.doi.org/10.1093/bioinformatics/btr223 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i1292015-07-29HighWireOUPbioinfo:27:13

A folding algorithm for extended RNA secondary structures zu Siederdissen, Christian Höner Bernhart, Stephan H. Stadler, Peter F. Hofacker, Ivo L. SEQUENCE ANALYSIS Motivation: RNA secondary structure contains many non-canonical base pairs of different pair families. Successful prediction of these structural features leads to improved secondary structures with applications in tertiary structure prediction and simultaneous folding and alignment. Results: We present a theoretical model capturing both RNA pair families and extended secondary structure motifs with shared nucleotides using 2-diagrams. We accompany this model with a number of programs for parameter optimization and structure prediction. Availability: All sources (optimization routines, RNA folding, RNA evaluation, extended secondary structure visualization) are published under the GPLv3 and available at <inter-ref locator="www.tbi.univie.ac.at/software/rnawolf/" locator-type="url">www.tbi.univie.ac.at/software/rnawolf/</inter-ref>. Contact: <inter-ref locator="choener@tbi.univie.ac.at" locator-type="email">choener@tbi.univie.ac.at</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i129 http://dx.doi.org/10.1093/bioinformatics/btr220 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i1372015-07-29HighWireOUPbioinfo:27:13

Error correction of high-throughput sequencing datasets with non-uniform coverage Medvedev, Paul Scott, Eric Kakaradov, Boyko Pevzner, Pavel SEQUENCE ANALYSIS Motivation: The continuing improvements to high-throughput sequencing (HTS) platforms have begun to unfold a myriad of new applications. As a result, error correction of sequencing reads remains an important problem. Though several tools do an excellent job of correcting datasets where the reads are sampled close to uniformly, the problem of correcting reads coming from drastically non-uniform datasets, such as those from single-cell sequencing, remains open. Results: In this article, we develop the method Hammer for error correction without any uniformity assumptions. Hammer is based on a combination of a Hamming graph and a simple probabilistic model for sequencing errors. It is a simple and adaptable algorithm that improves on other tools on non-uniform single-cell data, while achieving comparable results on normal multi-cell data. Availability: <inter-ref locator="http://www.cs.toronto.edu/~pashadag" locator-type="url">http://www.cs.toronto.edu/~pashadag</inter-ref>. Contact: <inter-ref locator="pmedvedev@cs.ucsd.edu" locator-type="email">pmedvedev@cs.ucsd.edu</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i137 http://dx.doi.org/10.1093/bioinformatics/btr208 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i1422015-07-29HighWireOUPbioinfo:27:13

Generative probabilistic models for protein-protein interaction networks--the biclique perspective Schweiger, Regev Linial, Michal Linial, Nathan PROTEIN INTERACTIONS AND MOLECULAR NETWORKS Motivation: Much of the large-scale molecular data from living cells can be represented in terms of networks. Such networks occupy a central position in cellular systems biology. In the protein–protein interaction (PPI) network, nodes represent proteins and edges represent connections between them, based on experimental evidence. As PPI networks are rich and complex, a mathematical model is sought to capture their properties and shed light on PPI evolution. The mathematical literature contains various generative models of random graphs. It is a major, still largely open question, which of these models (if any) can properly reproduce various biologically interesting networks. Here, we consider this problem where the graph at hand is the PPI network of <it>Saccharomyces cerevisiae</it>. We are trying to distinguishing between a model family which performs a process of copying neighbors, represented by the duplication–divergence (DD) model, and models which do not copy neighbors, with the Barabási–Albert (BA) preferential attachment model as a leading example. Results: The observed property of the network is the distribution of maximal bicliques in the graph. This is a novel criterion to distinguish between models in this area. It is particularly appropriate for this purpose, since it reflects the graph's growth pattern under either model. This test clearly favors the DD model. In particular, for the BA model, the vast majority (92.9%) of the bicliques with both sides ≥4 must be already embedded in the model's seed graph, whereas the corresponding figure for the DD model is only 5.1%. Our results, based on the biclique perspective, conclusively show that a naïve unmodified DD model can capture a key aspect of PPI networks. Contact: <inter-ref locator="regevs01@cs.huji.ac.il" locator-type="email">regevs01@cs.huji.ac.il</inter-ref>; <inter-ref locator="michall@cc.huji.ac.il" locator-type="email">michall@cc.huji.ac.il</inter-ref>; <inter-ref locator="nati@cs.huji.ac.il" locator-type="email">nati@cs.huji.ac.il</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr201/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i142 http://dx.doi.org/10.1093/bioinformatics/btr201 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i1492015-07-29HighWireOUPbioinfo:27:13

RINQ: Reference-based Indexing for Network Queries Gülsoy, Günhan Kahveci, Tamer PROTEIN INTERACTIONS AND MOLECULAR NETWORKS We consider the problem of similarity queries in biological network databases. Given a database of networks, similarity query returns all the database networks whose similarity (i.e. alignment score) to a given query network is at least a specified similarity cutoff value. Alignment of two networks is a very costly operation, which makes exhaustive comparison of all the database networks with a query impractical. To tackle this problem, we develop a novel indexing method, named RINQ (Reference-based Indexing for Biological Network Queries). Our method uses a set of reference networks to eliminate a large portion of the database quickly for each query. A reference network is a small biological network. We precompute and store the alignments of all the references with all the database networks. When our database is queried, we align the query network with all the reference networks. Using these alignments, we calculate a lower bound and an approximate upper bound to the alignment score of each database network with the query network. With the help of upper and lower bounds, we eliminate the majority of the database networks without aligning them to the query network. We also quickly identify a small portion of these as guaranteed to be similar to the query. We perform pairwise alignment only for the remaining networks. We also propose a supervised method to pick references that have a large chance of filtering the unpromising database networks. Extensive experimental evaluation suggests that (i) our method reduced the running time of a single query on a database of around 300 networks from over 2 days to only 8 h; (ii) our method outperformed the state of the art method Closure Tree and SAGA by a factor of three or more; and (iii) our method successfully identified statistically and biologically significant relationships across networks and organisms. Contact: <inter-ref locator="ggulsoy@cise.ufl.edu" locator-type="email">ggulsoy@cise.ufl.edu</inter-ref>; <inter-ref locator="tamer@cise.ufl.edu" locator-type="email">tamer@cise.ufl.edu</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr203/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i149 http://dx.doi.org/10.1093/bioinformatics/btr203 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i152015-07-29HighWireOUPbioinfo:27:13

Environment specific substitution tables improve membrane protein alignment Hill, Jamie R. Kelm, Sebastian Shi, Jiye Deane, Charlotte M. PROTEIN STRUCTURE AND FUNCTION Motivation: Membrane proteins are both abundant and important in cells, but the small number of solved structures restricts our understanding of them. Here we consider whether membrane proteins undergo different substitutions from their soluble counterparts and whether these can be used to improve membrane protein alignments, and therefore improve prediction of their structure. Results: We construct substitution tables for different environments within membrane proteins. As data is scarce, we develop a general metric to assess the quality of these asymmetric tables. Membrane proteins show markedly different substitution preferences from soluble proteins. For example, substitution preferences in lipid tail-contacting parts of membrane proteins are found to be distinct from all environments in soluble proteins, including buried residues. A principal component analysis of the tables identifies the greatest variation in substitution preferences to be due to changes in hydrophobicity; the second largest variation relates to secondary structure. We demonstrate the use of our tables in pairwise sequence-to-structure alignments (also known as ‘threading’) of membrane proteins using the FUGUE alignment program. On average, in the 10–25% sequence identity range, alignments are improved by 28 correctly aligned residues compared with alignments made using FUGUE's default substitution tables. Our alignments also lead to improved structural models. Availability: Substitution tables are available at: <inter-ref locator="http://www.stats.ox.ac.uk/proteins/resources" locator-type="url">http://www.stats.ox.ac.uk/proteins/resources</inter-ref>. Contact: <inter-ref locator="deane@stats.ox.ac.uk" locator-type="email">deane@stats.ox.ac.uk</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i15 http://dx.doi.org/10.1093/bioinformatics/btr230 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i1592015-07-29HighWireOUPbioinfo:27:13

Construction of co-complex score matrix for protein complex prediction from AP-MS data Xie, Zhipeng Kwoh, Chee Keong Li, Xiao-Li Wu, Min PROTEIN INTERACTIONS AND MOLECULAR NETWORKS Motivation: Protein complexes are of great importance for unraveling the secrets of cellular organization and function. The AP-MS technique has provided an effective high-throughput screening to directly measure the co-complex relationship among multiple proteins, but its performance suffers from both false positives and false negatives. To computationally predict complexes from AP-MS data, most existing approaches either required the additional knowledge from known complexes (supervised learning), or had numerous parameters to tune. Method: In this article, we propose a novel unsupervised approach, without relying on the knowledge of existing complexes. Our method probabilistically calculates the affinity between two proteins, where the affinity score is evaluated by a co-complexed score or C2S in brief. In particular, our method measures the log-likelihood ratio of two proteins being co-complexed to being drawn randomly, and we then predict protein complexes by applying hierarchical clustering algorithm on the C2S score matrix. Results: Compared with existing approaches, our approach is computationally efficient and easy to implement. It has just one parameter to set and its value has little effect on the results. It can be applied to different species as long as the AP-MS data are available. Despite its simplicity, it is competitive or superior in performance over many aspects when compared with the state-of-the-art predictions performed by supervised or unsupervised approaches. Availability: The predicted complex sets in this article are available in the Supplementary information or by sending email to <inter-ref locator="asckkwoh@ntu.edu.sg" locator-type="email">asckkwoh@ntu.edu.sg</inter-ref> Contact: <inter-ref locator="xlli@i2r.a-star.edu.sg" locator-type="email">xlli@i2r.a-star.edu.sg</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr212/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i159 http://dx.doi.org/10.1093/bioinformatics/btr212 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i1672015-07-29HighWireOUPbioinfo:27:13

Uncover disease genes by maximizing information flow in the phenome-interactome network Chen, Yong Jiang, Tao Jiang, Rui PROTEIN INTERACTIONS AND MOLECULAR NETWORKS Motivation: Pinpointing genes that underlie human inherited diseases among candidate genes in susceptibility genetic regions is the primary step towards the understanding of pathogenesis of diseases. Although several probabilistic models have been proposed to prioritize candidate genes using phenotype similarities and protein–protein interactions, no combinatorial approaches have been proposed in the literature. Results: We propose the first combinatorial approach for prioritizing candidate genes. We first construct a phenome–interactome network by integrating the given phenotype similarity profile, protein–protein interaction network and associations between diseases and genes. Then, we introduce a computational method called MAXIF to maximize the information flow in this network for uncovering genes that underlie diseases. We demonstrate the effectiveness of this method in prioritizing candidate genes through a series of cross-validation experiments, and we show the possibility of using this method to identify diseases with which a query gene may be associated. We demonstrate the competitive performance of our method through a comparison with two existing state-of-the-art methods, and we analyze the robustness of our method with respect to the parameters involved. As an example application, we apply our method to predict driver genes in 50 copy number aberration regions of melanoma. Our method is not only able to identify several driver genes that have been reported in the literature, it also shed some new biological insights on the understanding of the modular property and transcriptional regulation scheme of these driver genes. Contact: <inter-ref locator="ruijiang@tsinghua.edu.cn" locator-type="email">ruijiang@tsinghua.edu.cn</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i167 http://dx.doi.org/10.1093/bioinformatics/btr213 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i1772015-07-29HighWireOUPbioinfo:27:13

Physical Module Networks: an integrative approach for reconstructing transcription regulation Novershtern, Noa Regev, Aviv Friedman, Nir PROTEIN INTERACTIONS AND MOLECULAR NETWORKS Motivation: Deciphering the complex mechanisms by which regulatory networks control gene expression remains a major challenge. While some studies infer regulation from dependencies between the expression levels of putative regulators and their targets, others focus on measured physical interactions. Results: Here, we present Physical Module Networks, a unified framework that combines a Bayesian model describing modules of co-expressed genes and their shared regulation programs, and a physical interaction graph, describing the protein–protein interactions and protein-DNA binding events that coherently underlie this regulation. Using synthetic data, we demonstrate that a Physical Module Network model has similar recall and improved precision compared to a simple Module Network, as it omits many false positive regulators. Finally, we show the power of Physical Module Networks to reconstruct meaningful regulatory pathways in the genetically perturbed yeast and during the yeast cell cycle, as well as during the response of primary epithelial human cells to infection with H1N1 influenza. Availability: The PMN software is available, free for academic use at <inter-ref locator="http://www.compbio.cs.huji.ac.il/PMN/" locator-type="url">http://www.compbio.cs.huji.ac.il/PMN/</inter-ref>. Contact: <inter-ref locator="aregev@broad.mit.edu" locator-type="email">aregev@broad.mit.edu</inter-ref>; <inter-ref locator="nirf@cs.huji.ac.il" locator-type="email">nirf@cs.huji.ac.il</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i177 http://dx.doi.org/10.1093/bioinformatics/btr222 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i1862015-07-29HighWireOUPbioinfo:27:13

Identification of metabolic network models from incomplete high-throughput datasets Berthoumieux, Sara Brilli, Matteo de Jong, Hidde Kahn, Daniel Cinquemani, Eugenio PROTEIN INTERACTIONS AND MOLECULAR NETWORKS Motivation: High-throughput measurement techniques for metabolism and gene expression provide a wealth of information for the identification of metabolic network models. Yet, missing observations scattered over the dataset restrict the number of effectively available datapoints and make classical regression techniques inaccurate or inapplicable. Thorough exploitation of the data by identification techniques that explicitly cope with missing observations is therefore of major importance. Results: We develop a maximum-likelihood approach for the estimation of unknown parameters of metabolic network models that relies on the integration of statistical priors to compensate for the missing data. In the context of the linlog metabolic modeling framework, we implement the identification method by an Expectation-Maximization (EM) algorithm and by a simpler direct numerical optimization method. We evaluate performance of our methods by comparison to existing approaches, and show that our EM method provides the best results over a variety of simulated scenarios. We then apply the EM algorithm to a real problem, the identification of a model for the <it>Escherichia coli</it> central carbon metabolism, based on challenging experimental data from the literature. This leads to promising results and allows us to highlight critical identification issues. Contact: <inter-ref locator="sara.berthoumieux@inria.fr" locator-type="email">sara.berthoumieux@inria.fr</inter-ref>; <inter-ref locator="eugenio.cinquemani@inria.fr" locator-type="email">eugenio.cinquemani@inria.fr</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr225/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i186 http://dx.doi.org/10.1093/bioinformatics/btr225 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i1962015-07-29HighWireOUPbioinfo:27:13

TREEGL: reverse engineering tree-evolving gene networks underlying developing biological lineages Parikh, Ankur P. Wu, Wei Curtis, Ross E. Xing, Eric P. PROTEIN INTERACTIONS AND MOLECULAR NETWORKS Motivation: Estimating gene regulatory networks over biological lineages is central to a deeper understanding of how cells evolve during development and differentiation. However, one challenge in estimating such evolving networks is that their host cells not only contiguously evolve, but also branch over time. For example, a stem cell evolves into two more specialized daughter cells at each division, forming a tree of networks. Another example is in a laboratory setting: a biologist may apply several different drugs individually to malignant cancer cells to analyze the effects of each drug on the cells; the cells treated by one drug may not be intrinsically similar to those treated by another, but rather to the malignant cancer cells they were derived from. Results: We propose a novel algorithm, <it>Treegl</it>, an &ell;<inf>1</inf> plus total variation penalized linear regression method, to effectively estimate multiple gene networks corresponding to cell types related by a tree-genealogy, based on only a few samples from each cell type. <it>Treegl</it> takes advantage of the similarity between related networks along the biological lineage, while at the same time exposing sharp differences between the networks. We demonstrate that our algorithm performs significantly better than existing methods via simulation. Furthermore we explore an application to a breast cancer dataset, and show that our algorithm is able to produce biologically valid results that provide insight into the progression and reversion of breast cancer cells. Availability: Software will be available at <inter-ref locator="http://www.sailing.cs.cmu.edu/" locator-type="url">http://www.sailing.cs.cmu.edu/</inter-ref>. Contact: <inter-ref locator="epxing@cs.cmu.edu" locator-type="email">epxing@cs.cmu.edu</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i196 http://dx.doi.org/10.1093/bioinformatics/btr239 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i22015-07-29HighWireOUPbioinfo:27:13

ISMB/ECCB 2011 PROCEEDINGS PAPERS COMMITTEE EDITORIAL Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i2 http://dx.doi.org/10.1093/bioinformatics/btr298 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i2052015-07-29HighWireOUPbioinfo:27:13

Optimally discriminative subnetwork markers predict response to chemotherapy Dao, Phuong Wang, Kendric Collins, Colin Ester, Martin Lapuk, Anna Sahinalp, S. Cenk PROTEIN INTERACTIONS AND MOLECULAR NETWORKS Motivation: Molecular profiles of tumour samples have been widely and successfully used for classification problems. A number of algorithms have been proposed to predict classes of tumor samples based on expression profiles with relatively high performance. However, prediction of response to cancer treatment has proved to be more challenging and novel approaches with improved generalizability are still highly needed. Recent studies have clearly demonstrated the advantages of integrating protein–protein interaction (PPI) data with gene expression profiles for the development of subnetwork markers in classification problems. Results: We describe a novel network-based classification algorithm (OptDis) using color coding technique to identify optimally discriminative subnetwork markers. Focusing on PPI networks, we apply our algorithm to drug response studies: we evaluate our algorithm using published cohorts of breast cancer patients treated with combination chemotherapy. We show that our OptDis method improves over previously published subnetwork methods and provides better and more stable performance compared with other subnetwork and single gene methods. We also show that our subnetwork method produces predictive markers that are more reproducible across independent cohorts and offer valuable insight into biological processes underlying response to therapy. Availability: The implementation is available at: <inter-ref locator="http://www.cs.sfu.ca/~pdao/personal/OptDis.html" locator-type="url">http://www.cs.sfu.ca/~pdao/personal/OptDis.html</inter-ref> Contact: <inter-ref locator="cenk@cs.sfu.ca" locator-type="email">cenk@cs.sfu.ca</inter-ref>; <inter-ref locator="alapuk@prostatecentre.com" locator-type="email">alapuk@prostatecentre.com</inter-ref>; <inter-ref locator="ccollins@prostatecentre.com" locator-type="email">ccollins@prostatecentre.com</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i205 http://dx.doi.org/10.1093/bioinformatics/btr245 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i2142015-07-29HighWireOUPbioinfo:27:13

Epistasis detection on quantitative phenotypes by exhaustive enumeration using GPUs Kam-Thong, Tony Pütz, Benno Karbalai, Nazanin Müller–Myhsok, Bertram Borgwardt, Karsten DISEASE MODELS AND EPIDEMIOLOGY Motivation: In recent years, numerous genome-wide association studies have been conducted to identify genetic makeup that explains phenotypic differences observed in human population. Analytical tests on single loci are readily available and embedded in common genome analysis software toolset. The search for significant epistasis (gene–gene interactions) still poses as a computational challenge for modern day computing systems, due to the large number of hypotheses that have to be tested. Results: In this article, we present an approach to epistasis detection by exhaustive testing of all possible SNP pairs. The search strategy based on the Hilbert–Schmidt Independence Criterion can help delineate various forms of statistical dependence between the genetic markers and the phenotype. The actual implementation of this search is done on the highly parallelized architecture available on graphics processing units rendering the completion of the full search feasible within a day. Availability:The program is available at <inter-ref locator="http://www.mpipsykl.mpg.de/epigpuhsic/" locator-type="url">http://www.mpipsykl.mpg.de/epigpuhsic/</inter-ref>. Contact: <inter-ref locator="tony@mpipsykl.mpg.de" locator-type="email">tony@mpipsykl.mpg.de</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i214 http://dx.doi.org/10.1093/bioinformatics/btr218 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i2222015-07-29HighWireOUPbioinfo:27:13

Detecting epistatic effects in association studies at a genomic level based on an ensemble approach Li, Jing Horstman, Benjamin Chen, Yixuan DISEASE MODELS AND EPIDEMIOLOGY Motivation: Most complex diseases involve multiple genes and their interactions. Although genome-wide association studies (GWAS) have shown some success for identifying genetic variants underlying complex diseases, most existing studies are based on limited single-locus approaches, which detect single nucleotide polymorphisms (SNPs) essentially based on their marginal associations with phenotypes. Results: In this article, we propose an ensemble approach based on boosting to study gene–gene interactions. We extend the basic AdaBoost algorithm by incorporating an intuitive importance score based on Gini impurity to select candidate SNPs. Permutation tests are used to control the statistical significance. We have performed extensive simulation studies using three interaction models to evaluate the efficacy of our approach at realistic GWAS sizes, and have compared it with existing epistatic detection algorithms. Our results indicate that our approach is valid, efficient for GWAS and on disease models with epistasis has more power than existing programs. Contact: <inter-ref locator="jingli@case.edu" locator-type="email">jingli@case.edu</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i222 http://dx.doi.org/10.1093/bioinformatics/btr227 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i2302015-07-29HighWireOUPbioinfo:27:13

Efficient spatial segmentation of large imaging mass spectrometry datasets with spatially aware clustering Alexandrov, Theodore Kobarg, Jan Hendrik MASS SPECTROMETRY AND PROTEOMICS Motivation: Imaging mass spectrometry (IMS) is one of the few measurement technology s of biochemistry which, given a thin sample, is able to reveal its spatial chemical composition in the full molecular range. IMS produces a hyperspectral image, where for each pixel a high-dimensional mass spectrum is measured. Currently, the technology is mature enough and one of the major problems preventing its spreading is the under-development of computational methods for mining huge IMS datasets. This article proposes a novel approach for spatial segmentation of an IMS dataset, which is constructed considering the important issue of pixel-to-pixel variability. Methods: We segment pixels by clustering their mass spectra. Importantly, we incorporate spatial relations between pixels into clustering, so that pixels are clustered together with their neighbors. We propose two methods. One is non-adaptive, where pixel neighborhoods are selected in the same manner for all pixels. The second one respects the structure observable in the data. For a pixel, its neighborhood is defined taking into account similarity of its spectrum to the spectra of adjacent pixels. Both methods have the linear complexity and require linear memory space (in the number of spectra). Results: The proposed segmentation methods are evaluated on two IMS datasets: a rat brain section and a section of a neuroendocrine tumor. They discover anatomical structure, discriminate the tumor region and highlight functionally similar regions. Moreover, our methods provide segmentation maps of similar or better quality if compared to the other state-of-the-art methods, but outperform them in runtime and/or required memory. Contact: <inter-ref locator="theodore@math.uni-bremen.de" locator-type="email">theodore@math.uni-bremen.de</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i230 http://dx.doi.org/10.1093/bioinformatics/btr246 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i2392015-07-29HighWireOUPbioinfo:27:13

Automatic 3D neuron tracing using all-path pruning Peng, Hanchuan Long, Fuhui Myers, Gene BIOIMAGING Motivation: Digital reconstruction, or tracing, of 3D neuron structures is critical toward reverse engineering the wiring and functions of a brain. However, despite a number of existing studies, this task is still challenging, especially when a 3D microscopic image has low signal-to-noise ratio (SNR) and fragmented neuron segments. Published work can handle these hard situations only by introducing global prior information, such as where a neurite segment starts and terminates. However, manual incorporation of such global information can be very time consuming. Thus, a completely automatic approach for these hard situations is highly desirable. Results: We have developed an automatic graph algorithm, called the all-path pruning (APP), to trace the 3D structure of a neuron. To avoid potential mis-tracing of some parts of a neuron, an APP first produces an initial over-reconstruction, by tracing the optimal geodesic shortest path from the seed location to every possible destination voxel/pixel location in the image. Since the initial reconstruction contains all the possible paths and thus could contain redundant structural components (SC), we simplify the entire reconstruction without compromising its connectedness by pruning the redundant structural elements, using a new maximal-covering minimal-redundant (MCMR) subgraph algorithm. We show that MCMR has a linear computational complexity and will converge. We examined the performance of our method using challenging 3D neuronal image datasets of model organisms (e.g. fruit fly). Availability: The software is available upon request. We plan to eventually release the software as a plugin of the V3D-Neuron package at <inter-ref locator="http://penglab.janelia.org/proj/v3d" locator-type="url">http://penglab.janelia.org/proj/v3d</inter-ref>. Contact: <inter-ref locator="pengh@janelia.hhmi.org" locator-type="email">pengh@janelia.hhmi.org</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i239 http://dx.doi.org/10.1093/bioinformatics/btr237 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i242015-07-29HighWireOUPbioinfo:27:13

Sequence-based prediction of protein crystallization, purification and production propensity Mizianty, Marcin J. Kurgan, Lukasz PROTEIN STRUCTURE AND FUNCTION Motivation: X-ray crystallography-based protein structure determination, which accounts for majority of solved structures, is characterized by relatively low success rates. One solution is to build tools which support selection of targets that are more likely to crystallize. Several <it>in silico</it> methods that predict propensity of diffraction-quality crystallization from protein chains were developed. We show that the quality of their predictions drops when applied to more recent crystallization trails, which calls for new solutions. We propose a novel approach that alleviates drawbacks of the existing methods by using a recent dataset and improved protocol to annotate progress along the crystallization process, by predicting the success of the entire process and steps which result in the failed attempts, and by utilizing a compact and comprehensive set of sequence-derived inputs to generate accurate predictions. Results: The proposed PPCpred (predictor of protein Production, Purification and Crystallization) predict propensity for production of diffraction-quality crystals, production of crystals, purification and production of the protein material. PPCpred utilizes comprehensive set of inputs based on energy and hydrophobicity indices, composition of certain amino acid types, predicted disorder, secondary structure and solvent accessibility, and content of certain buried and exposed residues. Our method significantly outperforms alignment-based predictions and several modern crystallization propensity predictors. Receiver operating characteristic (ROC) curves show that PPCpred is particularly useful for users who desire high true positive (TP) rates, i.e. low rate of mispredictions for solvable chains. Our model reveals several intuitive factors that influence the success of individual steps and the entire crystallization process, including the content of Cys, buried His and Ser, hydrophobic/hydrophilic segments and the number of predicted disordered segments. Availability: <inter-ref locator="http://biomine.ece.ualberta.ca/PPCpred/" locator-type="url">http://biomine.ece.ualberta.ca/PPCpred/</inter-ref>. Contact: <inter-ref locator="lkurgan@ece.ualberta.ca" locator-type="email">lkurgan@ece.ualberta.ca</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr229/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i24 http://dx.doi.org/10.1093/bioinformatics/btr229 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i2482015-07-29HighWireOUPbioinfo:27:13

Tanglegrams for rooted phylogenetic trees and networks Scornavacca, Celine Zickmann, Franziska Huson, Daniel H. EVOLUTION AND COMPARATIVE GENOMICS Motivation: In systematic biology, one is often faced with the task of comparing different phylogenetic trees, in particular in multi-gene analysis or cospeciation studies. One approach is to use a tanglegram in which two rooted phylogenetic trees are drawn opposite each other, using auxiliary lines to connect matching taxa. There is an increasing interest in using rooted phylogenetic networks to represent evolutionary history, so as to explicitly represent reticulate events, such as horizontal gene transfer, hybridization or reassortment. Thus, the question arises how to define and compute a tanglegram for such networks. Results: In this article, we present the first formal definition of a tanglegram for rooted phylogenetic networks and present a heuristic approach for computing one, called the <it>NN-tanglegram</it> method. We compare the performance of our method with existing tree tanglegram algorithms and also show a typical application to real biological datasets. For maximum usability, the algorithm does not require that the trees or networks are bifurcating or bicombining, or that they are on identical taxon sets. Availability: The algorithm is implemented in our program Dendroscope 3, which is freely available from <inter-ref locator="www.dendroscope.org" locator-type="url">www.dendroscope.org</inter-ref>. Contact: <inter-ref locator="scornava@informatik.uni-tuebingen.de" locator-type="email">scornava@informatik.uni-tuebingen.de</inter-ref>; <inter-ref locator="huson@informatik.uni-tuebingen.de" locator-type="email">huson@informatik.uni-tuebingen.de</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i248 http://dx.doi.org/10.1093/bioinformatics/btr210 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i2572015-07-29HighWireOUPbioinfo:27:13

Mapping ancestral genomes with massive gene loss: A matrix sandwich problem Gavranovic, Haris Chauve, Cedric Salse, Jérôme Tannier, Eric EVOLUTION AND COMPARATIVE GENOMICS Motivation: Ancestral genomes provide a better way to understand the structural evolution of genomes than the simple comparison of extant genomes. Most ancestral genome reconstruction methods rely on universal markers, that is, homologous families of DNA segments present in exactly one exemplar in every considered species. Complex histories of genes or other markers, undergoing duplications and losses, are rarely taken into account. It follows that some ancestors are inaccessible by these methods, such as the proto–monocotyledon whose evolution involved massive gene loss following a whole genome duplication. Results: We propose a mapping approach based on the combinatorial notion of ‘sandwich consecutive ones matrix’, which explicitly takes gene losses into account. We introduce combinatorial optimization problems related to this concept, and propose a heuristic solver and a lower bound on the optimal solution. We use these results to propose a configuration for the proto-chromosomes of the monocot ancestor, and study the accuracy of this configuration. We also use our method to reconstruct the ancestral boreoeutherian genomes, which illustrates that the framework we propose is not specific to plant paleogenomics but is adapted to reconstruct any ancestral genome from extant genomes with heterogeneous marker content. Availability: Upon request to the authors. Contact: <inter-ref locator="haris.gavranovic@gmail.com" locator-type="email">haris.gavranovic@gmail.com</inter-ref>; <inter-ref locator="eric.tannier@inria.fr" locator-type="email">eric.tannier@inria.fr</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i257 http://dx.doi.org/10.1093/bioinformatics/btr224 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i2662015-07-29HighWireOUPbioinfo:27:13

Predicting site-specific human selective pressure using evolutionary signatures Sadri, Javad Diallo, Abdoulaye Banire Blanchette, Mathieu EVOLUTION AND COMPARATIVE GENOMICS Motivation: The identification of non-coding functional regions of the human genome remains one of the main challenges of genomics. By observing how a given region evolved over time, one can detect signs of negative or positive selection hinting that the region may be functional. With the quickly increasing number of vertebrate genomes to compare with our own, this type of approach is set to become extremely powerful, provided the right analytical tools are available. Results: A large number of approaches have been proposed to measure signs of past selective pressure, usually in the form of reduced mutation rate. Here, we propose a radically different approach to the detection of non-coding functional region: instead of measuring past evolutionary rates, we build a machine learning classifier to predict current substitution rates in human based on the inferred evolutionary events that affected the region during vertebrate evolution. We show that different types of evolutionary events, occurring along different branches of the phylogenetic tree, bring very different amounts of information. We propose a number of simple machine learning classifiers and show that a Support-Vector Machine (SVM) predictor clearly outperforms existing tools at predicting human non-coding functional sites. Comparison to external evidences of selection and regulatory function confirms that these SVM predictions are more accurate than those of other approaches. Availability: The predictor and predictions made are available at <inter-ref locator="http://www.mcb.mcgill.ca/~blanchem/sadri" locator-type="url">http://www.mcb.mcgill.ca/~blanchem/sadri</inter-ref>. Contact: <inter-ref locator="blanchem@mcb.mcgill.ca" locator-type="email">blanchem@mcb.mcgill.ca</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i266 http://dx.doi.org/10.1093/bioinformatics/btr241 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i2752015-07-29HighWireOUPbioinfo:27:13

PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions Lin, Michael F. Jungreis, Irwin Kellis, Manolis EVOLUTION AND COMPARATIVE GENOMICS Motivation: As high-throughput transcriptome sequencing provides evidence for novel transcripts in many species, there is a renewed need for accurate methods to classify small genomic regions as protein coding or non-coding. We present PhyloCSF, a novel comparative genomics method that analyzes a multispecies nucleotide sequence alignment to determine whether it is likely to represent a conserved protein-coding region, based on a formal statistical comparison of phylogenetic codon models. Results: We show that PhyloCSF's classification performance in 12-species <it>Drosophila</it> genome alignments exceeds all other methods we compared in a previous study. We anticipate that this method will be widely applicable as the transcriptomes of many additional species, tissues and subcellular compartments are sequenced, particularly in the context of ENCODE and modENCODE, and as interest grows in long non-coding RNAs, often initially recognized by their lack of protein coding potential rather than conserved RNA secondary structures. Availability and Implementation: The Objective Caml source code and executables for GNU/Linux and Mac OS X are freely available at <inter-ref locator="http://compbio.mit.edu/PhyloCSF" locator-type="url">http://compbio.mit.edu/PhyloCSF</inter-ref> Contact: <inter-ref locator="mlin@mit.edu" locator-type="email">mlin@mit.edu</inter-ref>; <inter-ref locator="manoli@mit.edu" locator-type="email">manoli@mit.edu</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i275 http://dx.doi.org/10.1093/bioinformatics/btr209 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i2832015-07-29HighWireOUPbioinfo:27:13

The role of proteosome-mediated proteolysis in modulating potentially harmful transcription factor activity in Saccharomyces cerevisiae Bonzanni, Nicola Zhang, Nianshu Oliver, Stephen G. Fisher, Jasmin APPLIED BIOINFORMATICS Motivation: The appropriate modulation of the stress response to variable environmental conditions is necessary to maintain sustained viability in <it>Saccharomyces cerevisiae</it>. Particularly, controlling the abundance of proteins that may have detrimental effects on cell growth is crucial for rapid recovery from stress-induced quiescence. Results: Prompted by qualitative modeling of the nutrient starvation response in yeast, we investigated <it>in vivo</it> the effect of proteolysis after nutrient starvation showing that, for the Gis1 transcription factor at least, proteasome-mediated control is crucial for a rapid return to growth. Additional bioinformatics analyses show that potentially toxic transcriptional regulators have a significantly lower protein half-life, a higher fraction of unstructured regions and more potential PEST motifs than the non-detrimental ones. Furthermore, inhibiting proteasome activity tends to increase the expression of genes induced during the Environmental Stress Response more than those in the rest of the genome. Our combined results suggest that proteasome-mediated proteolysis of potentially toxic transcription factors tightly modulates the stress response in yeast. Contact: <inter-ref locator="jasmin.fisher@microsoft.com" locator-type="email">jasmin.fisher@microsoft.com</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr211/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i283 http://dx.doi.org/10.1093/bioinformatics/btr211 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i2882015-07-29HighWireOUPbioinfo:27:13

Mixed-model coexpression: calculating gene coexpression while accounting for expression heterogeneity Furlotte, Nicholas A. Kang, Hyun Min Ye, Chun Eskin, Eleazar APPLIED BIOINFORMATICS Motivation: The analysis of gene coexpression is at the core of many types of genetic analysis. The coexpression between two genes can be calculated by using a traditional Pearson's correlation coefficient. However, unobserved confounding effects may cause inflation of the Pearson's correlation so that uncorrelated genes appear correlated. Many general methods have been suggested, which aim to remove the effects of confounding from gene expression data. However, the residual confounding which is not accounted for by these generic correction procedures has the potential to induce correlation between genes. Therefore, a method that specifically aims to calculate gene coexpression between gene expression arrays, while accounting for confounding effects, is desirable. Results: In this article, we present a statistical model for calculating gene coexpression called mixed model coexpression (MMC), which models coexpression within a mixed model framework. Confounding effects are expected to be encoded in the matrix representing the correlation between arrays, the inter-sample correlation matrix. By conditioning on the information in the inter-sample correlation matrix, MMC is able to produce gene coexpressions that are not influenced by global confounding effects and thus significantly reduce the number of spurious coexpressions observed. We applied MMC to both human and yeast datasets and show it is better able to effectively prioritize strong coexpressions when compared to a traditional Pearson's correlation and a Pearson's correlation applied to data corrected with surrogate variable analysis (SVA). Availability: The method is implemented in the R programming language and may be found at <inter-ref locator="http://genetics.cs.ucla.edu/mmc" locator-type="url">http://genetics.cs.ucla.edu/mmc</inter-ref>. Contact: <inter-ref locator="nfurlott@cs.ucla.edu" locator-type="email">nfurlott@cs.ucla.edu</inter-ref>; <inter-ref locator="eeskin@cs.ucla.edu" locator-type="email">eeskin@cs.ucla.edu</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i288 http://dx.doi.org/10.1093/bioinformatics/btr221 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i2952015-07-29HighWireOUPbioinfo:27:13

A generalized model for multi-marker analysis of cell cycle progression in synchrony experiments Mayhew, Michael B. Robinson, Joshua W. Jung, Boyoun Haase, Steven B. Hartemink, Alexander J. APPLIED BIOINFORMATICS Motivation: To advance understanding of eukaryotic cell division, it is important to observe the process precisely. To this end, researchers monitor changes in dividing cells as they traverse the cell cycle, with the presence or absence of morphological or genetic markers indicating a cell's position in a particular interval of the cell cycle. A wide variety of marker data is available, including information-rich cellular imaging data. However, few formal statistical methods have been developed to use these valuable data sources in estimating how a population of cells progresses through the cell cycle. Furthermore, existing methods are designed to handle only a single binary marker of cell cycle progression at a time. Consequently, they cannot facilitate comparison of experiments involving different sets of markers. Results: Here, we develop a new sampling model to accommodate an arbitrary number of different binary markers that characterize the progression of a population of dividing cells along a branching process. We engineer a strain of <it>Saccharomyces cerevisiae</it> with fluorescently labeled markers of cell cycle progression, and apply our new model to two image datasets we collected from the strain, as well as an independent dataset of different markers. We use our model to estimate the duration of post-cytokinetic attachment between a <it>S.cerevisiae</it> mother and daughter cell. The Java implementation is fast and extensible, and includes a graphical user interface. Our model provides a powerful and flexible cell cycle analysis tool, suitable to any type or combination of binary markers. Availability: The software is available from: <inter-ref locator="http://www.cs.duke.edu/~amink/software/cloccs/" locator-type="url">http://www.cs.duke.edu/~amink/software/cloccs/</inter-ref>. Contact: <inter-ref locator="michael.mayhew@duke.edu" locator-type="email">michael.mayhew@duke.edu</inter-ref>; <inter-ref locator="amink@cs.duke.edu" locator-type="email">amink@cs.duke.edu</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i295 http://dx.doi.org/10.1093/bioinformatics/btr244 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i3042015-07-29HighWireOUPbioinfo:27:13

Systematic exploration of error sources in pyrosequencing flowgram data Balzer, Susanne Malde, Ketil Jonassen, Inge APPLIED BIOINFORMATICS Motivation: 454 pyrosequencing, by Roche Diagnostics, has emerged as an alternative to Sanger sequencing when it comes to read lengths, performance and cost, but shows higher per-base error rates. Although there are several tools available for noise removal, targeting different application fields, data interpretation would benefit from a better understanding of the different error types. Results: By exploring 454 raw data, we quantify to what extent different factors account for sequencing errors. In addition to the well-known homopolymer length inaccuracies, we have identified errors likely to originate from other stages of the sequencing process. We use our findings to extend the flowsim pipeline with functionalities to simulate these errors, and thus enable a more realistic simulation of 454 pyrosequencing data with flowsim. Availability: The flowsim pipeline is freely available under the General Public License from <inter-ref locator="http://biohaskell.org/Applications/FlowSim" locator-type="url">http://biohaskell.org/Applications/FlowSim</inter-ref>. Contact: <inter-ref locator="susanne.balzer@imr.no" locator-type="email">susanne.balzer@imr.no</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i304 http://dx.doi.org/10.1093/bioinformatics/btr251 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i3102015-07-29HighWireOUPbioinfo:27:13

An enhanced Petri-net model to predict synergistic effects of pairwise drug combinations from gene microarray data Jin, Guangxu Zhao, Hong Zhou, Xiaobo Wong, Stephen T. C. APPLIED BIOINFORMATICS Motivation: Prediction of synergistic effects of drug combinations has traditionally been relied on phenotypic response data. However, such methods cannot be used to identify molecular signaling mechanisms of synergistic drug combinations. In this article, we propose an enhanced Petri-Net (EPN) model to recognize the synergistic effects of drug combinations from the molecular response profiles, i.e. drug-treated microarray data. Methods: We addressed the downstream signaling network of the targets for the two individual drugs used in the pairwise combinations and applied EPN to the identified targeted signaling network. In EPN, drugs and signaling molecules are assigned to different types of places, while drug doses and molecular expressions are denoted by color tokens. The changes of molecular expressions caused by treatments of drugs are simulated by two actions of EPN: firing and blasting. Firing is to transit the drug and molecule tokens from one node or place to another, and blasting is to reduce the number of molecule tokens by drug tokens in a molecule node. The goal of EPN is to mediate the state characterized by control condition without any treatment to that of treatment and to depict the drug effects on molecules by the drug tokens. Results: We applied EPN to our generated pairwise drug combination microarray data. The synergistic predictions using EPN are consistent with those predicted using phenotypic response data. The molecules responsible for the synergistic effects with their associated feedback loops display the mechanisms of synergism. Availability: The software implemented in Python 2.7 programming language is available from request. Contact: <inter-ref locator="stwong@tmhs.org" locator-type="email">stwong@tmhs.org</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i310 http://dx.doi.org/10.1093/bioinformatics/btr202 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i3172015-07-29HighWireOUPbioinfo:27:13

Accurate estimation of heritability in genome wide studies using random effects models Golan, David Rosset, Saharon POPULATION GENOMICS Motivation: Random effects models have recently been introduced as an approach for analyzing genome wide association studies (GWASs), which allows estimation of overall heritability of traits without explicitly identifying the genetic loci responsible. Using this approach, <cross-ref type="bib" refid="B23">Yang <it>et al.</it> (2010</cross-ref>) have demonstrated that the heritability of height is much higher than the ~10% associated with identified genetic factors. However, <cross-ref type="bib" refid="B23">Yang <it>et al.</it> (2010</cross-ref>) relied on a heuristic for performing estimation in this model. Results: We adopt the model framework of <cross-ref type="bib" refid="B23">Yang <it>et al.</it> (2010</cross-ref>) and develop a method for maximum-likelihood (ML) estimation in this framework. Our method is based on Monte-Carlo expectation-maximization (MCEM; <cross-ref type="bib" refid="B19">Wei <it>et al.</it>, 1990</cross-ref>), an expectation-maximization algorithm wherein a Markov chain Monte Carlo approach is used in the E-step. We demonstrate that this method leads to more stable and accurate heritability estimation compared to the approach of <cross-ref type="bib" refid="B23">Yang <it>et al.</it> (2010</cross-ref>), and it also allows us to find ML estimates of the portion of markers which are causal, indicating whether the heritability stems from a small number of powerful genetic factors or a large number of less powerful ones. Contact: <inter-ref locator="saharon@post.tau.ac.il" locator-type="email">saharon@post.tau.ac.il</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i317 http://dx.doi.org/10.1093/bioinformatics/btr219 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i3242015-07-29HighWireOUPbioinfo:27:13

StructHDP: automatic inference of number of clusters and population structure from admixed genotype data Shringarpure, Suyash Won, Daegun Xing, Eric P. POPULATION GENOMICS Motivation: Clustering of genotype data is an important way of understanding similarities and differences between populations. A summary of populations through clustering allows us to make inferences about the evolutionary history of the populations. Many methods have been proposed to perform clustering on multilocus genotype data. However, most of these methods do not directly address the question of how many clusters the data should be divided into and leave that choice to the user. Methods: We present StructHDP, which is a method for automatically inferring the number of clusters from genotype data in the presence of admixture. Our method is an extension of two existing methods, <it>Structure</it> and <it>Structurama</it>. Using a Hierarchical Dirichlet Process (HDP), we model the presence of admixture of an unknown number of ancestral populations in a given sample of genotype data. We use a Gibbs sampler to perform inference on the resulting model and infer the ancestry proportions and the number of clusters that best explain the data. Results: To demonstrate our method, we simulated data from an island model using the neutral coalescent. Comparing the results of StructHDP with <it>Structurama</it> shows the utility of combining HDPs with the <it>Structure</it> model. We used StructHDP to analyze a dataset of 155 Taita thrush, <it>Turdus helleri</it>, which has been previously analyzed using <it>Structure</it> and <it>Structurama</it>. StructHDP correctly picks the optimal number of populations to cluster the data. The clustering based on the inferred ancestry proportions also agrees with that inferred using <it>Structure</it> for the optimal number of populations. We also analyzed data from 1048 individuals from the Human Genome Diversity project from 53 world populations. We found that the clusters obtained correspond with major geographical divisions of the world, which is in agreement with previous analyses of the dataset. Availability: StructHDP is written in C++. The code will be available for download at <inter-ref locator="http://www.sailing.cs.cmu.edu/structhdp" locator-type="url">http://www.sailing.cs.cmu.edu/structhdp</inter-ref>. Contact: <inter-ref locator="suyash@cs.cmu.edu" locator-type="email">suyash@cs.cmu.edu</inter-ref>; <inter-ref locator="epxing@cs.cmu.edu" locator-type="email">epxing@cs.cmu.edu</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i324 http://dx.doi.org/10.1093/bioinformatics/btr242 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i3332015-07-29HighWireOUPbioinfo:27:13

Reconstruction of genealogical relationships with applications to Phase III of HapMap Kyriazopoulou-Panagiotopoulou, Sofia Kashef Haghighi, Dorna Aerni, Sarah J. Sundquist, Andreas Bercovici, Sivan Batzoglou, Serafim POPULATION GENOMICS Motivation: Accurate inference of genealogical relationships between pairs of individuals is paramount in association studies, forensics and evolutionary analyses of wildlife populations. Current methods for relationship inference consider only a small set of close relationships and have limited to no power to distinguish between relationships with the same number of meioses separating the individuals under consideration (e.g. aunt–niece versus niece–aunt or first cousins versus great aunt–niece). Results: We present CARROT (ClAssification of Relationships with ROTations), a novel framework for relationship inference that leverages linkage information to differentiate between <it>rotated</it> relationships, that is, between relationships with the same number of common ancestors and the same number of meioses separating the individuals under consideration. We demonstrate that CARROT clearly outperforms existing methods on simulated data. We also applied CARROT on four populations from Phase III of the HapMap Project and detected previously unreported pairs of third- and fourth-degree relatives. Availability: Source code for CARROT is freely available at <inter-ref locator="http://carrot.stanford.edu" locator-type="url">http://carrot.stanford.edu</inter-ref>. Contact: <inter-ref locator="sofiakp@stanford.edu" locator-type="email">sofiakp@stanford.edu</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i333 http://dx.doi.org/10.1093/bioinformatics/btr243 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i342015-07-29HighWireOUPbioinfo:27:13

A method for probing the mutational landscape of amyloid structure O'Donnell, Charles W. Waldispühl, Jérôme Lis, Mieszko Halfmann, Randal Devadas, Srinivas Lindquist, Susan Berger, Bonnie PROTEIN STRUCTURE AND FUNCTION Motivation: Proteins of all kinds can self-assemble into highly ordered β-sheet aggregates known as amyloid fibrils, important both biologically and clinically. However, the specific molecular structure of a fibril can vary dramatically depending on sequence and environmental conditions, and mutations can drastically alter amyloid function and pathogenicity. Experimental structure determination has proven extremely difficult with only a handful of NMR-based models proposed, suggesting a need for computational methods. Results: We present AmyloidMutants, a statistical mechanics approach for <it>de novo</it> prediction and analysis of wild-type and mutant amyloid structures. Based on the premise of protein <it>mutational landscapes</it>, AmyloidMutants energetically quantifies the effects of sequence mutation on fibril conformation and stability. Tested on non-mutant, full-length amyloid structures with known chemical shift data, AmyloidMutants offers roughly 2-fold improvement in prediction accuracy over existing tools. Moreover, AmyloidMutants is the only method to predict complete super-secondary structures, enabling accurate discrimination of topologically dissimilar amyloid conformations that correspond to the same sequence locations. Applied to mutant prediction, AmyloidMutants identifies a global conformational switch between Aβ and its highly-toxic ‘Iowa’ mutant in agreement with a recent experimental model based on partial chemical shift data. Predictions on mutant, yeast-toxic strains of HET-s suggest similar alternate folds. When applied to HET-s and a HET-s mutant with core asparagines replaced by glutamines (both highly amyloidogenic chemically similar residues abundant in many amyloids), AmyloidMutants surprisingly predicts a greatly reduced capacity of the glutamine mutant to form amyloid. We confirm this finding by conducting mutagenesis experiments. Availability: Our tool is publically available on the web at <inter-ref locator="http://amyloid.csail.mit.edu/" locator-type="url">http://amyloid.csail.mit.edu/</inter-ref>. Contact: <inter-ref locator="lindquist_admin@wi.mit.edu" locator-type="email">lindquist_admin@wi.mit.edu</inter-ref>; <inter-ref locator="bab@csail.mit.edu" locator-type="email">bab@csail.mit.edu</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr238/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i34 http://dx.doi.org/10.1093/bioinformatics/btr238 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i3422015-07-29HighWireOUPbioinfo:27:13

ccSVM: correcting Support Vector Machines for confounding factors in biological data classification Li, Limin Rakitsch, Barbara Borgwardt, Karsten DATABASES AND ONTOLOGIES Motivation: Classifying biological data into different groups is a central task of bioinformatics: for instance, to predict the function of a gene or protein, the disease state of a patient or the phenotype of an individual based on its genotype. Support Vector Machines are a wide spread approach for classifying biological data, due to their high accuracy, their ability to deal with structured data such as strings, and the ease to integrate various types of data. However, it is unclear how to correct for confounding factors such as population structure, age or gender or experimental conditions in Support Vector Machine classification. Results: In this article, we present a Support Vector Machine classifier that can correct the prediction for observed confounding factors. This is achieved by minimizing the statistical dependence between the classifier and the confounding factors. We prove that this formulation can be transformed into a standard Support Vector Machine with rescaled input data. In our experiments, our confounder correcting SVM (ccSVM) improves tumor diagnosis based on samples from different labs, tuberculosis diagnosis in patients of varying age, ethnicity and gender, and phenotype prediction in the presence of population structure and outperforms state-of-the-art methods in terms of prediction accuracy. Availability: A ccSVM-implementation in MATLAB is available from <inter-ref locator="http://webdav.tuebingen.mpg.de/u/karsten/Forschung/ISMB11_ccSVM/" locator-type="url">http://webdav.tuebingen.mpg.de/u/karsten/Forschung/ISMB11_ccSVM/</inter-ref>. Contact: <inter-ref locator="limin.li@tuebingen.mpg.de" locator-type="email">limin.li@tuebingen.mpg.de</inter-ref>; <inter-ref locator="karsten.borgwardt@tuebingen.mpg.de" locator-type="email">karsten.borgwardt@tuebingen.mpg.de</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i342 http://dx.doi.org/10.1093/bioinformatics/btr204 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i3492015-07-29HighWireOUPbioinfo:27:13

Ontology patterns for tabular representations of biomedical knowledge on neglected tropical diseases Santana, Filipe Schober, Daniel Medeiros, Zulma Freitas, Fred Schulz, Stefan DATABASES AND ONTOLOGIES Motivation: Ontology-like domain knowledge is frequently published in a tabular format embedded in scientific publications. We explore the re-use of such tabular content in the process of building NTDO, an ontology of neglected tropical diseases (NTDs), where the representation of the interdependencies between hosts, pathogens and vectors plays a crucial role. Results: As a proof of concept we analyzed a tabular compilation of knowledge about pathogens, vectors and geographic locations involved in the transmission of NTDs. After a thorough ontological analysis of the domain of interest, we formulated a comprehensive design pattern, rooted in the biomedical domain upper level ontology BioTop. This pattern was implemented in a VBA script which takes cell contents of an Excel spreadsheet and transforms them into OWL-DL. After minor manual post-processing, the correctness and completeness of the ontology was tested using pre-formulated competence questions as description logics (DL) queries. The expected results could be reproduced by the ontology. The proposed approach is recommended for optimizing the acquisition of ontological domain knowledge from tabular representations. Availability and implementation: Domain examples, source code and ontology are freely available on the web at <inter-ref locator="http://www.cin.ufpe.br/~ntdo" locator-type="url">http://www.cin.ufpe.br/~ntdo</inter-ref>. Contact: <inter-ref locator="fss3@cin.ufpe.br" locator-type="email">fss3@cin.ufpe.br</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i349 http://dx.doi.org/10.1093/bioinformatics/btr226 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i3572015-07-29HighWireOUPbioinfo:27:13

Detection and interpretation of metabolite-transcript coresponses using combined profiling data Redestig, Henning Costa, Ivan G. GENE REGULATION AND TRANSCRIPTOMICS Motivation: Studying the interplay between gene expression and metabolite levels can yield important information on the physiology of stress responses and adaptation strategies. Performing transcriptomics and metabolomics in parallel during time-series experiments represents a systematic way to gain such information. Several combined profiling datasets have been added to the public domain and they form a valuable resource for hypothesis generating studies. Unfortunately, detecting coresponses between transcript levels and metabolite abundances is non-trivial: they cannot be assumed to overlap directly with underlying biochemical pathways and they may be subject to time delays and obscured by considerable noise. Results: Our aim was to predict pathway comemberships between metabolites and genes based on their coresponses to applied stress. We found that in the presence of strong noise and time-shifted responses, a hidden Markov model-based similarity outperforms the simpler Pearson correlation but performs comparably or worse in their absence. Therefore, we propose a supervised method that applies pathway information to summarize similarity statistics to a consensus statistic that is more informative than any of the single measures. Using four combined profiling datasets, we show that comembership between metabolites and genes can be predicted for numerous KEGG pathways; this opens opportunities for the detection of transcriptionally regulated pathways and novel metabolically related genes. Availability: A command-line software tool is available at <inter-ref locator="http://www.cin.ufpe.br/~igcf/Metabolites" locator-type="url">http://www.cin.ufpe.br/~igcf/Metabolites</inter-ref>. Contact: <inter-ref locator="henning@psc.riken.jp" locator-type="email">henning@psc.riken.jp</inter-ref>; <inter-ref locator="igcf@cin.ufpe.br" locator-type="email">igcf@cin.ufpe.br</inter-ref> Supplementary information: <inter-ref locator="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr231/DC1" locator-type="url">Supplementary data</inter-ref> are available at <it>Bioinformatics</it> online. Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i357 http://dx.doi.org/10.1093/bioinformatics/btr231 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i3662015-07-29HighWireOUPbioinfo:27:13

From sets to graphs: towards a realistic enrichment analysis of transcriptomic systems Geistlinger, Ludwig Csaba, Gergely Küffner, Robert Mulder, Nicola Zimmer, Ralf GENE REGULATION AND TRANSCRIPTOMICS Motivation: Current gene set enrichment approaches do not take interactions and associations between set members into account. Mutual activation and inhibition causing positive and negative correlation among set members are thus neglected. As a consequence, inconsistent regulations and contextless expression changes are reported and, thus, the biological interpretation of the result is impeded. Results: We analyzed established gene set enrichment methods and their result sets in a large-scale investigation of 1000 expression datasets. The reported statistically significant gene sets exhibit only average consistency between the observed patterns of differential expression and known regulatory interactions. We present <it>Gene Graph Enrichment Analysis</it> (GGEA) to detect consistently and coherently enriched gene sets, based on prior knowledge derived from directed gene regulatory networks. Firstly, GGEA improves the concordance of pairwise regulation with individual expression changes in respective pairs of regulating and regulated genes, compared with set enrichment methods. Secondly, GGEA yields result sets where a large fraction of relevant expression changes can be explained by nearby regulators, such as transcription factors, again improving on set-based methods. Thirdly, we demonstrate in additional case studies that GGEA can be applied to human regulatory pathways, where it sensitively detects very specific regulation processes, which are altered in tumors of the central nervous system. GGEA significantly increases the detection of gene sets where measured positively or negatively correlated expression patterns coincide with directed inducing or repressing relationships, thus facilitating further interpretation of gene expression data. Availability: The method and accompanying visualization capabilities have been bundled into an <ty>R</ty> package and tied to a grahical user interface, the <ty>Galaxy</ty> workflow environment, that is running as a web server. Contact: <inter-ref locator="Ludwig.Geistlinger@bio.ifi.lmu.de" locator-type="email">Ludwig.Geistlinger@bio.ifi.lmu.de</inter-ref>; <inter-ref locator="Ralf.Zimmer@bio.ifi.lmu.de" locator-type="email">Ralf.Zimmer@bio.ifi.lmu.de</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i366 http://dx.doi.org/10.1093/bioinformatics/btr228 en Copyright (C) 2011, Oxford University Press

oai:open-archive.highwire.org:bioinfo:27/13/i3742015-07-29HighWireOUPbioinfo:27:13

Small sets of interacting proteins suggest functional linkage mechanisms via Bayesian analogical reasoning Airoldi, Edoardo M. Heller, Katherine A. Silva, Ricardo GENE REGULATION AND TRANSCRIPTOMICS Motivation: Proteins and protein complexes coordinate their activity to execute cellular functions. In a number of experimental settings, including synthetic genetic arrays, genetic perturbations and RNAi screens, scientists identify a small set of protein interactions of interest. A working hypothesis is often that these interactions are the observable phenotypes of some functional process, which is not directly observable. Confirmatory analysis requires finding other pairs of proteins whose interaction may be additional phenotypical evidence about the same functional process. Extant methods for finding additional protein interactions rely heavily on the information in the newly identified set of interactions. For instance, these methods leverage the attributes of the individual proteins directly, in a supervised setting, in order to find relevant protein pairs. A small set of protein interactions provides a small sample to train parameters of prediction methods, thus leading to low confidence. Results: We develop RBSets, a computational approach to ranking protein interactions rooted in analogical reasoning; that is, the ability to learn and generalize relations between objects. Our approach is tailored to situations where the training set of protein interactions is small, and leverages the attributes of the individual proteins indirectly, in a Bayesian ranking setting that is perhaps closest to propensity scoring in mathematical psychology. We find that RBSets leads to good performance in identifying additional interactions starting from a small evidence set of interacting proteins, for which an underlying biological logic in terms of functional processes and signaling pathways can be established with some confidence. Our approach is scalable and can be applied to large databases with minimal computational overhead. Our results suggest that analogical reasoning within a Bayesian ranking problem is a promising new approach for real-time biological discovery. Availability: Java code is available at: <inter-ref locator="www.gatsby.ucl.ac.uk/~rbas" locator-type="url">www.gatsby.ucl.ac.uk/~rbas</inter-ref>. Contact: <inter-ref locator="airoldi@fas.harvard.edu" locator-type="email">airoldi@fas.harvard.edu</inter-ref>; <inter-ref locator="kheller@mit.edu" locator-type="email">kheller@mit.edu</inter-ref>; <inter-ref locator="ricardo@stats.ucl.ac.uk" locator-type="email">ricardo@stats.ucl.ac.uk</inter-ref> Oxford University Press 2011-07-01 00:00:00.0 TEXT text/html http://bioinformatics.oxfordjournals.org/cgi/content/short/27/13/i374 http://dx.doi.org/10.1093/bioinformatics/btr236 en Copyright (C) 2011, Oxford University Press 1714601077294!0001-01-01!9999-12-31!bioinfo!100!oai_dc