Version 2.5.2.0 CRISP Logo CRISP Homepage Help for CRISP Email Us

Abstract

Grant Number: 5R29LM005524-05
Project Title: CLASSIFICATION NEURAL NETWORKS FOR GENOME RESEARCH
PI Information:NameEmailTitle
WU, CATHY H. wuc@georgetown.edu PROFESSOR

Abstract: The long-term objective is to develop computer technology needed to accomplish the objectives of the Human Genome Project and to apply the technology to the analysis and management of sequencing data. Currently, a database search for sequence similarities represents the most direct computational approach to the analysis of genomic information. However, the search is becoming ever more forbidding due to the accelerating growth of sequencing data. The goal of the proposed research is to further develop and enhance a software tool for speedy classification of unknown sequences, and make it available to the genome community. The research will build upon a pilot system designed and developed by the principal investigator that has shown great promise. The specific aims are (1) to enhance the tool for speedy identification of PIR superfamilies and ProSite patterns, (2) to develop a pilot DNA/RNA classification system, (3) to distribute the tool, and (4) to aid PIR protein database and RDP ribosomal RNA database organization. In contrast to other search methods whose search time grows linearly with the number of entries in the database, the time of the proposed tool grows with the number of families, which is likely to remain low. The tool would automate family assignment which is especially important for managing the influx of new data in a timely manner. The proposed research applies neural network technology to solving the database search/organization problem. The major design principles involve an encoding schema to extract sequence information and a modular architecture to scale up backpropagation networks. The encoding algorithm is a hashing function similar to the k-tuple method. A pilot system has been implemented on a Cray supercomputer to classify electron transfer proteins and enzymes. The system achieves about 90% accuracy and 50 times speed of other search methods. The speed may be 1000 times faster than others in a decade if the database continues to grow at the current rate. In the proposed research, the sensitivity of the tool would be improved and a full-scale system would be developed. The automated software tool would be portable at the source code, user interface, and hardware levels. The system would be updated in accordance with database releases, and distributed to the research community via anonymous ftp. The tool would be used to classify PIR sequences according to superfamilies and to classify ribosomal RNA sequences according to phylogenetic relations.

Public Health Relevance:
This Public Health Relevance is not available.

Thesaurus Terms:
artificial intelligence, computer assisted sequence analysis, computer program /software, genome, information system
computer system design /evaluation, electron transport, nucleic acid sequence, protein sequence

Institution: UNIVERSITY OF TEXAS HLTH CTR AT TYLER
11937 US HIGHWAY 271
TYLER, TX 75708
Fiscal Year: 1997
Department: EPIDEMIOLOGY & BIOMATHEMATICS
Project Start: 01-JUL-1993
Project End: 30-SEP-1999
ICD: NATIONAL LIBRARY OF MEDICINE
IRG: GNM


CRISP Homepage Help for CRISP Email Us