Genome Informatics Section 

DOE Human Genome Program Contractor-Grantee Workshop VII 
January 12-16, 1999  Oakland, CA


99. A Relational Database and Web/CGI Approach in the Analysis and Data Presentation of Large-Scale BAC-EST Hybridization Screens 

Robert Xuequn Xu, Chang-Su Lim, Bum-Chan Park, Mei Wang, Jonghyeob Lee, Aaron Rosin, Eunpyo Moon, Melvin Simon, and Ung-Jin Kim 
Division of Biology, Caltech, Pasadena, CA 91125 
xux@cco.caltech.edu 

Large-scale hybridization, in which both probes and targets are in huge numbers, are frequently used in the genome projects to mass-produce positive screening results. In our project of "Construction of a genome-wide human BAC-Unigene resource", probes (ESTs) are pooled in groups of 20 according to a pre-designed 20x20 matrix, and the pooled and labeled probes are applied to BAC library filters in hybridization. 

We have developed a complete data management, analysis and presentation system with a relational database tool, combined with a web server and several Perl scripts, for the deconvolution results of our massive BAC-EST screening. It features screen-by-screen progress reporting, detailed description of each probe, automatic statistics report generation, and some quality control functions (http://www.tree.caltech.edu/lib_D_Unigene.html). The methodology can be easily adapted to any other large-scale hybridization projects using similar probe pooling strategy. With the help of this tool, the probe pooling matrix is virtually unlimited, and therefore the efficiency of hybridization screening can be greatly improved. For example, for 10,000 probes, if they are organized into 100x100 matrix, only 200 hybridizations are needed (100 row pools and 100 column pools), instead of 10,000 individual hybridizations. 

The initial hybridization results are collections of BACs that are positive to particular probe pools - row and column pools. Since there are 20 row pools and 20 column pools, 40 hybridizations are performed. In order to resolve the individual probe-BAC relationship, The pooled screening results are fed into a relational database scheme: First, a probe table PROBE_RC is constructed which includes fields PROBE, ROW, and COL (PROBE is the IMAGE clone ID; ROW is the row number of the probe in the 20x20 matrix; and COL is the probe's column number). In a 20x20 matrix, ROW ranges from 1-20, and COL ranges from 21-40. Then, when the pool-wise hybridization results are available, those results are entered into the positive BAC table as BAC, POS in which BAC is the positive BAC address identified from the filter screen and POS is the pool number of probes that that BAC is positive to. Apparently, the POS's value will range from 1 to 40. Finally, after data entry, a relational database tool is started to create views that split the positive BAC table into BAC_ROW view and BAC_COL view, and these two views are joined at the BAC field (i.e. BAC_ROW.BAC= BAC_COL.BAC) to create a new view BAC_ROW_COL which contains BACs that appear on both row-pool screens and column-pool screens. 

The resulting view BAC_ROW_COL (which consists of BAC, BAC_ROW.POS and BAC_COL.POS fields) is joined further with the probe table, with the join condition "PROBE_RC.ROW= BAC_ROW.POS AND PROBE_RC.COL= BAC_COL.POS", and the final "deconvolution" table is generated as selecting BAC, ROW, COL, PROBE. Any possible individual positive BAC-probe relations are revealed in this table, and it can be grouped, sorted and reported through the relational database tool's internal reporting function and publish to the Web; or through custom designed Perl/CGI scripts. 


 
Home Sequencing Functional Genomics
Author Index Sequencing Technologies Microbial Genome Program
Search Mapping Ethical, Legal, & Social Issues
Order a copy Informatics Infrastructure