Skip CCR Main Navigation National Cancer Institute National Cancer Institute U.S. National Institutes of Health www.cancer.gov
CCR - For Our Staff| Home |

Our Science – Lee Website

Byungkook Lee, Ph.D.

Portait Photo of Byungkook Lee
Laboratory of Molecular Biology
Head, Molecular Modeling Section
Senior Investigator
National Cancer Institute
Building 37, Room 5120
37 Convent Dr. MSC 4262
Bethesda, MD 20892-4262
Phone:  
301-496-6580
Fax:  
301-402-1344
E-Mail:  
bk@nih.gov
Link:
Other Homepage

Biography

Dr. Lee received his Ph.D. from Cornell University in 1967 and studied at Yale University. From 1970 to 1980 he taught at the University of Kansas. He has been at the NIH since 1981 and in his present position since 1991.

Research

We use theoretical and computational techniques to help solve biological and medical problems. The current research topics can be grouped into the following five categories:

Gene discovery We analyze the mRNA and Expressed Sequence Tag (EST) DNA sequence databases to discover genes that are specifically expressed in a particular organ or tumor. The products of such genes can potentially be used as targets for delivery of antitumor agents, for anticancer vaccine development, and for tumor imaging. We also utilize these databases to discover novel fusion genes resulting from chromosomal rearrangements, which are frequently involved in carcinogenesis. This is a collaborative effort with Dr. Ira Pastan's Molecular Biology group, who experimentally verifies and characterizes these genes.

Comparative analysis of genes and genomes Comparison of human genes with their evolutionarily related homologs provides invaluable clues for the biological function of the proteins they encode. We collect, and attempt to construct the evolutionary history of, the homologs of the human genes identified by our Gene Discovery program. We also perform systematic search of human-specific mutations that occurred after the Homo-Pan divergence by comparison of the human and the chimpanzee genome sequences. The human-specific genetic alterations should be responsible for the generation of human-specific traits.

Immunotoxins Immunotoxins are man-made molecules constructed by joining an anticancer antibody and a suitable toxin, in our case, the pseudomonas exotoxin A. In all molecules under current active consideration, the antibody part is truncated to only the antigen-binding Fv portion of the molecule. The toxin part is modified to delete its own receptor-binding domain. Ideally, these molecules will bind only to the target cancer cells and kill them. Dr. Pastan's group has made many such molecules, each of which has a specific antibody for a particular cancer. Some of these have been or are being tested in phase I clinical trials. We study these molecules and attempt to find ways to improve their properties as an effective drug. This is also a collaborative work with Dr. Pastan's laboratory.


Protein structure and modeling We have a long-standing interest in surface areas, volumes, cavities, and stability of protein structures. We build models of the three-dimensional structure of protein molecules by homology modeling and by using both the in-house and publicly available fold recognition programs. A predicted structure often helps determine the function of the new genes that we find from the gene discovery program. We also develop tools for these operations. Currently, we are working on (1) improving our automatic protein structure alignment/search algorithm; (2) comparing the results of two very different such algorithms to the manually procured world-standard protein classification database, SCOP, in collaboration with Dr. Peter Munson's group at the Center for Information Technology (CIT) of NIH and with Drs. Jean Garnier and Jean-Francois Gibrat of the Institut National de la Recherche Agronomique, Jouy-en-Josas, France; (3) developing simple graphical means of displaying secondary structure and their interaction pattern; and (4) developing a protein modeling workbench that will enable one to perform various tasks in one convenient package, including construction, prediction, viewing, comparison, modification, and manipulation of protein structures. We also occasionally collaborate with other scientists on protein structure modeling projects.


Hydrophobicity We study the phenomenon of hydrophobicity by means of statistical thermodynamics. The hydrophobic effect is believed to be one of the main forces that determine the structure, stability, and interaction of protein and other biologically important molecules. This research is done in collaboration with Prof. Giuseppe Graziano at Universita del Sannio, Benevento, Italy.


SUMMARY OF RESEARCH ACTIVITIES
Following are more specific descriptions of the research activity in the past four years:
1. Gene discovery

1.1. EST clustering
By clustering EST sequences, and working closely with Dr. Pastan's experimental group, we have found some dozen new genes that are specifically expressed in human prostate and breast. Products of some of these genes appear to be promising candidates for anticancer vaccine and immunotoxin delivery targets. We have developed a new procedure for clustering EST sequences that take full advantage of the human genome sequence. We also developed a database and a local genome browser, patterned after the Human Genome Browser of UCSC, which are tailored for the purpose of conveniently finding organ-specific EST clusters. We expect to be able to find new prostate and breast specific genes more easily using these tools.

1.2. MAPcL project
In order to find more breast-specific genes that code for membrane-bound and excreted proteins, Dr. Pastan's group built a new cDNA library from the membrane-associated polysomal RNA prepared from a set of breast cancer cell lines. This library is called MAPcL (membrane-associated polysomal cDNA library). They then subtracted cDNAs from a set of normal tissues and obtained some 27,000 sequences from the subtracted library. We analyzed these sequences to determine that the membrane-bound protein genes are indeed enriched by some 3.5-fold in this library, that there are over 5,000 sequence clusters that appear to be from genes for which there is no information in any publicly available databases, and that nearly 1,200 of these never occur in the current EST or other sequence databases. We have identified two breast and breast-cancer specific genes from this library: BASE, which is soluble and BPSR, which is membrane-bound.

1.3. Finding fusion gene
Most human cancer cells bear chromosomal aberrations, which often create chimeric, fusion genes that are involved in cancer pathogenesis. Transcripts from such genes should be present in the expressed sequence tag (EST) database. We developed a semi-automatic procedure to recognize fusion gene transcripts among many chimeric sequences, most of which are artifacts generated during the library construction procedure. Using this procedure, we could identify 314 transcript sequences that originated from 237 chimeric, fusion genes. Sixty of these are known cases, including the ABL1-BCR fusion that is responsible for 90% of the chronic myelogenous leukemia. The remaining 177 genes are new cases that were found for the first time. We have experimentally verified the presence of one of these, the IRA1-RGS17 fusion in the breast cancer cell line MCF7. IRA1 is on chromosome 3q26.32 and encodes a subunit of the nuclear receptor corepressor/HDAC3 complex. RGS17 is on chromosome 6q25.2 and encodes a member of the GTPase activating proteins (Regulator of G-protein Signaling). Nearly half of the new cases are from unremarked, presumably normal, tissues. This work demonstrates that important information on chromosomal aberration can be obtained from careful analysis of the publicly available expressed sequence databases. It also provides a new tool for identifying the precise location of chromosomal fusion events and potentially tumorigenic fusion genes.


2. Comparative analysis of genes and genomes
2.1. Identification of orthologs and paralogs of genes of interest
Identification of evolutionarily related genes within and across species is important because it may provide important clues to their function. We identify homologs of the human genes of interest, compare their genomic structure and protein sequences, and collect literatures on the homologs. The combined information is invaluable for designing experiments for functional study of the genes. We have collected information on non-human orthologs of human BASE, BPSR, and TEPP. We have also identified many paralogous NGEP and POTE genes in the human genome. In the POTE gene case, we have found that the gene family experienced extensive remodeling involving complex rearrangements after the initial formation of the ancestral gene. POTE genes have been found to share an ancestor with a structurally similar gene ANKRD26 whose mouse homolog can serve as a model gene for the functional study of POTE proteins.

2.2. Human-specific mutations
Humans have many distinct features compared to the great apes, for example, bipedalism, facilitated encephalization, and use of complex language. These must be the result of genetic alterations accumulated in the genome during the human evolution. Possible alterations are modification of the gene expression patterns, accelerated amino acid substitutions of the existing genes, and creation of new genes. Another probable mechanism is the genetic loss. The recent release of the draft chimpanzee genome assembly provides an invaluable resource for identification of human-specific genetic changes. We developed a simple procedure to identify genes that bear human-specific frameshift and nonsense mutations. The procedure involves comparing human and orthologous chimpanzee sequences, using one or more non-human/non-chimpanzee homolog sequence as the jury. Using this procedure, we identified several genes that have been modified in humans. Two of these are likely to be decaying pseudogenes. The remainder would produce proteins with altered or prematurely terminated C-termini and have been reported or assumed to be functional at least to some degree. Some of these are polymorphic among human individuals.


3. Immunotoxin
3.1. Protein engineering
We identified mutation sites on three different immuotoxins that, when altered, will reduce the non-specific toxicity of the immunotoxin by lowering the PI of the molecule. Ira Pastan's group showed by experiments that the designed mutants were indeed significantly less toxic when tested on mice.

3.2. Modeling immunotoxin delivery
We have constructed a mathematical model of the process in which the immunotoxin (IT) molecules pass through the blood vessel, diffuse through the intercellular space, bind to the specific antigens, enter the cells that bear these antigens, and finally kill them. The model contains nearly 20 parameters. The values of some of these parameters are known from experiments, but many others are not known. We are in the process of determining the values of these parameters by minimizing the difference between the calculated and experimental toxicity data. Once the parameter values are determined, the model should yield quantitative information on the amount of toxin lost during each step of the delivery process. For example, preliminary studies indicate that a surprisingly large amount of toxin is lost inside the cell after the IT-laden receptor is internalized. The model also shows that too strong binding can reduce the effectiveness of the drug and suggests the optimal binding constant given a certain number of receptors per cell, diffusion constant, etc.

4. Protein structure
4.1. Homology modeling
We built three-dimensional structural models of the protein product of a number of new genes that we found. Two genes, PATE and POTE, are of particular interest. The structural model of PATE enabled us to recognize that the 10 cysteins in this structure play structurally important roles. Using the conservation pattern of these key residues, we found that in about 200,000 base-pair stretch of the chromosomal region 11q24.2, there are 6 genes, all of which are likely to be structurally related to PATE and found experimentally to be expressed only in prostate and testis. Thus, the expression of the genes in this region of chromosome appears to be controlled by organ-specific transcription factors or perhaps by other organ-specific epigenetic features. This is a collaborative work with Dr. Wreschner, who is a visiting scholar in Dr. Pastan's laboratory. The model of POTE indicated that this protein is probably anchored to the inner aspect of the plasma membrane. This was later confirmed experimentally.
We also built models of a set of prokaryotic replication initiator proteins, including the E. coli protein repA. Dr. Sue Wickner of this laboratory made mutations in the repA on the basis of the model and obtained results that confirm the model.

4.2. Pair-to-pair (PTPSM) and Context-specific (CSSM) scoring matrices
All current sequence comparison algorithms use single residue comparisons, i.e., the alignment score is the sum of the scores assigned to each aligned position of the two sequences being compared. In contrast, we had devised a set of score matrices for aligning a pair of residues at a time from each sequence. We had implemented these score matrices (PTPSM for pair-to-pair substitution matrix) in FASTA and observed that the use of these matrices significantly improved the ability to identify structurally similar proteins over the use of the conventional pair substitution matrix. We are now working to implement the new matrix in BLAST, which is a more powerful and popular sequence alignment program than FASTA.

Most of the research on sequence-structure relationship is focused on predicting the structure of a protein from its sequence. However, the reverse problem, that of identifying all sequences that are likely to have a known structure, is probably as important but easier to address since the alignment can now take the full advantage of the known structure. To this end, we have devised a set of 4 context-specific score matrices (CSSM), each of which is a single residue substitution matrix that reflects the frequency of substitution of residues that have a given level of solvent exposure. We have also designed a set of rational gap penalty functions by observing the frequency of gaps that occur in a large set of structurally aligned protein pairs. We have implemented these score matrices in BLAST and PSIBLAST and observed that use of the CSSM score matrices improves coverage and alignment of both of these programs.

4.3. Domains and stucture comparison
We designed two protein structure comparison procedures, one for the global (SHEBA) and the other for the local (LSHEBA) comparisons. We used the former procedure to automatically cluster all protein structures and to find a surprisingly large number of protein structures that align after one sequence is circularly permuted. The second procedure will find structurally similar parts from two proteins that are otherwise unrelated. We are currently exploring the possibility of using the second to automatically parse protein structures into domains.

We have compared two protein structure comparison programs, SHEBA and VAST, using SCOP as the gold standard. SCOP is a well-known manually procured protein domain structure database. We found interesting differences among the three. We could also identify some weaknesses in both VAST and SHEBA. We are currently working on SHEBA and LSHEBA to correct the problem. This is a collaborative work with Peter Munson's group at the Center for Information Technology (CIT) of NIH and Jean Garnier and Jean-Francois Gibrat of the Institut National de la Recherche Agronomique, Jouy-en-Josas, France. Dr. Gibrat is an author of VAST.

4.4.CASP experiment
We took part in the sixth CASP= (Critical Assessment of techniques of Structure Prediction) world-wide protein structure prediction experiment as one of the three assessors. We assessed two categories of predictions: the domain prediction and the 3D structure prediction when the target has a new or nearly new fold. There were 1011 domain predictions from 22 prediction groups for 63 target proteins. For the 17 target proteins that had new or nearly new fold, we processed some 7,400 models from 165 prediction groups. In order to assess the quality of domain predictions, we devised a new scoring scheme, which could prove to be more generally useful whenever two partition/classification schemes are to be compared. From this exercise, we could identify the best scoring prediction groups and their methods, but also learned that even the best methods are far from being perfect.

CASP is a community-wide double-blind experiment of protein structure prediction. The sequences of proteins for which the structure determination is imminent are placed on a web site by the CASP organizers. The structure predictors worldwide take these sequences and submit their predicted structures to the organizers without knowing the true structure. The submitted structures are then evaluated by independent assessors, to whom the identity of the predictors is hidden. The experiment is performed every other year. The last one was the sixth in the series.

4.5. Protein modeling workbench
During evaluations of CASP experiments we had need for an integrated protein modeling and fully featured molecular graphics program. We evaluated a number of software applications and found none were suitable for our purposes. Hence, we are developing our own integrated system. The centerpiece of this system will be a graphical viewer, which is being built as an extension to UCSF Chimera. Our aim is to develop a simple system that incorporates many of the features of GEMM, our in-house graphics software which is becoming obsolete, and which can be used to submit multiple modeling jobs to various third party protein modeling servers, such as those participating in CASP. A key component of this system will be a single consistent interface to outside servers thus enabling non-experts to quickly use a variety of different modeling servers while also saving time.

4.6.Secondary structure interaction map
Since protein 3D structures are difficult to view and understand, schematic drawings of polypeptide backbone such as cartoon drawing are widely used as a simplified representation of protein 3D structure. A diagonal plot (residue-wise contact map) has been used as a 2-dimenstional representation of a 3D structure and signed distance map is an extension of the diagonal plot that shows handedness of helices and twist of beta-sheets. Currently, we are developing secondary structure interaction map, which is similar to a diagonal map but applied to the secondary structural contacts. We find that a map with rich information can be made, which shows the symmetry and domain architecture of the protein structure.


5. Hydrophobicity
5.1. Aromatics
We found by a theoretical analysis of the hydration thermodynamics of aromatic compounds that the substantial solubility of these compounds in water was due to the strong van der Waals interaction between water and the aromatic ring. The weak hydrogen bond between the aromatic ring and water molecules turns out to produce compensating enthalpy and entropy changes that result in little net free energy change. The hydrophobic and hydrophilic dual character of these molecules explains the abundance of aromatic amino acids in membrane proteins at water-membrane interface and in globular proteins at protein-protein interface.

5.2. Entropy convergence
The entropies of hydration of a series of hydrophobic compounds often converge to the same value at an elevated temperature. This is one of a number of curious effects associated with hydrophobicity that must be explained, if the hydrophobic effect is to be understood with confidence. We formulated a general condition for entropy convergence that explains why such phenomenon is observed.

5.3. Structure of the hydration shell
Because of the large entropy decrease that is observed when a non-polar molecule is hydrated at room temperature, it has commonly been thought that water forms an ice-like structure, termed 'iceberg', around the dissolved non-polar molecule. We had dissected each thermodynamic function into several terms, each produced by a particular physical cause, and estimated the magnitude of each term by means of a hybrid calculation that combines computer simulation data and the experimental calorimetric data. On the basis of these calculations, we had proposed that the hydration shell is, if anything, slightly less structured than bulk water. Dr. Graziano and I have now shown that this observation can be proved if Muller's two-state mixture model is used.


For more information on our research and on the software that we developed, please click here.

This page was last updated on 6/11/2008.