skip banner navigation skip banner  
caIntegrator homepage
Developers

“empowering translational research through informatics…”

 

  caIntegrator powered Applications
CGEMS 
CSP 
I-SPY 
Rembrandt 

  Collaborators
CGF 
NCI-NOB 
NINDS 
NHGRI 
UCSF 

  Related Links
Rembrandt 
caGWAS 
NCICB 
caBIG 
I-SPY 
CGAP 
caArray 

  Downloads
Download caIntegrator 
Developers

Architecture | APIs | DTO's | Service Oriented API | Presentation | Analytical Server
webGenome Interface | Database design | Extending caIntegrator

Architecture

The overall goal of caIntegrator project is to provide a framework with the infrastructural components needed to develop enterprise level translational applications such as Rembrandt and I-SPY (see “Applications” section). The framework leverages Java 2 Enterprise Edition, a hybrid star schema, and various open source technologies. The following are some of the high level features of the caIntegrator framework:

  • A common set of interfaces and specification objects that define the clinical genomic analysis services. In other words, they act as templates for the caIntegrator based translational applications, which will extend and implement these interfaces and specification objects. Via domain as well as business objects, the application’s user interface communicates with its caIntegrator based middle-tier services.
  • A generic real time analytical service that currently supports class comparison analysis, principle component analysis and hieratical analysis. It is designed to easily incorporate other types of analysis in the future and scale to provide performance.
  • The caIntegrator hybrid data system consists of a star schema database which contains the clinical and annotation data as dimensions, pre-calculated gene expression copy number data as facts and CSM tables for user provisioning data. For performance reasons, normalized gene expression data used by the real time analysis module is stored as R-binary files.
  • A generic interface to allow visualization of both genomic data (copy number scatter plot and ideogram plots) via the WebGenome application.
  • The upcoming phases will incorporate a common set of ETL (data extraction, transformation and loading) utilities that interface with other transactional applications such as caArray and C3D and help populate the underlying robust hybrid data system.


  • API


    caIntegrator’s service layer provides a clearly defined mechanism to access the service interfaces using available operations and input/output parameters.


    1. The service interfaces are based on clinical genomic use cases and not on study specific requirements. The study specific service providers and consumers (presentation tier) communicate via well-defined operations and strong typed input and output parameters consisting of Data Transfer Objects (DTOs) and/or Domain Objects. You can think of these services as an interface contract. This contract defines the behavior of the service and the type of objects they accept and return.
    2. The caIntegrator framework supports flexible mappings between Object Model and Data Model, and allows for both fine and course grained access.
    3. Allows population of transient objects with run time analysis computations.
    4. Each demonstrable application’s middle tier implements these interfaces. The study specific implementation of these services allows the presentation tier to perform the necessary operations.
    5. The implemented services are self-contained. That is, the service maintains its own state.
    6. Using the caCORE SDK, caIntegrator will achieve semantic interconnection with the metadata registry by registering both the Domain Objects as well as the Data Transfer Objects.

    Domain Objects (DO) and Data Transfer Objects (DTO)

    One of the major goals of the current release of caIntegrator was to create a clinical genomic object model (CG-OM) and expose the domain model thru caIntegrator’s Findings service tier. The purpose of the object model is to help capture the relationships between the clinical study and its associated experimental observations.

    The Clinical Genomic Object Model and caIntegrator need to deal with not just the clinical and experimental findings from the study, but also with the relationship of the findings to other annotations in both genomic and clinical spaces. The caIntegrator’s Data Transfer Objects (DTO’s) capture these annotations and allow the Findings service to execute queries based on the mappings between the DTOs and DOs.

    For example, to implement the use case “For a given gene, report neighboring SNPs and their genotype status”

    1. The presentation tier passes a Query with the populated gov.nih.nci.caintegrator.dto.de.GeneIdentifierDE attribute to the caIntegrator derived applications (CGEMS) service layer.
    2. The middle tier uses the Service Layer Mapping to find out the associated SNP Reporters with each of the genes (See Figure 1 )
    3. Analysis is performed on the SNP reporters and results are returned as CG-OM domain objects as well the associated annotations in a Findings DTO object.

    Via the service-oriented architecture, the caIntegrator framework allows users to pass a complex translational query as an argument and get back translational analysis findings. The DTOs allow us to interface with other domain models or services such as caBIO, EVS and CT-OM with out coupling their business rules with caIntegrator or the CG-OM.

    Service Oriented API (Finding Factory)


    The presentation tier interacts with caIntegrator’s service-oriented API via the Finding Factory. The FindingFoctory follows the abstract factory design pattern. (see MSDN Factory Design Pattern ) caIntegrator’s FindingFactory interface is implemented by each derived implementation such as REMBRANDTFindingFactory. The factory pattern completely abstracts the creation and initialization of the actual Findings implementation from the user interface (UI). This indirection enables the UI to focus on its discrete role in the application without concerning itself with the details of how the findings are created. Thus, as the product implementation changes over time, the client remains unchanged.

    While this indirection is a tangible benefit, the most important aspect of this pattern is the fact that the client is abstracted from both the type of Findings and the type of factory used to create the product. Presuming that the product interface is invariant, this enables the factory to create any product type it deems appropriate. Furthermore, presuming that the factory interface is invariant, the entire factory along with the associated products it creates can be replaced in a wholesale fashion. Both of these radical modifications can occur without any changes to the UI.

    The FindingFactory assembles each Finding by calling the appropriate Strategy Object associated with each Finding type. The Strategy objects follow the Strategy Design pattern (See http://exciton.cs.rice.edu/javaresources/DesignPatterns/StrategyPattern.htm) The Strategy Design Pattern basically consists of decoupling an algorithm from its host, and encapsulating the algorithm into a separate class. More simply put, an object and its behavior are separated and put into two different classes. This allows you to switch the algorithm that you are using at any time.

    There are several advantages of using the strategy design pattern for each Finding type. First, if you have several different behaviors (how the actual translational Finding was formulated) that you want an object to perform, it is much simpler to keep track of them if each behavior is a separate class, and not buried in the body of some method. Should you ever want to add, remove, or change any of the behaviors, it is a much simpler task, since each one is its own class. Each such behavior or algorithm encapsulated into its own class is called a Strategy.

    Presentation Layer


    The web-based graphical user interface is built using Apache’s Struts (version 1.1) application framework and runs in most of the currently available J2EE Application servers, though the intended platform is JBoss version 4.0.2.

    The Struts framework ultimately renders a web page comprised of (X)HTML, images, Javascript, and CSS. To support some of the applications advanced features, we have employed the AJAX technology for remote scripting. AJAX allows for server side calls, via a client side script (Javascript). Ajax sends asynchronous calls to the server via the XMLHttpRequest object. The presentation layer allows the users to formulate complex translational queries and presents the results either in a tabular report or plotted on a graph. caIntegrator uses the JFreeChart graphing package to render is 2D graphical plots. The reporting mechanism for the web application is based around using XSL(T) to transform XML into either XHTML or a CSV formatted document. The presentation tier will receive the data from the middle tier, and then encapsulate that data in XML, with a row/column centric schema using the Dom4J package. This XML is then cached, and then transformed into its presentation format by applying an XSL stylesheet based on the desired output. The same cached XML dataset can be transformed into XHTML and displayed in a web browser, or it can be transformed into a CSV formatted file and downloaded to the client’s local machine for offline viewing. This approach allows us to reuse the cached XML dataset without executing another database query to retrieve the same result set and apply different presentation templates based on our desired output. We are also able to incorporate additional features such as pagination, filtering, and highlighting via XSL.

    Analytical Server


    The Analytical Server provides an on-the-fly computational analysis capability for caIntegrator-based applications. The Analytical Server communicates asynchronously with the caIntegrator’s middle-tier via the Java Messaging Service (JMS). The Analytical Server utilizes the Java 1.5 ThreadPoolExecutor.

    JMS allows caIntegrator to abstract the statistical packages being utilized for the heavy computational tasks. The current release utilizes R to implement the statistical methods. The Rserve package (see: http://stats.math.uni-augsburg.de/Rserve/ ) is used to interface the R system with Java. Rserve provides Java classes to execute R commands and to retrieve results as Java objects. But the overall architecture of the Analytical Server allows us to plug in any other statistical package such as SAS as long as it exposes an API.

    In the current release of caIntegrator, we processed gene expression data by assembling text files (the probe level data is consolidated with Affy MAS5 algorithm, with target scaling value at 500) and also processed the same data with lpg cdf (as3p). Generate signals for all tumor samples, signals for normal pool, sample vs normal pool and disease group vs normal comparison (absolute fold change, p-value, standard deviation, etc) using two-sample t test and store R binary file into Rserver.

    WebGenome Interface


    One of main requirements of caIntegrator is to help researchers better analyze and visualize the results in a user friendlier manner. Since webGemone (aka webCGH) was already plotting some of these graphs, the two projects decided to create an interface between caIntegrator and webGenome and extend webGenome’s capabilities with new plots and functionality.

    To achieve this integration, an EJB architecture was proposed. WebGenome now exposes a set of interfaces to help populate these plots. caIntegrator framework implements these interfaces as Enterprise Java Beans (EJB). caIntegrator remotely invokes the WebGenome application via the Java Naming and Directory Interface (JNDI).

    Database Design

    caIntegrator employs basic star schema with modification for the Study data warehouse design that supports the integration of clinical and genomic data. It is a generic, query optimized schema that contains fact tables such as “Differential_Gene_Expression_Fact” and “Genomic_Abnormality_Fact”, etc. Look up entities such as Genes, Biosample, and Disease type make up the dimensions in the schema. This schema provides a highly de-normalized view of the data and a data neutral framework from which queries can be executed with quick retrieval time.

    Clinical Genomic Data Warehouse Schema supports the scientific query based on the disease type, expression profile, gene or genomic region of interest, and any other clinical indicators or combination of them. It supports classification of molecular signature and provides information on clinical outcomes.

    In essence, it is a modified star schema suitable for clinical genomic research. There are three types of the tables in this data warehouse “Star schema” (see figure). In the center are the gene expression/genomic abnormality fact tables, which are the focal points that all clinical and genomic dimensions intersect. The Fact tables contain all the pre-calculated data points based on various scientific algorithms. The dimension tables contain study relevant data points, such as tumor histology, patient demographics, genomic annotation etc. Each dimension is an axis providing a unique aspect of the clinical genomic data from a different angle. Lookup tables and mapping tables provide additional information for explanation of the fact tables and dimension tables. They contain static general information, such as gender, study platform etc.

    For real time Gene Expression analysis, Preprocessing datasets are stored as R-Binary files and utilized by the R server. This provides maximum performance for analyzing 54000 reporters for over 350 bio assays.

    The current schema is an initial phase of clinical genomic data warehouse schema design, which only supports the currently available data sets. It can be easily extended to support other data sets such as proteomics and tissue array data.

    The current schema also includes tables from NCICB’s Common Security Module (CSM) to allow single user sign-on and assignment of appropriate data access.

    Extending caIntegrator for other cancer studies

    • Perform Use Case Analysis
    • Map new objects/attributes to Domain Model/ Business Object
    • Map new objects/attributes to dB Schema
    • Tailor ETL process for new study data set
    • Implement “FindingsFactory” service layer interface for the new study
    • Extend and associated Strategy and Findings objects to meet the new study specific use cases
    • More information coming soon…
    National Cancer Institute Department of Health and Human Services National Institutes of Health FirstGov.gov