MetaMorphoSys Batch Run documentation
Performing Batch MetamorphoSys Runs
This document is a guide to running MetamorphoSys in a programmatic
way rather than through the standard GUI interface. This can be
very useful if you need to make a variety of runs to produce multiple
subsets and want to reuse standard configuration files.
Getting Started
To perform an scripted run of MetamorphoSys you will need five
things.
- Input data files in a directory. The data files can either be an installed RRF or ORF subset, or could be the starting .nlm files that
come with a UMLS distribution.
- Destination directory where the scripted subset is to be created
- MetamorphoSys installation (like the MMSYS/ directory on the DVD,or an unpacked mmsys.zip file).
- JRE matching the version of the MetamorphoSys distribution.
- An unpacked mmsys.zip file will contain a JRE directory you can use for Linux, Solaris, or Windows.
- MetamorphoSys configuration file.
The easiest way to obtain all five pieces is to start with a UMLS
distribution: either the DVD or downloaded files from the Knowledge
Sources Server. Either the Metathesaurus .nlm files or an
installed .RRF subset can serve as the input
data. You can choose any directory to hold output data.
Unpacking the mmsys.zip file to a known location on your machine can
serve as both the MetamorphoSys installation and as the JRE.
For the final piece, the easiest way to obtain a configuration file
that you want to use is to run the MetamorphoSys application and work
your way to the configuration screens in the GUI. After
having selected your source list and other configuration options, save
your configuration to a file in a known place. The resulting
configuration file may contain path
information
that is overridden by properties specified in the script. This is
not
a problem. The switches to the Java call above will override any
path information int the file.
Configuring and Running a Script
In the sections below, you can follow a sample series of steps for putting
together the pieces above into a script and actually generating a
subset.
Windows
@echo off
REM
REM Specify directory
containing .RRF or .nlm files
REM
set METADIR=C:\UMLS
REM
REM Specify output
directory
REM
set DESTDIR=C:\UMLS\METASUBSET
REM
REM Specify MetamorphoSys
directory
REM
set MMSYS_HOME=C:\UMLS\MMSYS
REM
REM Specify CLASSPATH
REM
set
CLASSPATH=%MMSYS_HOME%;%MMSYS_HOME%/lib/mms.jar;%MMSYS_HOME%/lib/objects.jar
REM
REM Specify JAVA_HOME
REM
set
JAVA_HOME="C:\Program Files\Java\jre1.5.0_11"
REM
REM Specify configuration
file
REM
set
CONFIG_FILE=C:\config.properties
REM
REM Call Batch
MetamorphoSys
REM
cd %MMSYS_HOME%
%JAVA_HOME%\bin\java -Dinput.dir=%METADIR% -Doutput.dir=%DESTDIR%
-Dmmsys.dir=%MMSYS_HOME% -Dmmsys.config=%CONFIG_FILE% -Xms300M
-Xmx1000M
gov.nih.nlm.mms.BatchMetamorphoSys
The script defines the
five needed pieces: input directory, output directory, MetamorphoSys
installation, JRE, and configuration file. JAVA_HOME and the
CLASSPATH are configured. The script makes the required java call from
the MMSYS_HOME
directory.
This will produce an output subset based on the input data
specified. The subset will contain either ORF or RRF files
(depending
upon which you
indicated in the configuration file). The style of input data
must be
correctly specified in the configuration file(choose from .nlm files,
RRF, or
ORF files). A log of the progress will also be generated as it
runs.
Linux, Macintosh, or Solaris
Consider the following script:
#!/bin/sh -f
#
# Specify directory containing .RRF or .nlm files
#
METADIR=/d1/UMLS
#
# Specify output directory
#
DESTDIR=/d1/UMLS/METASUBSET
#
# Specify MetamorphoSys directory
#
MMSYS_HOME=/d1/UMLS/MMSYS
#
# Specify CLASSPATH
#
CLASSPATH="${MMSYS_HOME}:$MMSYS_HOME/lib/mms.jar:$MMSYS_HOME/lib/objects.jar"
#
# Specify JAVA_HOME
#
JAVA_HOME=$MMSYS_HOME/jre/linux
#
# Specify configuration file
#
CONFIG_FILE=/d1/umls/config.properties
#
# Run Batch MetamorphoSys
#
export METADIR
export DESTDIR
export MMSYS_HOME
export CLASSPATH
export JAVA_HOME
cd $MMSYS_HOME
$JAVA_HOME/bin/java -Dinput.dir=$METADIR -Doutput.dir=$DESTDIR
-Dmmsys.dir=$MMSYS_HOME \
-Dmmsys.config=$CONFIG_FILE -Xms300M -Xmx1000M
gov.nih.nlm.mms.BatchMetamorphoSys
The script defines the
five needed pieces: input directory, output directory, MetamorphoSys
installation, JRE, and configuration file. JAVA_HOME and the
CLASSPATH are configured. The script makes the required java call from
the MMSYS_HOME
directory.
This will produce an output subset based on the input data
specified. The subset will contain either ORF or RRF files
(depending upon which you
indicated in the configuration file). The style of input data
must be correctly specified in the configuration file(choose from .nlm
files, RRF, or
ORF files). A log of the progress will also be generated as it
runs.
Configuration File Notes
As indicated above, the configuration file you use is best generated
using the MMSYS GUI. There are a couple of things you may want to
consider when reusing a configuration file.
- Input data to a batch MetamorphoSys process can take either the
form of the .nlm. files on the DVD, or an installed RRF subset.
If you create your configuration file using the GUI, this will be
managed for you. If you want to change your mind after the fact,
you can edit a few properties file settings to fix this.
- First, choose one of the following two settings:
- mmsys_input_stream=gov.nih.nlm.mms.RichMRMetamorphoSysInputStream
- mmsys_input_stream=gov.nih.nlm.mms.NLMFileMetamorphoSysInputStream
- Now, if you chose, say RichMRMetamorphoSysInputStream,
make sure you express the relevant properties for this input
stream. For example (in this case we assume the path /d1/UMLS
contains RRF files):
- gov.nih.nlm.mms.RichMRMetamorphoSysInputStream.source_path=/d1/UMLS/
- Output data can take either the form of RRF or ORF data. If
you create your configuration file using the GUI, this will be
managed for you. If you want to change your mind after the fact,
you can edit a property file settings to fix this. Choose one of
the following two settings:
- mmsys_output_stream=gov.nih.nlm.mms.RichMRMetamorphoSysOutputStream
- mmsys_output_stream=gov.nih.nlm.mms.NLMFileMetamorphoSysOutputStream
- Each time the UMLS Metathesaurus is updated, some of the various
"default" data sets may change. For example, the source list,
default SAB,TTY list, and list of suppressible (CUI,AUI). If your
configurations rely on these properties (e.g. sources or termgroups properties), make
sure you compare the previous version value list to the current version
value list. To avoid this kind of problem, it is often better to
express your configurations in terms of things to include instead of things to exclude. Furthermore, you can
always re-open your configuration file in the GUI for the latest
release of MetamorphoSys and see a report of changes that may affect
your configuration. Then you can make desired changes, save it
again, and reuse it in your batch environment.
Instead of using the MetamorphoSys GUI to create your configuration
file, you may want to consider a programmatic approach to editing the
default mmsys.a.prop
config file that comes with a MetamorphoSys distribution.
Consider this code snippet:
% grep ^sources
$MMSYS_HOME/config/mmsys.a.prop | /usr/local/bin/perl -pe
's/sources=//; s/;/\n/g' | \
awk -F\|
'{print $1"|"$1}' | /usr/local/bin/perl -pe 's/\n/;/g' >!
/tmp/sab_list.txt
% /usr/local/bin/perl -pe
'open(SOURCES,"/tmp/sab_list.txt"); \
$sources =
<SOURCES>; \
chop($sources); \
s/(gov.nih.nlm.mms.filters.SourcesToRemoveFilter.selected_sources).*/$1=$sources/;
\
s/^(.*)\.remove_utf8=true/$1.remove_utf8=false/; \
s/^(mmsys_input_stream)=.*/$1=gov.nih.nlm.mms.NLMFileMetamorphoSysInputStream/;
\
s/^(mmsys_output_stream)=.*/$1=gov.nih.nlm.mms.RichMRMetamorphoSysOutputStream/;
\
s/^(.*)\.remove_selected_sources=true/$1.remove_selected_sources=false/;
' \
$INIT_CONFIG_FILE >! my_config.prop
In this example, we are starting by looking up the complete list of
sources in the "sources" property in the default configuration file and
compiling a SAB list. The second command makes five modifications
to the default configuration file and writes a new configuration file.
- The selected_sources
property of the source list filter is set to the complete source list
(taken from the prior command).
- The remove_utf8
property of any of the output streams is set to be false (in case it was true).
- The mmsys_input_stream
property is set to RichMRMetamorphoSysInputStream
(RRF Files).
- The mmsys_output_stream
property is set to RichMRMetamorphoSysOutputStream
(RRF Files).
- The remove_selected_sources
property (of the source list filter) is set to false (causing the source list
filter to operate in include
mode).
The effect of this is the output
my_config.prop
file which is now configured (correctly for this version of the data)
to be a "keep everything" subset of the NLM data files. Now, this
config file can be passed along with other parameters to the
BatchMetamorphoSys call to make the desired subset.