MMTx Frequently Asked Questions (FAQ)

Welcome to the MMTx FAQ. This FAQ tries to answer specific questions concerning the MMTx program. We have also included some relevent questions on related programs like jlvg, dfBuilder, etc. where we felt the questions/answers were important to MMTx users.

Last Modified: March 30, 2007


Questions

  1. General Information
    1. What exactly is MMTx?
    2. Does MMTx run on my platform?
    3. How do I report a bug or other trouble I am having with MMTx?
    4. How do I monitor your progress on a submitted Trouble Report?
    5. How do I access the Download page? I have (or don't have) an UMLS KS account.
    6. Is there an interface to the Brill POS Tagger?
    7. Is there any documentation on how I might build a tagger client for my MMTx?
    8. Is there a way to update MMTx databases short of creating new databases and reloading the entire model?
    9. How do I load in the optional models after the installation process?
    10. Is there an uninstall program for MMTx?
    11. How do I get additional information on MetaMap and SKR?
    12. Which UMLS Sources are included which MMTx?
    13. What do I do if I'm having problems downloading the large MMTx jar file?

  2. Installation
    1. Nothing yet

  3. Running MMTx
    1. How do I get pipe separated output?
    2. Is there anyway to get XML output?
    3. What is up with these toggle options?
    4. Can you explain what the machine output format is and what information is included?
    5. How can I tell what version of MMTx I'm running?
    6. I just installed MMTx and when I invoke it, I get the error: "Main method is not public". How do I resolve this?
    7. When I invoke MMTx with a large file, I get the error: "Exception in thread "main" java.lang.OutOfMemoryError". How do I resolve this?

  4. Data File Builder (dfBuilder)
    1. How can I filter/skip some obviously non-medical concepts/abbreviations (although they are listed in UMLS)?
    2. What are the "C......." items that appear in the word files?
    3. In FilterMRCONSO.filterNmStrDups, what is the criterion for which record is kept when duplicates occur?
    4. Is there any particular reason for using another unique name for semantic types instead of just using the Tcodes?
    5. Why isn't jlvg used for normalization? NLSStrings looks to be more lightweight, was it just easier to include?
    6. When making a custom UMLS, what MRSO term types get filtered out?
    7. If I have a custom data set for Version 2.0 do I have to regenerate it to use it with version 2.2?


General Information

What exactly is MMTx?

MMTx was designed to provide the general public with access to the MetaMap algorithms and capabilities. The original MetaMap program was designed using the Prolog language which limited the platforms that are available for MetaMap to be ran on. With this in mind, we decided that the MMTx program would be built using the Java language to allow it to run on as many platforms as possible.

We have worked with the author of the MetaMap program to ensure a faithful reproduction of the original algorithms and list of available options. You should in most cases receive identical output from both MetaMap and MMTx. There are known differences due to the fact that MetaMap uses the Xerox Parc POS Tagger, while MMTx uses (by default) built-in algorithms for tagging text. This difference in tagging may on occassion provide different results in the two systems.


Does MMTx run on my platform?

MMTx is a Java-based application and theoretically is platform independent. Having said that, we are officially supporting the following platforms:

Please reference the following list of items required to be installed and configured for MMTx to run properly:

How do I report a bug or other trouble I am having with MMTx?

We will accept emailed trouble/bug report messages sent to the mmtx@nlm.nih.gov address, but, would REALLY, REALLY prefer you to use our "Trouble Reporter" system which is accessible via the link in the left sidebar or from this link here: Trouble Reporter


How do I monitor your progress on a submitted Trouble Report?

Once we receive a bug/trouble report it is entered into our tracking system which is available via the "Review Status of Trouble Reports" link in the left sidebar or from the following link: Review Status of Trouble Reports

Reports submitted via the "Trouble Reporter" are automatically entered into the tracking system after they have been reviewed by our Trouble Report moderator.

Bug/Trouble reports submitted via email are entered into the tracking system by the Trouble Report moderator as time permits.

Once your Trouble Report is submitted and reviewed, it is assigned to one of our MMTx Team Members to work. You will be contacted via email directly when further clarification is required, or when the status of your Trouble Report has changed.

Trouble Reports or TRs are only closed once we have received feedback from the submitter that the problem has in fact been resolved. One caveat is that if we have asked for feedback on a fix and have not received any response for a week, we will consider the TR closed.

How do I access the Download page? I have (or don't have) an UMLS KS account.

Currently, we are bundling an initial dataset for your use in the MMTx. This dataset is based on information contained in the Unified Medical Language System (UMLS) and must comply with all of the UMLS copyright restrictions.

So that we can honor these restrictions, you must have satisfied the following criteria prior to receiving access to the MMTx Download page:

  1. You must either have signed the UMLS agreement, or be sponsored by someone who has signed the UMLS agreement.

  2. You must have created an account under the UMLS Knowledge Source Server (UMLS KS).
We are looking into creating a default dataset that does not require compliance with the UMLS copyright restrictions.


Is there an interface to the Brill POS Tagger?

There is currently no interface to the Brill POS Tagger from MMTx. We have been looking into implementing this, but, no decision has been made whether to support the tagger or not.

The main hurdle here is supporting the TreeBank tags that are produced via the Brill POS Tagger versus the current suite of tags we are using internally with the Xerox PARC Tagger.

Please see Notes on Tagger Integration for more information on how you can integrate the Brill Tagger into the MMTx.


Is there any documentation on how I might build a tagger client for my MMTx?

We are currently in the process of completing this and should have a web page available soon.


Is there a way to update MMTx databases short of creating new databases and reloading the entire model?

There is currently no defined way of doing this within MMTx. If you know SQL and MySQL, you will be able to modify the tables very easily using the MySQL provided tools.

hbr>

How do I load in the optional models after the installation process?

If you need to install the Moderate and/or Relaxed Data Model databases either due to later download or error on the part of the install script - please reference the following link for details: Loading Optional Models.


Is there an uninstall program for MMTx?

Currently there is no formal uninstall program available for MMTx. We are looking to incorporate one in an upcoming release. For now, the best set of steps to follow are the following:

  1. Remove the entire mmtx directory - where MMTx was installed.

  2. Go into MySQL as a root user and drop all of the MMTx related tables including any of the following tables:
    • DB_00_moderate
    • DB_00_relaxed
    • DB_00_strict
    • DB_01_moderate
    • DB_01_relaxed
    • DB_01_strict
    • DB_MMTxDBTest_01_strict
    • J_Static2001MMTxLexicon
    • Static2001MMTxLexicon
    • mmtx

  3. If you want to completely remove any vestiges, you also need to remove any entries in the MySQL database user table referring to either user "mmtxUser" or "lvg" by using the following set of commands:
                      use mysql;
                      delete from user where user='mmtxUser';
                      delete from user where user='lvg';
                      flush privileges; 
You are now rid of MMTx and can move on to something else or reinstall MMTx.


How do I get additional information on MetaMap and SKR?

You can reference additional information on MetaMap and SKR from the following link: MetaMap and SKR Research Information


Which UMLS Sources are included which MMTx?

The following UMLS Sources are excluded from MMTx distribution:

  CDT5    Current Dental Terminology 2005 (CDT-5), 5
  CPT01SP Physicians' Current Procedural Terminology, Spanish Translation, 2001 
  CPT2005 Physicians' Current Procedural Terminology, 2005
  HCDT5   HCPCS Version of Current Dental Terminology 2005 (CDT-5), 5
  HCPCS05 Healthcare Common Procedure Coding System, 2005
  HCPT05  HCPCS Version of Current Procedural Terminology (CPT), 2005
  MTHCH05 Metathesaurus CPT Hierarchical Terms, 2005
  MTHHH05 Metathesaurus HCPCS Hierarchical Terms, 2005
       
All other sources are included.


What do I do if I'm having problems downloading the large MMTx jar file?

The following information comes from one our MMTx users (Leon F.) who was experiencing problems downloading the MMTx jar file because of it's size. I've also been able to replicate their success in using the software so I can highly recommend it. GNU's Wget program which is accessible via the following link: http://www.gnu.org/software/wget/

The following is from the GNU Wget page:

GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc.

GNU Wget has many features to make retrieving large files or mirroring entire web or FTP sites easy, including:

* Can resume aborted downloads, using REST and RANGE
* Can use filename wild cards and recursively mirror directories
* NLS-based message files for many different languages
* Optionally converts absolute links in downloaded documents to
   relative, so that downloaded documents may link to each other locally
* Runs on most UNIX-like operating systems as well as Microsoft Windows
* Supports HTTP and SOCKS proxies
* Supports HTTP cookies
* Supports persistent HTTP connections
* Unattended / background operation
* Uses local file timestamps to determine whether documents need to be re-downloaded when mirroring
* GNU Wget is distributed under the GNU General Public License.

So, once you have wget downloaded and installed, you can download the MMTx software by using the following command:
wget --http-user=USERNAME --http-passwd=PASSWORD http://mmtx.nlm.nih.gov/Download/mmtx_V2.4.B_data.jar

It should be noted that this does involved having your username and password visible from the command during the download process as well as from the process information so care should be taken.


Installation

Nothing Yet

Running MMTx

How do I get pipe delimited output?

The machine output format (-q) is the only currently supported machine readable/parsable output format we offer in MMTx. See the FAQ section entitled "Can you explain what the machine output format is and what information is included?" for more details on machine output.

Creating piped output is easy except for the issue of repeating information (e.g., phrases, candidates, mappings). A modified version of machine output might work. And like machine output, piped output of an utterance will be extended over several lines with each line beginning with the utterance id and output type (and maybe other identifying information).

This is being reviewed for inclusion in an upcoming release.


Is there anyway to get XML output?

MMTx currently does not support XML output of any kind. It should be fairly easy to modify the output routines to include XML output if you download and modify the sources.

XML output is not currently scheduled to be included in any future release of MMTx.


What is up with these toggle options?

MMTx has several options that appear to work backwards when specified on the command line. This is a feature designed into the MMTx system to provide conformance with the MetaMap program (or in other words - for historical reasons). The following options are considered toggle options because when you specify them on the command line, they actually turn the option OFF instead of ON like you would expect.

The reason these options are by default turned ON, is that they are the options typically used by everyone and to help save time and typing, they were set to ON as a default.

In a future release of MMTx we are looking to support a more explicit way of expressing these options as well as supporting the historical method. In the future, you will be able to make specifications like "--candidates=true" or "-b=false" on the command line. These specifications should more explicitely show exactly which way the toggle is to be set and used in the program.


Can you explain what the machine output format is and what information is included?

The machine output format (-q) is the only currently supported machine readable/parsable output format we offer in MMTx.

There is a lot of information contained in the machine output and the following documents outline the contents in great depth:


How can I tell what version of MMTx I'm running?

Well ..., from Version 2.0.C on, you will be able to tell simply by running mmtx --version from the command line.

With previous versions, you will need to look at the date of the mmtx/classes/programs/MMTx.class file.


I just installed MMTx and when I invoke it, I get the error: "Main method is not public". How do I resolve this?

This error message occurs when you are using the new version of Java - 1.4. We have a work-around that allows you to run MMTx in Java 1.4 without getting this error. The work-around is contained in V2.0.C of MMTx.

Currently in MMTx, if you are making changes to the source code and trying to recompile in Java 1.4 - you will receive errors and be unable to complete the compile process. We are working on bringing MMTx into conformance with Java 1.4 and should have something out soon. Currently, MMTx ONLY supports compilation under Java 1.3.


When I invoke MMTx with a large file, I get the error: "Exception in thread "main" java.lang.OutOfMemoryError". How do I resolve this?

This error message occurs when you are using large input files with MMTx. You simply need to add memory sizing to the MMTx script in the mmtx/bin directory.

            Add in the following options to the java call:

              -ms100m   <default is 4MB, this changes it to 100MB>

              -mx100m   <default is 16MB, this changes it to 100MB>

            So, the beginning of the java line in the script should look
            like the following:

                java -ms100m -mx100m -cp

          You can play around with the 100s to see what works for you. The
          only caveat is that -ms must be less than or equal to -mx, it can't
          be greater than -mx.
        


Data File Builder (dfBuilder)

How can I filter/skip some obviously non-medical concepts/abbreviations (although they are listed in UMLS)?

For example:

        (1) "Fig." inside this text should not be marked as "Fig [Food]":
            This condition is called a succenturiate lobe (Fig. 6-5 ) and may be
            problematic if that lobe of placenta is inadvertently left within
            the uterus at the time of delivery.

        (2) Likewise, "al" shouldn't be "Aluminum [Element, Ion, or Isotope]"
            These studies provide strong evidence of the interconnectiveness of
            maternal and fetal fluid spaces across the membranes and placenta
            (Kilpatrick et al, 1991).
           
The quick answer to your question about filtering is that currently there is no way to do so. You can specify the -u (--unique_acros_abbrs_only) or -a (--no_acros_abbrs) options; but these options only prevent the generation of some variants before accessing the Metathesaurus. If the abbreviation is in the Metathesaurus, itself, (as is the case with "al") then the options don't help. MMTx doesn't get your first example right either because it doesn't realize that "Fig." *is* an abbreviation. (BTW, I normally use both -D (--an_derivational_variants) and -a (no_acors_abbrs) in my processing; the -D option allows derivational variants only between adjectives and nouns, and the -a option filters out all abbreviatory variants.)

There are plans to improve MetaMap's (and MMTx's) accuracy by incorporating word sense disambiguation and also to recognize higher-order tokens such as numbers, chemicals, bibliographic information, etc. Either of these might help your problem; but neither will be available for the foreseeable future. If you were really desperate, you could somehow modify the Metathesaurus data files, filtering out selsected concepts or abbreviations, but that would entail a tremendous amount of work that would need to be repeated with each new release of the UMLS knowledge sources. In addition, as in the "fig" case, absolute filtering probably isn't appropriate anyway.


What are the "C......." items that appear in the word files?

C....... occurs in some data for optimization purposes. As an example, the line
abdomen|S0003328|C.......
in a word index file means that the word "abdomen" occurs in the string S0003328 ('Palpation of abdomen'). The presence of C....... signals that the given string is also the concept name. No further searching is necessary to obtain the concept name. On the other hand, the entry
abdomen|S0288461|C0000735
means that "abdomen" occurs in S0288461 ('abdomen neoplasm'), a string for concept C0000735 with preferred name 'Abdominal Neoplasms'.


In FilterMRCONSO.filterNmStrDups, what is the criterion for which record is kept when duplicates occur?

We keep the first occurrence of a string in the mrconso file (a join of MRCON and MRSO) that is indistinguishable from other strings within a concept. Since preferred forms occur in this file before non-preferred forms, this has the effect of keeping the preferred forms.


Is there any particular reason for using another unique name for semantic types instead of just using the Tcodes?

Yes, we find the mnemonics much easier to understand when we manually review output.


Why isn't jlvg used for normalization? NLSStrings looks to be more lightweight, was it just easier to include?

NLSStrings is a descendent of a corresponding Prolog module used by MetaMap and similar programs developed in the Natural Language Systems program. It predates lvg, is tailored to our specific needs, and is more efficient than lvg (the last time I checked). I'm also fairly certain that lvg doesn't have the full functionality required for the specific kind of normalization we're doing; in particular, it doesn't respect word order and the kinds of normalization involving NOS and NEC are different (MetaMap continues to filter out NOS but no longer does so with NEC).


When making a custom UMLS, what MRSO term types get filtered out?

For information on what filtering is done in the MMTx and MetaMap programs, please review the following document which describes what filtering was done for 2001. Filtering the UMLS Metathesaurus for MetaMap, 2001  PDF - Filtering the UMLS Metathesaurus for MetaMap 2001 paper

You can also find more information on Filtering the UMLS Metathesaurus and Ambiguity in the UMLS Metathesaurus from the SKR Reference Information page.


If I have a custom data set for Version 2.0 do I have to regenerate it to use it with version 2.2?

The data files themselves did not change so you do not have to regenerate. If you want your processing based on the 2002 UMLS then you will want to create new source files (sourceData/<your custom>/umls) from the 2002 UMLS data files and run the Data File Builder. To migrate you data to V.2.2 do the following:

  1. Download 2001 Common Files from the download page.

  2. Unjar the files as described for the installation package into your nls/mmtx/data directory.

  3. Create a new directory with the name of your custom data set. (mmtx/data/<yourCustom>/)

  4. Create two subdirectories: mmtx and lexicon. (Data Set Creation and Load section of the Data File Builder manual shows the data set structure.)

  5. Copy the files from the matching directories in the 2001 directory into these new mmtx and lexicon directories.

  6. From your v2.0 data directory copy the treecodes.txt file from the meta directory into the mmtx directory of your new data set.

  7. If you modified the semdef.txt file copy it into the new mmtx directory.

  8. From the data/meta/DB_<yourCustom>_moderate/ copy the .txt files to a new data/<yourCustom>/moderate directory.

  9. Do the same for relaxed or strict models that you may have created.

  10. You should now be able to run the install program and it will load your custom dataset.

  11. To run MMTx with your custom data set, see the Data File Builder manual for examples.