UMLS Metathesaurus Loader for Apache Derby

Posted 1/9/2008

The UMLS Metathesaurus provides for the generation of db schema and load scripts via MetamorphoSys for Oracle and MySql (and maybe others).

Here's a simple utility for loading into Apache Derby. Since this utility is pure JDBC, about the only thing one needs to do for use with another db is change the driver and maybe tweak the data types as the SQL is quite basic. Would have been a really good idea to put the driver connection into a config file!

Derby is very nice for experimental programming since you only need a jar file to use the embedded driver.

You can browse the Javadoc here.

This software is made available under the terms of the GNU General Public License v3.0 available at http://www.gnu.org/licenses/gpl.html
The download is here.There is a sample config file, the Derby JDBC driver, and a sample bat file. The source is in the jar file.


The schema generation and data loader reads MRFILES and MRCOLS to derive the data structures. It just skips over any files not present in the data directory. So if you want to only load some of the files generated by MMSYS you have two choices, either remove the files from the data directory or edit MRFILES and delete the line corresponding to the file you wish to omit.

The Loader main takes one argument - the name of a config file.

This config file contains the following properties:

  • data_dir - the directory containing the RRF files
  • schema - the Derby schema name
  • derby home - the directory to store the Derby db
  • mfiles_indexes - a pipe delimited file where each line contains a mrfile and a comma delimited list of columns to index similar to mrfiles format

Say you've created a SNOMED and ICD9 subset using MMSys, and say those files are in C:/data/umls/2007AC/META snomed/.

Your UmlsLoader.properties file might look like this:

data_dir = c:/data/umls/2007AC/META snomed/
schema = umls-snomed
derby_home = c:/ddb
mrfiles_indexes = c:/data/2007AC/META snomed/indexes.txt 

This was devloped for use on another project and has not been tested beyond the scope of that project. In particular, the focus was on SNOMED and ICD9, limited to the MRCON and MRRELS files.

Handles UTF-8 and assumes that's the character set for the data files.

Uses MRFILES row count to validate the integrity of the load.

Started out by extending RichMRMetamorphoSysOutputStream in the MMSYS source, but that required tweaking the visibility of some members on that class and the indexing was hard coded. So switched to reading the MRFILES and MRCOLS directly, and put the index specification in a config file.

Release notes:

Version 0.01

Initial release.