Software

Overview

We do not typically maintain releases of our software. Most of what is available comes directly from the current version of the repository, which changes daily for many projects. Please do not count on any sort of compatibility with future versions. Note: all of the svn checkout URLs provided below are intended to be checked out using a Subversion client. There are many clients available for Windows (e.g., TortoiseSVN), Linux (check your package repository), and Mac. Also note: if you are having trouble accessing our repositories via TortoiseSVN, try an older version (e.g., 1.16 for 32- and 64-bit systems). There is an issue with new versions that prevents anonymous access to our server.

Spatiotemporal Prediction

Machine learning

Statistical classification servers
We work with a number of supervised classification models trained on thousands to millions of instances, where each instance has as many as 100K features. The resulting model files can be as large as 3GB and take a couple minutes to load. This is a problem in any situation where instances need to be classified on-demand by an application. It's not feasible to wait a couple minutes each time an instance comes in. We have modified a few existing packages, allowing them run in "server mode". This means they load a model once and serve classifications from the loaded model without having to reload the model for each classification request. This can dramatically reduce processing time. We have made this modification for a couple popular, large-scale packages, found below:

Natural Language Processing

Systems

  • Nominal semantic role labeling. Matthew Gerber has packaged up the nominal semantic role labeling system described in his dissertation. The system performs end-to-end nominal SRL over completely unstructured text, achieving an F1 score of approximately 70% on the testing section of the Penn TreeBank. You can obtain the system by first checking out this directory to a Windows machine using a Subversion client. Once you have the directory, follow the README.txt in the directory for further instructions.

Resource APIs

  • FrameNet (svn checkout). This is a C# .NET API for the FrameNet 1.3/1.5 semantic frame resource. The API captures most of the content of the FrameNet project, including all frame definitions, frame and frame element relations, lexical unit annotations, and frame element bindings within those annotations.
  • NomBank (svn checkout). This is a C# .NET API for the NomBank resource. The API captures, in addition to everything captured by the TreeBank API (described below), all nominalization argument information, including split and co-referential arguments. The API also includes all information from the NomLex resource, which is distributed with NomBank. A sample application is included.
  • Penn TreeBank, PropBank, and DiscourseBank (svn checkout). These are C# .NET APIs for the Penn TreeBank, PropBank, and DiscourseBank resources. The TreeBank portion of the API captures all annotated parse trees, including syntactic constituent labels, grammatical function labels, and null element instantiations. The PropBank portion of the API captures (in addition to everything captured by the TreeBank API above) all verbal argumentation information, including split and co-referential arguments. The DiscourseBank portion of the API is rather preliminary, and only captures the argument nodes for each discourse connective - other information such as features is currently left out. The TreeBank and PropBank APIs are demonstrated with a sample application. We haven't gotten around to writing sample code for the DiscourseBank API, but the code is well documented so you should be able to figure out how it works. The software also includes a handy GUI for generating nicely laid out parse tree images in a variety of formats (e.g., PNG, JPG, EPS, etc.) - this relies on GraphViz.
  • SemLink (svn checkout) and updated mapping data (svn checkout). This is a C# .NET API for the SemLink resource. The API allows one to map between PropBank, VerbNet, and FrameNet verb argument structures.

    The original SemLink 1.1 mapping is very out of date. We have updated the mapping data to be in agreement with PropBank 1, VerbNet 3.1, and FrameNet 1.5. This required around 1000 modifications to the original SemLink mapping. The data format of the new files is identical to the original SemLink mapping.

  • Systematized Nomenclature of Medicine - Clinical Terms (SNOMED-CT) (svn checkout). This is a C# .NET API for the SNOMED-CT resource. The API assumes that you have already loaded the SNOMED-CT data into a MySQL database. Table schemata and load scripts are provided. We've also included functionality for producing graphs of the SNOMED-CT hierarchy using GraphViz. Here's an example.
  • Unified Medical Language System (UMLS) (svn checkout). This is a C# .NET API for the UMLS resource. The API assumes that you have already loaded (a subset of) the UMLS data into a MySQL database. The API provides the following functionality:
    • Construction of the semantic network
    • Concept retrieval, including lexical units/variants and semantic types
    • Identification of inter-concept relationships
    • Others under development
  • VerbNet (svn checkout). This is a C# .NET API for the VerbNet 3.1 resource. The API captures most of the content of the VerbNet project, including all classes and verb members. The API includes a sample application.
  • WordNet (svn checkout). This is a C# .NET API for the WordNet 3.0 lexical semantics resource. The API captures most of the content of the WordNet project, including all synset definitions (words and glosses) and synset relations (both semantic and lexical). The API offers two access methods: in-memory and disk-based. The former requires quite a bit of memory (~200MB), but is extremely fast. The latter requires very little memory but is slower due to on-disk searching of the WordNet data. Also included are some methods for shortest path searching between synsets. The API includes a sample application.

    WARNING: This API will modify the index.* files that are distributed with WordNet. These files will be re-sorted for use by the .NET runtime, whose string sort order differs from that of the Java runtime. As a result, the Java (and other) APIs/applications might not function correctly when used with the re-sorted index.* files. You should create multiple copies of the WordNet data (one for each runtime) to avoid such problems.

    LASTLY: Please take a look at the README.txt files for each project before emailing us with questions. The most common issues (e.g., how to compile, where to find referenced DLLs, etc.) are addressed there.

Social Media

  • Twitter API (svn checkout). This is a C# .NET API for Twitter that has native authentication support (i.e., no third-party oauth libraries) and partial support for the API itself. Basic functionality is finished (e.g., status updates, searching, and stream- and keyword-based filters); however, many functions are missing. We use this primarily to obtain streams of tweets for data mining purposes. This is a good place to start if you want a native .NET Twitter API and are willing to do some coding to flesh out the API calls that are missing. A sample GUI application is included.

Other

MS Word / JAMIA style file
Microsoft Word 2010 style file for the Journal of the American Medical Informatics Association (JAMIA). It's not perfect, but it's close. For example, it won't sort and abbreviate citation numbers as required. It also won't bold the volume number within the bibliography. To correct these problems, we suggest turning all citation/bibliographic text into plain text before submission and making the corrections manually. Note: in order to use this file, place it in "C:\Program Files\Microsoft Office\Office14\Bibliography\Style" (or wherever the appropriate location on your machine is). Restart Word if it's running, and "JAMIA" should appear in the list of available bibliographic styles.