Overview
I don't maintain releases of my software. Anything you download comes directly from the current version of the repository, which changes daily for many projects. Most of the code is very clean and well documented, but I am given to making sweeping changes on a whim, so don't count on any sort of compatibility with future versions. Note: all of the download URLs provided below are designed for use with a Subversion client. There are many clients available for Windows (e.g., TortoiseSVN) and Linux (check your package repository).
Natural Language Processing
Systems
- Nominal semantic role labeling. I have packaged up the nominal semantic role labeling system described in my dissertation. The system performs end-to-end nominal SRL over completely unstructured text, achieving an F1 score of approximately 70% on the testing section of the Penn TreeBank. You can obtain the system by first checking out this directory to a Windows machine using Subversion (or by painstakingly downloading each individual file in the directory). Once you have the directory, follow the README.txt in the directory for further instructions.
Resource APIs
- FrameNet (download). This is a C# .NET API for the FrameNet 1.3/1.5 semantic frame resource. The API captures most of the content of the FrameNet project, including all frame definitions, frame and frame element relations, lexical unit annotations, and frame element bindings within those annotations.
- NomBank (download). This is a C# .NET API for the NomBank resource. The API captures, in addition to everything captured by the TreeBank API (described above), all nominalization argument information, including split and co-referential arguments. The API also includes all information from the NomLex resource, which is distributed with NomBank. A sample application is included.
- Penn TreeBank, PropBank, and DiscourseBank (download). These are C# .NET APIs for the Penn TreeBank, PropBank, and DiscourseBank resources. The TreeBank portion of the API captures all annotated parse trees, including syntactic constituent labels, grammatical function labels, and null element instantiations. The PropBank portion of the API captures (in addition to everything captured by the TreeBank portion) all verbal argumentation information, including split and co-referential arguments. The DiscourseBank portion of the API is rather preliminary, and only captures the argument nodes for each discourse connective - other information such as features is currently left out. The TreeBank and PropBank APIs are demonstrated with a sample application. I haven't gotten around to writing sample code for the DiscourseBank API, but, as usual, my code is meticulously commented so you should be able to figure out how it works. The software also includes a handy GUI for generating nicely laid out parse tree images in a variety of formats (e.g., PNG, JPG, EPS, etc.) - this relies on GraphViz.
- SemLink (download) and updated mapping data (download). This is a C# .NET API for the SemLink resource. The API allows one to map between PropBank, VerbNet, and FrameNet verb argument structures.
The original SemLink 1.1 mapping is very out of date. I have updated the mapping data to be in agreement with PropBank 1, VerbNet 3.1, and FrameNet 1.5. This required around 1000 modifications to the original SemLink mapping. The data format of the new files is identical to the original SemLink mapping.
- Systematized Nomenclature of Medicine - Clinical Terms (SNOMED-CT) (download). This is a C# .NET API for the SNOMED-CT resource. The API assumes that you have already loaded the SNOMED-CT data into a MySQL database. Table schemata and load scripts are provided. I've also included functionality for producing graphs of the SNOMED-CT hierarchy using GraphViz. Here's an example.
- Unified Medical Language System (UMLS) (download). This is a C# .NET API for the UMLS resource. The API assumes that you have already loaded (a subset of) the UMLS data into a MySQL database. The API provides the following functionality:
- Construction of the semantic network
- Concept retrieval, including lexical units/variants and semantic types
- Identification of inter-concept relationships
- Others under development
- VerbNet (download). This is a C# .NET API for the VerbNet 3.1 resource. The API captures most of the content of the VerbNet project, including all classes and verb members. The API includes a sample application.
- WordNet (download). This is a C# .NET API for the WordNet 3.0 lexical semantics resource. The API captures most of the content of the WordNet project, including all synset definitions (words and glosses) and synset relations (both semantic and lexical). The API offers two access methods: in-memory and disk-based. The former requires quite a bit of memory (~200MB), but is extremely fast. The latter requires essentially no memory, but is slower due to on-disk searching of the WordNet data. Also included are some methods for shortest path searching between synsets. The API includes a sample application.
WARNING: This API will modify the index.* files that are distributed with WordNet. These files will be re-sorted for use by the .NET runtime, whose string sort order differs from that of the Java runtime. As a result, the Java (and other) APIs/applications might not function correctly when used with the re-sorted index.* files. You should create multiple copies of the WordNet data (one for each runtime) to avoid such problems.
LASTLY: Please take a look at the README.txt files for each project before emailing me with questions. The most common issues (e.g., how to compile, where to find referenced DLLs, etc.) are addressed there.
Machine learning
Statistical classification servers
I work with a number of supervised classification models trained from thousands to millions of instances, where each instance has as many as 100K features. The resulting model files can be as large as 3GB and take a couple minutes to load. This is a problem in any situation where instances need to be classified on-demand by an application. It's not feasible to wait a couple minutes each time an instance comes in. My solution is to modify existing packages, allowing them run in "server mode". This means they load a model once and serve classifications from the loaded model without having to reload the model for each classification request. This can dramatically reduce processing time. I have made this modification for a couple popular, large-scale packages, found below:
Other
MetaPixel GUI
MetaPixel is a program that can be used to create photomosaics, which are pictures formed by many smaller pictures. I wrote a GUI front-end for MetaPixel using Mono/C# that supports image preprocessing and mosaic creation. The GUI helps streamline mosaic creation by allowing you to specify the desired dimensions (in inches) of the sub-images and final mosaic. It also allows you to create a batch of mosaics for a range of parameters, after which you can open them in EoG or GIMP directly from the GUI. The GUI has been tested on Debian 6 using MetaPixel 1.0.2 and Mono 2.10. If all you want is the binary, download it here. You can get the source code here. Use the following command to start the GUI:
mono MetaPixelGUI.exe
MS Word / JAMIA style file
Microsoft Word 2010 style file for the Journal of the American Medical Informatics Association (JAMIA). It's not perfect, but it's close. For example, it won't sort and abbreviate citation numbers as required. It also won't bold the volume number within the bibliography. To correct these problems, I suggest turning all citation/bibliographic text into plain text before submission and making the corrections manually. Note: in order to use this file, place it at "C:\Program Files\Microsoft Office\Office14\Bibliography\Style" (at least, that's the path for my installation of Microsoft Word 2010). Restart Word if it's running, and "JAMIA" should appear in the list of available bibliographic styles.