Version 1.0.0
This is a C++ 11 implementation of A Unified Bayesian Model of Scripts, Frames and Language (Ferraro and Van Durme, 2016). It has been developed and tested on Linux x86_64, under G++ >= 4.8.4. It should also compile on a Mac, though a couple Makefile changes may need to be made.
The following are required to get both the UPF and baseline model to compile.
- boost (recent, works with >= 1.56)
- Thrift >= 0.9.3
- gsl == 1.16
- cblas
- libarchive == 3.1.2
- redis >= 3.0.0
- hiredis == 0.13.3
- GoogleLOG (GLOG) == 0.3.3
As this project shares code with other projects, there are some optional dependencies.
There are a number of dependencies that are included in this repo directly. They are compiled on-demand.
- Google test (included in repo)
- Eigen (included in repo)
- Concrete >= 4.8 < 5
- Concrete is a data schema, described in Ferraro et al., 2014.
Depending on where you installed the dependencies, you may need to update Makefile.config
.
- If the headers are installed in
/usr/local/include
, and the shared objects are installed in/usr/local/lib
, then you do not have to change anything. - If the headers (and shared objects) are installed in the same place (but not
/usr/local/{include,lib}
), then change lines 5 and 6 ofMakefile.config
. - If the headers (shared objects) are installed in separate directories, then change lines 11-61 as appropriate.
make help
will display all known targets.
To build the UPF model, run make models/upf_cgibbs_driver
.
To build the baseline model, run make models/crtlda_cgibbs_basic_driver
.
There are a number of Makefile ENV variables that can be set to change compilation. Some major ones are:
DEBUG={0,1}
: turn on debug compilation.DEBUG=1
turns ON debugging, i.e.,-g -O0
). To run quickly,DEBUG=0
. Default: 0LINK_HOW={dynamic,static}
: use dynamic or static linking for certain libraries. Default: dynamicLOG_AS_COUT
: This is meant to change what logging is used. It is by default undefined. Set it to anything to use stdout logging instead of Google Log.
Due to licensing issues, I cannot release the fully annotated input files. They are Concrete Communications with:
- a Stanford dependency parse (collapsed-cc)
- Stanford part-of-speech and lemmatization tags
- Stanford entity coreference
- Semafor frame semantic parses.
The list of training ids is in data/nyt10k.id_list.txt
.
This code is released under GPL v3.0. Please contact me (ferraro [at] cs [dot] jhu [dot] edu) with any questions.