Skip to content

Reimplementation of the authorship attribution approach described in "Olivier de Vel, Alison M. Anderson, Malcolm W. Corney, and George M. Mohay. Mining e-mail content for author identification forensics. SIGMOD Record, 30(4), 55-64, 2001" as part of the ECIR 2016 reproducibility study "Who Wrote the Web?"

Notifications You must be signed in to change notification settings

pan-webis-de/devel01

Repository files navigation

devel01 - An Approach to Authorship Attribution

This is a reimplementation of the approach to authorship attribution originally described in

Olivier de Vel, Alison M. Anderson, Malcolm W. Corney, and George M. Mohay. Mining e-mail content for author identification forensics. SIGMOD Record, 30(4), 55-64, 2001 [paper]

It was reimplemented as part of a science reproducibility study alongside 14 other authorship attribution approaches. The results of the reproducibility study can be found in

Martin Potthast, Sarah Braun, Tolga Buz, Fabian Duffhauss, Florian Friedrich, Jörg Marvin Gülzow, Jakob Köhler, Winfried Lötzsch, Fabian Müller, Maike Elisa Müller, Robert Paßmann, Bernhard Reinke, Lucas Rettenmeier, Thomas Rometsch, Timo Sommer, Michael Träger, Sebastian Wilhelm, Benno Stein, Efstathios Stamatatos, and Matthias Hagen. Who Wrote the Web? Revisiting Influential Author Identification Research Applicable to Information Retrieval. In Advances in Information Retrieval. 38th European Conference on IR Research (ECIR 16) volume 9626 of Lecture Notes in Computer Science, Berlin Heidelberg New York, March 2016. Springer. [paper] [bib]

If you use this reimplementation in your own research, please make sure to cite both of the above papers.

Usage

To execute the software, install it and make sure all its dependencies are installed as well; then run the software using the following command:

devel01.exe <path-to-input-data> <output-path>

Input and Output Formats

The software accepts authorship attribution datasets that are formatted according to the corresponding PAN shared task on authorship attribution. A number of datasets can be found there, and all of them are formatted as follows.

In a dataset's TOP_DIRECTORY, a meta-file.json is found which comprises

  • the language of the texts within (e.g., EN, GR, etc.),
  • the names of the subdirectories that contain texts from candidate authors,
  • the name of the subdirectory that contains texts of unknown authorship, and
  • the name of each file of unknown authorship that is to be attributed to one of the candidate authors.

The software accepts as input a path to an inflated dataset's TOP_DIRECTORY and starts the authorship attribution process from there. The output in the OUTPUT_PATH will be a file answers.json formatted as follows:

{
"answers": [
	{"unknown_text": "unknown00001.txt", "author": "candidate00001", "score": 0.8},
	{"unknown_text": "unknown00002.txt", "author": "candidate00002", "score": 0.9}
	]
}

where unknown_text is the name of an unknown text as per meta-file.json, author is the name of a candidate author as per meta-file.json, and score is as real value in the range [0,1] which indicates the software's confidence in its attribution (0 means completely uncertain, 1 means completely sure).

License

Copyright (c) 2015 Fabian Duffhauss

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Old; remove later

#Implementation Create FeatureExtracter.exe and JsonWriter.exe using the sourcecode in the corresponding folders. Use svm-train.exe and svm-predict.exe in order to run the classification (https://www.csie.ntu.edu.tw/~cjlin/libsvm/).

Example

  1. Run FeatureExtracter(creates training and test files): ./FeatureExtracter.exe "./NEW CORPORA/pan11small" "train.dat" "test.dat"

  2. Run svm-train.exe (creates a model file): svm-train.exe -b 1 -c 20 -t 1 -d 3 train.dat model

  3. Run svm-predict.exe (creates a prediction file): svm-predict.exe -b 1 test.dat model predict

  4. Run JsonWriter (creates a json file): ./JsonWriter.exe "predict" "devel01.json"

About

Reimplementation of the authorship attribution approach described in "Olivier de Vel, Alison M. Anderson, Malcolm W. Corney, and George M. Mohay. Mining e-mail content for author identification forensics. SIGMOD Record, 30(4), 55-64, 2001" as part of the ECIR 2016 reproducibility study "Who Wrote the Web?"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages