Skip to content

iesl/rexa1-pstotext

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Rexa version 1 pstotext, adapted from the DEC utility pstotext.

Installing

Run bin/setup to recompile pstotext. This will compile pstotext and place the executable in bin/. Make sure you have ghostscript installed (and either on your path, or supplied via the -gs argument, see below). The file 'ligatures.txt', also in the bin directory, should either be in the same directory as pstotext, or else specified as a parameter:

pstotext -ligatures /path/to/ligatures.txt

Provided utilities:

pstotext:

Extract text and positional information from pdf

Usage: pstotext input-file.pdf > output.xml

    pstotext --help:
      Usage: bin/pstotext [option|file]...
      Options:
        -cork               assume Cork encoding for dvips output
        -landscape          rotate 270 degrees
        -landscapeOther     rotate 90 degrees
        -portrait           don't rotate (default)
        -bboxes             output one word per line with bounding box
        -ligatures file     *attempt* to fix words with missing ligatures
                            using the specified ligature dictionary
        -debug              show Ghostscript output and error messages
        -gs "command"       Ghostscript command
        -                   read from stdin (default if no files specified)
        -output file        output results to "file" (default is stdout)

run-pstotext.sh:

Wrapper for pstotext.

Usage: run-pstotext.sh --file test-data/test.pdf --nogzip -timeout 30 --debug

  • Runs pstotext with a few extra features:
    • Specifies a timeout (some pdfs can cause pstotext to hang indefinitely), and kills it if necessary.
    • Outputs the results to the specified file, and optionally zips the output file.
    • Runs a simple test to determine if that paper is likely to be a research paper, and outputs the result of the test
    • Creates log files with the results of the process
    Options:
        --file somefile.pdf
        --nogzip
        --pstotext  path to pstotext (if not on the standard path)
        --timeout   timeout before killing pstotext subprocess
        --debug     print extra debugging info to stdout
        --log       name of logfile to append
        --logprefix string that will be prepended to all logging output for this process

idftype

Try to guess the file type of an unknown file, then rename the file with an appropriate extension. If the file is compressed, uncompress and identify the newly expanded file. The input file will be overwritten if any renaming or decompression takes place.

Usage: idftype -file /some/unknown/file.xxx

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published