Skip to content

cle-ment/path-index-graph-kernel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

path-index-graph-kernel

An implementation of an efficient graph kernel using a compressed path index. This implementation was originally done by Markus Heinonen and Niko Välimäki ("Efficient Path Kernels for Reaction Function Prediction" - M Heinonen, N Välimäki, V Mäkinen, J Rousu - 2012) .

Contents

  • algorithm - The implementation of the algorithm to be used with .sdf or .mol files as input
  • build.sh - A script that builds / compiles all the used scripts for the current system
  • examples - Example use cases for the algorithm
    • keggReactionPrediction - Prediction of reaction ec numbers using the KEGG dataset
  • README.md - This readme

Usage

Prerequisites

  • python
  • java
  • a c++ compiler
  • matlab for some of the examples

Building

First you need to build / compile some of the files to be used on your system. The script build.sh does all this for you, just run it once and everything should be in place.

Running the algorithm

The algorithm currently only supports .sdf (e.g. PubChem) and .mol files (e.g. KEGG) as input graphs. Running the wrapper script (python runAlgorithm.py) in the algorithm directory

Examples

keggReactionPrediction

Prediction of reaction ec numbers using the KEGG dataset. First the kegg data needs to be proprocessed to build reaction graphs from the kegg molecule and reaction data which can then be used as input for the algorithm.

The contained preprocessing script does all of this automatically by running preprocessing.py [-h] -k KEGGPATH [-v]. The results will be stored in the results directory.

For the sake of completeness, here are the same steps that if run manually:

  1. reaction-list: Create KEGG reaction listing with python reaction-list/extract-reactions.py -k KEGGPATH > RESULTS/kegg-reactions.txt.
  2. feature-generator: Create atom features from KEGG mol files using python generator2011.py KEGGPATH/mol/* -k all --output-dir RESULTS/mol-features/. The optional parameter k specifies the context size, default is all. There are other optional parameters for this script.
  3. atommapper: Create reactiongraph mol files with atommapper: java Mapper2000 -rgraphs -moldir KEGGPATH/mol/ -featdir RESULTS/mol-features/ -reacfile RESULTS/kegg-reactions.txt -output RESULTS/reaction-graphs. There are further optional parameters for this script.

Old stuff below this line


keggPreProcessing/4-trie-generator

Java code to extract all paths from each node as tries.

  • TrieGenerator.java converts reactiongraph .mol files into .seqs files listing all the paths with a specified depth: java TrieGenerator <maxdepth> dir/with/reactiongraph-mol-files/*.mol
  • Concatenate those files to a single file (e.g. cat-rgraphs.trees): cat dir/with/seqs-files/*.seqs > cat-rgraphs.trees

treeBurrowsWheelerTransform

C++ code to traverse all these trees and count the path frequencies of each graph

  • convert the encoding of the graph to a unicode encoding via ./tconvert < dir/with/cat-rgraphs.trees > output.graph 2>encoding.txt. This enables the use of edge labels and multi-character labels that are encoded with a single character. The encoding key is provided via stderr.
  • builder is used to create an index (output.graph.tbwt): ./builder output.graph
  • traverse then creates resulting path frequencies out of the index generated by builder: ./traverse output.graph.tbwt > output.freqs
  • and resultconvert translates the encoding back that tconvert produced: ./resultconvert cat-rgraphs.trees encoding.txt < output.freqs > result-kegg.freqs

kernel

  • create a listing of all .mol files in the reaction graph directory (results of keggPreProcessing/3-atommapper): cd resultdir/of/3-atommapper and ls -1 *.mol > rgraphlist.txt
  • convert the .freqs result to a matlab sparse matrix file with python freq2mtl.py path/to/result-kegg.freqs path/to/rgraphlist.txt
  • uncomment the kernels you want to use in genkernelmatrices.m, adjust the paths and run the script

About

An implementation of an efficient graph kernel using a compressed path index

Resources

Stars

Watchers

Forks

Packages

No packages published