Skip to content

Utility program for extracting sequences from a fasta/fastq file by size or by header name

License

Notifications You must be signed in to change notification settings

ilovefungi/pullseq

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This software includes two applications: pullseq and seqdiff.

Pullseq Summary:
  pullseq - extract sequences from a fasta/fastq file.  This program is
  fast, and can be useful in a variety of situations.  You can use it to
  extract sequences from one fasta/fastq file into a new file, given
  either a list of header ids to include / exclude or a size minimum /
  maximum sequence lengths.

  Additionally, it can convert from fastq to fasta or visa-versa and
  can change the length of the output sequence lines.

  NOTE: pullseq prints to standard out, so you need to use redirection
  (e.g. pullseq input.fasta -m 10 *>* output.fasta ) to create output files.

Synopsis:
  # general extraction with a list of names
  pullseq --input=<input fasta/fastq file> --names=<fasta header ids file>
  
  # general extraction with a minimum size requirement
  pullseq --input=<input fasta/fastq file> --min=<minimum size sequence to extract>
  
  # only sequences with min 200 and max 500
  pullseq -i input.fasta -m 200 -a 500 > new.fasta

  Options:
    -i, --input,     Input fasta/fastq file (required)
    -n, --names,     File of header id names to select
    -m, --min,       Minimum sequence length
    -a, --max,       Maximum sequence length
    -l, --length,    Sequence characters per line (default 50)
    -c, --convert,   Convert input to fastq/fasta (e.g. if input is fastq, output will be fasta)
    -q, --quality,   ASCII code to use for fasta->fastq quality conversions
    -e, --excluded,  Exclude the header id names in the list (-n)
    -t, --count,     Just count the possible output, but don't write it
    -h, --help,      Display this help and exit
    -v, --verbose,   Print extra details during the run
    --version,       Output version information and exit


Seqdiff Summary:
  seqdiff - compare two fasta (or fastq) files to determine overlap of
  sequences.  This overlap can be at the sequence level (are two
  sequences exactly the same in both files?) or at the header name
  level (do two sequences contain the same header name between the two
  files?).

Synopsis:
  seqdiff -1 first_file.fa -2 second_file.fa

Usage:
  seqdiff -1 <first input fasta/fastq file> -2 <second fasta/fastq file>

  Options:
    -1, --first,      First sequence file (required)
    -2, --second,     Second sequence file (required)
    -a, --a_output,   File name for uniques from first file
    -b, --b_output,   File name for uniques from second file
    -c, --c_output,   File name for common entries
    -d, --headers,    Compare headers instead of sequences (default: false)
    -s, --summary,    Just show summary stats? (default: false)
    -h, --help,       Display this help and exit
    -v, --verbose,    Print extra details during the run
    --version,        Output version information and exit

REQUIREMENTS:
  Pullseq/Seqdiff require a C compiler and has been tested to work with
  either GCC or clang. They also require (and include) kseq.h (Heng
  Li) and uthash.h (http://troydhanson.github.com/uthash/).

  kseq.h also requires Zlib (so your linker should be able to handle
  the '-lz' option).  You can obtain zlib from http://www.zlib.net/
  or commonly from your OS package manager (e.g. apt-get zlib or
  emerge zlib).

INSTALL:
  To install, do the following in a shell on your system...

  From Git:
  git clone https://github.com/bcthomas/pullseq.git # checkout the code using git
  cd pullseq
  ./bootstrap  # get set up for config/build after cloning
  ./configure  # configure the application based on your system
  make         # will build the application
  make install # will install in /usr/local by default

  From a Tar file:
  tar xvf pullseq_version.tar.gz
  cd pullseq_version
  ./configure  # configure the application based on your system
  make         # will build the application
  make install # will install in /usr/local by default

About

Utility program for extracting sequences from a fasta/fastq file by size or by header name

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published