GitHub - sean-mccorkle/GenomeUtils: Programs and scripts of general use for nucleotide and protein sequence bioinformatics

Github repo version of Genome/src

This is a collection of unix command line program and script utilities mostly for processing DNA, RNA and protein sequences. Many of these read naturally from files specified as command line arguments or from stdin when no arguments or given, and typically write output on standard out. They are intended to be usable in unix pipes.

Most of these read and write sequence data in fasta format.

Perl scripts

codon_freqs - reports codon frequences of DNA & RNA sequences
connected_subgraphs Reads pairwise edges (node node) and outputs connected subgraphs
extract_sequences_list Read list of ids, extract those fasta sequences with those ids in header
fasta_chunker - split file of many fasta sequences into n files (for parallel processing)
fasta2md5 - read fasta sequences and write the md5 hex hashcode along with the sequence header (DNA, RNA, protein)
kmers_by_frame - print out kmer counts in each reading frame (DNA, RNA - can work on proteins too).
new_overlapper - reads two (sorted) tables of chromosome coordinates and reports overlaps between them
ngrams - reads fasta-format sequences, prints counts of ngrams (DNA, RNA, protein)
numseqs - report count of numbers of fasta sequences (DNA, RNA, protein)
one_line_fasta - remove newlines from fasta sequences (ouput sequence on one line only, regardless of length)
orfs.pl - find open reading frames in DNA sequences
poly_a - report long polynucleotide streches in DNA and RNA sequences
pp - fasta format sequence pretty-printer (DNA, RNA, protein)
rc - reverse complement DNA fasta sequences
sam_overlapper - finds SAM file mapped reads which overlap (or not) a set of transcript (or other) coordinates
sel - (select) reads fasta sequences and filters according to various criteria (DNA, RNA, protein)
seq_len - (sequence length) reads fasta sequences and prints out their lengths (DNA, RNA, protein)
seq_range - (range of sequences filter) reads fasta sequences, prints out only those in specified numeric range
splt - breaks files of multiple fasta sequences into separate files (DNA, RNA, protein)
trans.pl - (translate) translates DNA or RNA sequences into protein sequences (fasta format)
truncseq - (truncate sequence) - read DNA, RNA or protein sequences and print subsequence (fasta format)

compiled C programs

intervals - extract multiple sequence intervals from a large DNA sequence (chromosome sized)
kmers - nucleotide k-mer counts
nt - nucleotide frequencies of DNA or RNA sequences
prosearch - search DNA for binding motifs specified by patterns

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
Scripts		Scripts
src		src
testdata		testdata
.gitignore		.gitignore
README.md		README.md
TODO		TODO

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scripts

Scripts

src

src

testdata

testdata

.gitignore

.gitignore

README.md

README.md

TODO

TODO

Repository files navigation

Github repo version of Genome/src

Perl scripts

compiled C programs

About

Releases

Packages

Languages

sean-mccorkle/GenomeUtils

Folders and files

Latest commit

History

Repository files navigation

Github repo version of Genome/src

Perl scripts

compiled C programs

About

Resources

Stars

Watchers

Forks

Languages