First Steps with dinopy¶
Installation¶
There are three ways to install dinopy:
With pip:
$ pip install dinopy
With conda:
$ conda install -c bioconda dinopy
From the source code:
$ python setup.py install
Using the Library¶
Depending on your use-cases, different parts of dinopy might be of interest to you.
For this tutorial, we’ll be interested in parsing FASTA files, so let’s start
by importing the dinopy main module and create a FastaReader for a given file
located at "foobar.fasta"
import dinopy
far = dinopy.FastaReader("foobar.fasta")
Having done that, let’s simply count its bigrams by reading the whole genome
and then using the dinopy.processors.qgrams
-processor to generate bi-grams:
counts = {}
for bigram in dinopy.qgrams(far.genome().sequence, 2):
counts[bigram] = counts.get(bigram, 0) + 1
Or even more pythonic:
from collections import Counter
counts = Counter(dinopy.qgrams(far.genome().sequence, 2))
Note
genome
is a named tuple consisting of the elementssequence
(which contains the genome data) andinfo
(which is a collection of named tuples of the following structure:(name, length, interval)
, whereinterval
is a(start_index, end_index)
tuple).That means that you can both use tuple-unpacking, as in
seq, info = far.genome()
and member-like access, as inseq = far.genome().sequence
.genome
reads the complete genome to memory. If you do not want this (because it might not fit into memory) supplychromosomes
or evenlines
instead, as this will provide an iterator over each chromosome (or line, respectively) in the file. However, you will also have to enable ‘qgram-wrapping’ to achieve the same result:dinopy.qgrams(fasta_parser.reads(), 2, wrap=True)
The above expression is then equivalent to
dinopy.qgrams(far.genome()['sequence'], 2)
but does not read the whole genome to memory.
Note
qgrams
accepts a multitude of different shape representations,
most commonly integers (as used above) or strings consisting of “#” (care)
and “_” (don’t care).
For more information on shapes, see dinopy.shape
and dinopy.shaping
.
To write reads to disk in FASTQ format you can use the FastqWriter class. All writer classes of dinopy use with-environments, to ensure the output file is closed properly in case something goes wrong.
Lets suppose we want to synthesize reads to test an algorithm. We want to use the genome
that we read in using the FastaReader in the above example and have a function
random_read
in the module helper
which randomly generates a read from a sequence:
import dinopy
from helper import random_read
far = dinopy.FastaReader("foobar.fasta")
genome = far.genome()
with dinopy.FastqWriter("foobar_reads.fastq") as fqw:
for _ in range(10000):
seq, name, quality = random_read(genome.sequence)
fqw.write(seq, name, quality)
Note that the output file will automatically be closed once you leave the environment.
In case you would like to have direct control over the opening and closing of the file,
or you need a variable amount of writers, you can directly use the open()
and close()
methods of the FASTQ Writer. Our code from above would then look like this:
import dinopy
from helper import random_read
far = dinopy.FastaReader("foobar.fasta")
genome = far.genome()
fqw = dinopy.FastqWriter("foobar_reads.fastq")
fqw.open()
for _ in range(10000):
seq, name, quality = random_read(genome.sequence)
fqw.write(seq, name, quality)
fqw.close()
And that’s basically it. If you wish to read Fastq files, simply replace
Fasta with Fastq (dinopy.fastq_reader
).
Similarly, if you wish to write files, replace Reader
with Writer (dinopy.fasta_writer
and dinopy.fastq_writer
).