Examples¶
Example: Operation on all 5-grams of a FASTA file:
import dinopy from dinopy import qgrams far = dinopy.FastaReader("ecoli.fasta") sequence, info = far.genome() for qgram in qgrams(sequence, 5): print(qgram)
Example: Perform an analysis on each read of a file:
import dinopy import numpy as np fqr = dinopy.FastqReader("ecoli_reads.fastq") for sequence, name, quality_values in fqr.reads(quality_values=True): qvs = np.frombuffer(quality_values) if np.mean(qvs) > arbitrary_threshold: important_analysis(sequence) else: low_quality(sequence)
Example: Work on all lines of a zipped FASTQ file:
import dinopy import numpy as np fqr = dinopy.FastqReader("ecoli_reads.fastq.gz") for line_type, line_value in fqr.lines(): if line_type == "name": print("Found a read with the name", line_value) elif line_type == "quality values": print("with the mean quality value", np.mean(np.frombuffer(line_value, dtype=np.uint8)))
Example: Create a 5-gram counter for all reads of a FASTA file:
import dinopy from collections import Counter fp = dinopy.FastaReader("ecoli.fasta") counts = Counter(dinopy.qgrams(fp.reads(), 5))
Example: Create a naïve 4-gram index for a genome:
import dinopy far = dinopy.FastaReader("ecoli.fasta") index = {} for i, qgram in enumerate(dinopy.qgrams(far.genome().sequence, 4, encoding=dinopy.two_bit)): if qgram in index: index[qgram].append(i) else: index[qgram] = [i]
Example: Read a genome, write a genome:
import dinopy far = dinopy.FastaReader("ecoli.fasta") with dinopy.FastaWriter("baz.fasta") as faw: faw.write_genome(far.genome())
Example: Calculate suffix array for a given sequence:
import dinopy seq = "mississippi" sar = dinopy.processors.suffix_array(seq) # [11, 10, 7, 4, 1, 0, 9, 8, 6, 3, 5, 2]
Example: Count unique gapped-qgram occurences of a given sequence [1]:
import dinopy from dinopy import qgrams, reverse_complement, FastaReader, two_bit from collections import Counter from itertools import islice q = 16 shape = '#' * q shapes = [shape[:i] + '_' + shape[i:] for i in range(1, q)] # shapes for 16-grams with 1 gap far = FastaReader("human_g1k_v37.fasta.gz") iterator = far.lines(skip_name_lines=True) iterator = islice(iterator, 500, 500 + 100000) # skip first few hundred lines of "NNNN...". Note that you could also use filter(lambda x: b'N' not in x, ...) seq_fwd = b''.join(list(iterator)) # for this example, we'll keep the data in memory seq_rev = reverse_complement(seq_fwd) for shape in shapes: counts = Counter(qgrams(seq_fwd, shape, encoding=two_bit, dtype=bytes)) # make sure to explicitly specify dtype for better performance counts += Counter(qgrams(seq_rev, list(reversed(shape)), encoding=two_bit, dtype=bytes)) num_unique = sum([1 for v in counts.values() if v == 1]) print("{}: {}".format(shape, num_unique))
Footnotes