Examples

  • Example: Operation on all 5-grams of a FASTA file:

    import dinopy
    from dinopy import qgrams
    
    far = dinopy.FastaReader("ecoli.fasta")
    sequence, info = far.genome()
    for qgram in qgrams(sequence, 5):
        print(qgram)
    
  • Example: Perform an analysis on each read of a file:

    import dinopy
    import numpy as np
    
    fqr = dinopy.FastqReader("ecoli_reads.fastq")
    for sequence, name, quality_values in fqr.reads(quality_values=True):
        qvs = np.frombuffer(quality_values)
        if np.mean(qvs) > arbitrary_threshold:
            important_analysis(sequence)
        else:
            low_quality(sequence)
    
  • Example: Work on all lines of a zipped FASTQ file:

    import dinopy
    import numpy as np
    
    fqr = dinopy.FastqReader("ecoli_reads.fastq.gz")
    for line_type, line_value in fqr.lines():
        if line_type == "name":
            print("Found a read with the name", line_value)
        elif line_type == "quality values":
            print("with the mean quality value", np.mean(np.frombuffer(line_value, dtype=np.uint8)))
    
  • Example: Create a 5-gram counter for all reads of a FASTA file:

    import dinopy
    from collections import Counter
    
    fp = dinopy.FastaReader("ecoli.fasta")
    counts = Counter(dinopy.qgrams(fp.reads(), 5))
    
  • Example: Create a naïve 4-gram index for a genome:

    import dinopy
    
    far = dinopy.FastaReader("ecoli.fasta")
    index = {}
    for i, qgram in enumerate(dinopy.qgrams(far.genome().sequence, 4, encoding=dinopy.two_bit)):
        if qgram in index:
            index[qgram].append(i)
        else:
            index[qgram] = [i]
    
  • Example: Read a genome, write a genome:

    import dinopy
    
    far = dinopy.FastaReader("ecoli.fasta")
    with dinopy.FastaWriter("baz.fasta") as faw:
        faw.write_genome(far.genome())
    
  • Example: Calculate suffix array for a given sequence:

    import dinopy
    
    seq = "mississippi"
    sar = dinopy.processors.suffix_array(seq)  # [11, 10, 7, 4, 1, 0, 9, 8, 6, 3, 5, 2]
    
  • Example: Count unique gapped-qgram occurences of a given sequence [1]:

    import dinopy
    from dinopy import qgrams, reverse_complement, FastaReader, two_bit
    from collections import Counter
    from itertools import islice
    
    q = 16
    shape = '#' * q
    shapes = [shape[:i] + '_' + shape[i:] for i in range(1, q)]  # shapes for 16-grams with 1 gap
    
    far = FastaReader("human_g1k_v37.fasta.gz")
    iterator = far.lines(skip_name_lines=True)
    iterator = islice(iterator, 500, 500 + 100000)  # skip first few hundred lines of "NNNN...". Note that you could also use filter(lambda x: b'N' not in x, ...)
    seq_fwd = b''.join(list(iterator))  # for this example, we'll keep the data in memory
    seq_rev = reverse_complement(seq_fwd)
    
    for shape in shapes:
        counts = Counter(qgrams(seq_fwd, shape, encoding=two_bit, dtype=bytes))  # make sure to explicitly specify dtype for better performance
        counts += Counter(qgrams(seq_rev, list(reversed(shape)), encoding=two_bit, dtype=bytes))
        num_unique = sum([1 for v in counts.values() if v == 1])
        print("{}: {}".format(shape, num_unique))
    

Footnotes