dinopy.processors module

The processors module offers functions that operate on iterators (for example as provided by dinopy.FastaReader.lines()) . At the moment, the following processors are provided:

Work in progress processors include:

  • suffix_array (using the induced sorting algorithm)

  • qgram_index (using a two-pass algorithm)

dinopy.processors.complement(seq)

Compute the complement of a sequence.

Parameters:

seq (Iterable) – sequence of which the complement is to be computed.

Returns:

The complement of the sequence.

dinopy.processors.complement_2bit(uint64 seq, unsigned int seq_length=0, bool sentinel=False) uint64

Return the complement of a 2bit encoded sequence.

Parameters:

seq (uint64) – 2bit encoded sequence.

Returns:

The 2bit encoded complement of the input sequence.

Return type:

uint64

dinopy.processors.complement_4bit(uint64 seq, unsigned int seq_length=0, bool sentinel=False) uint64

Return the reverse complement of a 4bit encoded sequence

Parameters:

seq (uint64) – 4bit encoded sequence.

Returns:

The 4bit encoded complement of the input sequence.

Return type:

uint64

dinopy.processors.qgrams(source, qgram_shape, dtype=None, type encoding=None, bool wrap=False, bool sentinel=False)

Construct an iterator of qgrams of any given iterable source, using the specified shape.

Parameters:
  • source (Iterable) – Any iterable source or iterable of iterable sources. So “ACGT”, [“ACGT”] and [“ACGT”, “ACGT”] would all be valid.

  • qgram_shape (object) – Any input that describes a q-gram shape as defined in dinopy.shape; for example 5 or "##_#".

  • dtype (type) – Data type of the source. One of None, bytes, bytearray, basenumbers, str (see Note about dtype, Default: None). This is mainly a performance hint; if for example your source iterable supplies str items, specifying str will result in dinopy picking the respective low level function for generating str-q-grams, instead of having to guess each time a new item is processed (which is the case with the default dtype value of None).

  • encoding (type) – One of None, two_bit, four_bit (Default: None).

  • wrap (bool) – Whether to compute q-grams across consecutive source-item-boundaries. For example the 3-grams for the source ["ACGT", "TTTT"] would be ["ACG", "CGT", "TTT", "TTT"] (wrap = False), but with wrap = True the q-grams would be ["ACG", "CGT", "GTT", "TTT", "TTT", "TTT"]. (Default: False)

Yields:

dtype – All q-grams of the source, either in the same dtype as the source or encoded using either 2bit or 4bit encoding.

Raises:

Examples

  • Example 1:

    far = dinopy.FastaReader("files/testgenome_IUPAC.fasta")
    shp = dinopy.shape.Shape("#######")
    for qgram in dinopy.qgrams(far.reads(dtype=str), shp, dtype=str, encoding=None):
        pass
    
  • Example 2:

    far = dinopy.FastaReader("files/testgenome_IUPAC.fasta")
    shp = dinopy.shape.Shape(5)
    for qgram in dinopy.qgrams(far.reads(dtype=bytes), shp, dtype=bytes, encoding=dinopy.two_bit):
        pass
    
  • Example 3:

    far = dinopy.FastaReader("files/testgenome_IUPAC.fasta")
    shp = dinopy.shape.Shape("##_#_#")
    for qgram in dinopy.qgrams(far.lines(dtype=str), shp, dtype=str, encoding=None, wrap=True):
        pass
    
dinopy.processors.replace_ambiguities(seq, random_choice=choice)

Replace each occurence of a IUPAC code in seq randomly by one of its corresponding bases. For bytearrays, this will happen in-place, while for str, bytes or int/long a new copy (with all ambiguities resolved) will be created.

Parameters:
  • seq – The sequence in which each IUPAC code is to be replaced by one of the bases A, C, G or T.

  • random_choice (Collection → Item) – A function which for a given collection of bases selects exactly one of those bases. Defaults to random.choice which will sample uniformly from the collection.

Returns:

(A copy of) the sequence in which all IUPAC codes have been randomly replaced with one of their corresponding bases.

dinopy.processors.reverse_complement(seq)

Compute the reverse complement of a sequence.

Parameters:

seq (Iterable) – sequence of which the reverse complement is to be computed.

Returns:

The reverse complement of the sequence.

dinopy.processors.reverse_complement_2bit(uint64 seq, unsigned int seq_length=0, bool sentinel=False) uint64

Return the reverse complement of a 2bit encoded sequence

dinopy.processors.reverse_complement_4bit(uint64 seq, unsigned int seq_length=0, bool sentinel=False) uint64

Return the reverse complement of a 2bit encoded sequence.

Parameters:

seq (uint64) – 2bit encoded sequence.

Returns:

The 2bit encoded complement of the input sequence.

Return type:

uint64

dinopy.processors.suffix_array(sequence) uint64_t[:]

Calculates the suffix array of given sequence using the induced sorting algorithm as specified in this paper.

Parameters:

sequence (str, bytes, bytearray) – Currently only supports sequences whose values can be interpreted as integers in the range of 0-255. String arguments will be coerced to bytes using string.encode('ascii'). Important: The sequence must be terminated with a so called sentinel (a unique value smaller than any other value of the sequence). If the sequence does not end with a nul-byte (i.e b'\x00') it will be added automatically. That means your sequence’s length is increased by one, which in turn means that the very first item of the resulting suffix array will be equal the number of items in the sequence + 1 (as ``b’