dinopy.conversion module

This module provides functions for the conversion between the data types used by dinopy.

This includes:

  • str

  • bytes

  • bytearray

  • dinopy.basenumbers

  • 2bit and 4bit encoded sequences

For a full list of data types used by dinopy please take a look at Note about dtype.

dinopy.conversion.basenumbers_to_bytes(basenumbers) bytes

Translate a sequence from basenumbers to bytes

Parameters:

basenumbers (Iterable) – Containing basenumbers (0 → A, 1 → C, …)

Returns:

A bytes object containing the sequence encoded as ascii bytes.

Return type:

bytes

dinopy.conversion.basenumbers_to_string(basenumbers, bool as_list=False)

Translate a sequence from basenumbers to string

Parameters:
  • basenumbers (iterable) – Containing basenumbers (0 → A, 1 → C, …)

  • as_list (bool) – Returns the string as a list of charcters instead of a (joined) string. (Default: False)

Returns:

A string containing the sequence. If as_list has been set to True, the string will be returned as a list of characters.

Return type:

str

dinopy.conversion.bytes_to_basenumbers(bytes byte_sequence, bool suppress_iupac=False) bytes

Translate a sequence of bytes into a basenumber sequence.

Parameters:
  • byte_sequence (bytes) – Bases encoded as ascii values (65 → A, 67 → C, …) Works for upper and lower case characters.

  • suppress_iupac (bool) – If this is set, all non-ACGT-characters will be replaced with N. (Default: False)

Returns:

A bytes object containing the sequence encoded as basenumbers.

Return type:

bytes

dinopy.conversion.bytes_to_string(bytes byte_sequence, bool as_list=False) unicode

Translate bytes into a string.

Parameters:
  • byte_sequence (bytes) – Bases encoded as ascii values (65 → A, 67 → C, …) Works for upper and lower case characters.

  • as_list (bool) – Returns the string as a list of charcters instead of a (joined) string. (Default: False)

Returns:

A string containing the dna seqeunce encoded as unicode text or a list containing the unjoined characters.

Return type:

str or list

dinopy.conversion.decode(unsigned long bit_seq, int length, dict inv_codemap) list

Decode a bit encoded sequence. Inverse of dinopy.auxiliary.encode().

Parameters:
  • seq (unsigned long) – A bit representation of a sequence, as produced by encode().

  • length (int) – The length of the original (‘decoded’) sequence. If length == -1, assume leading sentinel bits, i.e. n ones where n is the number of bits needed to represent a single item of the codemap.

  • inv_codemap (dict) – The inverse map of the map used in encode().

Returns:

The sequence obtained by applying the inverse of encode() to a bit representation of said sequence.

Return type:

list

Note

This is a generic decode functions and requires a codemap (dictionary containing a translation for each possible item of the sequence).

Codemaps to translate to 2bit and 4bit to all dtypes are available in the dinopy.definitions module. For these translations you can use the specialized functions decode_twobit() and decode_fourbit().

dinopy.conversion.decode_fourbit(unsigned long fourbit_seq, int length, type dtype=bytes)

Decode a 4bit encoded sequence into dtype. Inverse of encode_fourbit().

Parameters:
  • fourbit_seq (unsigned long) – A 4bit representation of a sequence, as produced by encode_fourbit().

  • length (int) – The length of the original (‘decoded’) sequence. If length == -1, assume leading sentinel bits, i.e. n ones where n is the number of bits needed to represent an item of the codemap.

  • dtype (type) – Target dtype to decode the sequence to (see Note about dtype).

Returns:

the decoded sequence of type Note about dtype.

Raises:

InvalidDtypeError – If an unrecognized Note about dtype has been passed.

dinopy.conversion.decode_twobit(unsigned long twobit_seq, int length, type dtype=bytes)

Decode a 2bit encoded sequence into a sequence of type Note about dtype. Inverse of encode_twobit().

Parameters:
  • twobit_seq (unsigned long) – A 2bit representation of a sequence, as produced by encode().

  • length (int) – The length of the original (‘decoded’) sequence. If length == -1, assume leading sentinel bits, i.e. n ones where n is the number of bits needed to represent an item of the codemap.

  • dtype (type) – Target dtype to decode the sequence to (see Note about dtype).

Returns:

The sequence obtained by applying the inverse of encode() to a bit representation of said sequence.

Return type:

dtype

Raises:

InvalidDtypeError – If an unrecognized dtype has been passed.

dinopy.conversion.encode(seq, codemap, bool sentinel=False) unsigned long

Translates an iterable sequence to a long representation using the given codemap

Parameters:
  • seq (object) – The iterable sequence to be encoded.

  • codemap (dict) – A dictionary specifying a mapping from an item in seq to some bits.

  • sentinel (bool) – Whether to prepend sentinel bits to the resulting sequence; this is useful if encoding sequences of variable lengths (but not really useful for sequences with fixed and known lengths, such as q-grams). If set to True, prepend n ones to the resulting sequence, where n is the number of bits needed to represent an item of the codemap. (Default: False).

Returns:

Value encoding the given sequence in a more compact bitstring version.

Return type:

unsigned long

Note

This is a generic encode functions and requires a codemap (dictionary containing a translation for each possible item of the sequence).

Codemaps to translate all dtypes to 2bit and 4bit are available in the dinopy.definitions module. For these translations you can use the specialized functions encode_twobit() and encode_fourbit().

dinopy.conversion.encode_fourbit(seq, bool sentinel=False) unsigned long

Encodes the given sequence using a four bit encoding. Note that all four bit encoded sequences share the common prefix 0b1111 for easier decoding.

Parameters:
  • seq (object) – The sequence to be 4bit encoded. Supports any dtype (see: Note about dtype).

  • sentinel (bool) – Whether to prepend sentinel bits to the resulting sequence; this is useful if encoding sequences of variable lengths (but not really useful for sequences with fixed and known lengths, such as q-grams). If set to True, prepend n ones to the resulting sequence, where n is the number of bits needed to represent an item of the codemap. (Default: False).

Returns:

The four bit encoded sequence as a single unsigned long integer.

Return type:

unsigned long

dinopy.conversion.encode_twobit(seq, bool sentinel=False) unsigned long

Encodes the given sequence using a two bit encoding. Note that all two bit encoded sequences share the common prefix 0b11 for easier decoding. If the sequence contains items other than 'A', 'C', 'G', 'T', they will be replaced randomly according to the usual IUPAC mapping.

Parameters:
  • seq (object) – The sequence to be 2bit encoded. Supports any dtype (see: Note about dtype).

  • sentinel (bool) – Whether to prepend sentinel bits to the resulting sequence; this is useful if encoding sequences of variable lengths (but not really useful for sequences with fixed and known lengths, such as q-grams). If set to True, prepend n ones to the resulting sequence, where n is the number of bits needed to represent an item of the codemap. (Default: False).

Returns:

The two bit encoded sequence as a single unsigned long integer.

Return type:

unsigned long

dinopy.conversion.get_inverse_codemap_from_twobit(type dtype) dict

Return the translation map from a two bit encoding to the respective dtype, e.g. from two bit to string: 0b00A, 0b01C etc.

Parameters:

dtype (type) – Target dtype for the codemap.

Returns:

Translation dict (codemap) from 2bit encoding to the given dtype.

Return type:

dict

Raises:

InvalidDtypeError – If the dtype is not supported / recognized.

dinopy.conversion.illumina13_to_phred(illumina13_qvs) list

Translate illumina 1.3 quality values to phred quality scores.

Parameters:

illumina13_qvs (int buffer) – Illumina 1.3 quality values.

Returns:

PHRED quality scores.

Return type:

list

dinopy.conversion.illumina15_to_phred(illumina15_qvs) list

Translate illumina 1.5 quality values to phred quality scores.

Parameters:

illumina153_qvs (int buffer) – Illumina 1.5 quality values.

Returns:

PHRED quality scores.

Return type:

list

dinopy.conversion.illumina18_to_phred(illumina18_qvs) list

Translate illumina 1.8 quality values to phred quality scores.

Parameters:

illumina18_qvs (int buffer) – Illumina 1.8 quality values.

Returns:

PHRED quality scores.

Return type:

list

dinopy.conversion.phred_to_illumina13(phred_qs) bytes

Translate phred quality scores to illumina 1.3 quality values.

Parameters:

phred_qs (int buffer) – PHRED quality scores.

Returns:

Illumina 1.3 quality values.

Return type:

bytes

dinopy.conversion.phred_to_illumina15(phred_qs) bytes

Translate phred quality scores to illumina 1.5 quality values.

Parameters:

phred_qs (int buffer) – PHRED quality scores.

Returns:

Illumina 1.5 quality values.

Return type:

bytes

dinopy.conversion.phred_to_illumina18(phred_qs) bytes

Translate phred quality scores to illumina 1.8 quality values.

Parameters:

phred_qs (int buffer) – PHRED quality scores.

Returns:

Illumina 1.8 quality values.

Return type:

bytes

dinopy.conversion.phred_to_sanger(phred_qs) bytes

Translate phred quality scores to sanger quality values.

Parameters:

phred_qs (int buffer) – PHRED quality scores.

Returns:

Sanger quality values.

Return type:

bytes

dinopy.conversion.phred_to_solexa(phred_qs) bytes

Translate phred quality scores to solexa quality values.

Parameters:

phred_qs (int buffer) – PHRED quality scores.

Returns:

Solexa quality values.

Return type:

bytes

dinopy.conversion.sanger_to_phred(sanger_qvs) list

Translate sanger quality values to phred quality scores.

Parameters:

sanger_qvs (int buffer) – Sanger quality values.

Returns:

PHRED quality scores.

Return type:

list

dinopy.conversion.solexa_to_phred(solexa_qvs) list

Translate solexa quality values to phred quality scores.

Parameters:

solexa_qvs (int buffer) – Solexa quality values.

Returns:

PHRED quality scores.

Return type:

list

dinopy.conversion.string_list_to_basenumbers(list sequence_list, bool suppress_iupac=False) list
Translate a list of sequences from string to basenumbers.

Can be used to translate whole reads with a single call.

Parameters:
  • base_list (list of strings) – List of strings of IUPAC-characters (upper or lower case).

  • suppress_iupac (bool) – If this is set, all non-ACGT-characters will be replaced with N. (Default: False)

Returns:

A list containing bytes object containing the sequences encoded as basenumbers.

Return type:

list of bytes

dinopy.conversion.string_sublists_to_basenumbers(list list_of_lists, bool suppress_iupac=False) list
Translate a list of lists of sequences from string to basenumbers.

Can be used to translate whole lists of reads.

Warning: this might consume a lot of memory.

Parameters:
  • base_list (nested list of strings) – List of lists of strings of IUPAC-characters (upper or lower case).

  • suppress_iupac (bool) – If this is set, all non-ACGT-characters will be replaced with N. (Default: False)

Returns:

A list of lists containing bytes object containing the sequences encoded as basenumbers.

Return type:

list of list of bytes

dinopy.conversion.string_to_basenumbers(unicode sequence, bool suppress_iupac=False) bytes

Translate a string to a sequence of basenumbers.

Parameters:
  • sequence (string) – String of IUPAC-characters (upper or lower case).

  • suppress_iupac (bool) – If this is set, all non-ACGT-characters will be replaced with N. (Default: False)

Returns:

A bytes object containing the sequence encoded as basenumbers.

Return type:

bytes

dinopy.conversion.string_to_bytes(unicode sequence, bool suppress_iupac=False) bytes

Translate a string to a byte sequence.

Parameters:
  • sequence (str) – String of IUPAC-characters (upper or lower case).

  • suppress_iupac (bool) – If this is set, all non-ACGT-characters will be replaced with N. (Default: False)

Returns:

A bytes object containing the sequence encoded as bytes.

Return type:

bytes