dinopy.conversion module¶
This module provides functions for the conversion between the data types used by dinopy.
This includes:
str
bytes
bytearray
dinopy.basenumbers
2bit and 4bit encoded sequences
For a full list of data types used by dinopy please take a look at Note about dtype.
- dinopy.conversion.basenumbers_to_bytes(basenumbers) bytes ¶
Translate a sequence from basenumbers to bytes
- Parameters:
basenumbers (Iterable) – Containing basenumbers (0 → A, 1 → C, …)
- Returns:
A bytes object containing the sequence encoded as ascii bytes.
- Return type:
bytes
- dinopy.conversion.basenumbers_to_string(basenumbers, bool as_list=False)¶
Translate a sequence from basenumbers to string
- Parameters:
basenumbers (iterable) – Containing basenumbers (0 → A, 1 → C, …)
as_list (bool) – Returns the string as a list of charcters instead of a (joined) string. (Default: False)
- Returns:
A string containing the sequence. If as_list has been set to True, the string will be returned as a list of characters.
- Return type:
str
- dinopy.conversion.bytes_to_basenumbers(bytes byte_sequence, bool suppress_iupac=False) bytes ¶
Translate a sequence of bytes into a basenumber sequence.
- Parameters:
byte_sequence (bytes) – Bases encoded as ascii values (65 → A, 67 → C, …) Works for upper and lower case characters.
suppress_iupac (bool) – If this is set, all non-ACGT-characters will be replaced with N. (Default: False)
- Returns:
A bytes object containing the sequence encoded as basenumbers.
- Return type:
bytes
- dinopy.conversion.bytes_to_string(bytes byte_sequence, bool as_list=False) unicode ¶
Translate bytes into a string.
- Parameters:
byte_sequence (bytes) – Bases encoded as ascii values (65 → A, 67 → C, …) Works for upper and lower case characters.
as_list (bool) – Returns the string as a list of charcters instead of a (joined) string. (Default: False)
- Returns:
A string containing the dna seqeunce encoded as unicode text or a list containing the unjoined characters.
- Return type:
str or list
- dinopy.conversion.decode(unsigned long bit_seq, int length, dict inv_codemap) list ¶
Decode a bit encoded sequence. Inverse of
dinopy.auxiliary.encode()
.- Parameters:
seq (unsigned long) – A bit representation of a sequence, as produced by
encode()
.length (int) – The length of the original (‘decoded’) sequence. If
length == -1
, assume leading sentinel bits, i.e. n ones where n is the number of bits needed to represent a single item of the codemap.inv_codemap (dict) – The inverse map of the map used in
encode()
.
- Returns:
The sequence obtained by applying the inverse of
encode()
to a bit representation of said sequence.- Return type:
list
Note
This is a generic decode functions and requires a codemap (dictionary containing a translation for each possible item of the sequence).
Codemaps to translate to 2bit and 4bit to all dtypes are available in the
dinopy.definitions
module. For these translations you can use the specialized functionsdecode_twobit()
anddecode_fourbit()
.
- dinopy.conversion.decode_fourbit(unsigned long fourbit_seq, int length, type dtype=bytes)¶
Decode a 4bit encoded sequence into dtype. Inverse of
encode_fourbit()
.- Parameters:
fourbit_seq (unsigned long) – A 4bit representation of a sequence, as produced by
encode_fourbit()
.length (int) – The length of the original (‘decoded’) sequence. If
length == -1
, assume leading sentinel bits, i.e. n ones where n is the number of bits needed to represent an item of the codemap.dtype (type) – Target dtype to decode the sequence to (see Note about dtype).
- Returns:
the decoded sequence of type Note about dtype.
- Raises:
InvalidDtypeError – If an unrecognized Note about dtype has been passed.
- dinopy.conversion.decode_twobit(unsigned long twobit_seq, int length, type dtype=bytes)¶
Decode a 2bit encoded sequence into a sequence of type Note about dtype. Inverse of
encode_twobit()
.- Parameters:
twobit_seq (unsigned long) – A 2bit representation of a sequence, as produced by
encode()
.length (int) – The length of the original (‘decoded’) sequence. If
length == -1
, assume leading sentinel bits, i.e. n ones where n is the number of bits needed to represent an item of the codemap.dtype (type) – Target dtype to decode the sequence to (see Note about dtype).
- Returns:
The sequence obtained by applying the inverse of
encode()
to a bit representation of said sequence.- Return type:
dtype
- Raises:
InvalidDtypeError – If an unrecognized dtype has been passed.
- dinopy.conversion.encode(seq, codemap, bool sentinel=False) unsigned long ¶
Translates an iterable sequence to a long representation using the given codemap
- Parameters:
seq (object) – The iterable sequence to be encoded.
codemap (dict) – A dictionary specifying a mapping from an item in seq to some bits.
sentinel (bool) – Whether to prepend sentinel bits to the resulting sequence; this is useful if encoding sequences of variable lengths (but not really useful for sequences with fixed and known lengths, such as q-grams). If set to
True
, prepend n ones to the resulting sequence, where n is the number of bits needed to represent an item of the codemap. (Default:False
).
- Returns:
Value encoding the given sequence in a more compact bitstring version.
- Return type:
unsigned long
Note
This is a generic encode functions and requires a codemap (dictionary containing a translation for each possible item of the sequence).
Codemaps to translate all dtypes to 2bit and 4bit are available in the
dinopy.definitions
module. For these translations you can use the specialized functionsencode_twobit()
andencode_fourbit()
.
- dinopy.conversion.encode_fourbit(seq, bool sentinel=False) unsigned long ¶
Encodes the given sequence using a four bit encoding. Note that all four bit encoded sequences share the common prefix
0b1111
for easier decoding.- Parameters:
seq (object) – The sequence to be 4bit encoded. Supports any dtype (see: Note about dtype).
sentinel (bool) – Whether to prepend sentinel bits to the resulting sequence; this is useful if encoding sequences of variable lengths (but not really useful for sequences with fixed and known lengths, such as q-grams). If set to
True
, prepend n ones to the resulting sequence, where n is the number of bits needed to represent an item of the codemap. (Default:False
).
- Returns:
The four bit encoded sequence as a single
unsigned long
integer.- Return type:
unsigned long
- dinopy.conversion.encode_twobit(seq, bool sentinel=False) unsigned long ¶
Encodes the given sequence using a two bit encoding. Note that all two bit encoded sequences share the common prefix
0b11
for easier decoding. If the sequence contains items other than'A', 'C', 'G', 'T'
, they will be replaced randomly according to the usual IUPAC mapping.- Parameters:
seq (object) – The sequence to be 2bit encoded. Supports any dtype (see: Note about dtype).
sentinel (bool) – Whether to prepend sentinel bits to the resulting sequence; this is useful if encoding sequences of variable lengths (but not really useful for sequences with fixed and known lengths, such as q-grams). If set to
True
, prepend n ones to the resulting sequence, where n is the number of bits needed to represent an item of the codemap. (Default:False
).
- Returns:
The two bit encoded sequence as a single
unsigned long
integer.- Return type:
unsigned long
- dinopy.conversion.get_inverse_codemap_from_twobit(type dtype) dict ¶
Return the translation map from a two bit encoding to the respective dtype, e.g. from two bit to string:
0b00
→A
,0b01
→C
etc.- Parameters:
dtype (type) – Target dtype for the codemap.
- Returns:
Translation dict (codemap) from 2bit encoding to the given dtype.
- Return type:
dict
- Raises:
InvalidDtypeError – If the dtype is not supported / recognized.
- dinopy.conversion.illumina13_to_phred(illumina13_qvs) list ¶
Translate illumina 1.3 quality values to phred quality scores.
- Parameters:
illumina13_qvs (int buffer) – Illumina 1.3 quality values.
- Returns:
PHRED quality scores.
- Return type:
list
- dinopy.conversion.illumina15_to_phred(illumina15_qvs) list ¶
Translate illumina 1.5 quality values to phred quality scores.
- Parameters:
illumina153_qvs (int buffer) – Illumina 1.5 quality values.
- Returns:
PHRED quality scores.
- Return type:
list
- dinopy.conversion.illumina18_to_phred(illumina18_qvs) list ¶
Translate illumina 1.8 quality values to phred quality scores.
- Parameters:
illumina18_qvs (int buffer) – Illumina 1.8 quality values.
- Returns:
PHRED quality scores.
- Return type:
list
- dinopy.conversion.phred_to_illumina13(phred_qs) bytes ¶
Translate phred quality scores to illumina 1.3 quality values.
- Parameters:
phred_qs (int buffer) – PHRED quality scores.
- Returns:
Illumina 1.3 quality values.
- Return type:
bytes
- dinopy.conversion.phred_to_illumina15(phred_qs) bytes ¶
Translate phred quality scores to illumina 1.5 quality values.
- Parameters:
phred_qs (int buffer) – PHRED quality scores.
- Returns:
Illumina 1.5 quality values.
- Return type:
bytes
- dinopy.conversion.phred_to_illumina18(phred_qs) bytes ¶
Translate phred quality scores to illumina 1.8 quality values.
- Parameters:
phred_qs (int buffer) – PHRED quality scores.
- Returns:
Illumina 1.8 quality values.
- Return type:
bytes
- dinopy.conversion.phred_to_sanger(phred_qs) bytes ¶
Translate phred quality scores to sanger quality values.
- Parameters:
phred_qs (int buffer) – PHRED quality scores.
- Returns:
Sanger quality values.
- Return type:
bytes
- dinopy.conversion.phred_to_solexa(phred_qs) bytes ¶
Translate phred quality scores to solexa quality values.
- Parameters:
phred_qs (int buffer) – PHRED quality scores.
- Returns:
Solexa quality values.
- Return type:
bytes
- dinopy.conversion.sanger_to_phred(sanger_qvs) list ¶
Translate sanger quality values to phred quality scores.
- Parameters:
sanger_qvs (int buffer) – Sanger quality values.
- Returns:
PHRED quality scores.
- Return type:
list
- dinopy.conversion.solexa_to_phred(solexa_qvs) list ¶
Translate solexa quality values to phred quality scores.
- Parameters:
solexa_qvs (int buffer) – Solexa quality values.
- Returns:
PHRED quality scores.
- Return type:
list
- dinopy.conversion.string_list_to_basenumbers(list sequence_list, bool suppress_iupac=False) list ¶
- Translate a list of sequences from string to basenumbers.
Can be used to translate whole reads with a single call.
- Parameters:
base_list (list of strings) – List of strings of IUPAC-characters (upper or lower case).
suppress_iupac (bool) – If this is set, all non-ACGT-characters will be replaced with N. (Default: False)
- Returns:
A list containing bytes object containing the sequences encoded as basenumbers.
- Return type:
list of bytes
- dinopy.conversion.string_sublists_to_basenumbers(list list_of_lists, bool suppress_iupac=False) list ¶
- Translate a list of lists of sequences from string to basenumbers.
Can be used to translate whole lists of reads.
Warning: this might consume a lot of memory.
- Parameters:
base_list (nested list of strings) – List of lists of strings of IUPAC-characters (upper or lower case).
suppress_iupac (bool) – If this is set, all non-ACGT-characters will be replaced with N. (Default: False)
- Returns:
A list of lists containing bytes object containing the sequences encoded as basenumbers.
- Return type:
list of list of bytes
- dinopy.conversion.string_to_basenumbers(unicode sequence, bool suppress_iupac=False) bytes ¶
Translate a string to a sequence of basenumbers.
- Parameters:
sequence (string) – String of IUPAC-characters (upper or lower case).
suppress_iupac (bool) – If this is set, all non-ACGT-characters will be replaced with N. (Default: False)
- Returns:
A bytes object containing the sequence encoded as basenumbers.
- Return type:
bytes
- dinopy.conversion.string_to_bytes(unicode sequence, bool suppress_iupac=False) bytes ¶
Translate a string to a byte sequence.
- Parameters:
sequence (str) – String of IUPAC-characters (upper or lower case).
suppress_iupac (bool) – If this is set, all non-ACGT-characters will be replaced with N. (Default: False)
- Returns:
A bytes object containing the sequence encoded as bytes.
- Return type:
bytes