dinopy.fasta_reader module¶
- class dinopy.fasta_reader.FastaReader(source, write_fai=False)¶
Reader for any FASTA-like input.
- Parameters:
source (str, file, stdin, or list) – FASTA input source, can be a path to a .fa / .fasta or .fa.gz / fasta.gz file (as str), an open file handle, the stdin stream (sys.stdin) or a list of bytes.
write_fai (bool) – If
True
, a fai (FASTA annotation information) file will be generated and stored atsource + '.fai'
/source.name + '.fai'
automatically. fai files can speed up random access of FASTA file contents. Default:False
.
Works with zipped files, which are identified by ‘.gz’ suffix.
Supports different iterators, for different level of analysis:
entries()
: Yield all entries of the file and provide information about length and position in the genome for all entries.genome()
: Concatenates and returns all data (i.e. all lines except name-lines) as a single sequence. Also return an annotation containing chromosome boundaries and lengths.reads()
: Interpret the entries of the file as reads and yield them either as FastaRead named tuples or only as sequences.chromosomes()
: Interpret the entries of the FASTA file as chromosomes. If only a subset of chromosomes is to be analyzed, those can be specified via the selected_chromosomes parameter. Chromosomes that have not been selected will be skipped.lines()
: Yield all lines of the file without interpretation.
Also supports array like access if a fai file is present:
far = FastaReader("testgenome.fasta") # or supply write_fai=True to create a fai file implicitly print(list(far['chromosome_II'])) # prints the entry with name 'chromosome_II' print(list(far[1])) # same as the above iff 'chromosome_II' really is the second chromosome in the file
You can even specify a list of items (in mixed mode!):
fparser[[0, 'chromosome_III']]
will return the first chromosome in the file and the chromosome named ‘chromosome_III’. Note the double brackets: this is because we actually expect a list of names/indices.Even better: random access using bracket style indexing is supported aswell! Just supply a tuple consisting of chromosome_name, start-index and end-index, i.e.
fparser[('chromosome_I', 2, 36)]
. Since we can handle lists, you can also supply multiple tuples:sequences = fparser[[('chromosome_I', 2, 36), ('chromosome_II', 24, 26)]]
Note
A FASTA annotation file (fai-file) can speed up the reading of a FASTA file. If a corresponding fai-file with the same filename / path and the suffix .fai is found it is automatically used. If not, such a file is created iff
write_fai=True
.Note
If you use the array-like (bracket-style) access, dtype is fixed to
bytes
. If you wish to use a different dtype for the sequence, usechromosomes()
orrandom_access()
with the appropriate dtype instead (or use one of the conversion methods afterwards)Warning
This feature is still experimental.
- chromosomes(self, selected_chromosomes=None, dtype=bytes)¶
Yield the selected chromosomes from the FASTA file.
This iterator is intended to extract a subset of chromosomes from the genome. If all chromosomes are needed, please use
entries()
instead.- Parameters:
selected_chromosomes (obj) – Names or indices of the chromosomes that should be returned. These can either come as a list of integers, giving the positions of chromosomes that are to be read, or a single integer if only one chromosome should be read. Also accepts chromosome_names (of type str or bytes), or None (to return all chromosomes). (Default: None)
dtype (type) – Desired type for the sequence(s) (see dtype, default: bytes).
- Yields:
named tuple –
A named tuple per entry in the FASTA file, each containing:
‘sequence’: The actual sequence (without linebreaks)
‘name’: The chromosome’s name
‘length’: The length of the sequence
‘interval’: A tuple
(start, end)
of start- and end-position of the chromosome in the genome (zero based).
Note
All chromosome indexes are zero-based. To get the first and third genome of a genome you have to pass selected_chromosomes=[0,2].
- entries(self, dtype=bytes)¶
Iterate over each entry in a FASTA file.
An entry is the combination of a name and a sequence. Typically entries represent chromosomes, but sometimes reads get stored in FASTA format, too. Please note that there are special methods to work with reads and chromosomes (
chromosomes()
andreads()
), which cater to the specific needs of either format and allow for more intuitive code.- Parameters:
dtype (type) – Desired type for the sequence(s) (see dtype, default: bytes).
- Yields:
named tuple –
A named tuple per entry in the FASTA file, each containing:
‘sequence’: The actual sequence (without linebreaks)
‘name’: The chromosome’s name
‘length’: The length of the sequence
‘interval’: A tuple
(start, end)
of start- and end-position of the chromosome in the genome (not in the file)
- genome(self, dtype=bytes) FastaGenomeC ¶
Read in the complete genome sequence.
This method returns the whole genome in a single data structure. Please be aware that this can consume a large amount of memory.
- Parameters:
dtype (type) – Desired type for the sequence(s) (see dtype, default: bytes).
- Returns:
A named tuple containing the complete genome sequence encoded as the given dtype and a list of namedtuples (one for each chromosome):
name: Name line of the chromosome
length: Length of the chromosome
interval: Begin and end index of the chromosome in the genome.
- Return type:
FastaGenome
- lines(self, skip_name_lines=False, dtype=bytes)¶
Iterate over all lines in the given FASTA file.
- Parameters:
skip_name_lines (boolean) – If
True
, name lines will not be returned. (Default: False)dtype (type) – Desired type for the sequence(s) (see dtype, default: bytes).
- Yields:
dtype – A line from the FASTA file. Sequence lines are encoded as dtype. Name lines (if skip_name_lines=False) are always returned as bytes.
Examples
Print the content of the FASTA file:
far = dinopy.FastaReader(filepath) for line in far.lines(): print(line)
Print only the sequences from the FASTA file:
far = dinopy.FastaReader(filepath) for line in far.lines(skip_name_lines=True): print(line)
- random_access(self, selected_chromosome, int start, int end, dtype=bytes)¶
Provides random access of fasta files if a fai file is available.
- Parameters:
selected_chromosome (str, bytes, int) – either
str
orbytes
giving a chromosome’s name or anint
referring to the number of the chromosome in the file.start (int) – start-index in the chromosome (i.e. relative!). Inclusive.
end (int) – end-index in the chromosome (i.e. relative!). Exclusive.
dtype (type) – Desired type for the sequence(s) (see dtype, default: bytes).
- Returns:
the subsequence of
selected_chromosome
fromstart
toend
(with type Note about dtype).
Examples
Access a subsequence of length 33 of ‘chromosome_II’ starting from the third base:
far = dinopy.FastaReader(filepath) far.random_access('chromosome_II', 2, 36)
- reads(self, read_names=False, dtype=bytes)¶
Yield all reads in the opened FASTA file as an iterator.
- Parameters:
read_names (bool, optional) – If
True
, returns (read, read_name) named tuples. Otherwise only plain reads will be returned.dtype (type) – Desired type for the sequence(s) (see dtype, default: bytes).
- Yields:
dtype or named tuple – The sequence of the reads are yielded, encoded according to the given dtype. If
read_names
is set, a named tuple of sequence (dtype) and name (bytes) is yielded instead.
Examples
Print all sequences of reads in the FASTA file:
far = dinopy.FastaReader(filepath) for read in far.reads(): print(read)
Iterate over all reads in the FASTA file. Print the names of the reads and do something magical with the read sequences.
far = dinopy.FastaReader(filepath) for seq, name in far.reads(read_names=True): print(name) magic(seq)