dinopy.fasta_reader module

class dinopy.fasta_reader.FastaReader(source, write_fai=False)

Reader for any FASTA-like input.

Parameters:
  • source (str, file, stdin, or list) – FASTA input source, can be a path to a .fa / .fasta or .fa.gz / fasta.gz file (as str), an open file handle, the stdin stream (sys.stdin) or a list of bytes.

  • write_fai (bool) – If True, a fai (FASTA annotation information) file will be generated and stored at source + '.fai' / source.name + '.fai' automatically. fai files can speed up random access of FASTA file contents. Default: False.

Works with zipped files, which are identified by ‘.gz’ suffix.

Supports different iterators, for different level of analysis:

  • entries(): Yield all entries of the file and provide information about length and position in the genome for all entries.

  • genome(): Concatenates and returns all data (i.e. all lines except name-lines) as a single sequence. Also return an annotation containing chromosome boundaries and lengths.

  • reads(): Interpret the entries of the file as reads and yield them either as FastaRead named tuples or only as sequences.

  • chromosomes(): Interpret the entries of the FASTA file as chromosomes. If only a subset of chromosomes is to be analyzed, those can be specified via the selected_chromosomes parameter. Chromosomes that have not been selected will be skipped.

  • lines(): Yield all lines of the file without interpretation.

Also supports array like access if a fai file is present:

far = FastaReader("testgenome.fasta")  # or supply write_fai=True to create a fai file implicitly
print(list(far['chromosome_II']))  # prints the entry with name 'chromosome_II'
print(list(far[1]))  # same as the above iff 'chromosome_II' really is the second chromosome in the file

You can even specify a list of items (in mixed mode!): fparser[[0, 'chromosome_III']] will return the first chromosome in the file and the chromosome named ‘chromosome_III’. Note the double brackets: this is because we actually expect a list of names/indices.

Even better: random access using bracket style indexing is supported aswell! Just supply a tuple consisting of chromosome_name, start-index and end-index, i.e. fparser[('chromosome_I', 2, 36)]. Since we can handle lists, you can also supply multiple tuples:

sequences = fparser[[('chromosome_I', 2, 36), ('chromosome_II', 24, 26)]]

Note

A FASTA annotation file (fai-file) can speed up the reading of a FASTA file. If a corresponding fai-file with the same filename / path and the suffix .fai is found it is automatically used. If not, such a file is created iff write_fai=True.

Note

If you use the array-like (bracket-style) access, dtype is fixed to bytes. If you wish to use a different dtype for the sequence, use chromosomes() or random_access() with the appropriate dtype instead (or use one of the conversion methods afterwards)

Warning

This feature is still experimental.

chromosomes(self, selected_chromosomes=None, dtype=bytes)

Yield the selected chromosomes from the FASTA file.

This iterator is intended to extract a subset of chromosomes from the genome. If all chromosomes are needed, please use entries() instead.

Parameters:
  • selected_chromosomes (obj) – Names or indices of the chromosomes that should be returned. These can either come as a list of integers, giving the positions of chromosomes that are to be read, or a single integer if only one chromosome should be read. Also accepts chromosome_names (of type str or bytes), or None (to return all chromosomes). (Default: None)

  • dtype (type) – Desired type for the sequence(s) (see dtype, default: bytes).

Yields:

named tuple

A named tuple per entry in the FASTA file, each containing:

  1. ‘sequence’: The actual sequence (without linebreaks)

  2. ‘name’: The chromosome’s name

  3. ‘length’: The length of the sequence

  4. ‘interval’: A tuple (start, end) of start- and end-position of the chromosome in the genome (zero based).

Note

All chromosome indexes are zero-based. To get the first and third genome of a genome you have to pass selected_chromosomes=[0,2].

entries(self, dtype=bytes)

Iterate over each entry in a FASTA file.

An entry is the combination of a name and a sequence. Typically entries represent chromosomes, but sometimes reads get stored in FASTA format, too. Please note that there are special methods to work with reads and chromosomes (chromosomes() and reads()), which cater to the specific needs of either format and allow for more intuitive code.

Parameters:

dtype (type) – Desired type for the sequence(s) (see dtype, default: bytes).

Yields:

named tuple

A named tuple per entry in the FASTA file, each containing:

  1. ‘sequence’: The actual sequence (without linebreaks)

  2. ‘name’: The chromosome’s name

  3. ‘length’: The length of the sequence

  4. ‘interval’: A tuple (start, end) of start- and end-position of the chromosome in the genome (not in the file)

genome(self, dtype=bytes) FastaGenomeC

Read in the complete genome sequence.

This method returns the whole genome in a single data structure. Please be aware that this can consume a large amount of memory.

Parameters:

dtype (type) – Desired type for the sequence(s) (see dtype, default: bytes).

Returns:

A named tuple containing the complete genome sequence encoded as the given dtype and a list of namedtuples (one for each chromosome):

  • name: Name line of the chromosome

  • length: Length of the chromosome

  • interval: Begin and end index of the chromosome in the genome.

Return type:

FastaGenome

lines(self, skip_name_lines=False, dtype=bytes)

Iterate over all lines in the given FASTA file.

Parameters:
  • skip_name_lines (boolean) – If True, name lines will not be returned. (Default: False)

  • dtype (type) – Desired type for the sequence(s) (see dtype, default: bytes).

Yields:

dtype – A line from the FASTA file. Sequence lines are encoded as dtype. Name lines (if skip_name_lines=False) are always returned as bytes.

Examples

Print the content of the FASTA file:

far = dinopy.FastaReader(filepath)
for line in far.lines():
    print(line)

Print only the sequences from the FASTA file:

far = dinopy.FastaReader(filepath)
for line in far.lines(skip_name_lines=True):
    print(line)
random_access(self, selected_chromosome, int start, int end, dtype=bytes)

Provides random access of fasta files if a fai file is available.

Parameters:
  • selected_chromosome (str, bytes, int) – either str or bytes giving a chromosome’s name or an int referring to the number of the chromosome in the file.

  • start (int) – start-index in the chromosome (i.e. relative!). Inclusive.

  • end (int) – end-index in the chromosome (i.e. relative!). Exclusive.

  • dtype (type) – Desired type for the sequence(s) (see dtype, default: bytes).

Returns:

the subsequence of selected_chromosome from start to end (with type Note about dtype).

Examples

Access a subsequence of length 33 of ‘chromosome_II’ starting from the third base:

far = dinopy.FastaReader(filepath)
far.random_access('chromosome_II', 2, 36)
reads(self, read_names=False, dtype=bytes)

Yield all reads in the opened FASTA file as an iterator.

Parameters:
  • read_names (bool, optional) – If True, returns (read, read_name) named tuples. Otherwise only plain reads will be returned.

  • dtype (type) – Desired type for the sequence(s) (see dtype, default: bytes).

Yields:

dtype or named tuple – The sequence of the reads are yielded, encoded according to the given dtype. If read_names is set, a named tuple of sequence (dtype) and name (bytes) is yielded instead.

Examples

Print all sequences of reads in the FASTA file:

far = dinopy.FastaReader(filepath)
for read in far.reads():
    print(read)

Iterate over all reads in the FASTA file. Print the names of the reads and do something magical with the read sequences.

far = dinopy.FastaReader(filepath)
for seq, name in far.reads(read_names=True):
    print(name)
    magic(seq)