dinopy.fasta_writer module

class dinopy.fasta_writer.FastaWriter(target, write_fai=False, force_overwrite=False, append=False)

FastaWriter for writing genomes (or reads) to disk in FASTA format.

Parameters:
  • target (str, bytes, file or sys.stdout) – Path where the file will be written to. If the path ends with the suffix .gz, a gzipped file will be created.

  • force_overwrite (bool) – If set to True overwrites existing FASTA files with the same name. (Default: False)

  • append (bool) – If set to True, existing files will not be overwritten. Reads will be appended to the end of the file. (Default: False)

  • write_fai (bool, bytes or string) – If write_fai denotes a path, write FASTA annotation information file to the specified path. If write_fai is True (and does not resemble a path, i.e. is not an instance of str or bytes) annotation information will be written to filepath + '.fai', which is the default behaviour. Note that force_overwrite and append apply to fai-files aswell. (Default: False)

  • line_width (int) – The maximum number of characters per line, excluding newlines. (Default: 80)

Raises:
  • ValueError – If the filename is invalid.

  • ValueError – If contradicting parameters are passed (overwrite=True and append=True).

  • TypeError – If target is neither a file, nor a path nor stdout.

  • IOError – If target is a file opened in the wrong mode.

  • FileExistsError – If target file (FASTA or fai) already exists and neither overwrite nor append have been specified.

Methods intended for public use are:

  • write_genome(): Write a whole genome to file.

  • write_entries(): Writes a list of tuples containing entry names and entry sequences. The tuples have to be in the format: (entry_sequence, entry_name)

    An entry can be a chromosome or a read.

  • write_entry(): Write a single entry to the openend file.

  • write_chromosomes(): Writes a list of chromosomes to file. Each chromosome must consist of a tuple containing: (chromosome_sequence, chromosome_name)

  • write_chromosome(): Writes a single chromosome to file.

Examples

Write a genome of three chromosomes from a single sequence:

seq = b"ACGTAACCGGTTAAACCCGGGTTT"
chr_info = [
    (b"single", 4, (0,4)),
    (b"double", 8, (4,12)),
    (b"triple", 12, (12,24)),
]
with dinopy.FastaWriter('somefile.fasta') as faw:
    faw.write_genome(seq, chr_info)

Write a genome of three chromosomes from separate chromosomes:

chromosomes = [
    ('ACGTACGT', b'chr1'),
    ('GCGTAGGATGGGCCTATCGA', b'chr2'),
    ('CCATAGGATAGACCANNACAGATCAN', b'chr3'),
]
with dinopy.FastaWriter('somefile.fasta') as faw:
    faw.write_chromosomes(chromosomes, dtype=str)
close(self)

Close the file (after writing).

Note

This should only be used if the exact number of files is not known at develpoment time. Otherwise the use of the environment is encouraged, as it is much harder to ‘forget’ closing an opened file.

write_chromosome(self, chromosome, type dtype=bytes)

Write a single chromosome to the opened FASTA file.

Note: Alias for write_entry().

Parameters:
  • chromosome (tuple) – Containing chromosome sequence (as dtype) and chromosome name (bytes).

  • dtype (type) – Type of the sequence. (See dtype; Default: bytes)

write_chromosomes(self, chromosomes, type dtype=bytes)

Write chromosomes to the specified filepath.

Note: Alias for write_entries(entry, dtype).

Parameters:
  • chromosomes (iterable) – Iterable of (sequence, name) tuples, where seq is the sequence of the chromosome (as dtype) and name is the chromosome name as bytes.

  • dtype (type) – Type of the sequence. (See dtype; Default: bytes)

Raises:

IOError – If no output FASTA file has been opened.

Note

This method is used to write a list of separate chromosomes to file. To split up a long sequence into chromosomes please use write_genome(genome, chr_info, ...), where chr_info is a list of tuples that contain name, length (start, stop) for each chromosome, or just a name (as str/bytes) if the organism only has one chromosome.

write_entries(self, entries, type dtype=bytes)

Write entries to the specified filepath.

Parameters:
  • entries (iterable) – Iterable of (seq, name) tuples, where seq is the sequence of the entry (as dtype) and name is the entry’s name as bytes.

  • dtype (type) – Type of the sequence. (See dtype; Default: bytes)

Raises:

IOError – If no output FASTA file has been opened.

Note

This method is used to write a list of separate entries to file. To split up a long sequence into chromosomes please use write_genome(genome, chr_info, ...), where chr_info is a list of tuples that contain name, length (start, stop) for each chromosome, or just a name (as str/bytes) if the organism only has one chromosome.

write_entry(self, entry, type dtype=bytes)

Write a single entry to the opened FASTA file.

Parameters:
  • entry (tuple) – Containing entry sequence (as dtype) and entry name (bytes).

  • dtype (type) – Type of the sequence. (See dtype; Default: bytes)

Raises:

IOError – If no output FASTA file has been opened.

write_genome(self, genome, chromosome_info=None, type dtype=bytes)

Write a genome to the specified filepath.

Parameters:
  • genome (dtype) – Genome sequence to be written to file as a single iterable of dtype.

  • chr_info (tuple, str or bytes) – Chromosome names and borders in the format: (chr_name[str], length[int], chr_interval[tuple of two ints]) or a single (byte)string. If a single (byte)string is encountered, it will be used as a genome name and the whole genome sequence will be written as a “single chromosome”.

  • dtype (type) – Type of the sequence. (See dtype; Default: bytes)

Raises:

IOError – If no output FASTA file has been opened.

Note

The separation of the genome is handled according to the given chromosome info. If the sequences is already split up into chromosomes please use write_chromosome / write_entry which do not need chromosome info to be specified.

If chromosome_info is a string or bytes, the genome is treated as a single chromosome with the string as name. If multiple chromosomes are to be written chromosome_info has to be a list of tuples in the format: (chr_name[str], length[int], chr_interval[tuple of two ints])