dinopy.fastq_reader module

class dinopy.fastq_reader.FastqReader(source)

Reader for FASTQ files.

Parameters:

source (str, file, stdin, or list) – FASTQ input source, can be a path .fastq or fastq.gz file (as str), an open file handle, the stdin stream (sys.stdin) or a list of bytes.

Works with gzipped files which are identified by the usual ‘.gz’ suffix.

The data type of the sequence can be specified as one of bytes, bytearray, str, or basenumbers. See Note about dtype for more information.

The name and plus lines, as well as the quality values are returned as bytes (without @ or +).

The specified source from which the FASTQ file is read can either be a filepath to a .fastq or fastq.gz file (zipped files are identified by the .gz suffix) or sys.stdin or a file/IOBase object.

Supports two iterators, for different level of analysis:

  • lines(): Return an iterator over all lines in the FASTQ file. Lines are classified and returned as a named tuple (type, value) where type can be ‘name’, ‘sequence’, ‘plus’, ‘quality values’ and value is of type bytes for ‘name’, ‘plus’ and ‘quality values’ and Note about dtype for ‘sequence’ (Default: bytes).

  • reads(): Return an iterator over all reads in the FASTQ file. Reads are returned as a named tuple containing:

    • sequence (dtype): Read sequence (always).

    • name (bytes): Name line (only if read_names=True).

    • quality (bytes): Quality values (only if quality_values=True).

    or just as a sequence, if both the read_names and the quality_values parameter are set to False. By default, both read names and quality values

Example

Iterate over all reads in the file path/to/fastq_file.fastq and pass all reads that match a certain condition (depending on the read name) to a computation function:

import dinopy
fqr = dinopy.FastqReader("path/to/fastq_file.fastq")
for seq, name, qvs in fqr.reads():
    if some_condition(name):
        computation(seq, qvs)
guess_quality_format(self, int max_reads=10000) tuple

Guess which format was used to encode the quality values.

The quality values are defined as described here.

Parameters:

max_lines (int) – Number of reads after which the scan is terminated. (Default: 10000)

Raises:

ValueError – for quality values lower than 33 or bigger than 104. These values are not part of any standard quality value format.

Returns:

A tuple containing the most probable format for the quality values as a string and a list of all possible formats fo the observed quality values.

Possible formats are:

  • ’Sanger’ (ASCII 33-73) PHRED 0-40

  • ’Solexa, Illumina <=1.2’ (ASCII 59-104) Solexa -5-40

  • ’Illumina 1.3+’ (ASCII 64-104) PHRED 0-40

  • ’Illumina 1.5+’ (ASCII 66-104) PHRED 3-40

  • ’Illumina 1.8+’ (ASCII 33-74) PHRED 0-41

If several formats could explain the observed values, the most conservative guess is returned.

Return type:

tuple

lines(self, type dtype=bytes)

Returns an iterator over all lines in the FASTQ file.

Parameters:

dtype (type) – Desired type for the sequences. (see dtype; Default: bytes) Note that name, plus and quality values lines are always returned as bytes.

Yields:

(str, bytes or dtype)-tuples – An iterator over all lines in the FASTQ file. Lines are classified and returned as a named tuple containing type and value of the line, where type can be:

  1. ‘name’

  2. ‘sequence’

  3. ‘plus’

  4. ‘quality values’

and value contains the content of the respective line.

Raises:
  • FileNotFoundError – If the FASTQ source is a filepath that does not point to a valid input file.

  • ValueError – If the input source does not specify a valid source type.

print_quality_tally(self, file=sys.stdout)

Print a crude ASCII histogram of the quality values if there are any.

Note

To use this the reads method has to be called with the quality_tally parameter set to True. Creating a tally by using the lines method is not yet supported.

Parameters:

file (filelike) – Where the histogram will be printed to. (Default: sys.stdout)

Raises:
  • ValueError – If no data for a quality tally has been found. This can either be due

  • to the quality_tally parameter of the reads method was set to False (i.e. no tally

  • was created) or because no file has been parsed yet.

quality_tally

Return a tally of the quality values, if there is any.

Returns:

A collections.Counter containing all encountered quality values. If no quality tally has been created, return None.

Return type:

Counter or None

reads(self, bool read_names=True, bool quality_values=True, bool quality_tally=False, type dtype=bytes)

Returns an iterator over all reads in the file.

Parameters:
  • read_names (bint) – Return the name line with each read. (Default: True)

  • quality_values (bint) – Return the quality values with each read. (Default: True)

  • quality_tally (bint) – Tally up quality values of the whole file. (Default: False)

  • dtype (type) – Desired type for the sequences. (see dtype; Default: bytes) Note that name, plus and quality values lines are always returned as bytes.

Yields:

named tuple or dtype – An iterator over all reads in the FASTQ file. The type of the items relies on the input parameter. If both read_names and quality_values are set to False only the sequence will be yielded encoded with the givend dtype. Otherwise a named tuple will be yielded that contains the following entries:

names

quality_values

return type

False

False

sequence

True

False

NamedTuple(sequence, name)

False

True

NamedTuple(sequence, quality)

True

True

NamedTuple(sequence, name, quality)

The type of sequence depends on the dtype parameter, name and quality are always returned as bytes.

Raises:

FileNotFoundError – If the source is an invalid filepath.