dinopy.fastq_reader module¶
- class dinopy.fastq_reader.FastqReader(source)¶
Reader for FASTQ files.
- Parameters:
source (str, file, stdin, or list) – FASTQ input source, can be a path .fastq or fastq.gz file (as str), an open file handle, the stdin stream (sys.stdin) or a list of bytes.
Works with gzipped files which are identified by the usual ‘.gz’ suffix.
The data type of the sequence can be specified as one of bytes, bytearray, str, or basenumbers. See Note about dtype for more information.
The name and plus lines, as well as the quality values are returned as bytes (without @ or +).
The specified source from which the FASTQ file is read can either be a filepath to a .fastq or fastq.gz file (zipped files are identified by the .gz suffix) or sys.stdin or a file/IOBase object.
Supports two iterators, for different level of analysis:
lines()
: Return an iterator over all lines in the FASTQ file. Lines are classified and returned as a named tuple(type, value)
wheretype
can be ‘name’, ‘sequence’, ‘plus’, ‘quality values’ and value is of typebytes
for ‘name’, ‘plus’ and ‘quality values’ and Note about dtype for ‘sequence’ (Default: bytes).reads()
: Return an iterator over all reads in the FASTQ file. Reads are returned as a named tuple containing:sequence (dtype): Read sequence (always).
name (bytes): Name line (only if
read_names=True
).quality (bytes): Quality values (only if
quality_values=True
).
or just as a sequence, if both the
read_names
and thequality_values
parameter are set toFalse
. By default, both read names and quality values
Example
Iterate over all reads in the file path/to/fastq_file.fastq and pass all reads that match a certain condition (depending on the read name) to a computation function:
import dinopy fqr = dinopy.FastqReader("path/to/fastq_file.fastq") for seq, name, qvs in fqr.reads(): if some_condition(name): computation(seq, qvs)
- guess_quality_format(self, int max_reads=10000) tuple ¶
Guess which format was used to encode the quality values.
The quality values are defined as described here.
- Parameters:
max_lines (int) – Number of reads after which the scan is terminated. (Default: 10000)
- Raises:
ValueError – for quality values lower than 33 or bigger than 104. These values are not part of any standard quality value format.
- Returns:
A tuple containing the most probable format for the quality values as a string and a list of all possible formats fo the observed quality values.
Possible formats are:
’Sanger’ (ASCII 33-73) PHRED 0-40
’Solexa, Illumina <=1.2’ (ASCII 59-104) Solexa -5-40
’Illumina 1.3+’ (ASCII 64-104) PHRED 0-40
’Illumina 1.5+’ (ASCII 66-104) PHRED 3-40
’Illumina 1.8+’ (ASCII 33-74) PHRED 0-41
If several formats could explain the observed values, the most conservative guess is returned.
- Return type:
tuple
- lines(self, type dtype=bytes)¶
Returns an iterator over all lines in the FASTQ file.
- Parameters:
dtype (type) – Desired type for the sequences. (see dtype; Default: bytes) Note that name, plus and quality values lines are always returned as bytes.
- Yields:
(str, bytes or dtype)-tuples – An iterator over all lines in the FASTQ file. Lines are classified and returned as a named tuple containing type and value of the line, where type can be:
‘name’
‘sequence’
‘plus’
‘quality values’
and value contains the content of the respective line.
- Raises:
FileNotFoundError – If the FASTQ source is a filepath that does not point to a valid input file.
ValueError – If the input source does not specify a valid source type.
- print_quality_tally(self, file=sys.stdout)¶
Print a crude ASCII histogram of the quality values if there are any.
Note
To use this the reads method has to be called with the quality_tally parameter set to True. Creating a tally by using the lines method is not yet supported.
- Parameters:
file (filelike) – Where the histogram will be printed to. (Default: sys.stdout)
- Raises:
ValueError – If no data for a quality tally has been found. This can either be due
to the quality_tally parameter of the reads method was set to False (i.e. no tally –
was created) or because no file has been parsed yet. –
- quality_tally¶
Return a tally of the quality values, if there is any.
- Returns:
A collections.Counter containing all encountered quality values. If no quality tally has been created, return None.
- Return type:
Counter or None
- reads(self, bool read_names=True, bool quality_values=True, bool quality_tally=False, type dtype=bytes)¶
Returns an iterator over all reads in the file.
- Parameters:
read_names (bint) – Return the name line with each read. (Default: True)
quality_values (bint) – Return the quality values with each read. (Default: True)
quality_tally (bint) – Tally up quality values of the whole file. (Default: False)
dtype (type) – Desired type for the sequences. (see dtype; Default: bytes) Note that name, plus and quality values lines are always returned as bytes.
- Yields:
named tuple or dtype – An iterator over all reads in the FASTQ file. The type of the items relies on the input parameter. If both
read_names
andquality_values
are set toFalse
only the sequence will be yielded encoded with the givend dtype. Otherwise a named tuple will be yielded that contains the following entries:names
quality_values
return type
False
False
sequence
True
False
NamedTuple(sequence, name)
False
True
NamedTuple(sequence, quality)
True
True
NamedTuple(sequence, name, quality)
The type of
sequence
depends on the dtype parameter,name
andquality
are always returned as bytes.- Raises:
FileNotFoundError – If the source is an invalid filepath.