dinopy.nameline_parser module¶
This module handles the parsing of name lines.
You can either use an instance of the NamelineParser
class to handle picking
the correct parsing function for you or choose a specific parsing function yourself:
Let
NamelineParser
do the work for you:from dinopy import NamelineParser parser = NamelineParser() line1 = parser.parse(b"@NS500639:6:H3MYMAFXX:1:11101:9262:1124 1:N:0:TAATGC") # this is a Casava 1.8+ style nameline line2 = parser.parse(b"@NS500639:6:H3MYMAFXX:1:11101:9262:1124 1:N:0:TAATGC") # this is yet another Casava 1.8+ style nameline # line 3 = parser.parse(b"@HWUSI-EAS100R:6:73:941:1973#0/1") # this is a Casava <1.8 style nameline and parsing this will result in a `ValueError`.Pick a specific parsing function yourself:
from dinopy.nameline_parser import parse_casava_18_line, parse_casava_pre18_line, parse_ncbi_line line1 = parse_casava_18_line(b"@NS500639:6:H3MYMAFXX:1:11101:9262:1124 1:N:0:TAATGC") # this is a Casava 1.8+ style nameline line2 = parse_casava_18_line(b"@NS500639:6:H3MYMAFXX:1:11101:9262:1124 1:N:0:TAATGC") # this is yet another Casava 1.8+ style nameline # line3 = parse_casava_18_line(b"@HWUSI-EAS100R:6:73:941:1973#0/1") # this is a Casava <1.8 style nameline and parsing this will result in a `ValueError` line4 = parse_casava_pre18_line(b"@HWUSI-EAS100R:6:73:941:1973#0/1") # this however will work.Workflow:
from dinopy import FastqReader, NamelineParser nameline_parser = NamelineParser() with FastqReader("file.fastq") as fqr: for seq, name, _ in fqr.reads(): nameline = nameline_parser.parse(name) print(nameline.tile) # do_stuff(nameline) # do_other_stuff(seq)
The following nameline-conventions are supported:
Casava < 1.8:
@instrument:flowcell_lane:tile:cluster_x:cluster_y#index_sequence/pair_member
Casava ≥ 1.8:
@unique_instrument_name:run_id:flowecell_id:flowcell_lane:tile:cluster_x:cluster_y pair_member:filtered:control_number:index_sequence
454: 14 character string, encoding
plate|region|xy
Helicos:
flowcell-channel-field-camera-position
IonTorrent:
run_id chip_row chip_column
NCBI-SRA:
SRA-id anyoftheabove length=n
Unknown formats: will be wrapped in a
_DummyLine
which only holds the line as abytes
reference, which can either be accessed viasome_dummy_line.line
,some_dummy_line[0]
or retrieved as a str viastr(some_dummy_line)
.
- class dinopy.nameline_parser.NamelineParser¶
Used for automagically parsing either of Casava 1.8+-, Casava <1.8- or NCBI-style namelines.
Examples
Parse Casava <1.8 style namelines:
from dinopy import NamelineParser parser = NamelineParser() line = parser.parse(b"@HWUSI-EAS100R:6:73:941:1973#0/1") print(line) # "instrument: b'HWUSI-EAS100R', flowcell_lane: 6, tile: 73, cluster_x: 941, cluster_y: 1973, index_sequence: b'TODO', pair_member: -1, additional_info: []" print(line.tile) # 73
Parse Casava 1.8+ style namelines:
from dinopy import NamelineParser parser = NamelineParser() line = parser.parse(b"@NS500639:6:H3MYMAFXX:1:11101:9262:1124 1:N:0:TAATGC") print(line) # "instrument: b'NS500639', run: 6, flowcell_id: b'H3MYMAFXX', lane: 1, tile: 11101, cluster_x: 9262, cluster_y: 1124, pair_member: 1, filtered: False, control_number: 0, index_sequence: b'TAATGC', additional_info: []" print(line.instrument) # b'NS500639'
Parse NCBI style namelines (defunct):
from dinopy import NamelineParser parser = NamelineParser() line = parser.parse(b"@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36") print(line) # TODO
Unknown nameline styles:
from dinopy import NamelineParser parser = NamelineParser() line = parser.parse(b"@Something:Entirely Different") print(line) # "@Something:Entirely Different"
- parse(self, bytes line) _NameLine ¶
- dinopy.nameline_parser.parse_casava_18_line(bytes line) _NameLine ¶
Split a Casava 1.8+ Style illumina fastq header. Documentation p. 50.
- Parameters:
line (bytes) – A casava style header line.
- Returns:
Containing all information of the casava line. These are:
instrument (bytes)
run (int)
flowcell_id (int)
lane (int)
tile (int)
cluster_x (int)
cluster_y (int)
pair_member (int)
filtered (bool)
control_nr (int)
index_sequence (bytes)
additional information after the casava line (empty most of the time) (list)
- Return type:
Nameline
- dinopy.nameline_parser.parse_casava_pre18_line(bytes line) _NameLine ¶
Split a Casava <1.8 Style illumina fastq header. Documentation p. 50.
- Parameters:
line (bytes) – A casava style header line.
- Returns:
Containing all information of the casava line. These are:
instrument (bytes)
flowcell lane (int)
tile (int)
cluster_x (int)
cluster_y (int)
index_sequence (bytes)
pair_member (int)
additional information after the casava line (empty most of the time) (list)
- Return type:
Nameline
- dinopy.nameline_parser.parse_ncbi_line(bytes line) _NameLine ¶