dinopy.nameline_parser module¶

This module handles the parsing of name lines. You can either use an instance of the NamelineParser class to handle picking the correct parsing function for you or choose a specific parsing function yourself:

Let NamelineParser do the work for you:

from dinopy import NamelineParser
parser = NamelineParser()
line1 = parser.parse(b"@NS500639:6:H3MYMAFXX:1:11101:9262:1124 1:N:0:TAATGC")  # this is a Casava 1.8+ style nameline
line2 = parser.parse(b"@NS500639:6:H3MYMAFXX:1:11101:9262:1124 1:N:0:TAATGC")  # this is yet another Casava 1.8+ style nameline
# line 3 = parser.parse(b"@HWUSI-EAS100R:6:73:941:1973#0/1")  # this is a Casava <1.8 style nameline and parsing this will result in a `ValueError`.

Pick a specific parsing function yourself:

from dinopy.nameline_parser import parse_casava_18_line, parse_casava_pre18_line, parse_ncbi_line
line1 = parse_casava_18_line(b"@NS500639:6:H3MYMAFXX:1:11101:9262:1124 1:N:0:TAATGC")  # this is a Casava 1.8+ style nameline
line2 = parse_casava_18_line(b"@NS500639:6:H3MYMAFXX:1:11101:9262:1124 1:N:0:TAATGC")  # this is yet another Casava 1.8+ style nameline
# line3 = parse_casava_18_line(b"@HWUSI-EAS100R:6:73:941:1973#0/1")  # this is a Casava <1.8 style nameline and parsing this will result in a `ValueError`
line4 = parse_casava_pre18_line(b"@HWUSI-EAS100R:6:73:941:1973#0/1")  # this however will work.

Workflow:

from dinopy import FastqReader, NamelineParser
nameline_parser = NamelineParser()
with FastqReader("file.fastq") as fqr:
    for seq, name, _ in fqr.reads():
        nameline = nameline_parser.parse(name)
        print(nameline.tile)
        # do_stuff(nameline)
        # do_other_stuff(seq)

The following nameline-conventions are supported:

Casava < 1.8: @instrument:flowcell_lane:tile:cluster_x:cluster_y#index_sequence/pair_member

Casava ≥ 1.8: @unique_instrument_name:run_id:flowecell_id:flowcell_lane:tile:cluster_x:cluster_y pair_member:filtered:control_number:index_sequence

454: 14 character string, encoding plate|region|xy

Helicos: flowcell-channel-field-camera-position

IonTorrent: run_id chip_row chip_column

NCBI-SRA: SRA-id anyoftheabove length=n

Unknown formats: will be wrapped in a _DummyLine which only holds the line as a bytes reference, which can either be accessed via some_dummy_line.line, some_dummy_line[0] or retrieved as a str via str(some_dummy_line).

class dinopy.nameline_parser.NamelineParser¶

Used for automagically parsing either of Casava 1.8+-, Casava <1.8- or NCBI-style namelines.

Examples

Parse Casava <1.8 style namelines:

from dinopy import NamelineParser
parser = NamelineParser()
line = parser.parse(b"@HWUSI-EAS100R:6:73:941:1973#0/1")
print(line)  # "instrument: b'HWUSI-EAS100R', flowcell_lane: 6, tile: 73, cluster_x: 941, cluster_y: 1973, index_sequence: b'TODO', pair_member: -1, additional_info: []"
print(line.tile)  # 73

Parse Casava 1.8+ style namelines:

from dinopy import NamelineParser
parser = NamelineParser()
line = parser.parse(b"@NS500639:6:H3MYMAFXX:1:11101:9262:1124 1:N:0:TAATGC")
print(line)  # "instrument: b'NS500639', run: 6, flowcell_id: b'H3MYMAFXX', lane: 1, tile: 11101, cluster_x: 9262, cluster_y: 1124, pair_member: 1, filtered: False, control_number: 0, index_sequence: b'TAATGC', additional_info: []"
print(line.instrument)  # b'NS500639'

Parse NCBI style namelines (defunct):

from dinopy import NamelineParser
parser = NamelineParser()
line = parser.parse(b"@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36")
print(line)  # TODO

Unknown nameline styles:

from dinopy import NamelineParser
parser = NamelineParser()
line = parser.parse(b"@Something:Entirely Different")
print(line)  # "@Something:Entirely Different"

parse(self, bytes line) → _NameLine¶

dinopy.nameline_parser.parse_casava_18_line(bytes line) → _NameLine¶

Split a Casava 1.8+ Style illumina fastq header. Documentation p. 50.

Parameters:

line (bytes) – A casava style header line.

Returns:

Containing all information of the casava line. These are:

instrument (bytes)

run (int)

flowcell_id (int)

lane (int)

tile (int)

cluster_x (int)

cluster_y (int)

pair_member (int)

filtered (bool)

control_nr (int)

index_sequence (bytes)

additional information after the casava line (empty most of the time) (list)

Return type:

Nameline

dinopy.nameline_parser.parse_casava_pre18_line(bytes line) → _NameLine¶

Split a Casava <1.8 Style illumina fastq header. Documentation p. 50.