dinopy.nameline_parser module

This module handles the parsing of name lines. You can either use an instance of the NamelineParser class to handle picking the correct parsing function for you or choose a specific parsing function yourself:

  1. Let NamelineParser do the work for you:

    from dinopy import NamelineParser
    parser = NamelineParser()
    line1 = parser.parse(b"@NS500639:6:H3MYMAFXX:1:11101:9262:1124 1:N:0:TAATGC")  # this is a Casava 1.8+ style nameline
    line2 = parser.parse(b"@NS500639:6:H3MYMAFXX:1:11101:9262:1124 1:N:0:TAATGC")  # this is yet another Casava 1.8+ style nameline
    # line 3 = parser.parse(b"@HWUSI-EAS100R:6:73:941:1973#0/1")  # this is a Casava <1.8 style nameline and parsing this will result in a `ValueError`.
    
  2. Pick a specific parsing function yourself:

    from dinopy.nameline_parser import parse_casava_18_line, parse_casava_pre18_line, parse_ncbi_line
    line1 = parse_casava_18_line(b"@NS500639:6:H3MYMAFXX:1:11101:9262:1124 1:N:0:TAATGC")  # this is a Casava 1.8+ style nameline
    line2 = parse_casava_18_line(b"@NS500639:6:H3MYMAFXX:1:11101:9262:1124 1:N:0:TAATGC")  # this is yet another Casava 1.8+ style nameline
    # line3 = parse_casava_18_line(b"@HWUSI-EAS100R:6:73:941:1973#0/1")  # this is a Casava <1.8 style nameline and parsing this will result in a `ValueError`
    line4 = parse_casava_pre18_line(b"@HWUSI-EAS100R:6:73:941:1973#0/1")  # this however will work.
    
  3. Workflow:

    from dinopy import FastqReader, NamelineParser
    nameline_parser = NamelineParser()
    with FastqReader("file.fastq") as fqr:
        for seq, name, _ in fqr.reads():
            nameline = nameline_parser.parse(name)
            print(nameline.tile)
            # do_stuff(nameline)
            # do_other_stuff(seq)
    

The following nameline-conventions are supported:

  1. Casava < 1.8: @instrument:flowcell_lane:tile:cluster_x:cluster_y#index_sequence/pair_member

  2. Casava ≥ 1.8: @unique_instrument_name:run_id:flowecell_id:flowcell_lane:tile:cluster_x:cluster_y pair_member:filtered:control_number:index_sequence

  3. 454: 14 character string, encoding plate|region|xy

  4. Helicos: flowcell-channel-field-camera-position

  5. IonTorrent: run_id chip_row chip_column

  6. NCBI-SRA: SRA-id anyoftheabove length=n

  7. Unknown formats: will be wrapped in a _DummyLine which only holds the line as a bytes reference, which can either be accessed via some_dummy_line.line, some_dummy_line[0] or retrieved as a str via str(some_dummy_line).

class dinopy.nameline_parser.NamelineParser

Used for automagically parsing either of Casava 1.8+-, Casava <1.8- or NCBI-style namelines.

Examples

  1. Parse Casava <1.8 style namelines:

    from dinopy import NamelineParser
    parser = NamelineParser()
    line = parser.parse(b"@HWUSI-EAS100R:6:73:941:1973#0/1")
    print(line)  # "instrument: b'HWUSI-EAS100R', flowcell_lane: 6, tile: 73, cluster_x: 941, cluster_y: 1973, index_sequence: b'TODO', pair_member: -1, additional_info: []"
    print(line.tile)  # 73
    
  2. Parse Casava 1.8+ style namelines:

    from dinopy import NamelineParser
    parser = NamelineParser()
    line = parser.parse(b"@NS500639:6:H3MYMAFXX:1:11101:9262:1124 1:N:0:TAATGC")
    print(line)  # "instrument: b'NS500639', run: 6, flowcell_id: b'H3MYMAFXX', lane: 1, tile: 11101, cluster_x: 9262, cluster_y: 1124, pair_member: 1, filtered: False, control_number: 0, index_sequence: b'TAATGC', additional_info: []"
    print(line.instrument)  # b'NS500639'
    
  3. Parse NCBI style namelines (defunct):

    from dinopy import NamelineParser
    parser = NamelineParser()
    line = parser.parse(b"@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36")
    print(line)  # TODO
    
  4. Unknown nameline styles:

    from dinopy import NamelineParser
    parser = NamelineParser()
    line = parser.parse(b"@Something:Entirely Different")
    print(line)  # "@Something:Entirely Different"
    
parse(self, bytes line) _NameLine
dinopy.nameline_parser.parse_casava_18_line(bytes line) _NameLine

Split a Casava 1.8+ Style illumina fastq header. Documentation p. 50.

Parameters:

line (bytes) – A casava style header line.

Returns:

Containing all information of the casava line. These are:

  • instrument (bytes)

  • run (int)

  • flowcell_id (int)

  • lane (int)

  • tile (int)

  • cluster_x (int)

  • cluster_y (int)

  • pair_member (int)

  • filtered (bool)

  • control_nr (int)

  • index_sequence (bytes)

  • additional information after the casava line (empty most of the time) (list)

Return type:

Nameline

dinopy.nameline_parser.parse_casava_pre18_line(bytes line) _NameLine

Split a Casava <1.8 Style illumina fastq header. Documentation p. 50.

Parameters:

line (bytes) – A casava style header line.

Returns:

Containing all information of the casava line. These are:

  • instrument (bytes)

  • flowcell lane (int)

  • tile (int)

  • cluster_x (int)

  • cluster_y (int)

  • index_sequence (bytes)

  • pair_member (int)

  • additional information after the casava line (empty most of the time) (list)

Return type:

Nameline

dinopy.nameline_parser.parse_ncbi_line(bytes line) _NameLine