biohazard-1.0.0: bioinformatics support library

Safe HaskellNone
LanguageHaskell2010

Bio.Bam.Fastq

Description

Parser for FastA/FastQ, Iteratee style, based on Data.Attoparsec, and written such that it is compatible with module Bam. This gives import of FastA/FastQ while respecting some local (to MPI EVAN) conventions.

Synopsis

Documentation

parseFastq :: Monad m => Enumeratee Bytes [BamRec] m a Source #

Reader for DNA (not protein) sequences in FastA and FastQ. We read everything vaguely looking like FastA or FastQ, then shoehorn it into a BAM record. We strive to extract information following more or less established conventions from the header, but don't aim for completeness. The recognized syntactical warts are converted into appropriate flags and removed. Only the canonical variant of FastQ is supported (qualities stored as raw bytes with offset 33).

Supported additional conventions:

  • A name suffix of /1 or /2 is turned into the first mate or second mate flag and the read is flagged as paired.
  • Same for name prefixes of F_ or R_, respectively.
  • A name prefix of M_ flags the sequence as unpaired and merged
  • A name prefix of T_ flags the sequence as unpaired and trimmed
  • A name prefix of C_, optionally before or after any of the other prefixes, is turned into the extra flag XP:i:-1 (result of duplicate removal with unknown duplicate count).
  • A collection of tags separated from the name by an octothorpe is removed and put into the fields XI and XJ as text.

Everything before the first sequence header is ignored. Headers can start with > or @, we treat both equally. The first word of the header becomes the read name, the remainder of the header is ignored. The sequence can be split across multiple lines; whitespace, dashes and dots are ignored, IUPAC-IUB ambiguity codes are accepted as bases, anything else causes an error. The sequence ends at a line that is either a header or starts with +, in the latter case, that line is ignored and must be followed by quality scores. There must be exactly as many Q-scores as there are bases, followed immediately by a header or end-of-file. Whitespace is ignored.

parseFastq' :: Monad m => (Bytes -> BamRec -> BamRec) -> Enumeratee Bytes [BamRec] m a Source #

Same as parseFastq, but a custom function can be applied to the description string (the part of the header after the sequence name), which can modify the parsed record. Note that the quality field can end up empty.

parseFastqCassava :: Monad m => Enumeratee Bytes [BamRec] m a Source #

Like parseFastq, but also

  • If the first word of the description has at least four colon separated subfields, the first is used to flag first/second mate, the second is the "QC failed" flag, and the fourth is the index sequence.