Safe Haskell | None |
---|---|
Language | Haskell2010 |
Parser for FastA/FastQ
, Iteratee
style, based on
Data.Attoparsec, and written such that it is compatible with module
Bam
. This gives import of FastA/FastQ
while respecting some
local (to MPI EVAN) conventions.
Synopsis
- parseFastq :: Monad m => Enumeratee Bytes [BamRec] m a
- parseFastq' :: Monad m => (Bytes -> BamRec -> BamRec) -> Enumeratee Bytes [BamRec] m a
- parseFastqCassava :: Monad m => Enumeratee Bytes [BamRec] m a
Documentation
parseFastq :: Monad m => Enumeratee Bytes [BamRec] m a Source #
Reader for DNA (not protein) sequences in FastA and FastQ. We read everything vaguely looking like FastA or FastQ, then shoehorn it into a BAM record. We strive to extract information following more or less established conventions from the header, but don't aim for completeness. The recognized syntactical warts are converted into appropriate flags and removed. Only the canonical variant of FastQ is supported (qualities stored as raw bytes with offset 33).
Supported additional conventions:
- A name suffix of
/1
or/2
is turned into the first mate or second mate flag and the read is flagged as paired. - Same for name prefixes of
F_
orR_
, respectively. - A name prefix of
M_
flags the sequence as unpaired and merged - A name prefix of
T_
flags the sequence as unpaired and trimmed - A name prefix of
C_
, optionally before or after any of the other prefixes, is turned into the extra flagXP:i:-1
(result of duplicate removal with unknown duplicate count). - A collection of tags separated from the name by an octothorpe is
removed and put into the fields
XI
andXJ
as text.
Everything before the first sequence header is ignored. Headers can
start with >
or @
, we treat both equally. The first word of
the header becomes the read name, the remainder of the header is
ignored. The sequence can be split across multiple lines;
whitespace, dashes and dots are ignored, IUPAC-IUB ambiguity codes
are accepted as bases, anything else causes an error. The sequence
ends at a line that is either a header or starts with +
, in the
latter case, that line is ignored and must be followed by quality
scores. There must be exactly as many Q-scores as there are bases,
followed immediately by a header or end-of-file. Whitespace is
ignored.
parseFastq' :: Monad m => (Bytes -> BamRec -> BamRec) -> Enumeratee Bytes [BamRec] m a Source #
Same as parseFastq
, but a custom function can be applied to the
description string (the part of the header after the sequence name),
which can modify the parsed record. Note that the quality field can
end up empty.
parseFastqCassava :: Monad m => Enumeratee Bytes [BamRec] m a Source #
Like parseFastq
, but also
- If the first word of the description has at least four colon separated subfields, the first is used to flag first/second mate, the second is the "QC failed" flag, and the fourth is the index sequence.