Safe Haskell | None |
---|---|
Language | Haskell2010 |
Trimming and fusing of reads as found in BAM files.
This API is remarkably ugly because the core loop is implemented in
C. This requires the adapters to be in storable vectors, and since
they shouldn't be constantly copied around, the ugly withADSeqs
function is needed. The performance gain seems to be worth it,
though.
Synopsis
- trim_3 :: Int -> BamRec -> BamRec
- trim_3' :: ([Nucleotides] -> [Qual] -> Bool) -> BamRec -> BamRec
- trim_low_quality :: Qual -> a -> [Qual] -> Bool
- data AD_Seqs
- withADSeqs :: MonadBaseControl IO m => [Vector Nucleotides] -> (AD_Seqs -> m r) -> m r
- default_fwd_adapters :: [Vector Nucleotides]
- default_rev_adapters :: [Vector Nucleotides]
- find_merge :: AD_Seqs -> AD_Seqs -> Vector Nucleotides -> Vector Qual -> Vector Nucleotides -> Vector Qual -> IO (Int, Int, Int)
- mergeBam :: Int -> Int -> AD_Seqs -> AD_Seqs -> BamRec -> BamRec -> IO [BamRec]
- find_trim :: AD_Seqs -> Vector Nucleotides -> Vector Qual -> IO (Int, Int, Int)
- trimBam :: Int -> Int -> AD_Seqs -> BamRec -> IO [BamRec]
- merged_seq :: (Vector v Nucleotides, Vector v Qual) => Int -> v Nucleotides -> v Qual -> v Nucleotides -> v Qual -> v Nucleotides
- merged_qual :: (Vector v Nucleotides, Vector v Qual) => Word8 -> Int -> v Nucleotides -> v Qual -> v Nucleotides -> v Qual -> v Qual
Documentation
trim_3' :: ([Nucleotides] -> [Qual] -> Bool) -> BamRec -> BamRec Source #
Trims from the 3' end of a sequence.
trim_3' p b
trims the 3' end of the sequence in b
at the
earliest position such that p
evaluates to true on every suffix
that was trimmed off. Note that the 3' end may be the beginning of
the sequence if it happens to be stored in reverse-complemented form.
Also note that trimming from the 3' end may not make sense for reads
that were constructed by merging paired end data (but we cannot take
care of that here). Further note that trimming may break dependent
information, notably the "mate" information and many optional fields.
Since the intention is to trim based on quality scores, reads without
qualities are passed along unchanged.
trim_low_quality :: Qual -> a -> [Qual] -> Bool Source #
Trim predicate to get rid of low quality sequence.
trim_low_quality q ns qs
evaluates to true if all qualities in qs
are smaller (i.e. worse) than q
.
withADSeqs :: MonadBaseControl IO m => [Vector Nucleotides] -> (AD_Seqs -> m r) -> m r Source #
default_fwd_adapters :: [Vector Nucleotides] Source #
For merging, we don't need the complete adapters (length around 70!), only a sufficient prefix. Taking only the more-or-less constant part (length around 30), there aren't all that many different adapters in the world. To deal with pretty much every library, we only need the following forward adapters, which will be the default (defined here in the direction they would be sequenced in): Genomic R2, Multiplex R2, Fraft P7.
default_rev_adapters :: [Vector Nucleotides] Source #
Like default_rev_adapters
, these are the few adapters needed for
the reverse read (defined in the direction they would be sequenced in
as part of the second read): Genomic R1, CL 72.
find_merge :: AD_Seqs -> AD_Seqs -> Vector Nucleotides -> Vector Qual -> Vector Nucleotides -> Vector Qual -> IO (Int, Int, Int) Source #
Finds the merge point. Input is list of forward adapters, list of reverse adapters, sequence1, quality1, sequence2, quality2; output is merge point and two qualities (YM, YN).
mergeBam :: Int -> Int -> AD_Seqs -> AD_Seqs -> BamRec -> BamRec -> IO [BamRec] Source #
Overlap-merging of read pairs. We shall compute the likelihood for every possible overlap, then select the most likely one (unless it looks completely random), compute a quality from the second best merge, then merge and clamp the quality accordingly. (We could try looking for chimaera after completing the merge, if only we knew which ones to expect?)
Two reads go in, with two adapter lists. We return Nothing
if all
merges looked mostly random. Else we return the two original reads,
flagged as eflagVestigial
*and* the merged version, flagged as
eflagMerged
and optionally eflagTrimmed
. All reads contain the
computed qualities (in YM and YN), which we also return.
The merging automatically limits quality scores some of the time. We additionally impose a hard limit of 63 to avoid difficulties representing the result, and even that is ridiculous. Sane people would further limit the returned quality! (In practice, map quality later imposes a limit anyway, so no worries...)
find_trim :: AD_Seqs -> Vector Nucleotides -> Vector Qual -> IO (Int, Int, Int) Source #
Finds the trimming point. Input is list of forward adapters, sequence, quality; output is trim point and two qualities (YM, YN).
trimBam :: Int -> Int -> AD_Seqs -> BamRec -> IO [BamRec] Source #
Trimming for a single read: we need one adapter only (the one coming after the read), here provided as a list of options, and then we merge with an empty second read. Results in up to two reads (the original, possibly flagged, and the trimmed one, definitely flagged, and two qualities).
merged_seq :: (Vector v Nucleotides, Vector v Qual) => Int -> v Nucleotides -> v Qual -> v Nucleotides -> v Qual -> v Nucleotides Source #
merged_qual :: (Vector v Nucleotides, Vector v Qual) => Word8 -> Int -> v Nucleotides -> v Qual -> v Nucleotides -> v Qual -> v Qual Source #