Safe Haskell	None
Language	Haskell2010

Bio.Bam.Trim

Description

Trimming and fusing of reads as found in BAM files.

This API is remarkably ugly because the core loop is implemented in C. This requires the adapters to be in storable vectors, and since they shouldn't be constantly copied around, the ugly withADSeqs function is needed. The performance gain seems to be worth it, though.

Synopsis

trim_3 :: Int -> BamRec -> BamRec
trim_3' :: ([Nucleotides] -> [Qual] -> Bool) -> BamRec -> BamRec
trim_low_quality :: Qual -> a -> [Qual] -> Bool
data AD_Seqs
withADSeqs :: MonadBaseControl IO m => [Vector Nucleotides] -> (AD_Seqs -> m r) -> m r
default_fwd_adapters :: [Vector Nucleotides]
default_rev_adapters :: [Vector Nucleotides]
find_merge :: AD_Seqs -> AD_Seqs -> Vector Nucleotides -> Vector Qual -> Vector Nucleotides -> Vector Qual -> IO (Int, Int, Int)
mergeBam :: Int -> Int -> AD_Seqs -> AD_Seqs -> BamRec -> BamRec -> IO [BamRec]
find_trim :: AD_Seqs -> Vector Nucleotides -> Vector Qual -> IO (Int, Int, Int)
trimBam :: Int -> Int -> AD_Seqs -> BamRec -> IO [BamRec]
merged_seq :: (Vector v Nucleotides, Vector v Qual) => Int -> v Nucleotides -> v Qual -> v Nucleotides -> v Qual -> v Nucleotides
merged_qual :: (Vector v Nucleotides, Vector v Qual) => Word8 -> Int -> v Nucleotides -> v Qual -> v Nucleotides -> v Qual -> v Qual

Documentation

trim_3 :: Int -> BamRec -> BamRec Source #

trim_3' :: ([Nucleotides] -> [Qual] -> Bool) -> BamRec -> BamRec Source #

Trims from the 3' end of a sequence. trim_3' p b trims the 3' end of the sequence in b at the earliest position such that p evaluates to true on every suffix that was trimmed off. Note that the 3' end may be the beginning of the sequence if it happens to be stored in reverse-complemented form. Also note that trimming from the 3' end may not make sense for reads that were constructed by merging paired end data (but we cannot take care of that here). Further note that trimming may break dependent information, notably the "mate" information and many optional fields. Since the intention is to trim based on quality scores, reads without qualities are passed along unchanged.

trim_low_quality :: Qual -> a -> [Qual] -> Bool Source #

Trim predicate to get rid of low quality sequence. trim_low_quality q ns qs evaluates to true if all qualities in qs are smaller (i.e. worse) than q.

data AD_Seqs Source #

withADSeqs :: MonadBaseControl IO m => [Vector Nucleotides] -> (AD_Seqs -> m r) -> m r Source #

default_fwd_adapters :: [Vector Nucleotides] Source #

For merging, we don't need the complete adapters (length around 70!), only a sufficient prefix. Taking only the more-or-less constant part (length around 30), there aren't all that many different adapters in the world. To deal with pretty much every library, we only need the following forward adapters, which will be the default (defined here in the direction they would be sequenced in): Genomic R2, Multiplex R2, Fraft P7.

default_rev_adapters :: [Vector Nucleotides] Source #

Like default_rev_adapters, these are the few adapters needed for the reverse read (defined in the direction they would be sequenced in as part of the second read): Genomic R1, CL 72.

find_merge :: AD_Seqs -> AD_Seqs -> Vector Nucleotides -> Vector Qual -> Vector Nucleotides -> Vector Qual -> IO (Int, Int, Int) Source #

Finds the merge point. Input is list of forward adapters, list of reverse adapters, sequence1, quality1, sequence2, quality2; output is merge point and two qualities (YM, YN).

mergeBam :: Int -> Int -> AD_Seqs -> AD_Seqs -> BamRec -> BamRec -> IO [BamRec] Source #

Overlap-merging of read pairs. We shall compute the likelihood for every possible overlap, then select the most likely one (unless it looks completely random), compute a quality from the second best merge, then merge and clamp the quality accordingly. (We could try looking for chimaera after completing the merge, if only we knew which ones to expect?)

Two reads go in, with two adapter lists. We return Nothing if all merges looked mostly random. Else we return the two original reads, flagged as eflagVestigial *and* the merged version, flagged as eflagMerged and optionally eflagTrimmed. All reads contain the computed qualities (in YM and YN), which we also return.

The merging automatically limits quality scores some of the time. We additionally impose a hard limit of 63 to avoid difficulties representing the result, and even that is ridiculous. Sane people would further limit the returned quality! (In practice, map quality later imposes a limit anyway, so no worries...)

find_trim :: AD_Seqs -> Vector Nucleotides -> Vector Qual -> IO (Int, Int, Int) Source #

Finds the trimming point. Input is list of forward adapters, sequence, quality; output is trim point and two qualities (YM, YN).

trimBam :: Int -> Int -> AD_Seqs -> BamRec -> IO [BamRec] Source #

Trimming for a single read: we need one adapter only (the one coming after the read), here provided as a list of options, and then we merge with an empty second read. Results in up to two reads (the original, possibly flagged, and the trimmed one, definitely flagged, and two qualities).

merged_seq :: (Vector v Nucleotides, Vector v Qual) => Int -> v Nucleotides -> v Qual -> v Nucleotides -> v Qual -> v Nucleotides Source #

merged_qual :: (Vector v Nucleotides, Vector v Qual) => Word8 -> Int -> v Nucleotides -> v Qual -> v Nucleotides -> v Qual -> v Qual Source #