Safe Haskell | None |
---|---|
Language | Haskell2010 |
Multiversion segmented reduction.
Synopsis
- regularSegmentedRedomap :: (HasScope Kernels m, MonadBinder m, Lore m ~ Kernels) => SubExp -> SubExp -> [SubExp] -> Pattern Kernels -> Pattern Kernels -> SubExp -> Commutativity -> Lambda InKernel -> Lambda InKernel -> [(VName, SubExp)] -> [KernelInput] -> [SubExp] -> [VName] -> m ()
- regularSegmentedScan :: (MonadBinder m, Lore m ~ Kernels) => SubExp -> Pattern Kernels -> SubExp -> Lambda InKernel -> Lambda InKernel -> [(VName, SubExp)] -> [KernelInput] -> [SubExp] -> [VName] -> m ()
Documentation
regularSegmentedRedomap :: (HasScope Kernels m, MonadBinder m, Lore m ~ Kernels) => SubExp -> SubExp -> [SubExp] -> Pattern Kernels -> Pattern Kernels -> SubExp -> Commutativity -> Lambda InKernel -> Lambda InKernel -> [(VName, SubExp)] -> [KernelInput] -> [SubExp] -> [VName] -> m () Source #
regularSegmentedRedomap
will generate code for a segmented redomap using
two different strategies, and dynamically deciding which one to use based on
the number of segments and segment size. We use the (static) group_size
to
decide which of the following two strategies to choose:
- Large: uses one or more groups to process a single segment. If multiple groups are used per segment, the intermediate reduction results must be recursively reduced, until there is only a single value per segment.
Each thread can read multiple elements, which will greatly increase performance; however, if the reduction is non-commutative the input array will be transposed (by the KernelBabysitter) to enable memory coalesced accesses. Currently we will always make each thread read as many elements as it can, but this could be unfavorable because of the transpose: in the case where each thread can only read 2 elements, the cost of the transpose might not be worth the performance gained by letting each thread read multiple elements. This could be investigated more in depth in the future (TODO)
- Small: is used to let each group process *multiple* segments within a group. We will only use this approach when we can process at least two segments within a single group. In those cases, we would normally allocate a whole group per segment with the large strategy, but at most 50% of the threads in the group would have any element to read, which becomes highly inefficient.