Safe Haskell | None |
---|---|
Language | Haskell2010 |
We generate code for non-segmented/single-segment SegRed using the basic approach outlined in the paper "Design and GPGPU Performance of Futhark’s Redomap Construct" (ARRAY '16). The main deviations are:
- While we still use two-phase reduction, we use only a single kernel, with the final workgroup to write a result (tracked via an atomic counter) performing the final reduction as well.
- Instead of depending on storage layout transformations to handle
non-commutative reductions efficiently, we slide a
groupsize
-sized window over the input, and perform a parallel reduction for each window. This sacrifices the notion of efficient sequentialisation, but is sometimes faster and definitely simpler and more predictable (and uses less auxiliary storage).
For segmented reductions we use the approach from "Strategies for
Regular Segmented Reductions on GPU" (FHPC '17). This involves
having two different strategies, and dynamically deciding which one
to use based on the number of segments and segment size. We use the
(static) group_size
to decide which of the following two
strategies to choose:
- Large: uses one or more groups to process a single segment. If multiple groups are used per segment, the intermediate reduction results must be recursively reduced, until there is only a single value per segment.
Each thread can read multiple elements, which will greatly increase performance; however, if the reduction is non-commutative we will have to use a less efficient traversal (with interim group-wide reductions) to enable coalesced memory accesses, just as in the non-segmented case.
- Small: is used to let each group process *multiple* segments within a group. We will only use this approach when we can process at least two segments within a single group. In those cases, we would allocate a whole group per segment with the large strategy, but at most 50% of the threads in the group would have any element to read, which becomes highly inefficient.
Documentation
compileSegRed :: Pattern ExplicitMemory -> KernelSpace -> Commutativity -> Lambda InKernel -> [SubExp] -> Body InKernel -> CallKernelGen () Source #