futhark-0.9.1: An optimising compiler for a functional, array-oriented language.

Safe HaskellNone
LanguageHaskell2010

Futhark.CodeGen.ImpGen.Kernels.SegRed

Description

We generate code for non-segmented/single-segment SegRed using the basic approach outlined in the paper "Design and GPGPU Performance of Futhark’s Redomap Construct" (ARRAY '16). The main deviations are:

  • While we still use two-phase reduction, we use only a single kernel, with the final workgroup to write a result (tracked via an atomic counter) performing the final reduction as well.
  • Instead of depending on storage layout transformations to handle non-commutative reductions efficiently, we slide a groupsize-sized window over the input, and perform a parallel reduction for each window. This sacrifices the notion of efficient sequentialisation, but is sometimes faster and definitely simpler and more predictable (and uses less auxiliary storage).

For segmented reductions we use the approach from "Strategies for Regular Segmented Reductions on GPU" (FHPC '17). This involves having two different strategies, and dynamically deciding which one to use based on the number of segments and segment size. We use the (static) group_size to decide which of the following two strategies to choose:

  • Large: uses one or more groups to process a single segment. If multiple groups are used per segment, the intermediate reduction results must be recursively reduced, until there is only a single value per segment.

Each thread can read multiple elements, which will greatly increase performance; however, if the reduction is non-commutative we will have to use a less efficient traversal (with interim group-wide reductions) to enable coalesced memory accesses, just as in the non-segmented case.

  • Small: is used to let each group process *multiple* segments within a group. We will only use this approach when we can process at least two segments within a single group. In those cases, we would allocate a whole group per segment with the large strategy, but at most 50% of the threads in the group would have any element to read, which becomes highly inefficient.

Documentation