Haskell Distributed Parallel Haskell ==================================== **Haskell distributed parallel Haskell (HdpH)** is a Haskell DSL for parallel computation on distributed-memory architectures. HdpH is implemented entirely in Haskell but does make use of a few GHC extensions, most notably TemplateHaskell. HdpH is described in some detail in the paper [Implementing a High-level Distributed-Memory Parallel Haskell in Haskell](http://www.macs.hw.ac.uk/~pm175/papers/Maier_Trinder_IFL2011_XT.pdf). The paper on [The Design of SymGridPar2](http://www.macs.hw.ac.uk/~pm175/papers/Maier_Stewart_Trinder_SAC2013.pdf) paints the bigger picture of what HdpH was designed for and where future developments will lead. This release is considered alpha stage. Building HdpH ------------- Should be straightforward from the cabalised package `hdph`. Note that `hdph` depends on `hdph-closure`, an independent package that factors out the closure representation of HdpH. Running HdpH ------------ ### Launch via `ssh` HdpH comes with a few sample applications. The simplest one, a distributed _Hello World_, is good for testing whether HdpH works. For example, the following Bourne shell command will run _Hello World_ on three nodes, distributed over two hosts, `bwlf01` and `bwlf02`. $> for host in bwlf01 bwlf02 bwlf01; do ssh $host hdph/dist/build/hello/hello -numProcs=3 & done This will launch three instances of the _Hello World_ binary `hello` (in directory `hdph/dist/build/hello` relative to the user's home directory), two on `bwlf01` and one on `bwlf02`. Note that the obligatory option `-numProcs` must match the number of instances; HdpH will likely hang otherwise. Assuming that `ssh` has been set up to enable password-less logins, the output should be something like this: Master Node:137.195.143.101:19754:0 wants to know: Who is here? Hello from Node:137.195.143.101:19754:0 Hello from Node:137.195.143.102:38939:0 Hello from Node:137.195.143.101:30062:0 ### Launch via `mpiexec` Shell scripts and `ssh` are quite a cumbersome way of launching HdpH. More convenient are dedicated parallel job launchers such as `mpiexec`, which comes with any recent MPI distribution. Even though this version of HdpH does not use MPI, it can still be launched by `mpiexec`. An MPICH installation, for example, might launch the _Hello World_ application with the following command: $> mpiexec -hosts bwlf01,bwlf02,bwlf01 ./hello -numProcs=3 This will launch the _Hello World_ binary `hello` (in the current working directory) 3 times, twice on `bwlf01` and once on `bwlf02`. The expected output should look somewhat like this: Hello from Node:137.195.143.102:12701:0 Hello from Node:137.195.143.101:15353:0 Master Node:137.195.143.101:14247:0 wants to know: Who is here? Hello from Node:137.195.143.101:14247:0 ### Parallel launch via `mpiexec` To test parallelism, there is a sample application computing the `n`th number of the _Fibonacci_ sequence using a naive parallel divide-and-conquer algorithm. For example, the following command will compute the 45th Fibonacci number in parallel on 3 nodes, switching to sequential execution for subproblems below a threshold of 30; a wall clock runtime of about 16 seconds is reported here. $> mpiexec -hosts bwlf01,bwlf02,bwlf03 ./fib -numProcs=3 v2 45 30 {v2, seqThreshold=30, parThreshold=30} fib 45 = 1836311903 {runtime=16.157814s} When passed the switch `-d1` HdpH will produce some summary output per node relating to parallelism, eg. number of sparks generated per node, maximum number of sparks residing on a node, number of times a node requested work, and number of times work was scheduled in response. An example run with `-d1` might look like this: $> mpiexec -hosts bwlf01,bwlf02,bwlf03 ./fib -numProcs=3 -d1 v2 45 30 137.195.143.101:9759:0 #SPARK=1017 max_SPARK=154 max_THREAD=[3,1] 137.195.143.101:9759:0 #FISH_sent=1 #SCHED_rcvd=0 {v2, seqThreshold=30, parThreshold=30} fib 45 = 1836311903 {runtime=16.449183s} 137.195.143.102:12551:0 #SPARK=308 max_SPARK=4 max_THREAD=[1,1] 137.195.143.102:12551:0 #FISH_sent=215 #SCHED_rcvd=214 137.195.143.103:27402:0 #SPARK=271 max_SPARK=4 max_THREAD=[0,1] 137.195.143.103:27402:0 #FISH_sent=225 #SCHED_rcvd=224 This shows that `bwlf01` (IP address `137.195.143.101`) was the master node, generating 1017 sparks in total, requesting work once (probably at the end of the computation) and not receiving any. The other nodes did generate some sparks themselves (308 and 271, respectively) but they also both received a substantial number of sparks (214 and 224, respectively). Most importantly, each node requested work exactly once more often than receiving it, which means no node was ever idling, except probably at the very end of the computation. ### Parallel launch via `mpiexec` on homogeneous multicores So far HdpH ran in a single thread on each node. To make use of multicores `mpiexec` can place several single-threaded nodes on the same host, eg. the following examples will launch two and four nodes per host, respectively; note that the number following the `-n` switch of `mpiexec` must match the number following the `-numProcs` switch of `fib`. $> mpiexec -hosts bwlf01,bwlf02,bwlf03 -n 6 ./fib -numProcs=6 v2 45 30 {v2, seqThreshold=30, parThreshold=30} fib 45 = 1836311903 {runtime=8.026417s} $> mpiexec -hosts bwlf01,bwlf02,bwlf03 -n 12 ./fib -numProcs=12 v2 45 30 {v2, seqThreshold=30, parThreshold=30} fib 45 = 1836311903 {runtime=4.845796s} The other possibility is to run HdpH itself in a multi-threaded mode. The following two examples run HdpH on two and four threads per node, respectively, expecting that the OS will bind each thread to a core. $> mpiexec -hosts bwlf01,bwlf02,bwlf03 ./fib -numProcs=3 -scheds=2 v2 45 30 +RTS -N2 {v2, seqThreshold=30, parThreshold=30} fib 45 = 1836311903 {runtime=10.648305s} $> mpiexec -hosts bwlf01,bwlf02,bwlf03 ./fib -numProcs=3 -scheds=4 v2 45 30 +RTS -N4 {v2, seqThreshold=30, parThreshold=30} fib 45 = 1836311903 {runtime=6.507969s} Note that the GHC RTS switch `-N` determines the number of threads (or HECs in GHC terminology) whereas the `fib` switch `-scheds` determines how many of these threads run the HdpH scheduler loop. Whether there should be as many schedulers as threads or less depends on the GHC version and probably on the OS. Using one scheduler less than the number of threads has been observed to reduce variability, and sometimes even bring performance gains: $> mpiexec -hosts bwlf01,bwlf02,bwlf03 ./fib -numProcs=3 -scheds=3 v2 45 30 +RTS -N4 {v2, seqThreshold=30, parThreshold=30} fib 45 = 1836311903 {runtime=5.831932s} ### Parallel launch via `mpiexec` on heterogeneous multicores So far all hosts were assumed to have the same number of cores. However, `mpiexec` can be used to launch on heterogeneous clusters. For instance, the following call will launch HdpH with two threads each on `bwlf01` and `bwlf02`, and with four threads on `bwlf03`; please see the man page of `mpiexec` for an explanation of the command line format. $> mpiexec -hosts bwlf01,bwlf02,bwlf03 -n 2 ./fib -numProcs=3 -scheds=2 -d1 v2 45 30 +RTS -N2 : -n 1 ./fib -numProcs=3 -scheds=4 -d1 v2 45 30 +RTS -N4 137.195.143.101:21394:0 #SPARK=1114 max_SPARK=183 max_THREAD=[3,1,1] 137.195.143.101:21394:0 #FISH_sent=3 #SCHED_rcvd=1 137.195.143.103:22027:0 #SPARK=261 max_SPARK=2 max_THREAD=[0,1,1,1,1] 137.195.143.103:22027:0 #FISH_sent=209 #SCHED_rcvd=208 137.195.143.102:22374:0 #SPARK=221 max_SPARK=3 max_THREAD=[1,1,1] 137.195.143.102:22374:0 #FISH_sent=173 #SCHED_rcvd=172 {v2, seqThreshold=30, parThreshold=30} fib 45 = 1836311903 {runtime=10.502205s} The switch `-d1` reveals that `bwlf03` (IP address `137.195.143.103`) did indeed run 4 schedulers because its `max_THREAD` list was length 5, and the length of the `max_THREAD` list is always 1 plus the number of schedulers. ### Launch caveats At startup HdpH nodes open random ports and rely on UDP multicasts to discover each other. This results in a number of limitations: + UDP multicasts must be routed between all nodes (which may imply that all hosts must reside in the same subnet). + The obligatory `-numProcs` switch, which tells an application how many nodes to expect, must match exactly the number of nodes launched. + Node discovery must be completed within a certain time frame (typically 10 seconds). + No second HdpH application may be launched during the node discovery phase. Any deviation from the above limitations is likely to cause HdpH to hang forever. Related Projects ---------------- * [Cloud Haskell](http://haskell-distributed.github.com) * [Par Monad and Friends](https://github.com/simonmar/monad-par) References ---------- 1. Patrick Maier, Phil Trinder. [Implementing a High-level Distributed-Memory Parallel Haskell in Haskell](http://www.macs.hw.ac.uk/~pm175/papers/Maier_Trinder_IFL2011_XT.pdf). Proc. 2011 Symposium on Implementation and Application of Functional Languages (IFL 2011), Springer LNCS 7257, pp. 35-50. 2. Patrick Maier, Rob Stewart, Phil Trinder. [Reliable Scalable Symbolic Computation: The Design of SymGridPar2](http://www.macs.hw.ac.uk/~pm175/papers/Maier_Stewart_Trinder_SAC2013.pdf). Proc. 28th ACM Symposium On Applied Computing (SAC 2013), pp. 1677-1684. 3. [HdpH development repository](https://github.com/PatrickMaier/HdpH) on github.