i using following simple code solve n-queens problem:
#lang racket ; following returns true if queens on diagonals: (define (check-diagonals bd) (for/or ((r1 (length bd))) (for/or ((r2 (range (add1 r1) (length bd)))) (= (abs (- r1 r2)) (abs(- (list-ref bd r1) (list-ref bd r2))))))) ; set board size: (define n 8) ; start searching: (for ((brd (in-permutations (range n)))) (when (not (check-diagonals brd)) (displayln brd)))
it working fine takes long time larger values of n. uses in-permutations
function stream of permutations. see uses 25% of cpu power (1 of 4 cores being used). how can modify code uses parallel testing of permutations in-permutations stream , uses 4 cpu cores? help.
edit: modified check-diagonals
function suggested in comments. older code was:
(define (check-diagonals bd) (define res #f) (for ((r1 (length bd)) #:break res) (for ((r2 (range (add1 r1) (length bd))) #:break res) (when (= (abs (- r1 r2)) (abs(- (list-ref bd r1) (list-ref bd r2)))) (set! res #t)))) res)
to start, before parallelizing anything, can improve original program quite bit. here changes can make:
- use
for/or
instead of mutating binding , using#:break
. - use
for*/or
instead of nestingfor/or
loops inside of each other. - use
in-range
possible ensurefor
loops specialize simple loops upon expansion. - use square brackets instead of parentheses in relevant places make code more readable racketeers (especially since code isn’t remotely portable scheme, anyway).
with changes, code looks this:
#lang racket ; following returns true if queens on diagonals: (define (check-diagonals bd) (for*/or ([r1 (in-range (length bd))] [r2 (in-range (add1 r1) (length bd))]) (= (abs (- r1 r2)) (abs (- (list-ref bd r1) (list-ref bd r2)))))) ; set board size: (define n 8) ; start searching: (for ([brd (in-permutations (range n))]) (unless (check-diagonals brd) (displayln brd)))
now can turn parallelizing. here’s thing: parallelizing things tends tricky. scheduling work in parallel tends have overhead, , overhead can outweigh benefits more might think. spent long time trying parallelize code, , ultimately, not able produce program ran faster original, sequential program.
still, interested did. maybe (or else) able come better can. relevant tool job here racket’s futures, designed fine-grained parallelism. futures restrictive due way racket’s runtime designed (which way historical reasons), not can parallelized, , large number of operations can cause futures block. fortunately, racket ships future visualizer, graphical tool understanding runtime behavior of futures.
before can begin, ran sequential version of program n=11 , recorded results.
$ time racket n-queens.rkt [program output elided] 14.44 real 13.92 user 0.32 sys
i use these numbers point of comparison remainder of answer.
to start, can try replacing primary for
loop for/asnyc
, runs bodies in parallel. extremely simple transformation, , leaves following loop:
(for/async ([brd (in-permutations (range n))]) (unless (check-diagonals brd) (displayln brd)))
however, making change not improve performance @ all; in fact, reduces it. merely running version n=9 takes ~6.5 seconds; n=10, takes ~55.
this is, however, not surprising. running code futures visualizer (using n=7) indicates displayln
not legal within future, preventing parallel execution ever taking place. presumably, can fix creating futures compute results, print them serially:
(define results-futures (for/list ([brd (in-permutations (range n))]) (future (λ () (cons (check-diagonals brd) brd))))) (for ([result-future (in-list results-futures)]) (let ([result (touch result-future)]) (unless (car result) (displayln (cdr result)))))
with change, small speedup on naïve attempt for/async
, we’re still far slower sequential version. now, n=9, takes ~4.58 seconds, , n=10, takes ~44.
taking @ futures visualizer program (again, n=7), there no blocks, there syncs (on jit_on_demand , allocate memory). however, after time spent jitting, execution seems going, , runs lot of futures in parallel!
after little bit of that, however, parallel execution seems run out of steam, , things seem start running relatively sequentially again.
i wasn’t sure going on here, thought maybe because of futures small. seems possible overhead of scheduling thousands of futures (or, in case of n=10, millions) far outweighing actual runtime of work being done in futures. fortunately, seems solved grouping work chunks, that’s relatively doable using in-slice
:
(define results-futures (for/list ([brds (in-slice 10000 (in-permutations (range n)))]) (future (λ () (for/list ([brd (in-list brds)]) (cons (check-diagonals brd) brd)))))) (for* ([results-future (in-list results-futures)] [result (in-list (touch results-future))]) (unless (car result) (displayln (cdr result))))
it seems suspicion correct, because change helps whole lot. now, running parallel version of program takes ~3.9 seconds n=10, speedup of more factor of 10 on previous version using futures. however, unfortunately, that’s still slower sequential version, takes ~1.4 seconds. increasing n 11 makes parallel version take ~44 seconds, , if chunk size provided in-slice
increased 100,000, takes longer, ~55 seconds.
taking @ future visualizer version of program, n=11 , chunk size of 1,000,000, see there seem periods of extended parallelism, futures blocked lot on memory allocation:
this makes sense, since each future handling more work, means futures synchronizing constantly, leading significant performance overhead i’m seeing.
at point, i’m not sure there’s else know how tweak improve future performance. tried cutting down allocation using vectors instead of lists , specialized fixnum operations, whatever reason, seemed destroy parallelism. thought maybe because futures never starting in parallel, replaced future would-be-future, results confusing me, , didn’t understand meant.
my conclusion racket’s futures fragile work problem, simple may be. give up.
now, little bonus, decided try , emulate same thing in haskell, since haskell has particularly robust story fine-grained parallel evaluation. if couldn’t performance boost in haskell, wouldn’t expect able 1 in racket, either.
i’ll skip details different things tried, eventually, ended following program:
import data.list (permutations) import data.maybe (catmaybes) checkdiagonals :: [int] -> bool checkdiagonals bd = or $ flip map [0 .. length bd - 1] $ \r1 -> or $ flip map [r1 + 1 .. length bd - 1] $ \r2 -> abs (r1 - r2) == abs ((bd !! r1) - (bd !! r2)) n :: int n = 11 main :: io () main = let results = flip map (permutations [0 .. n-1]) $ \brd -> if checkdiagonals brd nothing else brd in mapm_ print (catmaybes results)
i able add parallelism using control.parallel.strategies
library. added line main function introduced parallel evaluation:
import control.parallel.strategies import data.list.split (chunksof) main :: io () main = let results = concat . withstrategy (parbuffer 10 rdeepseq) . chunksof 100 $ flip map (permutations [0 .. n-1]) $ \brd -> if checkdiagonals brd nothing else brd in mapm_ print (catmaybes results)
it took time figure out right chunk , rolling buffer sizes, these values gave me consistent 30-40% speedup on original, sequential program.
now, obviously, haskell’s runtime considerably more suited parallel programming racket’s, comparison hardly fair. helped me see myself that, despite having 4 cores (8 hyperthreading) @ disposal, wasn’t able 50% speedup. keep in mind.
as matthias noted in mailing list thread wrote on subject:
a word of caution on parallelism in general. not long ago, in cs @ ivy league school studied use of parallelism across different uses , applications. goal find out how ‘hype’ parallelism affected people. recollection found close 20 projects professors (in ce, ee, cs, bio, econ, etc) told grad students/phds use parallelism run programs faster. checked of them , n - 1 or 2, projects ran faster once parallelism removed. faster.
people routinely underestimate communication cost parallelism introduces.
be careful not make same mistake.
Comments
Post a Comment