## Paths to Fast Barrier Synchronization on the Node Conor Hetland, Georgios Tziantzioulis, Brian Suchy, Michael Leonard, Jin Han, John Albers, Nikos Hardavellas, and Peter Dinda - Software barrier latency is on the order of the tens-of-thousands of cycles - This is really slow - A barrier is functionally the logic of an AND gate plus communication; this should take on the order of hundreds of cycles to see if all threads have arrived at the barrier. Each FPGA cycle, the FPGA adds all hardware can then be reused. The system alternates between two copies of this logic. reported barrier arrivals to an internal register. When this internal register holds the number of threads allocated to the barrier, the internal register containing the count is reset, and all polled cache lines are written to signal the waiting threads to continue. The ## **Barriers Matter** NESL (VCODE) Interpreter dissemination 18451913 othread 8229673 pool 8379959 ticket 0 8175458 counting tournament 268371 ideal 5045 128 160 192 224 256 288 320 352 384 Streamcluster pthread 1448.78 ticket 97,06 pool 95.17 counting 11.74 32 64 96 128 160 192 224 256 288 320 352 384 Runtimes and applications with fine-grained parallelism rely heavily on barrier performance