SpikeFun v0.65 contains memory and work scheduling optimizations addressing NUMA architectures which are becoming de facto standard in the x86 server and workstation markets.
NUMA systems (such as Intel Xeon E5-2600, previously known as "Romley" platform) consists of separate nodes of CPUs and memory. This effectively means that memory access is not uniform and the latency and bandwidth depend whether the memory addressed is in the local or remote NUMA node. By ignoring NUMA layout, memory-intensive applications (such as biological neural network simulations) will work slower than optimal due to memory-access penalties.
To ensure optimal performance on NUMA systems, SpikeFun 0.65 employs several optimization strategies:
Results on the reference platform:
For performance comparison, we used dual-CPU Intel(R) Xeon E5-2600 system:
The system architecture can be described with the following diagram:
Following table shows the performance difference between SpikeFun 0.64 and 0.65 on the NUMA platform:
Test Description: | # CPUs: | NUMA Mode Enabled in BIOS | Simulation Performance: |
---|---|---|---|
SpikeFun 0.64 (ref) | 1 (8 cores total) | N/A | 0.245x Real Time |
SpikeFun 0.64 | 2 (16 cores total) | NO | 0.321x Real Time |
SpikeFun 0.64 | 2 (16 cores total) | YES | 0.276x Real Time |
SpikeFun 0.65 | 2 (16 cores total) | YES | 0.409x Real Time |
SpikeFun 0.67 | 2 (16 cores total) | YES | 0.510x Real Time |
As it could be seen, SpikeFun 0.65 (and, further, SpikeFun 0.67) delivers significant simulation speed improvement when it is running on a NUMA system. Another interesting observation related to NUMA unoptimized SpikeFun 0.64 is that it runs slower when NUMA system partitioning is known to Windows scheduler compared to mode when NUMA hardware layout is not published by the Firmware (BIOS/EFI). However, when memory allocation and thread management is done in accordance with the NUMA system layout, performance starts to approach theoretical 2x speedup that could be gained with 2x cores present on the system.