i decided to do some micro benchmarks.
The benchmark is threads that locks/unlocks an private mutex
and increment a counter.
The tests are:
- mutex_aligned, pthread_mutex, lock/unlock, each mutex in separate cache-line
- mutex_non_aligned, same as above but mutexes are packed together hence sharing cache-lines
- spin_aligned, home-made spinlock (only x86), each spinlock in separate cache-line, the spinlock is an atomic operation for lock, and a full-barrier+assign for unlock
- spin_non_aligned, same as above but spinlocks are packed together hence sharing cache-lines
- lock_xadd, atomic-inc, (on sparc impl. using atomic.h, which uses cas i think)
- xadd, (only x86), the non-smp (but irq) safe add variant for x86
- gcc_sync_fetch_and_add, gcc intrinsic for atomic add
- add_mb, "normal" add (on volatile variable) followed by a full-barrier
- add, "normal" add (on volatile variable)
- nop, a nop, just to see that thread start/stop does not affect test outcome noticable
- atomic operations are very expensive
- false sharing is a true disaster
- it might be worth the effort to impl. both spinlocks and atomic-inc for sparc
Sorry for lousy html formatting :(
Intel(R) Core(TM)2 Quad CPU Q6600@2.40GHz (1-socket 4-cores) | ||||||||
---|---|---|---|---|---|---|---|---|
mops vs threads | ||||||||
op | ns/op | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
mutex_align | 42 | 23 | 44 | 66 | 88 | 78 | 79 | 89 |
mutex_non_align | 42 | 23 | 7 | 10 | 8 | 19 | 18 | 22 |
spin_align | 16 | 60 | 121 | 182 | 234 | 184 | 196 | 212 |
spin_non_align | 16 | 60 | 16 | 24 | 32 | 39 | 40 | 57 |
lock_xadd | 8 | 117 | 235 | 352 | 470 | 357 | 352 | 411 |
xadd | 2 | 342 | 684 | 1026 | 1368 | 855 | 1026 | 1196 |
gcc_sync_fetch_and_add | 8 | 119 | 239 | 359 | 479 | 371 | 359 | 419 |
add_mb | 5 | 171 | 342 | 513 | 684 | 455 | 513 | 598 |
add | 2 | 398 | 797 | 1195 | 1594 | 996 | 1196 | 1394 |
nop | 0 | 6357142 | 2870967 | 2870967 | 2119047 | 1390625 | 898989 | 687258 |
2 x Intel(R) Xeon(R) CPU X5355 @2.66GHz (2-socket 4 cores each) | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mops vs threads | |||||||||||||||
op | ns/op | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |
mutex_align | 43 | 22 | 42 | 63 | 84 | 105 | 126 | 145 | 162 | 126 | 134 | 132 | 141 | 139 | 155 |
mutex_non_align | 43 | 22 | 10 | 15 | 18 | 18 | 22 | 29 | 27 | 27 | 33 | 36 | 44 | 49 | 57 |
spin_align | 17 | 56 | 112 | 166 | 210 | 275 | 292 | 273 | 260 | 318 | 270 | 345 | 312 | 354 | 346 |
spin_non_align | 17 | 56 | 17 | 36 | 38 | 51 | 49 | 50 | 55 | 72 | 91 | 93 | 123 | 87 | 120 |
lock_xadd | 10 | 98 | 195 | 289 | 377 | 467 | 513 | 504 | 504 | 442 | 420 | 525 | 512 | 582 | 605 |
xadd | 2 | 380 | 742 | 1060 | 1189 | 1490 | 1610 | 1829 | 1350 | 1687 | 1760 | 1954 | 1714 | 1425 | 2006 |
gcc_sync_fetch_and_add | 10 | 98 | 195 | 287 | 375 | 466 | 560 | 488 | 680 | 523 | 583 | 556 | 622 | 587 | 597 |
add_mb | 7 | 126 | 252 | 369 | 479 | 587 | 589 | 770 | 719 | 598 | 650 | 639 | 686 | 649 | 602 |
add | 2 | 443 | 861 | 974 | 1393 | 1775 | 1807 | 1904 | 1740 | 1986 | 2139 | 1679 | 1555 | 2307 | 2116 |
nop | 0 | 4114457 | 3283653 | 2355172 | 1366000 | 903439 | 803529 | 617540 | 640712 | 532761 | 466530 | 410950 | 382418 | 351699 | 320356 |
SUNW,T5240, 2*(HT-64) 1415MHz SUNW,UltraSPARC-T2+ (2-socket 8-cores each, 8 threads/core) | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mops vs threads | |||||||||||||||||
op | ns/op | 1 | 9 | 17 | 25 | 33 | 41 | 49 | 57 | 65 | 73 | 81 | 89 | 97 | 105 | 113 | 121 |
mutex_align | 299 | 3 | 29 | 55 | 78 | 98 | 115 | 132 | 141 | 153 | 161 | 179 | 181 | 191 | 200 | 208 | 209 |
mutex_non_align | 299 | 3 | 7 | 13 | 22 | 29 | 37 | 44 | 51 | 57 | 63 | 68 | 73 | 78 | 83 | 88 | 92 |
lock_xadd | 70 | 14 | 125 | 232 | 326 | 408 | 472 | 506 | 538 | 536 | 520 | 512 | 503 | 499 | 492 | 487 | 477 |
add_mb | 34 | 28 | 258 | 469 | 637 | 759 | 853 | 909 | 937 | 947 | 926 | 915 | 897 | 881 | 893 | 892 | 870 |
add | 13 | 74 | 637 | 1020 | 1257 | 1398 | 1412 | 1356 | 1295 | 1222 | 1247 | 1273 | 1246 | 1265 | 1272 | 1286 | 1287 |
nop | 0 | 184331 | 46367 | 24849 | 17520 | 12893 | 10676 | 9051 | 7616 | 6697 | 6196 | 5492 | 4960 | 4069 | 3739 | 3486 | 3289 |
5 comments:
Why does it seem spin_lock starts fast but is getting a lot slower as you increase the threads whereas mutexes are not getting as slow as the thread count goes up?
Also, do you think you could share the code you used for these tests?
Regards,
Ivan Novick
Hi,
1) I don't understand your comment about spin locks/mutexes.
Both spinlocks and mutexes run/"scales" well in the "aligned" case (i.e no false sharing) and runs/"scales" really bad when false-sharing occurs ("non_aligned")
Or?
2) sharing the code
sure, i don't know where to put it though...can you host it ?
/Jonas
I can put it on novickscode.com and attribute your name to it... also make sure to put your name and copyright in the code if you want me to do that.
I would really like to see the code, then I can see exactly what you are talking about. I did a similar test but not as thorough as yours and I really want to understand this issue fully.
My gmail id is novickivan if you want to correspond offline.
Regards,
Ivan Novick
Created github repository:
http://github.com/jonasoreland/micro-benchmarks
thanks Jonas
Post a Comment