Thursday, April 23, 2009

distributed pushed-down join

just managed to run the first join inside the data-nodes.
the query correspondce to

SELECT t1.*, t2.*
FROM T1 t1
LEFT OUTER JOIN T1 t2 on t1.pk = t2.pk
WHERE t1.pk = avalue

- the code inside the data-nodes is general...but incomplete (e.g leaks memory, doesnt handle error correctly...)
- there is no ndbapi and no SQL,the program testing is a hard-coded c-program sending messages using the ndb-cluster wire-protocol
- the result-set is not managed (the c-program just hangs)

Summary:
- there is *many* things left to do, when this will actually hit a release is very uncertain (or if it ever will...)
- this is the coolest thing i implemented in a long long time

Details:
- the code is written so that the query and the parameters are sent separately so that we in the future could have the queries permanently stored in the data nodes.
- the query is:
SELECT t1.?1, t2.?2
FROM T1 t1
LEFT OUTER JOIN T1 t2 on t1.pk = t2.pk
WHERE t1.pk = ?3
i.e both values and projection is parameterized

- the serialized form of this request is 44bytes query and 32 byte parameters
- i could do a N-way (key-lookup) join with aribitrary tables and join-conditions using the code inside the data-node
- code *only* supports left-outer-joins and only lookups
- next step is doing scan+lookup

Hard-core details:

sh> cat mysql-cluster/ndb_1.out.log
DBSPJ: ::build()
DBSPJ: - loop 0 pos: 1
DBSPJ: getOpInfo(1)
DBSPJ: createNode - seize -> ptrI: 57344
DBSPJ: lookup_build: len=5
DBSPJ: attrCnt: 1 -> len: 0
DBSPJ: param len: 4
DBSPJ: attrCnt: 1
DBSPJ: - loop 1 pos: 6
DBSPJ: getOpInfo(1)
DBSPJ: createNode - seize -> ptrI: 57376
DBSPJ: lookup_build: len=5
DBSPJ: attrCnt: 1 -> len: 0
DBSPJ: added 57376 as child of 57344
DBSPJ: param len: 4
DBSPJ: attrCnt: 1
LQHKEYREQ to 6f70002
ClientPtr = H'0000e000 hashValue = H'87c3aa01 tcBlockRef = H'01080002
transId1 = H'00000000 transId2 = H'00000301 savePointId = H'00000000
Op: 0 Lock: 0 Flags: Simple Dirty NoDisk ScanInfo/noFiredTriggers: H'0
AttrLen: 0 (0 in this) KeyLen: 0 TableId: 4 SchemaVer: 1
FragId: 0 ReplicaNo: 0 LastReplica: 0 NextNodeId: 0
ApiRef: H'80000003 ApiOpRef: H'00000008
AttrInfo:
KEYINFO: ptr.i = 7(0xac3ee800) ptr.sz = 1(1)
H'0x00000000
ATTRINFO: ptr.i = 8(0xac3ee900) ptr.sz = 5(5)
H'0x00000000 H'0xffee0000 H'0x01080002 H'0x0000e000 H'0xfff00005
DBSPJ: execTRANSID_AI
execTRANSID_AI: ptr.i = 3(0xac3ee400) ptr.sz = 2(2)
H'0x00000004 H'0x00000000
LQHKEYREQ to 6f70002
ClientPtr = H'0000e020 hashValue = H'87c3aa01 tcBlockRef = H'01080002
transId1 = H'00000000 transId2 = H'00000301 savePointId = H'00000000
Op: 0 Lock: 0 Flags: Simple Dirty ScanInfo/noFiredTriggers: H'0
AttrLen: 0 (0 in this) KeyLen: 0 TableId: 4 SchemaVer: 1
FragId: 0 ReplicaNo: 0 LastReplica: 0 NextNodeId: 0
ApiRef: H'80000003 ApiOpRef: H'00000008
AttrInfo:
KEYINFO: ptr.i = 8(0xac3ee900) ptr.sz = 1(1)
H'0x00000000
ATTRINFO: ptr.i = 9(0xac3eea00) ptr.sz = 1(1)
H'0xfff00005

sh> cat api-signal-log.txt
---- Send ----- Signal ----------------
r.bn: 245 "DBTC", r.proc: 2, gsn: 12 "TCKEYREQ" prio: 1
s.bn: 32768 "API", s.proc: 3, s.sigId: 0 length: 8 trace: 1 #sec: 2 fragInf: 0
apiConnectPtr: H'00000020, apiOperationPtr: H'00000008
Operation: Read, Flags: Dirty Start Execute NoDisk IgnoreError Simple spj
keyLen: 0, attrLen: 0, AI in this: 0, tableId: 4, tableSchemaVer: 1, API Ver: 5
transId(1, 2): (H'00000000, H'00000300)
-- Variable Data --
SECTION 0 type=generic size=1
H'00000000
SECTION 1 type=generic size=19
H'000b0002 H'00050001 H'00000000 H'00000004 H'00000001 H'00000001 H'00050001
H'00000001 H'00000004 H'00000001 H'00000001 H'00040001 H'00000000 H'00000008
H'fff00005 H'00040001 H'00000000 H'00000008 H'fff00005
---- Received - Signal ----------------
r.bn: 2047 "API", r.proc: 3, r.sigId: -1 gsn: 10 "TCKEYCONF" prio: 1
s.bn: 245 "DBTC", s.proc: 2, s.sigId: 241503 length: 9 trace: 1 #sec: 0 fragInf:
0
H'80000005 H'00000000 H'00000000 H'00000001 H'00000000 H'00000300 H'00000008
H'80000002 H'00000000
---- Received - Signal ----------------
r.bn: 2047 "API", r.proc: 3, r.sigId: -1 gsn: 5 "TRANSID_AI" prio: 1
s.bn: 249 "DBTUP", s.proc: 2, s.sigId: 241503 length: 22 trace: 1 #sec: 0 fragIn
f: 0
H'80000007 H'00000008 H'00000000 H'00000301 H'fff30004 H'0000001f H'00000000
H'00000005 H'ef76733c H'deece672 H'ffffffff H'80000007 H'00000008 H'00000000
H'00000301 H'fff30004 H'0000001f H'00000000 H'00000005 H'ef76733c H'deece672
H'ffffffff

Sunday, April 12, 2009

mutex micro benchmark

when working on mutex contention for ndbapi (mysql-cluster)
i decided to do some micro benchmarks.

The benchmark is threads that locks/unlocks an private mutex
and increment a counter.

The tests are:
  • mutex_aligned, pthread_mutex, lock/unlock, each mutex in separate cache-line
  • mutex_non_aligned, same as above but mutexes are packed together hence sharing cache-lines
  • spin_aligned, home-made spinlock (only x86), each spinlock in separate cache-line, the spinlock is an atomic operation for lock, and a full-barrier+assign for unlock
  • spin_non_aligned, same as above but spinlocks are packed together hence sharing cache-lines
  • lock_xadd, atomic-inc, (on sparc impl. using atomic.h, which uses cas i think)
  • xadd, (only x86), the non-smp (but irq) safe add variant for x86
  • gcc_sync_fetch_and_add, gcc intrinsic for atomic add
  • add_mb, "normal" add (on volatile variable) followed by a full-barrier
  • add, "normal" add (on volatile variable)
  • nop, a nop, just to see that thread start/stop does not affect test outcome noticable

The conclusions are:
  • atomic operations are very expensive
  • false sharing is a true disaster
  • it might be worth the effort to impl. both spinlocks and atomic-inc for sparc


Sorry for lousy html formatting :(
Intel(R) Core(TM)2 Quad CPU Q6600@2.40GHz (1-socket 4-cores)
mops vs threads
opns/op1234567
mutex_align4223446688787989
mutex_non_align42237108191822
spin_align1660121182234184196212
spin_non_align1660162432394057
lock_xadd8117235352470357352411
xadd23426841026136885510261196
gcc_sync_fetch_and_add8119239359479371359419
add_mb5171342513684455513598
add23987971195159499611961394
nop063571422870967287096721190471390625898989687258
2 x Intel(R) Xeon(R) CPU X5355 @2.66GHz (2-socket 4 cores each)
mops vs threads
opns/op1234567891011121314
mutex_align4322426384105126145162126134132141139155
mutex_non_align432210151818222927273336444957
spin_align1756112166210275292273260318270345312354346
spin_non_align17561736385149505572919312387120
lock_xadd1098195289377467513504504442420525512582605
xadd2380742106011891490161018291350168717601954171414252006
gcc_sync_fetch_and_add1098195287375466560488680523583556622587597
add_mb7126252369479587589770719598650639686649602
add244386197413931775180719041740198621391679155523072116
nop04114457328365323551721366000903439803529617540640712532761466530410950382418351699320356
SUNW,T5240, 2*(HT-64) 1415MHz SUNW,UltraSPARC-T2+ (2-socket 8-cores each, 8 threads/core)
mops vs threads
opns/op191725334149576573818997105113121
mutex_align299329557898115132141153161179181191200208209
mutex_non_align299371322293744515763687378838892
lock_xadd7014125232326408472506538536520512503499492487477
add_mb3428258469637759853909937947926915897881893892870
add137463710201257139814121356129512221247127312461265127212861287
nop018433146367248491752012893106769051761666976196549249604069373934863289

Thursday, April 9, 2009

bugs and more bugs

have i worked on so far this year...
but 7.0 is soon to be released...and
things are starting to look reasonable