jonas blog: 2009

Thursday, October 22, 2009

Dbspj + mysqld, first trials


mysql> create table T1 (a int, b int, primary key(a)) engine = ndb;
mysql> select x.a,x.b,z.a,z.b from T1 x, T1 y, T1 z where x.b=y.a and y.b=z.a;
+----+------+----+------+
| a  | b    | a  | b    |
+----+------+----+------+
| 31 |   47 | 63 |   31 |
| 63 |   31 | 47 |   63 |
| 47 |   63 | 31 |   47 |
|  5 |   47 | 63 |   31 |
+----+------+----+------+
4 rows in set (0.01 sec)

- entire query is executed with one request to data nodes
- code is only a hack (that bluntly examines mysqld's internal structures).
- list of limitations is so long that i can't write it here due to bandwidth restrictions
But still super cool!

Quote from Jan, that implemented experiment: "You can write queries that return correct result"

Wednesday, October 14, 2009

Dbspj preliminary numbers

So some 5 month later...
- Dbspj has an ndbapi
- Dbspj works enough for simple benchmarks!

Reminder, what is Dbspj:
- It's a new feature for Ndb
- It gives the possibility to push-down linked operations (e.g in SQL terminology: joins)
- It currently only supports left-outer-joins, and only some kinds of joins
- It is currently *not* in anyway integrated with mysqld (for accelerating SQL access)

Anyway so here is the benchmark setup
2 computers
- ndbapi running on one
- 2 datanodes running on other

On images below:
- red is new code, blue is corresponding "current" code
- Y-axis is run-time, so lower is better
- X-axis is "depth", i.e no of tables joined

Note: this is debug-compiled, so the actually absolute numbers are
not that interesting...rather the comparison...

Query 1:
depth 1: select * from T t1, T t2 where T1.pk = constant and T2.pk = T1.pk
depth 2: select * from T t1, T t2, T t3 where T1.pk = constant and T2.pk = T1.pk and T3.pk = T2.pk
etc...

Query 2:
depth 1: select * from T t1, T t2 where T2.pk = T1.pk
depth 2: select * from T t1, T t2, T t3 where T2.pk = T1.pk and T3.pk = T2.pk
etc...

Wednesday, May 6, 2009

ArenaAllocator complete (Dbspj)

and i created a new DataBuffer2, which has pool-type as template argument.
it was "impossible" to alter DataBuffer to do this...it was simply to hard-wired to ArrayPool. too bad.

---

so now everything in Dbspj can/will be using the ArenaAllocator...
rationale for ArenaAllocator is really not to have a "quick" release (the code still releases each object individually...to reset magic values) but that it's a kind of variable length allocation strategy that provide great locality of data...

---

also pushed infrastructure for "result-set correlation".
Dbspj will (at least for now) when joining only return rows once so in the case of a 2-way join, a SQL result set will return the rows from the first table one time *for each* match in second table.
Dbspj will not...(but ha_ndbcluster will have to do this...when presenting things for mysqld) and since it will not, there has to be a way to correlate rows from the 2 tables.
this we call "result-set correlation".

Saturday, May 2, 2009

memory allocation in ndb(mt)d

inside the data nodes a number of different techniques for allocating memory is used.
there are 2 main variants
- page allocations
- object allocations (e.i slab allocator)

and for the object allocations there are a number of different variants
- RWPool, "ordinary" slab, only objects of same type on 1 page, free-list per page and object
- WOPool, allocations which are short-lived, no free-list just ref-count
- ArrayPool, fixed array of packed object

and I'm just now introducing a ArenaAllocator (which will be used for spj)

---

- RWPool, WOPool and ArenaAllocator/Pool are "new", meaning that they are build from start with support for more dynamic memory handling in mind.
- ArrayPool is very fixed, and exists both in the structured form, and various hard-coded variants.

---

one thing in common for all the "new" allocators is that they use magic numbers per object, so that every time a pointer (once per signal) that validity of the object is checked, and a trap is generated on a invalid memory access.

another thing in common is that they are all used by 1 thread at a time, and only pages are allocated/freed to a global free-list.

---

- to support variable length data, linked list of fixed size objects are used (DataBuffer)
- for input data (from e.g sockets) a special DataBuffer with additional thread-safety code is used so that data can be passed between threads wo/ copying.

---

but there is no malloc/free new/delete :-)

Friday, May 1, 2009

distributed pushed-down join - part II

now also support table/index scan as root node, i.e tree = [ scan ] ? [ lookup ] *
still with the limitations:
- a child may only be dependent on immediate parent
- still only left outer join.

i also completed the parameterized scan-filter inside tup but does not yet support this in spj or ndbapi.

currently working on
- the parameterized scan-filer in spj (and a tiny bit in the ndbapi).
- result set correlation

currently thinking on
- arena based allocator for spj-requests, (compare mysqld mem-root concept)

discussed with frazer what spj "really is", concluded that it probably is some kind of "data-flow-engine"...should probably find some hip acronym/term for it...

note:
current mental capacity is limiting spj to only one scan in a tree.
this limitation can of course be lifted...but not now, or my brain will fry.

---

wonder how i should make people giving me votes on planetmysql??

Thursday, April 23, 2009

distributed pushed-down join

just managed to run the first join inside the data-nodes.
the query correspondce to

SELECT t1.*, t2.*
FROM T1 t1
LEFT OUTER JOIN T1 t2 on t1.pk = t2.pk
WHERE t1.pk = avalue

- the code inside the data-nodes is general...but incomplete (e.g leaks memory, doesnt handle error correctly...)
- there is no ndbapi and no SQL,the program testing is a hard-coded c-program sending messages using the ndb-cluster wire-protocol
- the result-set is not managed (the c-program just hangs)

Summary:
- there is *many* things left to do, when this will actually hit a release is very uncertain (or if it ever will...)
- this is the coolest thing i implemented in a long long time

Details:
- the code is written so that the query and the parameters are sent separately so that we in the future could have the queries permanently stored in the data nodes.
- the query is:
SELECT t1.?1, t2.?2
FROM T1 t1
LEFT OUTER JOIN T1 t2 on t1.pk = t2.pk
WHERE t1.pk = ?3
i.e both values and projection is parameterized

- the serialized form of this request is 44bytes query and 32 byte parameters
- i could do a N-way (key-lookup) join with aribitrary tables and join-conditions using the code inside the data-node
- code *only* supports left-outer-joins and only lookups
- next step is doing scan+lookup

Hard-core details:


sh> cat mysql-cluster/ndb_1.out.log
DBSPJ: ::build()
DBSPJ:  - loop 0 pos: 1
DBSPJ: getOpInfo(1)
DBSPJ: createNode - seize -> ptrI: 57344
DBSPJ: lookup_build: len=5
DBSPJ: attrCnt: 1 -> len: 0
DBSPJ: param len: 4
DBSPJ: attrCnt: 1
DBSPJ:  - loop 1 pos: 6
DBSPJ: getOpInfo(1)
DBSPJ: createNode - seize -> ptrI: 57376
DBSPJ: lookup_build: len=5
DBSPJ: attrCnt: 1 -> len: 0
DBSPJ: added 57376 as child of 57344
DBSPJ: param len: 4
DBSPJ: attrCnt: 1
LQHKEYREQ to 6f70002
ClientPtr = H'0000e000 hashValue = H'87c3aa01 tcBlockRef = H'01080002
transId1 = H'00000000 transId2 = H'00000301 savePointId = H'00000000
Op: 0 Lock: 0 Flags: Simple Dirty NoDisk ScanInfo/noFiredTriggers: H'0
AttrLen: 0 (0 in this) KeyLen: 0 TableId: 4 SchemaVer: 1
FragId: 0 ReplicaNo: 0 LastReplica: 0 NextNodeId: 0
ApiRef: H'80000003 ApiOpRef: H'00000008
AttrInfo:
KEYINFO: ptr.i = 7(0xac3ee800) ptr.sz = 1(1)
H'0x00000000
ATTRINFO: ptr.i = 8(0xac3ee900) ptr.sz = 5(5)
H'0x00000000 H'0xffee0000 H'0x01080002 H'0x0000e000 H'0xfff00005
DBSPJ: execTRANSID_AI
execTRANSID_AI: ptr.i = 3(0xac3ee400) ptr.sz = 2(2)
H'0x00000004 H'0x00000000
LQHKEYREQ to 6f70002
ClientPtr = H'0000e020 hashValue = H'87c3aa01 tcBlockRef = H'01080002
transId1 = H'00000000 transId2 = H'00000301 savePointId = H'00000000
Op: 0 Lock: 0 Flags: Simple Dirty ScanInfo/noFiredTriggers: H'0
AttrLen: 0 (0 in this) KeyLen: 0 TableId: 4 SchemaVer: 1
FragId: 0 ReplicaNo: 0 LastReplica: 0 NextNodeId: 0
ApiRef: H'80000003 ApiOpRef: H'00000008
AttrInfo:
KEYINFO: ptr.i = 8(0xac3ee900) ptr.sz = 1(1)
H'0x00000000
ATTRINFO: ptr.i = 9(0xac3eea00) ptr.sz = 1(1)
H'0xfff00005

sh> cat api-signal-log.txt
---- Send ----- Signal ----------------
r.bn: 245 "DBTC", r.proc: 2, gsn: 12 "TCKEYREQ" prio: 1
s.bn: 32768 "API", s.proc: 3, s.sigId: 0 length: 8 trace: 1 #sec: 2 fragInf: 0
apiConnectPtr: H'00000020, apiOperationPtr: H'00000008
Operation: Read, Flags: Dirty Start Execute NoDisk IgnoreError Simple  spj
keyLen: 0, attrLen: 0, AI in this: 0, tableId: 4, tableSchemaVer: 1, API Ver: 5
transId(1, 2): (H'00000000, H'00000300)
-- Variable Data --
SECTION 0 type=generic size=1
H'00000000
SECTION 1 type=generic size=19
H'000b0002 H'00050001 H'00000000 H'00000004 H'00000001 H'00000001 H'00050001
H'00000001 H'00000004 H'00000001 H'00000001 H'00040001 H'00000000 H'00000008
H'fff00005 H'00040001 H'00000000 H'00000008 H'fff00005
---- Received - Signal ----------------
r.bn: 2047 "API", r.proc: 3, r.sigId: -1 gsn: 10 "TCKEYCONF" prio: 1
s.bn: 245 "DBTC", s.proc: 2, s.sigId: 241503 length: 9 trace: 1 #sec: 0 fragInf:
0
H'80000005 H'00000000 H'00000000 H'00000001 H'00000000 H'00000300 H'00000008
H'80000002 H'00000000
---- Received - Signal ----------------
r.bn: 2047 "API", r.proc: 3, r.sigId: -1 gsn: 5 "TRANSID_AI" prio: 1
s.bn: 249 "DBTUP", s.proc: 2, s.sigId: 241503 length: 22 trace: 1 #sec: 0 fragIn
f: 0
H'80000007 H'00000008 H'00000000 H'00000301 H'fff30004 H'0000001f H'00000000
H'00000005 H'ef76733c H'deece672 H'ffffffff H'80000007 H'00000008 H'00000000
H'00000301 H'fff30004 H'0000001f H'00000000 H'00000005 H'ef76733c H'deece672
H'ffffffff

Sunday, April 12, 2009

mutex micro benchmark

when working on mutex contention for ndbapi (mysql-cluster)
i decided to do some micro benchmarks.

The benchmark is threads that locks/unlocks an private mutex
and increment a counter.

The tests are:

mutex_aligned, pthread_mutex, lock/unlock, each mutex in separate cache-line
mutex_non_aligned, same as above but mutexes are packed together hence sharing cache-lines
spin_aligned, home-made spinlock (only x86), each spinlock in separate cache-line, the spinlock is an atomic operation for lock, and a full-barrier+assign for unlock
spin_non_aligned, same as above but spinlocks are packed together hence sharing cache-lines
lock_xadd, atomic-inc, (on sparc impl. using atomic.h, which uses cas i think)
xadd, (only x86), the non-smp (but irq) safe add variant for x86
gcc_sync_fetch_and_add, gcc intrinsic for atomic add
add_mb, "normal" add (on volatile variable) followed by a full-barrier
add, "normal" add (on volatile variable)
nop, a nop, just to see that thread start/stop does not affect test outcome noticable

The conclusions are:

atomic operations are very expensive
false sharing is a true disaster
it might be worth the effort to impl. both spinlocks and atomic-inc for sparc

Sorry for lousy html formatting :(

Intel(R) Core(TM)2 Quad CPU Q6600@2.40GHz (1-socket 4-cores)
		mops vs threads
op	ns/op	1	2	3	4	5	6	7
mutex_align	42	23	44	66	88	78	79	89
mutex_non_align	42	23	7	10	8	19	18	22
spin_align	16	60	121	182	234	184	196	212
spin_non_align	16	60	16	24	32	39	40	57
lock_xadd	8	117	235	352	470	357	352	411
xadd	2	342	684	1026	1368	855	1026	1196
gcc_sync_fetch_and_add	8	119	239	359	479	371	359	419
add_mb	5	171	342	513	684	455	513	598
add	2	398	797	1195	1594	996	1196	1394
nop	0	6357142	2870967	2870967	2119047	1390625	898989	687258

2 x Intel(R) Xeon(R) CPU X5355 @2.66GHz (2-socket 4 cores each)
		mops vs threads
op	ns/op	1	2	3	4	5	6	7	8	9	10	11	12	13	14
mutex_align	43	22	42	63	84	105	126	145	162	126	134	132	141	139	155
mutex_non_align	43	22	10	15	18	18	22	29	27	27	33	36	44	49	57
spin_align	17	56	112	166	210	275	292	273	260	318	270	345	312	354	346
spin_non_align	17	56	17	36	38	51	49	50	55	72	91	93	123	87	120
lock_xadd	10	98	195	289	377	467	513	504	504	442	420	525	512	582	605
xadd	2	380	742	1060	1189	1490	1610	1829	1350	1687	1760	1954	1714	1425	2006
gcc_sync_fetch_and_add	10	98	195	287	375	466	560	488	680	523	583	556	622	587	597
add_mb	7	126	252	369	479	587	589	770	719	598	650	639	686	649	602
add	2	443	861	974	1393	1775	1807	1904	1740	1986	2139	1679	1555	2307	2116
nop	0	4114457	3283653	2355172	1366000	903439	803529	617540	640712	532761	466530	410950	382418	351699	320356

SUNW,T5240, 2*(HT-64) 1415MHz SUNW,UltraSPARC-T2+ (2-socket 8-cores each, 8 threads/core)
		mops vs threads
op	ns/op	1	9	17	25	33	41	49	57	65	73	81	89	97	105	113	121
mutex_align	299	3	29	55	78	98	115	132	141	153	161	179	181	191	200	208	209
mutex_non_align	299	3	7	13	22	29	37	44	51	57	63	68	73	78	83	88	92
lock_xadd	70	14	125	232	326	408	472	506	538	536	520	512	503	499	492	487	477
add_mb	34	28	258	469	637	759	853	909	937	947	926	915	897	881	893	892	870
add	13	74	637	1020	1257	1398	1412	1356	1295	1222	1247	1273	1246	1265	1272	1286	1287
nop	0	184331	46367	24849	17520	12893	10676	9051	7616	6697	6196	5492	4960	4069	3739	3486	3289

Thursday, April 9, 2009

bugs and more bugs

have i worked on so far this year...
but 7.0 is soon to be released...and
things are starting to look reasonable

Tuesday, January 13, 2009

api-thread-contention TODOs

currently working on a refactoring ndbapi for mysql-cluster, in order to decrease
mutex contention allowing to better utilize many threads on multi-core machines.
while doing so, i implement a "vanilla" ndbapi-backend which lacks features
current in there.

once this is working, i'll start adding optimizations...

reintroduce:
- zero copy receive
- optimizations for single threaded performance (less ctx switches)

introduce:
- split mutex(es)
- zero copy send (or actually 1-copy instead of 2, but zero sounds better)

jonas blog