Thursday, November 4, 2010

distributed pushed down joins - webinar and new preview available

as part of the webinar tonight we made a new preview release src and binary (linux x86-64 glibc23).

this includes pushing of ref that was not supported in last preview release as well as latest and greatest new optimizations.

note that this is still not production ready code.
but...we welcome any feedback on it!

---

Nov 5: And feedback can be sent to spj-feedback@sun.com

Wednesday, August 25, 2010

distributed pushed down joins - progress and attitude

we're now at a stage so we can test/benchmark scan/scan queries, i.e push most queries. there are still know bugs and things that we can't push.

to see how we performs, we tested a very old query/dataset, that a customer tried in the 4.1 days. the result at that time was a disaster, and they at that time decided to use something else.

it's a 11-way join with mainly REF accesses (i.e not many lookups)

i tested it on same hardware/configuration as for my UC presentation.
without join pushdown the average query time is 3.6s.
with join pushdown it's ~500ms, i.e ~7x improvement.

the nights of kni are deeply disappointed, and that's an attitude i like :-)

FYI: we also optimized TPCW::getBestSeller some more and are now at 42x on that hardware/configuration.

---

stay tuned

---

micro update: Found a bug when using ndbmtd...fixing that gave 200ms (i.e 18x)...also did some more optimizations and now record is 120ms (i.e 29x)

Thursday, May 20, 2010

distributed pushed down joins - node-failure handling done

node-failure handling was as expected lots of small changes to various pieces of the code.
some non compatible protocol changes performed (nice not to have any existing GA release to care about :)

also when doing this, I re-implemented entire "abort of join"-handling.
it should now also be capable of handling aborting more complex joins
(but node-failure is still an special case)

this means that "testSpj -n NF_Join" now passes!!

---

this is an important step towards a releasable version.

note: this has only been implemented in the branch which also contains extension for scan.vs.scan joins. I.e not in the branch released as preview.

and, OleJohn integrated support for filters on child-joins with mysqld.

---

And as always please provide feedback on preview release that we made!

Wednesday, May 12, 2010

distributed pushed down joins - more testing

have now written and pushed ndbapi testing of our push-down join implementation (~2k LOC)

the following classes/programs was added
  • HugoQueryBuilder (class)
    Construct a pseudo-random query
    inputs: a list of tables and a options specifying which features should be used in query
  • HugoQueries (class)
    Run an arbitrary query.
    Note: Does only verify individual rows, not relations/result-set. For verifying result-sets we use random query generator
  • testSpj (program)
    Ndbapi test-program using above classes to runs various tests for our cluster specific automatic testing framework
  • hugoJoin (program)
    Stand-alone test-program that takes list of tables/options as command-line arguments
    and using above classes constructs/runs random queries
Quite happy with this, as it's very much needed to get good coverage of the implementation,
especially error cases (e.g node-failure) which is hard to test from/with SQL.

---

And please provide feedback on preview release that we made!

Friday, May 7, 2010

distributed pushed down joins - more features

update on latest accomplishments:

  1. added support for filters (compare engine_condition_pushdown) on non-root operations, this in 2 flavors:
    • - constant/immediate filters that is provided after NdbQuery has been defined, these may not contains filter conditions referencing other NdbQueryOperation's,
    • - parameterized/linked filters, these type of programs must be provided when building the NdbQuery-object.

    The constant filters should be "easy" to integrate with ndbapi/mysqld, but the parameterized/linked is harder as we need to add new features to NdbScanFilter.

    Once this is integrated into mysqld, it will provide better performance for already pushable queries, see page 43 in my UC presentation (ref: filter).

  2. added support for NdbQueryOpertation's referencing non-direct parent NdbQueryOperation's. This should also be quite easy to integrate into ndbapi/mysqld. This will mean that new types of queries will become pushable.

  3. support for "to-many" joins is made so complete that it will get until we get an ndbapi for more elaborate testing. This will mean that lots of new queries will become pushable.

Note: all the additions has only been made inside ndb(mt)d so far, nothing new has (yet) been added to ndbapi/mysqld.

---

And please provide feedback on preview release that we made!

Saturday, April 24, 2010

distributed push down joins - first "to many" join

just managed to run the first "to-many" join inside the data-nodes.
the query correspondce to

SELECT t1.*, t2.*
FROM T1 t1
LEFT OUTER JOIN T1 as t2 on t2.pk >= t1.b

- the code inside the data-nodes is general...but incomplete (data-node later crashed)
- there is no ndbapi and no SQL,the program testing is a hard-coded c-program sending messages using the ndb-cluster wire-protocol
- the result-set is not managed (the c-program just hangs)
- hopefully converting this into usable code will be faster than last years post as ndbapi is now quite complete, and this is "only" an add-on.

for a bit more background look at my presentation from MySQL Conference 2010.

hard-core details:

r.bn: 245 "DBTC", r.proc: 2, gsn: 32 "SCAN_TABREQ" prio: 1
s.bn: 32768 "API", s.proc: 4, s.sigId: 0 length: 11 trace: 1 #sec: 2 fragInf: 0
apiConnectPtr: H'00000027 requestInfo: H'08400a01:
Parallellism: 1 Batch: 64 LockMode: 0 Keyinfo: 0 Holdlock: 0 RangeScan: 0 Desc
ending: 0 TupScan: 0
ReadCommitted: 1 DistributionKeyFlag: 0 NoDisk: 1 spj: 1 attrLen: 0, keyLen: 0
tableId: 7, tableSchemaVer: 1
transId(1, 2): (H'00000000, H'00000400) storedProcId: H'0000ffff
batch_byte_size: 32768, first_batch_size: 64
H'00000027 H'00000000 H'08400a01 H'00000007 H'00000001 H'0000ffff H'00000000
H'00000400 H'00000026 H'00008000 H'00000040
SECTION 0 type=generic size=1
H'00000014
SECTION 1 type=generic size=33
H'000f0002 H'00050002 H'00000010 H'00000007 H'00000001 H'00010001 H'00090003
H'00000003 H'00000008 H'00000001 H'00000001 H'00000003 H'00010001 H'00000002
H'00060000 H'00080002 H'00000009 H'10000020 H'00000001 H'00000012 H'00000002
H'fff00002 H'ffe90000 H'000a0003 H'00000009 H'ffff0100 H'10000024 H'00000001
H'00000012 H'00000003 H'fff00002 H'ffe90000 H'fffb0000
---- Received - Signal ----------------
r.bn: 32768 "API", r.proc: 4, r.sigId: -1 gsn: 5 "TRANSID_AI" prio: 1
s.bn: 249 "DBTUP", s.proc: 2, s.sigId: 166035 length: 3 trace: 1 #sec: 1 fragInf
: 0
H'10000020 H'00000000 H'00000400
SECTION 0 type=linear size=7
H'fff30004 H'00000003 H'00000001 H'00000002 H'ffe90008 H'00000014 H'00000000
---- Received - Signal ----------------
r.bn: 32768 "API", r.proc: 4, r.sigId: -1 gsn: 5 "TRANSID_AI" prio: 1
s.bn: 249 "DBTUP", s.proc: 2, s.sigId: 166092 length: 3 trace: 1 #sec: 1 fragInf
: 0
H'10000020 H'00000000 H'00000400
SECTION 0 type=linear size=7
H'fff30004 H'00000003 H'00000002 H'00000003 H'ffe90008 H'00000014 H'00000001
---- Received - Signal ----------------
r.bn: 32768 "API", r.proc: 4, r.sigId: -1 gsn: 5 "TRANSID_AI" prio: 1
s.bn: 249 "DBTUP", s.proc: 2, s.sigId: 166202 length: 3 trace: 1 #sec: 1 fragInf
: 0
H'10000024 H'00000000 H'00000400
SECTION 0 type=linear size=9
H'fff30004 H'00000003 H'00000001 H'00000002 H'ffe90008 H'00000014 H'00000000
H'fffb0004 H'00000000
---- Received - Signal ----------------
r.bn: 32768 "API", r.proc: 4, r.sigId: -1 gsn: 5 "TRANSID_AI" prio: 1
s.bn: 249 "DBTUP", s.proc: 2, s.sigId: 166203 length: 3 trace: 1 #sec: 1 fragInf
: 0
H'10000024 H'00000000 H'00000400
SECTION 0 type=linear size=9
H'fff30004 H'00000003 H'00000002 H'00000003 H'ffe90008 H'00000014 H'00000001
H'fffb0004 H'00000000
---- Received - Signal ----------------
r.bn: 32768 "API", r.proc: 4, r.sigId: -1 gsn: 5 "TRANSID_AI" prio: 1
s.bn: 249 "DBTUP", s.proc: 2, s.sigId: 166227 length: 3 trace: 1 #sec: 1 fragInf
: 0
H'10000024 H'00000000 H'00000400
SECTION 0 type=linear size=9
H'fff30004 H'00000003 H'00000001 H'00000002 H'ffe90008 H'00000014 H'00010002
H'fffb0004 H'00000001
---- Received - Signal ----------------
r.bn: 32768 "API", r.proc: 4, r.sigId: -1 gsn: 5 "TRANSID_AI" prio: 1
s.bn: 249 "DBTUP", s.proc: 2, s.sigId: 166234 length: 3 trace: 1 #sec: 1 fragInf
: 0
H'10000024 H'00000000 H'00000400
SECTION 0 type=linear size=9
H'fff30004 H'00000003 H'00000002 H'00000003 H'ffe90008 H'00000014 H'00010003
H'fffb0004 H'00000001

Thursday, April 15, 2010

uploaded cluster pushdown join presentation...and preview-release

presentation is here
src tarball, that do contain bugs if node-failure occurs, is here

thanks for attending for those of you that did
feedback on preview is welcome, e.g here

maybe we'll do a binary version...if so i'll make a new post

Thursday, March 25, 2010

fast IO and ndb

i read Marks blog entry about fast IO for PBXT, InnoDB (1.0.6 plugin) and MyISAM with some interest.

then there was the durable, not durable and really not durable blog in which Mark and LinuxJedi discussed cluster, and Mark wondered how much IO one could get from 1 data-node with ndb.

so I decided to try.
the setup is similar.
ndb cluster using disk-tables, with a tablespace stored in /dev/shm.

Data occupied 7.3G, which is slightly less than (some of) the others, this is as we don't support having columns with indexes stored on disk. I.e any column that has an index will be stored only in memory instead.

ndb does however not support "handler" statements, so I used oltp-point-select-all-cols instead.

sysbench --test=oltp --mysql-host=foobar --mysql-user=root \
--mysql-password=pw \
--mysql-db=test \
--oltp-table-size=40000000 \
--max-time=60 --max-requests=0 \
--mysql-table-engine=ndb \
--db-ps-mode=disable --mysql-engine-trx=yes \
--oltp-read-only --oltp-skip-trx --oltp-test-mode=simple \
--oltp-point-select-all-cols \
--oltp-dist-type=uniform \
--oltp-range-size=1000 \
--num-threads=1 --seed-rng=1 run




results are not super great...
so i did a ndbapi program (which is roughly the equivalent of handler-statements)



these number look a bit better. but datanode (ndbmtd) was almost idle when running this...

so i made another experiment. Instead of retrieving 1 row at a time (set@r = rand() % 40000000; select * from sbtest where id = @r) i changed to retrive 16 rows at a time (set @r1 = rand(); set @r2 = rand(); select * from sbtest where id in (@r1...@r16).



i believe these results are relevant given that mark's aim was to test fast IO,
and i think that this rewrite wont affect other SE as much as it does with ndb.
and those numbers was quite ok.

as an extra bonus, i also tried with using our memory tables (alter table sbtest storage memory), also with ndbapi and 16 rows at a time.



these tests was executed on a 16-core machine Intel(R) Xeon(R) CPU E7420@2.13GHz
and in my config.ini I had

DiskPageBufferMemory = 1GB
DiskIOThreadPool=8
FileSystemPath=/dev/shm


note: I first create tables in myisam, then converted them to ndb disk-tables by issuing "alter table sbtest storage disk tablespace TS engine = ndb" and all tests was using ndb-cluster-connection-pool=4

all results:

SQL 2756 4998 7133 9130 10720 12222 13305 14190 14626 15287 15547 15955 16143 16334 16507 16757
ndbapi 4177 7581 10319 13162 15245 17064 18874 20652 20850 24131 25976 24910 29832 30666 32625 34841
b=16 1520 11374 35454 53396 55601 63248 71819 78468 103324 97330 97572 111099 125564 126790 133873 141588
mm b=16 73004 128296 172207 207361 246907 270783 293753 312435 327006 345085 346924 374837 360747 372192 376887 394862

Monday, March 15, 2010

Dbspj update, teaser for UC

5 months more...

reminder, what is Dbspj:
- It's a new feature for MySQL Cluster
- It gives the possibility to push-down SQL joins, i.e to evaluate the joins inside the data nodes.

latest and greatest:
- last 5 months spent on SQL integration
- big effort having it used only for cases that we actually support
- lots of testing using RQG
- a significant part of bugs found last couple weeks, has not been related to spj, but in fact "ordinary" optimizer bugs
- we think that it's quite usable now.

and the numbers:
- my plan was to present TPC-W as "realistic" numbers
- now they look more like fantastic numbers
- if i could, i would add a "25x" to my title
- visit my UC presentation (link) to learn more

and fyi: we also plan to provide a feature preview source (or binary) release for
O'Reilly MySQL Conference & Expo 2010


(i will of course also disclose all information provided in UC presentation after the event)