Thursday, December 19, 2013

Sunday, October 7, 2012

small men with big titles

i used the phrasing "small men with big titles" yesterday when having a cup of coffe with friends. i don't know if it's a established phrase, mr. google doesn't seem to know about it.

Tuesday, June 26, 2012

Foreign keys for MySQL Cluster

features supported are:
  • on-update/delete restrict
  • on-update/delete noaction
  • on-update/delete setnull
  • on-update/delete cascade
  • online add/drop foreign key
The foreign keys are implemented inside data-nodes, and will hence work regardless of which type of client you use. SQL, memcache, Cluster/J, ndbapi, {insert your favourite connector}.

somewhat interesting differences between ndb and innodb are:
  1. i haven't implemented the Innodb "extension" has that allows parent reference to be non-unique. It seems ok i think...wl#148 describes it as a feature not to have this extension
  2. For innodb "noaction" is identical as "restrict".
    For ndb, noaction means deferred check. I.e the constraint is checked before commit.
    (i think it should be at end of statement...but ndb currently doesnt have any statement concept)
  3. Ndb does not "natively" support update of primary keys (i.e not inside data nodes), an update on a primary key from SQL is emulated by an delete+insert.
    The implication of this, is that on-update-actions (such as cascade/setnull) doesn't work if parent reference is using primary key.
    Fixing this...means first adding native support for update of primary keys, and then adding the FK support for it...but I think(hope) that this is a limitation that is acceptable.
  4. Ndb today supports checking unique keys in a deferred fashion.
    This feature is used by slave applier to avoid false conflicts with replication.
    Foreign keys supports the same.
  5. Ndb implementation does not (today) support the foreign_key_checks variable.
  6. Ndb supports online adding and dropping of foreign keys (i.e no table copy, clients can read/write while FK is added/dropped).
    Hopefully this fact can decrease the need for the foreign_key_checks variable. (although we should consider how to process a mysqldump-file without hazzle)

implementation note 1: the implementation is based on ndb's internal trigger mechanism. the same mechanism that is used to maintain unique indexes. one implication of this is that cost (both in terms of single threaded latency and increase in load) should be roughly comparable to that of unique indexes. note that (to my knowledge) no benchmarks has been performed. stay tuned for that.

implementation note 2: a somewhat interesting implementation discovery is that mysql actually features a capable foreign key parser. my original plan was to "copy" the innodb hand-crafted parser. but this wasn't needed...the server parsed and created nice structures...only didn't expose them to the storage engine. so instead of adding a hand-crafted parser, i modified the server to expose them, and the code in our handler was much simpler...happy with that!

implementation note 3: a (deliberate) artefact of current implementation is that when using on {update/delete} {set null/cascade} the set-null or cascade that is performed on child table, will be put into binlog (if turned on). that means that the foreign key action will be performed on slave side too, even if foreign keys are not present (or enabled) on that site.

a general reflection conceived when adding this is that ndb internals now has added quite a lot of infrastructure that really makes adding complex features like this a lot easier.

Other resources on the subject is:

and finally the diffstat for the original quilt patchset looks like
[]:jonas@eel:~/src/ndb-fk> diffstat patches/*
102 files changed, 14813 insertions(+), 842 deletions(-) 


STANDARD DISCLAIMER
  • the exact format of this feature might (will) change before reaching a (GA) release
  • it's currently unknown when(if?) it will reach a (GA) release near you

Wednesday, February 15, 2012

international man of mystery

i can't help to think of austin powers when seeing one billion queries per minute

Thursday, October 13, 2011

Brewing in MySQL Cluster 7.2.x

Admittedly MySQL Cluster have some way to go before monitoring becomes best in class. But, we are progressing!
In 7.1 we introduced NDBINFO, which is an infrastructure that enables presenting information from within the cluster in SQL format.
And here are a 4 new tables that are currently brewing

ndbinfo.transactions and ndbinfo.operations
mysql> select COLUMN_NAME, DATA_TYPE, COLUMN_COMMENT from information_schema.columns where TABLE_NAME = 'ndb$transactions';
+----------------+-----------+---------------------------------+
| COLUMN_NAME    | DATA_TYPE | COLUMN_COMMENT                  |
+----------------+-----------+---------------------------------+
| node_id        | int       | node id                         |
| block_instance | int       | TC instance no                  |
| objid          | int       | Object id of transaction object |
| apiref         | int       | API reference                   |
| transid        | varchar   | Transaction id                  |
| state          | int       | Transaction state               |
| flags          | int       | Transaction flags               |
| c_ops          | int       | No of operations in transaction |
| outstanding    | int       | Currently outstanding request   |
| timer          | int       | Timer (seconds)                 |
+----------------+-----------+---------------------------------+
mysql> select COLUMN_NAME, DATA_TYPE, COLUMN_COMMENT from information_schema.columns where TABLE_NAME = 'ndb$operations';
+----------------+-----------+-------------------------------+
| COLUMN_NAME    | DATA_TYPE | COLUMN_COMMENT                |
+----------------+-----------+-------------------------------+
| node_id        | int       | node id                       |
| block_instance | int       | LQH instance no               |
| objid          | int       | Object id of operation object |
| tcref          | int       | TC reference                  |
| apiref         | int       | API reference                 |
| transid        | varchar   | Transaction id                |
| tableid        | int       | Table id                      |
| fragmentid     | int       | Fragment id                   |
| op             | int       | Operation type                |
| state          | int       | Operation state               |
| flags          | int       | Operation flags               |
+----------------+-----------+-------------------------------+
  • these two tables show currently ongoing transactions resp. currently ongoing operations.
  • ndbinfo.transactions roughly corresponds to information_schema.INNODB_TRX
  • ndbinfo.operations roughly corresponds to information_schema.INNODB_LOCKS
  • the information provided is collected without any kind of locks
  • the information provided is collected by iterating internal data-structures. Hence output does not necessarily represent a state that has existed (i.e not a snapshot)


one missing piece of this puzzle is how to map a ndb transaction id, to a mysql connection id.
when (if?) this information is available, one could e.g join information_schema.processlist with ndbinfo.operations too see locks are being held by a certain connection. (suggestion on how to gather/expose this is welcome).

ndbinfo.threadblocks and ndbinfo.threadstat
mysql> select COLUMN_NAME, DATA_TYPE, COLUMN_COMMENT from information_schema.columns where TABLE_NAME = 'ndb$threadblocks';
+----------------+-----------+----------------+
| COLUMN_NAME    | DATA_TYPE | COLUMN_COMMENT |
+----------------+-----------+----------------+
| node_id        | int       | node id        |
| thr_no         | int       | thread number  |
| block_number   | int       | block number   |
| block_instance | int       | block instance |
+----------------+-----------+----------------+

mysql> select COLUMN_NAME, DATA_TYPE, COLUMN_COMMENT from information_schema.columns where TABLE_NAME = 'ndb$threadstat';
+----------------+-----------+------------------------------------------+
| COLUMN_NAME    | DATA_TYPE | COLUMN_COMMENT                           |
+----------------+-----------+------------------------------------------+
| node_id        | int       | node id                                  |
| thr_no         | int       | thread number                            |
| thr_nm         | varchar   | thread name                              |
| c_loop         | bigint    | No of loops in main loop                 |
| c_exec         | bigint    | No of signals executed                   |
| c_wait         | bigint    | No of times waited for more input        |
| c_l_sent_prioa | bigint    | No of prio A signals sent to own node    |
| c_l_sent_priob | bigint    | No of prio B signals sent to own node    |
| c_r_sent_prioa | bigint    | No of prio A signals sent to remote node |
| c_r_sent_priob | bigint    | No of prio B signals sent to remote node |
| os_tid         | bigint    | OS thread id                             |
| os_now         | bigint    | OS gettimeofday (millis)                 |
| os_ru_utime    | bigint    | OS user CPU time (micros)                |
| os_ru_stime    | bigint    | OS system CPU time (micros)              |
| os_ru_minflt   | bigint    | OS page reclaims (soft page faults       |
| os_ru_majflt   | bigint    | OS page faults (hard page faults)        |
| os_ru_nvcsw    | bigint    | OS voluntary context switches            |
| os_ru_nivcsw   | bigint    | OS involuntary context switches          |
+----------------+-----------+------------------------------------------+
these two tables shows currently which blocks run in which thread resp. statistics per thread.
  • the statistics are from data-node started, so to see trend, one need to snapshot table, and compare with snapshot.
  • the fields starting with os_ru_ are gather with getrusage(RUSAGE_THREAD) (or equivalent)
lots of numbers! and I'm honestly not quite sure how to interpret them
a few simple rules might be that (for a non idle cluster)
  • user time should be high and system should be low
  • involuntary context switches should be low
  • page faults should be low

STANDARD DISCLAIMER
  • the exact format of the tables might (will) change before reaching a release
  • it's currently unknown when(if?) they will reach a release near you

Monday, October 10, 2011

new features in MySQL Cluster 7.2.1

  • AQL (aka push down join)
    Further improvements and refinements compared to 7.2.0 from April
  • Index statistics
    A long over due feature, that aims to reduce(minimize) need of manual query tuning that previously has been essential for efficient SQL usage with ndb.
  • memcache access support
  • Active-Active replication enhancements
  • Various internal limits has been increased
    - Max row-size now 14k (previously 8k)
    - Max no of columns in table now 512 (previously 128)
  • Rebase to mysql-5.5 (7.2.1 is based on mysql-5.5.15)
  • Improved support for geographically separated cluster
    (note: single cluster...i.e not using asynchronous replication)

Brief introduction to AQL (aka join pushdown)

Basic concept is to evaluate joins down in data-nodes instead(in addition to) of in mysqld.
Ndb will examine query plan created by mysqld, and construct a serialized definition of this join, ship it down to data-nodes.
This join will in the data-nodes be evaluated in parallel (if appropriate), and the result set will be sent back to mysqld using a streaming interface.
Performance gain (latency reduction) is normally in the range of 20x for a 3-way join.

Brief introduction to Index statistics

The index statistics works a lot like Innodb persistent statistics.
When you execute analyze table T, data nodes will scan the indexes of T and produce a histogram of each index.
This histogram is stored in tables in ndb (mysql.ndb_index_stat_head and mysql.ndb_index_stat_sample). The histogram can then be used by any mysqld connected to this cluster. The histogram will not be generated until a new analyze table T is requested.

Brief introduction to Active-Active enhancements

MySQL Cluster has supported active-active asynchronous replication with conflict detection and conflict resolution since 6.3.
In prior version, the schema had to be modified, adding a timestamp column to each table and application has to be modified to maintain this timestamp column.
In this new version, no schema modification is required and no application modification is needed.
In previous version, conflict detection/resolution was performed on row-by-row basis.
In this new version, transaction boundaries are respected.
E.g in a row R is determined to be in conflict, not only this row-change will be resolved,
but entire transaction T that modified the row will be resolved and all transactions depending on the T transitively.
Longer descriptions can be found here and here

Sorry for omitting hex dumps and/or unformatted numbers

Mandatory late update: the join described here has now gained an extra 2x (but this improvement did not make 7.2.1)

Tuesday, May 10, 2011

Star schema benchmark on MySQL Cluster 7.2

I decided to try the star schema benchmark on our latest 7.2 release (link). Star schema benchmark is an analytics oriented benchmark, and MySQL Cluster has not been developed to address this kind of workload. Nevertheless I couldn't resist trying...

Setup

  • 2 data-nodes each running on a 4-way Xeon E7420 @ 2.13GHx (total 16 cores) 256Gb RAM
  • The mysqld was co-located with one of the data-nodes
  • I used memory tables

Results


Queries: link
Querysf10sf100
Q1.1562
Q1.20.44.6
Q1.30.11.1
Q2.1995
Q2.248495
Q2.351517
Q3.147481
Q3.247476
Q3.3220
Q3.4219
Q4.157572
Q4.257574
Q4.312120

(time in sec, less time is better)

Facts

  • I have done no analysis to why some queries are faster than others
  • This type of queries is currently not in focus of our efforts to improve join performance (nor has it been so far)
  • for lineorder table I specified max_rows=500000000
  • ndb_join_pushdown=on (default in 7.2)

Thoughts and guesses

  • sf100 is almost exactly 10x slower than sf10. So I think a reasonable guess is that sf1000 is 10x slower than sf100
  • Star schema benchmark is an easy benchmark (compared to e.g TPC-H)
  • I think results looks quite good
  • My guess is that scaling out (e.g to 4 nodes) would scale well (although I didn't try) (e.g scale out by running 2 ndbmtd on each machine)
  • Running high write load in parallel doesnt increase query times signficantly (as tested in presentation page 45-46) which is quite cool (although I didn't try this time)

Load procedure for sf100

  • I manually split the 61Gb file (lineorder) into 10 6Gb files (using split(1))
  • Then I loaded these 10 in parallel with 10 clients using load data infile
  • I disabled indexes while loading using ndb_restore --disable-indexes (from an backup made on empty db containing just schema definition)
  • I then rebuilt the indexes using ndb_restore --rebuild-indexes using BuildIndexThreads=15 in config.ini
(for sf10 i did nothing special)

Load time for sf100

  • 4h including rebuilding of indexes

References


Last minute update

  • OleJohn reports that Q2.2 and Q2.3 should be 10x faster if forcing a different join order (e.g by using straight_join or force index). But this I haven't tried