Thursday, October 13, 2011

Brewing in MySQL Cluster 7.2.x

Admittedly MySQL Cluster have some way to go before monitoring becomes best in class. But, we are progressing!
In 7.1 we introduced NDBINFO, which is an infrastructure that enables presenting information from within the cluster in SQL format.
And here are a 4 new tables that are currently brewing

ndbinfo.transactions and ndbinfo.operations
mysql> select COLUMN_NAME, DATA_TYPE, COLUMN_COMMENT from information_schema.columns where TABLE_NAME = 'ndb$transactions';
+----------------+-----------+---------------------------------+
| COLUMN_NAME    | DATA_TYPE | COLUMN_COMMENT                  |
+----------------+-----------+---------------------------------+
| node_id        | int       | node id                         |
| block_instance | int       | TC instance no                  |
| objid          | int       | Object id of transaction object |
| apiref         | int       | API reference                   |
| transid        | varchar   | Transaction id                  |
| state          | int       | Transaction state               |
| flags          | int       | Transaction flags               |
| c_ops          | int       | No of operations in transaction |
| outstanding    | int       | Currently outstanding request   |
| timer          | int       | Timer (seconds)                 |
+----------------+-----------+---------------------------------+
mysql> select COLUMN_NAME, DATA_TYPE, COLUMN_COMMENT from information_schema.columns where TABLE_NAME = 'ndb$operations';
+----------------+-----------+-------------------------------+
| COLUMN_NAME    | DATA_TYPE | COLUMN_COMMENT                |
+----------------+-----------+-------------------------------+
| node_id        | int       | node id                       |
| block_instance | int       | LQH instance no               |
| objid          | int       | Object id of operation object |
| tcref          | int       | TC reference                  |
| apiref         | int       | API reference                 |
| transid        | varchar   | Transaction id                |
| tableid        | int       | Table id                      |
| fragmentid     | int       | Fragment id                   |
| op             | int       | Operation type                |
| state          | int       | Operation state               |
| flags          | int       | Operation flags               |
+----------------+-----------+-------------------------------+
  • these two tables show currently ongoing transactions resp. currently ongoing operations.
  • ndbinfo.transactions roughly corresponds to information_schema.INNODB_TRX
  • ndbinfo.operations roughly corresponds to information_schema.INNODB_LOCKS
  • the information provided is collected without any kind of locks
  • the information provided is collected by iterating internal data-structures. Hence output does not necessarily represent a state that has existed (i.e not a snapshot)


one missing piece of this puzzle is how to map a ndb transaction id, to a mysql connection id.
when (if?) this information is available, one could e.g join information_schema.processlist with ndbinfo.operations too see locks are being held by a certain connection. (suggestion on how to gather/expose this is welcome).

ndbinfo.threadblocks and ndbinfo.threadstat
mysql> select COLUMN_NAME, DATA_TYPE, COLUMN_COMMENT from information_schema.columns where TABLE_NAME = 'ndb$threadblocks';
+----------------+-----------+----------------+
| COLUMN_NAME    | DATA_TYPE | COLUMN_COMMENT |
+----------------+-----------+----------------+
| node_id        | int       | node id        |
| thr_no         | int       | thread number  |
| block_number   | int       | block number   |
| block_instance | int       | block instance |
+----------------+-----------+----------------+

mysql> select COLUMN_NAME, DATA_TYPE, COLUMN_COMMENT from information_schema.columns where TABLE_NAME = 'ndb$threadstat';
+----------------+-----------+------------------------------------------+
| COLUMN_NAME    | DATA_TYPE | COLUMN_COMMENT                           |
+----------------+-----------+------------------------------------------+
| node_id        | int       | node id                                  |
| thr_no         | int       | thread number                            |
| thr_nm         | varchar   | thread name                              |
| c_loop         | bigint    | No of loops in main loop                 |
| c_exec         | bigint    | No of signals executed                   |
| c_wait         | bigint    | No of times waited for more input        |
| c_l_sent_prioa | bigint    | No of prio A signals sent to own node    |
| c_l_sent_priob | bigint    | No of prio B signals sent to own node    |
| c_r_sent_prioa | bigint    | No of prio A signals sent to remote node |
| c_r_sent_priob | bigint    | No of prio B signals sent to remote node |
| os_tid         | bigint    | OS thread id                             |
| os_now         | bigint    | OS gettimeofday (millis)                 |
| os_ru_utime    | bigint    | OS user CPU time (micros)                |
| os_ru_stime    | bigint    | OS system CPU time (micros)              |
| os_ru_minflt   | bigint    | OS page reclaims (soft page faults       |
| os_ru_majflt   | bigint    | OS page faults (hard page faults)        |
| os_ru_nvcsw    | bigint    | OS voluntary context switches            |
| os_ru_nivcsw   | bigint    | OS involuntary context switches          |
+----------------+-----------+------------------------------------------+
these two tables shows currently which blocks run in which thread resp. statistics per thread.
  • the statistics are from data-node started, so to see trend, one need to snapshot table, and compare with snapshot.
  • the fields starting with os_ru_ are gather with getrusage(RUSAGE_THREAD) (or equivalent)
lots of numbers! and I'm honestly not quite sure how to interpret them
a few simple rules might be that (for a non idle cluster)
  • user time should be high and system should be low
  • involuntary context switches should be low
  • page faults should be low

STANDARD DISCLAIMER
  • the exact format of the tables might (will) change before reaching a release
  • it's currently unknown when(if?) they will reach a release near you

Monday, October 10, 2011

new features in MySQL Cluster 7.2.1

  • AQL (aka push down join)
    Further improvements and refinements compared to 7.2.0 from April
  • Index statistics
    A long over due feature, that aims to reduce(minimize) need of manual query tuning that previously has been essential for efficient SQL usage with ndb.
  • memcache access support
  • Active-Active replication enhancements
  • Various internal limits has been increased
    - Max row-size now 14k (previously 8k)
    - Max no of columns in table now 512 (previously 128)
  • Rebase to mysql-5.5 (7.2.1 is based on mysql-5.5.15)
  • Improved support for geographically separated cluster
    (note: single cluster...i.e not using asynchronous replication)

Brief introduction to AQL (aka join pushdown)

Basic concept is to evaluate joins down in data-nodes instead(in addition to) of in mysqld.
Ndb will examine query plan created by mysqld, and construct a serialized definition of this join, ship it down to data-nodes.
This join will in the data-nodes be evaluated in parallel (if appropriate), and the result set will be sent back to mysqld using a streaming interface.
Performance gain (latency reduction) is normally in the range of 20x for a 3-way join.

Brief introduction to Index statistics

The index statistics works a lot like Innodb persistent statistics.
When you execute analyze table T, data nodes will scan the indexes of T and produce a histogram of each index.
This histogram is stored in tables in ndb (mysql.ndb_index_stat_head and mysql.ndb_index_stat_sample). The histogram can then be used by any mysqld connected to this cluster. The histogram will not be generated until a new analyze table T is requested.

Brief introduction to Active-Active enhancements

MySQL Cluster has supported active-active asynchronous replication with conflict detection and conflict resolution since 6.3.
In prior version, the schema had to be modified, adding a timestamp column to each table and application has to be modified to maintain this timestamp column.
In this new version, no schema modification is required and no application modification is needed.
In previous version, conflict detection/resolution was performed on row-by-row basis.
In this new version, transaction boundaries are respected.
E.g in a row R is determined to be in conflict, not only this row-change will be resolved,
but entire transaction T that modified the row will be resolved and all transactions depending on the T transitively.
Longer descriptions can be found here and here

Sorry for omitting hex dumps and/or unformatted numbers

Mandatory late update: the join described here has now gained an extra 2x (but this improvement did not make 7.2.1)

Tuesday, May 10, 2011

Star schema benchmark on MySQL Cluster 7.2

I decided to try the star schema benchmark on our latest 7.2 release (link). Star schema benchmark is an analytics oriented benchmark, and MySQL Cluster has not been developed to address this kind of workload. Nevertheless I couldn't resist trying...

Setup

  • 2 data-nodes each running on a 4-way Xeon E7420 @ 2.13GHx (total 16 cores) 256Gb RAM
  • The mysqld was co-located with one of the data-nodes
  • I used memory tables

Results


Queries: link
Querysf10sf100
Q1.1562
Q1.20.44.6
Q1.30.11.1
Q2.1995
Q2.248495
Q2.351517
Q3.147481
Q3.247476
Q3.3220
Q3.4219
Q4.157572
Q4.257574
Q4.312120

(time in sec, less time is better)

Facts

  • I have done no analysis to why some queries are faster than others
  • This type of queries is currently not in focus of our efforts to improve join performance (nor has it been so far)
  • for lineorder table I specified max_rows=500000000
  • ndb_join_pushdown=on (default in 7.2)

Thoughts and guesses

  • sf100 is almost exactly 10x slower than sf10. So I think a reasonable guess is that sf1000 is 10x slower than sf100
  • Star schema benchmark is an easy benchmark (compared to e.g TPC-H)
  • I think results looks quite good
  • My guess is that scaling out (e.g to 4 nodes) would scale well (although I didn't try) (e.g scale out by running 2 ndbmtd on each machine)
  • Running high write load in parallel doesnt increase query times signficantly (as tested in presentation page 45-46) which is quite cool (although I didn't try this time)

Load procedure for sf100

  • I manually split the 61Gb file (lineorder) into 10 6Gb files (using split(1))
  • Then I loaded these 10 in parallel with 10 clients using load data infile
  • I disabled indexes while loading using ndb_restore --disable-indexes (from an backup made on empty db containing just schema definition)
  • I then rebuilt the indexes using ndb_restore --rebuild-indexes using BuildIndexThreads=15 in config.ini
(for sf10 i did nothing special)

Load time for sf100

  • 4h including rebuilding of indexes

References


Last minute update

  • OleJohn reports that Q2.2 and Q2.3 should be 10x faster if forcing a different join order (e.g by using straight_join or force index). But this I haven't tried