jonas blog: October 2011

Thursday, October 13, 2011

Brewing in MySQL Cluster 7.2.x

Admittedly MySQL Cluster have some way to go before monitoring becomes best in class. But, we are progressing!
In 7.1 we introduced NDBINFO, which is an infrastructure that enables presenting information from within the cluster in SQL format.
And here are a 4 new tables that are currently brewing

ndbinfo.transactions and ndbinfo.operations

mysql> select COLUMN_NAME, DATA_TYPE, COLUMN_COMMENT from information_schema.columns where TABLE_NAME = 'ndb$transactions';
+----------------+-----------+---------------------------------+
| COLUMN_NAME    | DATA_TYPE | COLUMN_COMMENT                  |
+----------------+-----------+---------------------------------+
| node_id        | int       | node id                         |
| block_instance | int       | TC instance no                  |
| objid          | int       | Object id of transaction object |
| apiref         | int       | API reference                   |
| transid        | varchar   | Transaction id                  |
| state          | int       | Transaction state               |
| flags          | int       | Transaction flags               |
| c_ops          | int       | No of operations in transaction |
| outstanding    | int       | Currently outstanding request   |
| timer          | int       | Timer (seconds)                 |
+----------------+-----------+---------------------------------+
mysql> select COLUMN_NAME, DATA_TYPE, COLUMN_COMMENT from information_schema.columns where TABLE_NAME = 'ndb$operations';
+----------------+-----------+-------------------------------+
| COLUMN_NAME    | DATA_TYPE | COLUMN_COMMENT                |
+----------------+-----------+-------------------------------+
| node_id        | int       | node id                       |
| block_instance | int       | LQH instance no               |
| objid          | int       | Object id of operation object |
| tcref          | int       | TC reference                  |
| apiref         | int       | API reference                 |
| transid        | varchar   | Transaction id                |
| tableid        | int       | Table id                      |
| fragmentid     | int       | Fragment id                   |
| op             | int       | Operation type                |
| state          | int       | Operation state               |
| flags          | int       | Operation flags               |
+----------------+-----------+-------------------------------+

these two tables show currently ongoing transactions resp. currently ongoing operations.
ndbinfo.transactions roughly corresponds to information_schema.INNODB_TRX
ndbinfo.operations roughly corresponds to information_schema.INNODB_LOCKS
the information provided is collected without any kind of locks
the information provided is collected by iterating internal data-structures. Hence output does not necessarily represent a state that has existed (i.e not a snapshot)

one missing piece of this puzzle is how to map a ndb transaction id, to a mysql connection id.
when (if?) this information is available, one could e.g join information_schema.processlist with ndbinfo.operations too see locks are being held by a certain connection. (suggestion on how to gather/expose this is welcome).

ndbinfo.threadblocks and ndbinfo.threadstat

mysql> select COLUMN_NAME, DATA_TYPE, COLUMN_COMMENT from information_schema.columns where TABLE_NAME = 'ndb$threadblocks';
+----------------+-----------+----------------+
| COLUMN_NAME    | DATA_TYPE | COLUMN_COMMENT |
+----------------+-----------+----------------+
| node_id        | int       | node id        |
| thr_no         | int       | thread number  |
| block_number   | int       | block number   |
| block_instance | int       | block instance |
+----------------+-----------+----------------+

mysql> select COLUMN_NAME, DATA_TYPE, COLUMN_COMMENT from information_schema.columns where TABLE_NAME = 'ndb$threadstat';
+----------------+-----------+------------------------------------------+
| COLUMN_NAME    | DATA_TYPE | COLUMN_COMMENT                           |
+----------------+-----------+------------------------------------------+
| node_id        | int       | node id                                  |
| thr_no         | int       | thread number                            |
| thr_nm         | varchar   | thread name                              |
| c_loop         | bigint    | No of loops in main loop                 |
| c_exec         | bigint    | No of signals executed                   |
| c_wait         | bigint    | No of times waited for more input        |
| c_l_sent_prioa | bigint    | No of prio A signals sent to own node    |
| c_l_sent_priob | bigint    | No of prio B signals sent to own node    |
| c_r_sent_prioa | bigint    | No of prio A signals sent to remote node |
| c_r_sent_priob | bigint    | No of prio B signals sent to remote node |
| os_tid         | bigint    | OS thread id                             |
| os_now         | bigint    | OS gettimeofday (millis)                 |
| os_ru_utime    | bigint    | OS user CPU time (micros)                |
| os_ru_stime    | bigint    | OS system CPU time (micros)              |
| os_ru_minflt   | bigint    | OS page reclaims (soft page faults       |
| os_ru_majflt   | bigint    | OS page faults (hard page faults)        |
| os_ru_nvcsw    | bigint    | OS voluntary context switches            |
| os_ru_nivcsw   | bigint    | OS involuntary context switches          |
+----------------+-----------+------------------------------------------+

these two tables shows currently which blocks run in which thread resp. statistics per thread.

the statistics are from data-node started, so to see trend, one need to snapshot table, and compare with snapshot.
the fields starting with os_ru_ are gather with getrusage(RUSAGE_THREAD) (or equivalent)

lots of numbers! and I'm honestly not quite sure how to interpret them
a few simple rules might be that (for a non idle cluster)

user time should be high and system should be low
involuntary context switches should be low
page faults should be low

STANDARD DISCLAIMER

the exact format of the tables might (will) change before reaching a release
it's currently unknown when(if?) they will reach a release near you

Monday, October 10, 2011

new features in MySQL Cluster 7.2.1

AQL (aka push down join)
Further improvements and refinements compared to 7.2.0 from April
Index statistics
A long over due feature, that aims to reduce(minimize) need of manual query tuning that previously has been essential for efficient SQL usage with ndb.
memcache access support
Active-Active replication enhancements
Various internal limits has been increased
- Max row-size now 14k (previously 8k)
- Max no of columns in table now 512 (previously 128)
Rebase to mysql-5.5 (7.2.1 is based on mysql-5.5.15)
Improved support for geographically separated cluster
(note: single cluster...i.e not using asynchronous replication)

Brief introduction to AQL (aka join pushdown)

Basic concept is to evaluate joins down in data-nodes instead(in addition to) of in mysqld.
Ndb will examine query plan created by mysqld, and construct a serialized definition of this join, ship it down to data-nodes.
This join will in the data-nodes be evaluated in parallel (if appropriate), and the result set will be sent back to mysqld using a streaming interface.
Performance gain (latency reduction) is normally in the range of 20x for a 3-way join.

Brief introduction to Index statistics

The index statistics works a lot like Innodb persistent statistics.
When you execute analyze table T, data nodes will scan the indexes of T and produce a histogram of each index.
This histogram is stored in tables in ndb (mysql.ndb_index_stat_head and mysql.ndb_index_stat_sample). The histogram can then be used by any mysqld connected to this cluster. The histogram will not be generated until a new analyze table T is requested.

Brief introduction to Active-Active enhancements

MySQL Cluster has supported active-active asynchronous replication with conflict detection and conflict resolution since 6.3.
In prior version, the schema had to be modified, adding a timestamp column to each table and application has to be modified to maintain this timestamp column.
In this new version, no schema modification is required and no application modification is needed.
In previous version, conflict detection/resolution was performed on row-by-row basis.
In this new version, transaction boundaries are respected.
E.g in a row R is determined to be in conflict, not only this row-change will be resolved,
but entire transaction T that modified the row will be resolved and all transactions depending on the T transitively.
Longer descriptions can be found here and here

Sorry for omitting hex dumps and/or unformatted numbers

Mandatory late update: the join described here has now gained an extra 2x (but this improvement did not make 7.2.1)

jonas blog