jonas blog: 2008

Friday, December 5, 2008

shortstat

i ran some simple stats on mysql releases


*** mysql-4.1..mysql-5.0
commits: 7790
diffstat: 4424 files changed, 1271855 insertions(+), 199555 deletions(-)
 /sql      255 files changed,  129725 insertions(+),  41964 deletions(-)
 /test    1509 files changed,  954053 insertions(+),  15940 deletions(-)

*** mysql-5.0..mysql-5.1
commits: 6411
diffstat: 10244 files changed, 2172077 insertions(+), 1349098 deletions(-)
 /sql       243 files changed,  151483 insertions(+),   69799 deletions(-)
 /test     3862 files changed, 1258333 insertions(+),  206729 deletions(-)

*** mysql-5.1..mysql-6.0
commits: 3546
diffstat: 3574 files changed, 679669 insertions(+), 82131 deletions(-)
 /sql      226 files changed,  63619 insertions(+), 16469 deletions(-)
 /test    1772 files changed, 292884 insertions(+), 33553 deletions(-)
 /storage 1184 files changed, 281298 insertions(+),  9179 deletions(-)

*** mysql-5.1..mysql-5.1-telco-6.2
commits: 761
diffstat: 841 files changed, 75105 insertions(+), 37063 deletions(-)
 /sql      72 files changed,  8673 insertions(+),  5544 deletions(-)
 /test    326 files changed, 12269 insertions(+), 18376 deletions(-)
 /storage 396 files changed, 52983 insertions(+), 12434 deletions(-)

*** mysql-5.1-telco-6.2..mysql-5.1-telco-6.3
commits: 347
diffstat:  455 files changed, 27372 insertions(+), 10033 deletions(-)
 /sql       39 files changed,  3471 insertions(+),   740 deletions(-)
 /test     215 files changed,  8735 insertions(+),  2251 deletions(-)
 /storage  182 files changed, 14990 insertions(+),  7031 deletions(-)

*** mysql-5.1-telco-6.3..mysql-5.1-telco-6.4
commits: 582
diffstat: 733 files changed, 73161 insertions(+), 30912 deletions(-)
 /sql      12 files changed,   472 insertions(+),   250 deletions(-)
 /test     48 files changed,  1151 insertions(+),   218 deletions(-)
 /storage 622 files changed, 70267 insertions(+), 30010 deletions(-)

--- conclusions

none

--- how to get git copy of mysql repository

git-clone git://ndb.mysql.com/mysql.git

--- script that produces stats


#!/bin/sh

R="mysql-4.1..mysql-5.0 mysql-5.0..mysql-5.1 mysql-5.1..mysql-6.0 mysql-5.1..mysql-5.1-telco-6.2 mysql-5.1-telco-6.2..mysql-5.1-t
elco-6.3 mysql-5.1-telco-6.3..mysql-5.1-telco-6.4"

for i in $R
do
        echo "*** $i"
        echo "commits: `git-log --no-merges $i | grep Author | wc -l`"
        echo "diffstat: `git-diff --shortstat $i`"
        echo " /sql     `git-diff --shortstat $i -- sql/`"
        echo " /test    `git-diff --shortstat $i -- mysql-test/`"
        if [ -z "`echo $i | grep mysql-5.0`" ]
        then
        echo " /storage `git-diff --shortstat $i -- storage/`"
        fi
        echo
done

Tuesday, November 25, 2008

950k reads per second on 1 datanode

i spent last night adding 75% of the next step for our multi-threaded datanode.
and got new numbers...
the config is the same as earlier post, with the exception that
MaxNoOfExecutionThreads=8

flexAsynch -ndbrecord -temp -con 4 -t 16 -p 312 -a 2 -l 3 -r 2
insert average: 461584/s min: 451928/s max: 474254/s stddev: 2%
update average: 533083/s min: 530950/s max: 537351/s stddev: 0%
delete average: 564388/s min: 559265/s max: 567143/s stddev: 0%
read average: 948954/s min: 937288/s max: 959262/s stddev: 0%

also tried using SCI instead of gigabit ethernet
flexAsynch -ndbrecord -temp -con 4 -t 16 -p 256 -a 2 -l 3 -r 2
insert average: 568012/s min: 550389/s max: 578367/s stddev: 2%
update average: 599828/s min: 598480/s max: 602175/s stddev: 0%
delete average: 614036/s min: 612440/s max: 616496/s stddev: 0%
read average: 1012472/s min: 1003429/s max: 1024000/s stddev: 0%

i.e with SCI the 1M reads/sec limit is reached! (on 1 datanode)
i think this should also be achievable on ethernet by adding some
more optimizations (let api-application start transactions directly
on correct TC-thread)

---

comments:
1) the new "feature" is multi threading the transaction coordinator
aka MT-TC

2) this part will likely not make the mysql cluster 6.4.0-release

3) our multi-threading architecture seems promising,
in less than a month i managed to double the throughput
(in a admittedly unrealistic benchmark, but still)

4) the 25% missing from the current patch is node-failure handling
and a "rcu-like" lock which will be used for reading/updating distribution
(it's read for each operation, and updated during node-failure,node-recovery and
online table repartitioning)

Wednesday, November 5, 2008

700k reads per second on 1 datanode

added multi connect to flexAsynch, got new numbers
everything else same as previous post

[jonas@n1 run]$ flexAsynch -ndbrecord -temp -con 2 -t 16 -p 512 -l 3 -a 2 -r 2
insert average: 360679/s min: 346150/s max: 370075/s stddev: 2%
update average: 373349/s min: 372465/s max: 374132/s stddev: 0%
delete average: 371014/s min: 357043/s max: 378523/s stddev: 2%
read average: 731042/s min: 702211/s max: 760631/s stddev: 2%

Monday, November 3, 2008

500k reads per second on 1 datanode

just did some benchmarking on multi-threaded ndbd (binary called ndbmtd)
that is in the coming 6.4 release.

quite happy with results

--- results

[jonas@n1 run]$ flexAsynch -ndbrecord -temp -t 8 -p 512 -r 5 -a 2
insert average: 374200/s min: 374200/s max: 374200/s stddev: 0%
update average: 370947/s min: 370947/s max: 370947/s stddev: 0%
delete average: 395061/s min: 395061/s max: 395061/s stddev: 0%
read average: 537178/s min: 531948/s max: 543092/s stddev: 0%

---

this flexAsynch command will run with
- 8 threads
- 512 parallel transactions per thread
- 8 byte records.

note: during the reads, the datanode was *not* maxed out.

---

this was run on two identical computers,
2-socket, 4 cores per socket Intel(R) Xeon(R) CPU X5355 @ 2.66GHz

api-program was running on computer 1 (n1)
datanode was running on computer 2 (n2)

--- configuration

[cluster_config]
DataMemory=2000M
IndexMemory=150M

SendBufferMemory=8M
ReceiveBufferMemory=8M
LongMessageBuffer=64M

NoOfReplicas=1
ndb_mgmd=n1
ndbd=n2
mysqld=n1,n1,n1,n1
Diskless=1
MaxNoOfExecutionThreads=6
MaxNoOfConcurrentTransactions=16384

Thursday, October 16, 2008

forks, add-on patch-sets and features

so far little is happening in this area with MySQL Cluster.
would be interesting to get patches to cluster from a(ny) (huge-web) company...
wonder if that will ever happen...
maybe we don't use enough buzz-words

---

it could also be that we add features in a high enough pace ourselves,
preliminary benchmarks of our multi-threaded ndbmtd(4 threads)
shows up to 3.7 times better throughput than singled threaded ndbd.

Monday, September 15, 2008

create(drop) node(group) pushed

this friday.

no (currently :) known bugs

Tuesday, September 2, 2008

end of think-period

today, I think I finally cracked how to create(drop) a nodegroup.
basic concept is to
- temporary block gcp
- create(drop) the node group
- unblock gcp

(the same concept is btw used for adding a starting node to gcp)
the block should last for micro seconds

now it's only implementing it...

---

very happy that I now know how to proceed,
I've spent quite a lot of time trying to figure out a 100% safe
way of doing it...(wo/ blocking gcp)
but this solution will be efficient and fairly easy to implement.
(if any protocol dealing with (multi)node-failures can be considered easy)

Saturday, August 30, 2008

status of create/drop node(group)

status:
create/drop nodegroup now works with one noticeable exception
replication cant be connected while the nodegroup is added.
i'll try to find time to fix this next week.

howto:
- start a 2 node cluster
- create table T1
- stop ndb_mgmd, add 2 nodes, start ndb_mgmd
- either stop the 2 running nodes and restart all 4
or rolling restart the 2 running nodes, and then start the 2 new nodes
- ndb_mgm> create nodegroup n1,n2
- alter table T1 add partitions partitions 2

Tata! fully online scaling of the cluster

howto backwards:
- drop table T1
- ndb_mgm> drop nodegroup X
- ndb_mgm> n1,n2 stop -a
- stop ndb_mgmd, remove the 2 nodes, start ndb_mgmd
- either stop the 2 running nodes and restart them
or rolling restart them

(A nodegroup is allowed to be dropped if it does not contain any data)

side effect:
- I added the possibility to specify nodegroups per node in the config-file
(this I intend to use for testing, but maybe someone might find it interesting)

future:
- magnus is working on "online configuration change" in the ndb_mgmd
once this is complete/functional, we can add the "add node"-command
so that the entire procedure can be done wo/ node restarts.

---

Friday, August 15, 2008

In pain

I miss bk

Friday, July 4, 2008

summer months

june: customer issues and bugs
july: vacation
august: must complete add node (I haven't started, but work has been done by stewart)

----

No need to comment this...
if you want to make me happy, choose one of the earlier posts

Sunday, June 8, 2008

boom-tjackalack! table-reorg is pushed

so...now table-reorg is in 6.4.
pushbuild found a few problems...that are fixed.

what is left:
1) detailed test-prg (which will check consistency after each step, by pausing schema-trans)
2) handling of cluster-crash during reorg
only way right now, is to restore a backup if you get crash during reorg
3) node failure during might cause SUMA to not scan some fragments
(this bug is an old one, existing in 4.1, that also affect unique index build)
4) reorg-abort (in certain state) leaves REORG_MOVED bit on records,
cause subsequent reorgs (to different partitioning) to create inconsistent data.

Not too bad...
I do however think it's quite testable (although maybe not extremely interesting wo/ add node)

Will start on add-node...and fix problems above in parallel

Thursday, June 5, 2008

almost push-time

I've now:
- fixed error handling (although testing is still not 100%)
- pushed the grand unified table state patch
- pushed a few patches in the series...

No one commented asking for a snapshot,
so i decided to push into 6.4 instead.

Will just spend some more time testing/cleaning up...

response to comment with questions

1) Which operations can I perform during a table reorg?
everything except DDL and node restart
ndb does currently only allow one DDL at a time, and the reorg is a DDL
ndb does currently prevent node restart while DDL in ongoing

2) What happens to an ongoing table reorg during
2a) node failure
reorg will be completed or aborted depending on how long it has progressed
(i.e if commit has been started)
2b) cluster failure, and recovery?
reorg will be completed or aborted depending on how long it has progressed
(i.e if commit has been written)

The reorg is committed after rows have been copied, but before rows has been
deleted/cleaned up

3) How do my a) SQL b) NDBAPI applications have to be changed to cope with table reorg?

Not at all, but
- your application can "hint" incorrectly if it does not check table state
and refresh it after reorg has been committed
- your application might encounter temporary errors due to the reorg,
this error is the same that you can get during a node restart, so no special
handling of this is needed.
And hopefully the temporary errors should be rare (testing will show...)

4) How can I trade off the duration of a reorganisation against its resource impact (CPU, Memory, Bandwidth etc.)

Currently you can't. speed is hard-coded. this will maybe be a future feature

5) What performance impact does re-org have on ongoing DML and query operations?

Don't know yet, not enough testing. debug-complied versions that I tested gave maybe 5-10% impact. (there is also another optimization that I want to do...which will reduce the impact)

6) What impact does re-org have on DDL operations?
Ongoing none, cause we only support one at a time.
And the re-org will prevent other DDL from starting while it's running

7) Will there be some easy way to re-org all cluster tables to balance across all available nodes?

write a stored procedure that list all tables, and reorgs them one by one.

8) How are indexes modified during table re-org?
ordered indexes are reorganised together with base table
unique indexes are currently untouched (this should probably change)

9) Which parts of the re-org are serial, and which are parallel?
Same as all other schema-transactions after wl3600.
I.e each operation-step is run parallel on each node,
but only one operation-step is run at a time.

This means that e.g copy and "cleanup" is run in parallel on all nodes.

10) Can I perform an online upgrade to a version of MySQL Cluster that supports re-org?

yes,

11) Can I restore a backup from an old version of MySQL Cluster and get online re-org features?

yes,

12) What are the down sides of this table re-org implementation?

none :-)
but there are some areas for improvement

3) Can re-org cope with heterogeneous NDBD nodes with different DataMemory capacities?

In the kernel, yes, but there is no SQL interface currently to expose this

14) How can I look at hash result to fragment id mapping tables?

Using a hand-written ndbapi program
(maybe will add this to ndb_desc)

---

Puh...
that comment held some may questions...
that i maybe should not be asking for more comments...

Friday, May 23, 2008

alter online table T add partition partitions N

I now enabled SQL-interface to table-reorg.

The syntax (which is the same for other partition mgm) is
ALTER ONLINE TABLE T ADD PARTITION PARTITIONS N;

Also switched so that hashmap partitioning is used for all tables created using SQL.
And mysql-test-run works (including a new ndb_add_partition-test)
(except for some range/list partition testcases)

it's still kind of fragile. Error handling is sparse...

there are 3 known things which are easy to fix
- ndbapi transaction hinting/pruning does not work after/during a reorg
- unique indexes will not work after/during a reorg
- only 1 reorg per table is possible (SUMA caches distribution information incorrectly)

and one quite hard
- cluster crash *during* table-reorg

current plan is
1) fix 3 easy known problems
2) fix error handling
3) write detailed test program (pushed back again!)

Tuesday, May 20, 2008

wl3600++ complete

I've coded and pushed wl3600++ to telco-6.4
- lots of code simplification
- lots of "duplicate" code removal

Also merged it into table-reorg clone.
And now system restart just started magically working.
So I can again run mysql-test-run

Will now fix any problems found by mysql-test-run.pl

---

Wednesday, May 7, 2008

wl3600++ clarification

Just one thing...

it's obvious that the only long term correct solution
is to add a schema-log instead of the schema-file
but the rules/framework that i'm developing will be
so that a change like that is only (almost) on transaction level,
i.e operations will be almost unchanged...

however,
the schema-log will likely not happen this, next of year after next year...(my guess)

Just wanted to clarify...
(that i'm not an idiot, well at least not a complete one :-)

wl3600++

Has now agreed on a way forward with pekka on wl3600
How to handle SchemaFile, batching & completeness needed for table-reorg.
What operations/transaction shall (not) do in different stages

Having a consistent model feels great!
(compared to the evolutionary mesh that is present today)

Now I only need to implement it...

---

I now have at least two distinct readers (comments from 2 persons)...
I'm blogging my way into fame

Wednesday, April 30, 2008

Core functionality(reorg) complete!

transactions + scan + replication works!

Now remaining:
- error handling
- detailed unit testing
- durability of HashMap
- fix schema trans restart on SR
- fix unique index
- ndbapi interface to HashMap
- optimize COPY by using new operation(ZMOVE)
which creates less load and interacts better with replication
(currently the COPY produces events which is "correct" but not optimal)
- sql-support

And of course add-node :)

---

Personally I think this post deserves at least one comment...

Tuesday, April 29, 2008

Core functionality(reorg) almost done

Core functionality is almost complete,
- all operations in place, running in correct order
- all transactions work correctly (with all synchronization)

Remaining is "only" 2 local functions, related to SUMA switchover.

---

Need to spend serious time on a contest to add to this blog,
to get more comments!

Thursday, April 24, 2008

Eureka - SUMA switch over for table-reorg

First it struck me that
- starting to double send events can be done wo/ synchronization, cause
1) this is basic techinque for node-failure handling
2) it does not matter if new-fragment does not contain full epoch, as it will contain
last part

Then (today) it struck me that
- turning off double send on old "home" for row does not either require synchronization
(except doing it on epoch-boundary) but different fragments can do it on different times, cause
- there is already double send ongoing, so no events will be lost

This makes the task relatively straightforward, but following is still needed,
- replication triggers must be turned off at epoch boundary
- replication triggers must be turned off 3 epochs after "turn off" has been initiated

- all of above means that it can be handled per node...(or maybe node group...)
- have to think more about potential per node group synchronization though...

Hope I'm right! Will discuss @office, to see if anyone can find any holes
(including me, as I got the idea today)

---

Still only one comment...I think I need to add a contest of something

Monday, April 21, 2008

What is table-reorg

Table-reorg is the the procedure which will be executed on "alter table X add partitions Y" .
This you would typically do when you have added Y new nodes to your cluster.

The procedure is online, i.e transactions can run during the operation
and no extra memory will be needed on the "old" nodes.

The reorg is based on linear hashing (but wo/ the normal skewness in distribution)
E.g when going from 2 to 4 partitions 50% of the rows will be moved.

The copying is done in parallel on the "old" nodes,
and consistency is kept using triggers.

---

So I have atleast 1 reader...

Wednesday, April 16, 2008

Transactions now work!

- Transactions now work correctly, both pk/uk and table/index scan
- I have decided how to do testing (single-step through reorg)

Fixing schema-transaction seems like a must now (for 6.4)
- schema-file flushing
- complete phase
Maybe I can come up with something else to do first...

---

Still no comments on my blog...wonder if I have any readers

Friday, March 28, 2008

Assorted notes

- transaction consistency (big part is testing)
- fix schema trans complete phase (to 6.4 directly)
- durability of objects
- fix schema trans restart on SR
- fix unique index
- sql
- automagic HashMap creation

---

Will likely next build testing framework...for transaction consistency

table-reorg plan/progress II

1) hashmap (done)
2) add partitions (done)
3) reorg-triggers (done)
4) reorg-copy (done)
5) reorg-delete (done)
6) consistent scan (partially done)

---

Held live-demo for office audience

Thursday, March 27, 2008

table-reorg plan/progress

1) hashmap (done)
2) add partitions (done)
3) reorg-triggers (done)
4) reorg-copy (done)
5) reorg-delete
6) consistent scan

done = runnable != complete

---

just "finished" reorg-copy, still needs polishing...

Wednesday, March 26, 2008

Announce

Just started...we'll see if I use it