MySQLBlog

John David Duncan on Data & Databases

My Letter to Congress

To Congresswoman Speier, Senator Feinstein, and Senator Boxer:

The NSA is out of control. It has finally become clear how little its leaders respect the society they are supposed to be securing. It needs new leaders and a clear new mandate.

In 1993, I was a kid with a degree in linguistics, a pile of student loans, and very few job skills. I often couldn’t pay my rent. I had been to some of this country’s best schools, but I was not sure its economy would ever have a place for me. Whenever I read about young men who have gone seriously astray (even violent ones), I think back to these days. Luckily for me, the internet boom came along when it did, and brought me a chance to use my wonderful education, have fun, and become successful. 20 years later, I have a wonderful wife and daughter, we own a house in San Mateo County, and I can look forward to more adventures in the second half of my career.

But that career only exists because the public trusts the privacy of messages and the security of commerce on the internet. As an industry, we promise that SSL and HTTPS are private and secure. Your own congressional web sites use HTTPS to maintain privacy for constituent messages, and your campaign web sites use HTTPS to secure credit card numbers. But now everyone can see that those promises are hollow. When the NSA insists on stealing the keys, the foundation crumbles, and the whole edifice begins to fall.

On open networks, cryptography is like an envelope; without it, every message is visible to all, as if written on a post card. Would you run our postal service without envelopes?

National security, of course, is also vital. But our constitution defines the limits on the pursuit of security. Rather than respect those constitutional limits, the NSA has chosen to operate in secret, knowing that some day the secrets might came out and come before a judge.

General Alexander must resign as NSA director and the integrity of our private communications and commerce must be restored. Please join me in insisting on this.

Sincerely,

John David Duncan

MySQL Cluster and Node.JS

Here are some things to know about our new Node.JS connector:

  • It provides an easy-to-use JavaScript API for data management. Developers write JavaScript, not SQL.

  • If the data is stored in MySQL Cluster, that JavaScript is turned into fast NDB API calls sent directly to the NDB data nodes.

  • You can also use the connector with data stored in InnoDB or some other engine. In this case, our implementation will generate SQL statements from your API calls and send them to MySQL using node-mysql, the all-JavaScript MySQL client built and maintained by Felix Geisendörfer. But we don’t just generate arbitarary SQL and trust the optimizer to make it work. We know where the indexes are, and we require you to use them.

  • Because node.js is still a rapidly moving target, we don’t build a binary connector tied to any particular version of node in the MySQL Cluster 7.3 binary package. From an installed 7.3 tree, you cd to share/nodejs and run “node configure” there, and type in the pathname to the root of your install tree (i.e. two levels up from share/nodejs).

  • We also actively maintain the code on github, at github.com/mysql/mysql-js.

  • Cluster 7.3 is not required; the code on github works with any recent release of MySQL Cluster, and the SQL path works with any recent MySQL.

Some links

Blog Migration

The machine that ran this blog gave up the ghost. It was a “mini tower” sitting under a desk, and the boot disk failed. I don’t really need the machine any more, and I was too busy working to get mysql-js out the door to spend any time rebuilding it, so the site disappeared off the internet for a while.

This week we officially released MySQL Cluster 7.3, and I decided to revive my blog. Which brought up some decisions: keep it on the MySQL + Apache + PHP + Serendipity stack that it was running on? Keep hosting it from my house? Move it to a cloud provider? (Which one?) Just use Tumblr?

So, I migrated off of the old platform. I’m writing this post in Markdown and then using Octopress to build the web pages. For now, the blog is still being served from my own hardware — a low-powered Soekris machine with only flash for storage.

Moving from a dynamic content system like Wordpress or S9Y to a static one means giving up pings and trackbacks. That’s not a big deal. But it also means outsourcing your comments, and I haven’t settled on how to do that yet.

A Sequence Generator

The memcache protocol has an INCR command, which atomically increments a stored value. MySQL Cluster is able to take simple operations (like incrementing a counter) and “push them down” to run on a data node with a lock held. Acquire the lock, increment the value, read it, and release the lock.

Originally in memcached, the incrementing sequence number was local to one memcache server, and was also volatile. Whenever the memcached server restarted, the sequence would be reset. Today, though, we can put cluster and memcached together, and instead of a volatile local sequence number, we get a durable and shared one.

This is a very simple application, and I’m going to walk through the whole thing. Keep reading to see the whole thing, start to finish. In my example I’ll use MySQL Cluster 7.2.4 and the memcached ASCII protocol. There is a minor (but annoying) bug, already fixed in 7.2.6, that prevented this from working with the binary protocol. So, keep that in mind, and grab 7.2.6 or later once it is released.

I’ll start with a newly-installed MySQL Cluster 7.2.4, and use the sandbox script to get a cluster running quickly.

% cd share/memcache-api/
% ./sandbox.sh start 

Then I’ll make sure everything really started. (The commands to bring up a new cluster are slow, and sometimes in the sandbox script one of them will time out. If it does, you can just run “sandbox.sh start” a second time).

%  ./sandbox.sh status
[ndb_mgmd]   running
[ndbd]       running
[mysqld]     running
[memcached]  running

It all looks good. Now I’ll go into mysql and create a sequence table. The table is just two columns: a sequence name, and the 64-bit unsigned sequence value.

% mysql -u root ndbmemcache
   CREATE TABLE test_sequence (   
     name varchar(12) PRIMARY KEY NOT NULL,
     value bigint unsigned NOT NULL DEFAULT 0
   ) engine = ndbcluster ;
    Query OK, 0 rows affected (0.05 sec)

I created that table in the “ndbmemcache” database for convenience, but if I were actually building a production system, I wouldn’t put it there. In a minute I’ll create a containers record to tell memcached the database name and table name. But first, while I’m here, I’ll create a sequence called “docId”:

INSERT into test_sequence (name) values ("docId") ;

In order for memcached to use this table, I will have to create a container record for it, and then assign the container to a key prefix. That takes these two statements.

INSERT INTO containers 
 SET name = 'seq_tab', 
     db_schema = 'ndbmemcache', 
     db_table = 'test_sequence', 
     key_columns = 'name', 
     value_columns = 'value',
     increment_column = 'value' ;
INSERT INTO key_prefixes   
        VALUES(0, 'seq:', 0, 'ndb-test', 'seq_tab');

This completes the configuration. Now if I touch the timestamp on the server role, the memcached server will reload the config tables. So in my mysql window, I do this:

UPDATE memcache_server_roles
   SET update_timestamp = now() where role_id = 0 ;

And over at the memcached Window, I see some log messages to confirm that this did the trick:

Received update to server role default_role
Retrieved 5 key prefixes for server role "default_role".
The default behavior is that: 
  GET uses NDB only
  SET uses NDB only
  DELETE uses NDB only.
The 4 explicitly defined key prefixes are "b:" (demo_table_large), 
    "mc:" (), "seq:" (test_sequence) and "t:" (demo_table_tabs)
ONLINE RECONFIGURATION COMPLETE

Notice that one of the 4 key prefixes is now “seq:”, and its container table is test_sequence.

Now it’s simple enough to telnet to the memcache server and verify that the sequence really works. I’ll open a connection:

% telnet localhost 11211
Trying ::1...
Connected to localhost.
Escape character is '^]'.

Then I type an INCR command:

incr seq:docId 1

and memcached responds with a new value, 1. We could go on like this all day …

incr seq:docId 1
2
incr seq:docId 1
3

But I’d rather see this work under some concurrent load. I’ll use a ruby script. I don’t actually use ruby very much, but I got a concise and readable ruby script from a customer with a bug report a few weeks ago, and started adapting it for other things. So this is a descendant of that same script. It’s basically a ten-second fork bomb. It’s going to fork off 8 processes (enough to keep my laptop busy) and pound the memcached server with INCR requests for ten seconds. After it runs, the value of the counter will tell us how many requests it ran.

First, though, there’s a dependancy (a memcache client).

% gem install memcache-client
Fetching: memcache-client-1.8.5.gem (100%)
Successfully installed memcache-client-1.8.5
1 gem installed

Once that’s taken care of, here’s the script. I saved this as seqtest.rb.

child_procs = 8
end_time = Time.new + 10                                                        
                                                                               
child_procs.times do |i|                                                        
 Process.fork do                                                               
   require 'memcache'                                                             
   mc = MemCache.new 'localhost:11211'
   while mc                                                                  
     v = mc.incr('seq:docId',1)
                                                                       
     exit 0 if end_time - Time.new <= 0                                        
   end                                                                         
   exit 1                                                                      
 end                                                                           
end                                                        

if Process.waitall.any?{ |pid,stat| stat.exitstatus != 0 }                      
 puts "There was an error."                                  
else 
 puts "All OK."                                                     
end

Let’s run it and see how many increments happened.

% ruby seqtest.rb 
All OK.

mysql> select * from test_sequence;
+-------+-------+
| name  | value |
+-------+-------+
| docId | 37671 |
+-------+-------+

So my little laptop was serving about 3,500 INCR requests per second.

That’s all there is to it. If your app happens to want two document ID’s at the same time, it can do mc.incr(“seq:docID”, 2) and use both the return value n and n-1.

Again, in 7.2.4 you need to stick to the ASCII protocol, but the binary protocol works starting in 7.2.6. In the Ruby example, this means you can substitute the high-performance binary client, Dalli, in place of memcache-client.

Also in 7.2.4, the way the sequence table is created here, if you just wanted to read the current value without incrementing it, you would not be able to do that with a GET command. As a workaround, you have to use an “INCR by 0” to get the current value. That bug has been fixed for 7.2.6 as well.

Storing Long Values With NDB and Memcache

With MySQL Cluster 7.2 we introduced native support for the Memcache protocol. From the very start of the project we knew there would be some challenges in storing large values — “blobs” that are bigger than the maximum NDB row size — from the Memcached server. Coincidentally in 7.2 we also increased this maximum row size from 8000 bytes to 14,000 bytes. This helped out a little bit when the first preview release became available, back in September,and there was still no way to access blobs through memcached.

We had always wanted the memcached server to be event-driven and to use asynchronous APIs on the back-end, but we had no asynchronous API for blob access. Reading a blob is a multi-step operation: it requires reading first a header row, and then a series of blob parts from a (hidden) parts table. The existing blob implementation, known as NdbBlob, encapsulates all of this into a single synchronous, blocking, API call.

What’s slightly worse, though, about NdbBlob, is the way it constructs the primary key of the parts table. That key has two parts: first the primary key from the main table, and then the part number. This decision optimizes in favor of main tables whose primary keys were small integers rather than, say, lengthy VARCHARs. Unfortunately, with memcached, the lengthy VARCHAR primary key would probably be the most common case.

As of release 7.2.3 the NDB/memcached combination does allow you to store larger values. It’s done in a way that looks perfectly normal from the Memcache perspective — the client sends a long value, and the server stores it. But rather than adapt NdbBlob, we created a new implementation, which from the SQL perspective might look a bit unusual. So I’ll describe how we designed it, and why. Compared to NdbBlob it has several distinctive features:

  • Blob parts are not stored in a special hidden table, but in a normal, visible table that you the user create yourself.
  • Similarly, the blob header in the main table is not an opaque hidden column, but a pair of integer columns — an id number and a size — that are normal visble SQL columns.
  • The primary key of the parts table is two integers: an id number and a part number.
  • If a value is small enough to fit entirely in the main table row, it is stored there (with the external size set to NULL). But if it is large enough to be externalized, then the whole value is externalized. This is again unlike NdbBlob, which stores the leftmost part of every value inside the blob header.
  • Finally, unlike NdbBlob, you cannot directly read a blob as a SQL BLOB from the MySQL server (though you can read it using group_concat and write it using a stored procedure).

This new design has two particular advantages. It allows us to have a new async API without any risk, and it conserves memory. The risk, if we had extended NdbBlob, was the possibility of ugly, difficult, data-corrupting bugs when memcached and mysqld accessed the same BLOB object concurrently. The memory benefits can result in an 80% savings in IndexMemory when keys are 64 bytes long, as I will illustrate here.

How to configure it

For a quick walkthrough of what it takes to use the external values API, I’ll describe the tables that are a standard part of the “ndbmemcache” configuration schema.

First, let’s look at the external values table. Its payload is a VARBINARY(13950) content column that can hold a single “stripe” of a large value, and its two-part primary key is simply an INT UNSIGNED id number plus a SMALLINT part number.

CREATE TABLE `external_values` (
  `id` INT UNSIGNED AUTO_INCREMENT NOT NULL,
  `part` SMALLINT NOT NULL,
  `content` VARBINARY(13950) NOT NULL,
  PRIMARY KEY (id,part)
 ) ENGINE = ndbcluster;

Then we see the main table, which has a string column for the memcache key, an optional column for a CAS ID, and a column for the string value if that value can fit inside 2000 bytes. Values larger than 2KB will be externalized: the ext_id will reference an “id” in the external_values table, and ext_size will record the total size of the long value.

CREATE TABLE IF NOT EXISTS `demo_table_large` (
  `mkey` VARCHAR(250) NOT NULL,
  `cas_value` BIGINT UNSIGNED,
  `string_value` VARBINARY(2000), 
  `ext_id` INT UNSIGNED,
  `ext_size` INT UNSIGNED,
  PRIMARY KEY USING HASH (mkey)
) ENGINE = ndbcluster;

To configure this table for use in Memcached, first we create a record in ndbmemcache.containers. The table itself is called “demo_table_large,” but we call the container “demo_ext”. Note how “string_value” is defined as the value column, and the fully-qualified table name “ndbmemcache.external_values” is used for the large_values_table.

INSERT INTO containers
  SET name = "demo_ext", db_schema = "ndbmemcache", 
      db_table = "demo_table_large",
      key_columns = "mkey", 
      value_columns = "string_value", 
      cas_column = "cas_value", 
      large_values_table = "ndbmemcache.external_values";

The final step is to define some memcache key prefixes that use the new container. Here are two. The first of these says that in server role 0 (the default role), any memcache key that begins with “b:” will use the demo_ext container. The second one says that in server role 4 (the “large” server), all memcache keys, by default, will use demo_ext (in other words, it defines the empty prefix for that role).

INSERT INTO key_prefixes (server_role_id, key_prefix, cluster_id, policy, container)
  VALUES
          (0, "b:",  0, "ndb-test", "demo_ext"),
          (4, ""  ,  0, "ndb-test", "demo_ext");

How it saves memory

Finally I’ll walk through an example that illustrates the memory efficiency of the new design. To begin, we will look at ndbinfo.memoryusage to take a baseline measurement.

$ mysql -u root -e 'select * from ndbinfo.memoryusage' 
+---------+--------------+---------+------------+-----------+-------------+
| node_id | memory_type  | used    | used_pages | total     | total_pages |
+---------+--------------+---------+------------+-----------+-------------+
|       2 | Data memory  | 1015808 |         31 | 188743680 |        5760 |
|       2 | Index memory |  303104 |         37 |  63176704 |        7712 |
+---------+--------------+---------+------------+-----------+-------------+

This one-node cluster contains just some system tables and the standard ndbmemcache configuration shema, and I’m paying particular attention to used pages of IndexMemory. The baseline starts with 37 pages used.

Next I’ll look at the memory used by NdbBlob. I’ll create a simple table like this:

CREATE TABLE blob1 (
  mkey varchar(250) PRIMARY KEY NOT NULL,
  bval blob,
) ENGINE=ndbcluster;

Then I’ll use a simple perl script to insert 1,000 rows, each with a 64-character key and a 30,000 byte value, and look at the memory usage.

$ perl blob.pl --size=30000 --sql  
$ mysq  l -u root -e 'select * from ndbinfo.memoryusage' 
+---------+--------------+----------+------------+-----------+-------------+
| node_id | memory_type  | used     | used_pages | total     | total_pages |
+---------+--------------+----------+------------+-----------+-------------+
|       2 | Data memory  | 34701312 |       1059 | 188743680 |        5760 |
|       2 | Index memory |   532480 |         65 |  63176704 |        7712 |
+---------+--------------+----------+------------+-----------+-------------+

IndexMemory use increased from 37 pages to 65 pages. That means that every 1,000 rows of this 30KB blob will require 28 pages of IndexMemory.

Let’s delete the rows and verify that memory usage returns to the baseline level.

$ perl blob.pl --flush 
$ mysql -u root -e 'select * from ndbinfo.memoryusage' 
+---------+--------------+---------+------------+-----------+-------------+
| node_id | memory_type  | used    | used_pages | total     | total_pages |
+---------+--------------+---------+------------+-----------+-------------+
|       2 | Data memory  | 1015808 |         31 | 188743680 |        5760 |
|       2 | Index memory |  303104 |         37 |  63176704 |        7712 |
+---------+--------------+---------+------------+-----------+-------------+

IndexMemory is back down to 37 pages. Next we can use the same script to store 1000 rows with a 64-byte key and 30,000-byte value using memcached, and compare.

$ perl blob.pl --size=30000 --mc   
$ mysql -u root -e 'select * from ndbinfo.memoryusage' 
+---------+--------------+----------+------------+-----------+-------------+
| node_id | memory_type  | used     | used_pages | total     | total_pages |
+---------+--------------+----------+------------+-----------+-------------+
|       2 | Data memory  | 34111488 |       1041 | 188743680 |        5760 |
|       2 | Index memory |   344064 |         42 |  63176704 |        7712 |
+---------+--------------+----------+------------+-----------+-------------+

Storing 1,000 rows in the external_values table increases IndexMemory from 37 pages to 42 pages, so the cost is only 5 pages per 1,000 rows. Compared to 28 pages with NdbBlob, this is an 80% improvement.

Page Efficiency

You might also notice a difference in DataMemory, which is an entirely different matter. In this example, NdbBlob happened to use 1028 pages of DataMemory (1059 – 31), while memcached used only 1010 pages. But this is entirely an artifact of the object size, the stripe size, and NDB’s 32KB pages. Because rows are always fit entirely within a single page, it’s a matter of how efficiently the page space happens to get filled. In a test with an objects larger or smaller than 30,000 bytes, or with realistically variable sizes, the results would have come out differently. In many of these tests, NdbBlob’s system (with 4,000-byte stripes) would have used less DataMemory than memcached with 13,950-byte stripes.

In fact the 13,950 byte stripe size in ndbmemcache.external_values appears have been a poor choice (on my part). I would encourage people to use VARBINARY(10800) rather than VARBINARY(13950). This should squeeze three rows onto a page, with 99% efficiency, rather than two rows at an efficiency of 85%.

Finally, I should note that it’s easy to test this stuff out yourself; you don’t even need to paste in the SQL examples above. Out of the box with MySQL 7.2.4 and above, you have two choices: you can start a memcached server with “-e role=large”, where every item you store will use the demo_ext container; or you can start the server with the default role, and prefix memcache keys with “b:” when you want to store large values.

The NeXT

Sitting on a shelf in my garage is a reminder that Steve Jobs’ product launches didn’t always go the way he hoped for. It’s my NeXT station and my NeXT laser printer. The printer has a perfectly straight paper path; it can print on cardboard. As for the computer, I learned a few years ago in NERT class that if the house were ever on fire, the firemen would probably want to be alerted about its magnesium case.

I want to take a minute here to try to reconstruct the scene in the early 1990’s. In 1989, Sun Microsystems introduced the first SPARC-based “pizza box” workstation, the SPARCstation 1, and sold 35,000 of them. My own NeXT machine is not the original NeXT cube, but the NeXTStation, from 1990, which copied Sun’s form factor. It would soon be noted that Sun sold more workstations every month than NeXT would sell in an entire year. Sun probably sold more computers in 1992 than NeXT did in its whole history. NeXT left the hardware business in 1993.

The Sun machines weren’t exactly easy to use. (Granted, they were so powerful, and there was so much innovation going on, that “using them” was a continually moving target). To manage them, you hired systems administrators, and, wherever there were systems administrators, there was a shelf full of books published by O’Reilly and Associates. It’s not that Systems Administration didn’t exist, or that Tim O’Reilly didn’t publish books, before there were SPARCstations, but that the SPARCstation era was when business started getting big. O’Reilly published the first edition of AEleen Frisch’s Essential System Administration in 1991. I’m not sure how big it was — maybe 400-something pages? — but my second edition, from 1995, is 760. Unix Systems Administration is what I think of as my first real job, something I dropped out of graduate school to do. You could say it began to be recognized as a field with the first USENIX Large Installation Systems Administration workshop, in 1987, now known as LISA.

I’m not sure that the NeXT was that much easier to run than the Sun (netinfo was weird), but it at least attempted to create a world where one person could use a GUI tool to manage every machine on the network. Microsoft surely had the same idea when they finally introduced a 32-bit OS with TCP/IP support, Windows NT 3.1, in 1993. And — here’s where I’ll do some speculating — that looming specter of Microsoft is one of the forces that brought Sun and NeXT together to start putting NeXT software on Sun hardware.

We move on to the mid-90s. Sun acquired Lighthouse Design, who made an awesome application called Diagram!, which you can run today on your iPad as OmniGraffle. But the next year, as we know, Apple acquired NeXT, its savior. Sun, for its part, would enter into bizarre alliances with AOL and Netscape, be blinded by its own successes, and miss the boat on Linux. Lighthouse Design’s co-founder, Jonathan Schwartz, would eventually become CEO of Sun, see the stock price drop to $3, and sell the company to Oracle.

I don’t mean to be nostalgic. The NeXT was too good in the wrong ways, and not good enough in the right ways. It lost a battle but won a war. Sun’s blindness, and collapse, is harder to understand. Tim O’Reilly did pretty well through the whole story, and I think he might tell you that the guiding force, the trend that continues through the narrative, that sometimes swerved but never lost the road, is increasing openness.

Last week, Larry Ellison cited Apple as an example of how integrated systems — hardware and software built to work together — are more compelling than “pieced together” ones. But it’s not an eiter/or choice, “integrated” vs. “open”. A pragmatist, whether he’s a pragmatic buyer or a pragmatic seller, chooses some of one and some of the other. So Oracle sells an “appliance,” but it comes with Hadoop and R.

Open World and a Bunch of Camps

It’s quite a week here! Craig Russell and I will be giving a talk about the NoSQL APIs for MySQL Cluster (mod_ndb, memcached, and ClusterJ) twice: first at Oracle Open World at 9:00 AM on Thursday morning, and then again at Silicon Valley Codecamp on Saturday afternoon. Tomas Ulin’s keynote address yesterday morning unveiled some of the goodness in MySQL 5.6 and MySQL Cluster 7.2 (like Batch Key Access, the great optimizer feature from the ill-fated MySQL 6.0 tree).

And last night I dropped by Dave Neilsen’s NoSQL Camp, where I saw Andy Twigg ask a very interesting question in a 5-minute lightning talk. Andy simply pointed out that with hard disk drives, random I/O is slow and capacity is cheap — so of course it makes sense to denormalize your data (for instance, into a document database like MongoDB). But on solid state storage, random I/O is fast and capacity is expensive. Do you want to denormalize your data on an SSD? Maybe not.

NoSQL Now!

I’m headed to NoSQL Now! in San Jose this week. I’ll be speaking on Thursday morning about our NoSQL APIs for MyQSL Cluster — memcached, ClusterJ, and mod_ndb. Our upcoming MySQL Cluster release introduces a Memcache API with near-native NDB performance, and, from the SQL side, includes major algorithmic changes to the MySQL optimizer so that, at long last, MySQL Cluster handles multi-way joins by “bringing the query to the data.”

It’s been a long, long road from the very first MySQL Cluster releases in 2004 to this point — but this release really brings me back to the excitement I remember when I first heard of MySQL Cluster, with its combination of low-level (NDBAPI) and high-level (SQL) access to the same distributed database. In 2004, the NDBAPI was hard to use, the SQL access turned out to have severe limitations, and managing the database was more complex than I’d expected. Mangaing distrtibuted databases might never be easy, but today there are tools built around MySQL Cluster that take away a lot of complexity and shorten the learning curve — both from Oracle (MySQL Cluster Manager) and from the larger user community (Severalnines Configurator). And I think we’ve gotten the APIs right: SQL with adaptive query localization, Memcache for LAMP apps, and the Hibernate-style high performance Java interface, ClusterJ.

I’m also excited to spend two days at NoSQL Now! and get some fresh insight on the latest buzz. I’m still inclined to think schemas are good — after all, if a developer misspells the name of an attribute, do you want to implicitly introduce a new misspelled attribute into the data set? But people are out there solving big problems, and choosing the tools to solve them with. Some of those tool choices are turning out to be unorthodox, surprising, and quite insightful. It will be a good week.

NDB & Memcache: When to Cache?

I had a great time at this year’s MySQL Conference and Expo, introducing the new memcached API for MySQL Cluster (now available on launchpad). There is an interesting question about how you might use the new API: should you turn the cache off, so that the memcached server is just an easy-to-use, high performance NoSQL gateway to the data stored in the cluster? Or should you enable caching in memcached? The memcached server – actually the NDB engine plugin in the server – allows either. In fact you can tune this on the basis of a key prefix, so that some keys are cached and others are not. But how would you make the choice?

I look at it this way. Not all MySQL Cluster databases are the same. Some may store all the data in memory, while others store some tables on disk. Some may live in ideal, low-latency environments, while others have higher network latencies, maybe because of the distance between nodes or maybe for some other reason.

Our goal is that with the ideal low-latency cluster, and data that is entirely in memory, the memcache API should perform very well for most workloads without caching. Going directly to NDB for each operation, it should deliver at least tens of thousands of operations per second, with sub-millisecond response times.

But if the cluster latency is higher than optimal, that may not be possible. All of the MySQL Cluster APIs – SQL, ClusterJ, and memcached – will suffer from the higher round trip times between nodes. At some point, it starts making sense to turn on the cache. The improved performance of a memcache response becomes worth the extra work of managing cache consistency.

When data is on disk, things look quite different. A “sub-millisecond” response is no longer in the cards. Spinning disks have seek times of 3 to 6 ms., so the cached response from memcached (in a few hundred microseconds) will be ten times faster than the response from the database. When a table is on disk, you should probably turn the cache on.

There’s another way to look at this. The amount of memory inside an NDB data node is limited, and you have several pools competing for it. When MySQL Cluster stores data rows on disk, the indexes are still stored in-memory. This is an important design compromise that guarantees “real-time” performance and yet still allows data to get bigger. But along with the index memory, you also have a disk page buffer pool. Like the buffer pool in InnoDB or any other disk database, this is a cache of recently accessed disk rows, designed to limit the number of total disk reads. But I suspect it will turn out that memory in a memcached server is actually less expensive than memory in an NDB data node. (Why? For all the same reasons that memcached has become as popular as it is today. After all, it’s also more cost effective to use memcached than to have a big MySQL query cache, or a huge InnoDB buffer pool). Rather than pushing the limits to get a bigger disk page buffer pool in NDB, it will make sense to have a distributed cache in memcached.

In summary, with the Memcache API, you will be able to have higher-latency clusters with data stored on disk and still get phenomenal low latency and high throughput when you turn on the cache.

MySQL Cluster and Memcached: Together at Last

Memcached is a simple solution for tough scalability problems. MySQL Cluster is a telco-grade network data store. Both are capable of performance far beyond what a relational database (even as good of one as MySQL) usually delivers. But memcached always relies on some other technology to persistently store data, and while MySQL Cluster can deliver exceptional throughput and response times, making it do so has not always been easy.

So, people have long discussed the idea of putting the two together, and I’ve been lucky enough to spend the past six months working on it.

There are many different ways they might fit together. For instance, memcached could merely be used as a protocol – as a path that gets you to the data in cluster – or it could be a fully functional cache on top of that data. I have tried to leave that decision (and many others) up to you, and build something that can easily be configured to work either way. The design allows many memcached servers to each connect to several different clusters, and to behave in customized ways based on memcache key prefixes. So, keys that begin with “session:” could be handled in one way and keys that begin with “user:” in another. I’ll describe configuration in more detail in my talk this week at the O’Reilly MySQL Conference and Expo, and you can see an overview of it in my slides.

The basis of all this is the great work that has been going into Memcached 1.6. Memcached has been modularized into a pluggable chassis, and the NDB storage layer loads into that chassis as a storage engine. It then relies on memcached’s traditional caching layer – now known as the default engine – for local caching. 1.6 is not quite officially finished yet, but you can read Dustin’s mailing list post about it, and fetch the code from github:

 git clone git://github.com/memcached/memcached
 cd memcached
 git checkout -t origin/engine-pu 

Once you install memcached 1.6, you can fetch a MySQL Cluster 7.2 beta from launchpad to see the NDB engine in action. Get the tree from here:

 bzr clone lp:~mysql/mysql-server/mysql-cluster-7.2-labs-memcached

And then configure it:

 sh BUILD/autorun.sh 
 ./configure --with-plugins=ndbcluster \
 --with-memcache=/path/to/new/memcached 

The next step is to look at the README file in the storage/ndb/memcache/ directory. Starting a sample cluster and trying out memcache is as simple as running the “sandbox.sh” script you will find there.

None of this is released yet for general use. There are some significant limitations, including the fact that memcached values are limited to about 8KB, due to NDB’s maximum row size. We have only just started taking a real look at throughput under load, and we expect to find bottlenecks that can only be resolved by new work at the NDB API layer. And we need to do some work to help manage cache consistency when memcached and MySQL access the same data. But we’re ready for you to try it out, and looking forward to your feedback.