Quick links:

LINBIT Blogs: “read-balancing” with 8.4.1+

Florian's blog: More details on OSCON 2012, and your chance to get in cheaper!

Florian's blog: Coming to New Zealand!

Florian's blog: A look back at my first OpenStack Design Summit & Conference

LINBIT Blogs: LINBIT participates in the German Cloud (“Deutsche Wolke”)

Anchor Web Hosting Blog » drbd: Answers for DRBD time-travel issues

Florian's blog: Speaking at the Percona Live MySQL Conference and Expo

Florian's blog: Speaking at OSCON 2012

LINBIT Blogs: Monitoring: better safe than sorry…

Florian's blog: Feature article on Pacemaker in this month’s Linux Journal

LINBIT Blogs: Maximum volume size on DRBD

Anchor Web Hosting Blog » drbd: Holy time-travellin’ DRBD, batman!

Florian's blog: Presentation accepted for OpenStack Spring 2012 Conference

Florian's blog: Announcing the High Performance High Availability Guide documentation project

Florian's blog: On my (ex-)maintainership of the DRBD User’s Guide

Florian's blog: Lots of new stuff on our web site

Florian's blog: Ceph: tickling my geek genes

Florian's blog: Announcing Cloud Jumpstart for OpenStack™ – and your chance to get into LinuxTag for free!

Florian's blog: OpenStack Spring 2012 Design Summit & Conference

Florian's blog: This blog is about to move!

Florian's blog: Speaking at the 2012 Percona Live MySQL Conference

LINBIT Blogs: Trust, but verify

Florian's blog: Last Minute discount now available for High Availability Expert training in Berlin

LINBIT Blogs: DRBD and the sync rate controller (8.3.9 and above)

Florian's blog: Speaking at linux.conf.au, meet us in Ballarat!

LINBIT Blogs: DRBD causes too much CPU-load

Florian's blog: Now available: Slides from Percona Live and Linuxcon Europe

Florian's blog: Ready to roll for Percona Live UK

Florian's blog: Twitter

Florian's blog: Busy weeks ahead!

“read-balancing” with 8.4.1+

Posted in LINBIT Blogs by Flip at May 09, 2012 06:45 AM

DRBD 8.4.1 introduces a new feature: read-balancing, which is configured in the disk section of the configuration file(s). This feature enables DRBD to balance read requests between the Primary/Secondary nodes.

While writes occur on both sides of the cluster, by default the reads are served locally (ie., the value is prefer-local). This might not be optimal if you’ve got a big pipe to the other node and a heavily loaded IO subsystem.

read-balancing has several options to choose from:

  • 32K-striping up to 1M-striping chooses the node to read from via the block address – eg. for 512K-striping the first half of each MiByte would be read from one machine, and the second half from the other1.
    This is a simple, static load-balancing.
  • round-robin just passes the request to alternating nodes.
    This might go wrong if your application reads 4kiB, 1MiB, 4kiB, 1MiB, and so on – but this is fairly unlikely.
  • least-pending chooses the node with the smallest number of open requests.
  • when-congested-remote uses the remote node if there are local requests2.
  • prefer-remote is implemented for completeness, however as of this writing there is no viable use case.

Please note that all this is still below the filesystem layer – so even if the secondary is used for reading, this won’t speed up a failover, as the pages read are not kept anywhere.

More details on OSCON 2012, and your chance to get in cheaper!

Posted in Florian's blog by Florian Haas at May 01, 2012 06:29 PM

A few more details on my speaking slot at this year’s OSCON, titled Highly Available Cloud: OpenStack Integration with Pacemaker.

Read more…


Coming to New Zealand!

Posted in Florian's blog by Florian Haas at April 25, 2012 05:37 AM

hastexo is offering Cloud Bootcamp for OpenStack™ in Wellington. Another fine example of the global OpenStack community at work.

Read more…


A look back at my first OpenStack Design Summit & Conference

Posted in Florian's blog by Florian Haas at April 24, 2012 09:35 AM

I’ve just returned from the OpenStack Folsom Design Summit and Spring 2012 Conference, and am finally getting rid of my jet lag. Here’s a summary of what’s been a mind-blowing conference experience for me.

Read more…


LINBIT participates in the German Cloud (“Deutsche Wolke”)

Posted in LINBIT Blogs by Flip at April 23, 2012 06:30 PM

Deutsche Wolke, Logo

Deutsche Wolke (“German Cloud”) was founded to establish Federal Cloud Infrastructure in Germany.

This infrastructure will provide additional legal and security protections for hosted data.  No longer will small businesses be exposed to the legal risk of losing their website presence without a trial (an unfortunate reality when doing business on transatlantic clouds).

The natural partner for backend storage infrastructure is LINBIT; as authors and maintainers of DRBD, we are best suited to provide the technical expertise to achieve High Availability.  Also, DRBD Proxy is the obvious choice for off-site or disaster recovery replication (from the office into the cloud).

We at LINBIT look forward to seeing this project grow and prosper!

Answers for DRBD time-travel issues

Posted in Anchor Web Hosting Blog » drbd by Barney Desmond at April 18, 2012 04:49 AM

A little update on a DRBD problem we wrote about at the start of April, in which in which we lost a few months of data during a cluster failover.

Linbit got in touch with us to offer assistance, and we were happy to be enlightened. We had a good idea of what had happened, but no idea why.

It seems that a race condition was introduced in version 8.3.9, when the fence-peer script was changed to run asynchronously. The engineering team explained that if the connection is reestablished while the script runs, it may happen that the peer’s disk-state gets overwritten with stale information.

This was fixed in 8.3.11, and of course we’re running version 8.3.10 on the cluster in question. We’d like to thank Linbit for their assistance and expertise in sorting this out, we’ve already started testing our plans for an upgrade.

Speaking at the Percona Live MySQL Conference and Expo

Posted in Florian's blog by Florian Haas at April 14, 2012 04:09 PM

This week, I had the pleasure of speaking at the Percona Live MySQL Conference & Expo. This was the first year it was not the O’Reilly MySQL Conference & Expo, and also the first time Oracle was not involved in any way. And what can I say, Terry Erisman and his team at Percona have put together an awe-inspiring conference.

Read more…


Speaking at OSCON 2012

Posted in Florian's blog by Florian Haas at April 03, 2012 09:28 AM

I’ll be speaking at OSCON 2012 in Portland, on high availability in OpenStack.

Read more…


Monitoring: better safe than sorry…

Posted in LINBIT Blogs by Flip at April 03, 2012 09:07 AM

Stumbling upon the Holy time-travellin’ DRBD, batman! blog post there’s only one thing to be said …

Be strict in what you emit, liberal in what you accept1

is simply not true when dealing with mission-critical systems.

It’s ok to be alerted on upgrading a machine because the “old, working” RegEx that did the parsing doesn’t match anymore2; it’s not a problem to get an email when someone adds the 100th DRBD resource and causes the grep to fail; and so on.

Better to have a few false positives when you’re actively changing things than to get a false negative that costs you months of data; that’s what an assert (and monitoring isn’t that different) is for, after all.

Keep monitoring strict, and let it fail loudly on unexpected things – after the first few occurrences they’re not unexpected anymore and can be dealt with.

Feature article on Pacemaker in this month’s Linux Journal

Posted in Florian's blog by Florian Haas at April 02, 2012 12:45 PM

I’ve written an article on the Pacemaker stack that’s being featured in this month’s Issue 216 of Linux Journal.

Read more…


Maximum volume size on DRBD

Posted in LINBIT Blogs by Flip at April 02, 2012 12:25 PM

From time to time we get asked things like this:

I want to use a 10TiB volume with DRBD, is that supported”?

The easiest way to answer things like that is to say look for yourself on the public DRBD usage page – the biggest public device size is ~220TiB, so go figure ;)

The current maximum device size is 1EiB (1 ExiByte = 1024 TibiByte1), so there’s a bit of room left.

DRBD needs about 32MB RAM per TB storage, so for 1 EB storage you’ll need 32GiB of RAM just for the DRBD bitmap2. Having a bit more for the OS, userspace and buffer cache is left as an exercise for the reader.

If you’ve got questions, ask the DRBD experts at LINBIT – we wrote the code, after all!

Holy time-travellin’ DRBD, batman!

Posted in Anchor Web Hosting Blog » drbd by Barney Desmond at March 31, 2012 04:52 PM

Here at Anchor we’ve developed High-Availability (HA) systems for our customers to ensure they remain online in the event of catastrophic hardware failure. Most of our HA systems involve the use of DRBD, the Distributed Replicated Block Device. DRBD is like RAID-1 across a network.

We’d like to share some notes on a recent issue that involved a DRBD volume jumping into a time-warp and rolling back four months. If you run your own DRBD setup, you’ll want to know about this. The chances that you hit the same problem are slim, but it’s not hard to avoid.


We have a script for Nagios that checks the health of your DRBD volumes, it was basically the go-to default check_drbd script on Nagios Exchange. The script is meant to ensure that both ends are in-sync, and that the connection is up.

The volume in question is the backing store for a virtual machine (VM) guest. One day, after an otherwise-ordinary cluster failover event, it was noticed that the VM’s disks had reverted to a state from last year in November. The monitoring had never tripped, what the heck was going on?

Our sysadmins started digging. Pacemaker generates a lot of logging output, this was one time it came in useful. This assumes you have some familiarity with DRBD and (ideally) Pacemaker’s cluster management functions:

  1. Everything was working fine
  2. A blip on the cluster caused the active server (server A) to attempt a fence action on the volume on the standby (server B)
  3. The fence action failed for some reason
  4. Server A says “Hmm, okay, whatever”, and stops sending DRBD updates to server B
  5. The DRBD connection remains up and running
  6. Server A’s monitoring script says “I’m the Primary node so I’m up-to-date, and the connection is up: OK”
  7. Server B’s monitoring script says “I’m the Secondary node, my data is ‘consistent’ (not half-synced), and the connection is up: OK”
  8. Everything looks okay and noone is aware that server B’s copy of the data is slipping further and further out of date
  9. Eventually a full-on cluster failover occurs, server B receives the call to action, and goes right ahead as it knows its data is consistent (represents a known point-in-time) but not that it’s very outdated

In short, a corner case in DRBD’s workings got the volume into a bad state that went undetected by the monitoring script. This allowed old data to replace new data during a failover.

When server A attempted the fencing action and failed, it knew something was wrong. It couldn’t tell what, but it didn’t trust it any more, so it stopped sending new data to server B.

Each server knows a little bit about the disk at the other end, thanks to the DRBD connection working just fine. Server A knew its disk was good but noted server B’s disk as “DUnknown” – something dodgy going on. Server B thought its own disk was fine (correct: it hadn’t received the fencing request) and knew server A’s disk was fine (server A is automatically trusted as the Primary node).

Server B’s “DUnknown” state is what Nagios didn’t see, and it should’ve been a warning bell. Server B willingly took over after the failover because everything looked fine, just that server A had been really, really quiet for the last few months. As the new Primary node it promptly pushed its copy of the volume back to server A, steamrolling 4 months of changes in the process.


Our immediate fix for this was to improve the monitoring script. The remote peer’s disk state is now taken into account, and the script was heavily restructured to improve readability and aggregate data in a more structured manner. We’ll be able to push the improvements to Github once we’ve cleaned it up a little further.

EDIT: It’s been published now: https://github.com/anchor/nagios-plugin-drbd

We’re also further investigating the fencing actions for DRBD. Building fault-tolerant systems is hard, which is why you employ defense-in-depth strategies – it may be that the fencing actions also need defensive measures.

Announcing the High Performance High Availability Guide documentation project

Posted in Florian's blog by Florian Haas at March 23, 2012 01:06 PM

A bit of updated information on my limited involvement in the DRBD User’s Guide, and something new that came out of it.

Read more…


On my (ex-)maintainership of the DRBD User’s Guide

Posted in Florian's blog by Florian Haas at March 20, 2012 01:19 PM

Here’s a quick summary of my past and current relationship with the DRBD User’s Guide.

Read more…


Lots of new stuff on our web site

Posted in Florian's blog by Florian Haas at March 13, 2012 05:13 PM

Over the past couple of weeks, we’ve quietly rolled out new content, new functionality, and Hangouts on our web site. Here’s a summary of these nifty little changes.

Read more…


Ceph: tickling my geek genes

Posted in Florian's blog by Florian Haas at March 08, 2012 07:12 PM

Haven’t heard of Ceph, the open-source distributed petascale storage stack? Well, you’ve really been missing out. It’s not just a filesystem. It’s a filesystem, and a striped/replicated block device provider, and a virtualization storage backend, and a cloud object store, and then some.

Read more…


Announcing Cloud Jumpstart for OpenStack™ – and your chance to get into LinuxTag for free!

Posted in Florian's blog by Florian Haas at March 06, 2012 10:15 AM

Yesterday, we announced Cloud Jumpstack for OpenStack™ – our brand new training offering with 2 full days of deep-diving into OpenStack. If you have little or no experience with OpenStack, and you want to get your feet wet and your hands dirty real quick, then Cloud Jumpstart for OpenStack is for you. And there’s an extra sweet deal on our first incarnation of this awesome class.

Read more


OpenStack Spring 2012 Design Summit & Conference

Posted in Florian's blog by Florian Haas at February 29, 2012 09:13 AM

This April, right after the MySQL Conference & Expo, I’ll stay around the San Francisco Bay Area for another week, as I’ve been invited attend the OpenStack Spring 2012 Design Summit & Conference.

Read more


This blog is about to move!

Posted in Florian's blog by Florian Haas at February 28, 2012 01:57 PM

My dear and faithful readers, this will cease to be my primary blog site.

From now on, I’ll be blogging over on the hastexo web site, where you can find my blog at http://www.hastexo.com/blogs/florian. An RSS feed, for those of you who want to update their readers, is at http://www.hastexo.com/blogs/florian/feed.

This is something I’ve been meaning to do for a while, and I’ve finally had the breathing room to do so.

The statements I make on my blog will continue to be my own, rather than “official” hastexo company statements. The same is true for Martin, whose blog is also moving to our web site (RSS).

To ease the transition, I plan to post the opening paragraphs of blog post popping up over there on this site. It’s just that the “Read More” links will now point to the new primary blog site. I have no intention on taking the WordPress site down anytime soon, so anything recorded here will remain for reference purposes.


Speaking at the 2012 Percona Live MySQL Conference

Posted in Florian's blog by Florian Haas at February 27, 2012 12:39 PM

This year, I have the pleasure of returning to the MySQL Conference & Expo as a speaker. Percona have picked up the torch that O’Reilly had held as the conference organizers, and they’re putting together a 3-day conference this year. I am co-presenting a tutorial with Yves Trudeau from Percona.

Read more


Trust, but verify

Posted in LINBIT Blogs by Flip at February 25, 2012 08:16 PM

DRBD tries to ensure data integrity across different computers, and it’s quite good at it.

But, as per the old saying Trust, But Verify1 it might be a good idea to periodically test whether the nodes really have identical data, similar to the checks that are2 done for RAID sets.

The verify-alg digest is used to save bandwidth during online verification; while without this setting the whole data has to be transferred3, a value of md5 means that only 20 bytes are needed for each 4KiByte block, resulting in bandwidth savings of about 99.5%.

The DRBD Users’ Guide has a nice chapter describing the configuration and usage, so I won’t get into this topic here.


If the volume you’re checking is actively used, you might see a few false positives in the log messages:

kernel: block drbd0: Out of sync: start=56079768, size=8 (sectors)

This is because some data block writes might have just been on the wire while the two nodes calculated their checksums, and so would see different generationsof the data. If you do this check eg. every week and get different block numbers every time, you’re fine. If you get the same block number(s), your storage might have stuck bits, and be unable to correctly write data in these blocks!


Please note that the needed verify-alg setting here sounds similar to the data-integrity-alg option, but serves a different purpose. data-integrity-alg means more CPU-usage for every write; but, similar to verify-alg, it is subject to false-positives, see here for details on both of these points.

Last Minute discount now available for High Availability Expert training in Berlin

Posted in Florian's blog by Florian Haas at February 07, 2012 05:10 AM

We have exactly one seat still left open in our hastexo High Availability Expert class coming up in Berlin next week. So if you want to learn about GFS2, OCFS2, advanced Pacemaker, GlusterFS and Ceph in one of Europe’s most beautiful cities, now is your chance!

And, we have a Last Minute discount available so you can get in for cheap! You’ll just have to be really, really fast before someone else grabs it. Our web site has the details.


DRBD and the sync rate controller (8.3.9 and above)

Posted in LINBIT Blogs by Flip at January 05, 2012 08:26 PM

The sync-rate controller is used for controlling the used bandwidth during resynchronization (not normal replication); it runs in the SyncTarget state, ie. on the (inconsistent) receiver side.

It’s configured as follows:

  • Set c-plan-ahead to approximately 10 times the RTT; so if ping from one node to the other says 200msec, configure 2 seconds (ie. a value of 20, as the unit is tenths of a second).1
    Please note that the controller is only polling every 100msec; so c-plan-ahead values below 5 don’t make sense, as the controller hasn’t collected enough information to decide whether to request more data. We recommend to use at least 1 second (configured value is 10).
    This value specifies the “thinking ahead” time of the controller, ie. the time period the controller has to achieve the actual sync-rate.
  • Configure minimum and maximum values via c-min-rate and c-max-rate; these depend mostly on the available bandwidth per resource.
    The c-min-rate is the minimum bandwidth that will be used during a resync, whereas c-max-rate is the most bandwidth that can be used by a resync.
  • Now decide whether to use c-fill-target or c-delay-target – you can choose only one.

Difference between delay and fill based control

If you set c-fill-target to a non-zero value, DRBD will try to keep that much data on the wire; if application IO gets in, it will temporarily displace the synchronization traffic. This means that application data will have only a limited amount of synchronization data in the buffers before it, which helps latency a bit.
The data still has to fit into the socket buffers, along with the application IO, so using multi-MB sizes here doesn’t make sense; 100kByte is a good starting value.

With a proxy you should use c-delay-target, so set the c-fill-target value to zero. This way the time interval that the synchronization data is on the wire is measured; if application IO gets in, this triggers the controller, and it will turn back the synchronization speed, to keep the communication latency at the specified value. Use 5 times the RTT as a starting point.

Speaking at linux.conf.au, meet us in Ballarat!

Posted in Florian's blog by Florian Haas at January 04, 2012 10:35 PM

After last year’s talk in Brisbane, where I greatly enjoyed co-presenting with Tim Serong, I have the privilege of returning to Australia for this year’s linux.conf.au in Ballarat, Victoria.

This time I have a brief talk opening up the High Availability and Distributed Storage miniconf on Monday, January 16, and a tutorial entitled High Availability Sprint in the morning on Thursday, January 19. Tim Serong is again joining me for the tutorial, and Pacemaker author Andrew Beekhof will be chiming in too. See you all in Ballarat!


DRBD causes too much CPU-load

Posted in LINBIT Blogs by Flip at December 20, 2011 11:36 AM

The TL;DR version: don’t use data-integrity-alg in a production setup.

The users guide (8.3 version) describes the data-integrity-alg as

DRBD can ensure the data integrity of the user’s data on the network by comparing hash values. [...]

Too many people think this is a must-have setting – but are sadly wrong.

During initial installation and testing it does make sense to use this – it’s an easy way to find out whether the hardware (CPU, memory, network card, etc.) work as they should – if you get the famous Digest integrity check FAILED message1 you can be worried (but not too much, since you found that during testing (;).

But in production this should not be set – apart from causing a lot of CPU load2 it might cause frequent connection abort – and that means a short bit of time (re-sync) during which the secondary is inconsistent.

So: don’t use this in production.

Now available: Slides from Percona Live and Linuxcon Europe

Posted in Florian's blog by Florian Haas at November 01, 2011 08:53 PM

The slides from last week’s talks I (co-)presented at Percona Live and Linuxcon Europe are now available from our web site.

All slides are available entirely free of charge for logged-in users on our web site. To log in, you don’t even need to register — just use your Google Profile, or Google Apps account, or your WordPress account, or anything else that uses OpenID, and you’ll be good to go.

Comments on our slides are, of course, always highly appreciated.


Ready to roll for Percona Live UK

Posted in Florian's blog by Florian Haas at October 23, 2011 06:17 PM

Percona Live MySQL Conference, London, Oct 24th and 25th, 2011

All slides are done, all virtual images are completed and we’re ready to roll for tomorrow’s MySQL High Availability Sprint: Launch the Pacemaker tutorial at Percona Live UK 2011.

This is probably your very last chance to register for PLUK as there are only a handful of tickets left. You can still use my discount code, HaasPLUK11. See you tomorrow!


Twitter

Posted in Florian's blog by Florian Haas at October 19, 2011 10:22 AM

Henceforth, you can find and follow us on Twitter. See you there!


Busy weeks ahead!

Posted in Florian's blog by Florian Haas at October 17, 2011 03:17 PM

I’m speaking at Percona Live, LinuxCon Europe, and linux.conf.au. And I just co-founded a new company.

I have a few busy weeks behind me, and even busier weeks ahead. If you’ve been wondering why recently I haven’t been updating this space too frequently, here’s why:

Yours truly and fellow ex-Linbiters Martin Loschwitz and Andreas Kurz have recently founded hastexo, an independent professional services organization focused on open-source high availability and disaster recovery. We are already offering both on-site and remote consultancy, custom training, and our Availability Checkup package, with more services lined up to be added to our offering.

We’re able to offer direct, 24/7 access to high availability experts with dial-in numbers in Europe, North America and Australia. We’re offering our services under an extremely flexible, versatile payments scheme with an attractive volume discount model. We’re experts in an array of high availability and disaster recover technologies — like Pacemaker, Corosync, Heartbeat, DRBD, highly available virtualization (a.k.a “enterprise cloud”), and cluster file systems.

And we’ve got a unique, free offering. Have you ever considered hiring a high availability consultant to review your setup or provide expert advice, but were unsure as to the expected cost involved? At hastexo, we can help. You simply go to our Help page (free-of-charge registration required), collect information as instructed, and then just create a ticket in our support system. And we’ll make a qualified estimate as to the amount of effort (and cost) required to fix your issue, or improve your uptime, or both.

And, just in case one of us has previously help you on a mailing list, on IRC, or at a conference, as we frequently do, then please leave us a message in our Shoutbox. We love to support the high availability community, and we’re thrilled to hear about it when we can help.

Speaking of conferences: next week, I’m doing back-to-back conferences in Europe.

And, for those of you making plans for Ballarat in January: I’ll return to linux.conf.au as a tutorial speaker, together with Andrew Beekhof and Tim Serong. I have also submitted a talk for the High Availability and Distributed Storage miniconf, preceding the main conference. See you there!