Quick links:

LINBIT Blogs: “umount is too slow”

Arrfab's Blog » Cluster: Rolling updates with Ansible and Apache reverse proxies

LINBIT Blogs: DRBD 8.4.3: faster than ever

LINBIT Blogs: Change the cluster distribution without downtime

LINBIT Blogs: Raspberry Tau: a Pi cluster

Arrfab's Blog » Cluster: Ansible as an alternative to puppet/chef/cfengine and others …

LINBIT Blogs: Backup ideas: using a double-stacked setup

LINBIT Blogs: Mirrored SAN vs. DRBD

Anchor Web Hosting Blog » drbd: Highly available infrastructure for your own website

Anchor Web Hosting Blog » drbd: Pacemaker and Corosync for virtual machines

Anchor Web Hosting Blog » drbd: Pacemaker and Corosync for HA services

Anchor Web Hosting Blog » drbd: Anatomy of an HA stack

LINBIT Blogs: Best Practice: Use Backup with your DRBD cluster!

Florian's blog: Imitation is the sincerest form of flattery

Florian's blog: 4 extra seats available in Cloud Bootcamp in Wellington!

Florian's blog: Returning to Paris for OpenStack in Action 2: Production Ready

LINBIT Blogs: “read-balancing” with 8.4.1+

Florian's blog: Our first Cloud Bootcamp is now Sold Out

Florian's blog: An exciting day for the Ceph community

Florian's blog: More details on OSCON 2012, and your chance to get in cheaper!

Florian's blog: Coming to New Zealand!

Florian's blog: A look back at my first OpenStack Design Summit & Conference

LINBIT Blogs: LINBIT participates in the German Cloud (“Deutsche Wolke”)

Anchor Web Hosting Blog » drbd: Answers for DRBD time-travel issues

Florian's blog: Speaking at the Percona Live MySQL Conference and Expo

Florian's blog: Speaking at OSCON 2012

LINBIT Blogs: Monitoring: better safe than sorry…

Florian's blog: Feature article on Pacemaker in this month’s Linux Journal

LINBIT Blogs: Maximum volume size on DRBD

Anchor Web Hosting Blog » drbd: Holy time-travellin’ DRBD, batman!

“umount is too slow”

Posted in LINBIT Blogs by flip at May 27, 2013 07:31 AM

A question we see over and over again is

Why is umount so slow? Why does it take so long?

Part of the answer was already given in an earlier blog post; here’s some more explanation.

The write() syscall typically writes into RAM only. In Linux we call that “page cache“, or “buffer cache“, depending on what exactly the actual target of the write() system call was.

From that RAM (cache inside the operating system, high in the IO stack) the operating system does periodically do writeouts, at its leisure, unless it is urged to write out particular pieces (or all of it) now.

A sync (or fsync(), or fdatasync(), or …) does exactly that: it urges the operating system to do the write out.
A umount also causes a write out of all not yet written data of the affected file system.

Note:

  • Of course the “performance” of writes that go into volatile RAM only will be much better than anything that goes to stable, persistent, storage. All things that have only been written to cache but not yet synced (written out to the block layer) will be lost if you have a power outage or server crash.
    The linux block layer has never seen these changes, DRBD has never seen these changes, they cannot possibly be replicated anywhere.
    Data will be lost.

There are also controller caches which may or may not be volatile, and disk caches, which typically are volatile. These are below and outside the operating system, and not part of this discussion. Just make sure you disable all volatile caches on that level.

Now, for a moment, assume

  • you don’t have DRBD in the stack, and
  • a moderately capable IO backend that writes, say, 300 MByte/s, and
  • around 3 GiByte of dirty data around at the time you trigger the umount, and
  • you are not seek-bound, so your backend can actually reach that 300 MB/s,

you get a umount time of around 10 seconds.


Still with me?

Ok. Now, introduce DRBD to your IO stack, and add a long distance replication link. Just for the sake of me trying to explain it here, assume that because it is long distance and you have a limited budget, you can only afford 100 MBit/s. And “long distance” implies larger round trip times, so lets assume we have a RTT of 100 ms.

Of course that would introduce a single IO request latency of > 100 ms for anything but DRBD protocol A, so you opt for protocol A. (In other words, using protocol A “masks” the RTT of the replication link from the application-visible latency.)

That was latency.

But, the limited bandwidth of that replication link also limits your average sustained write throughput, in the given example to about 11MiByte/s.
The same 3 GByte of dirty data would now drain much slower, in fact that same umount would now take not 10 seconds, but 5 minutes.

You can also take a look at a drbd-user mailing list post.


So, concluding: try to avoid having much unsaved data in RAM; it might bite you. For example, you want your cluster to do a switchover, but the umount takes too long and a timeout hits: the node (should) get fenced, and the data not written to stable storage will be lost.

Please follow the advice about setting some sysctls to start write-out earlier!

Rolling updates with Ansible and Apache reverse proxies

Posted in Arrfab's Blog » Cluster by fabian.arrotin at May 23, 2013 04:36 PM

It's not a secret anymore that I use Ansible to do a lot of things. That goes from simple "one shot" actions with ansible on multiple nodes to "configuration management and deployment tasks" with ansible-playbook. One of the thing I also really like with Ansible is the fact that it's also a great orchestration tool.

For example, in some WSOA flows you can have a bunch of servers behind load balancer nodes. When you want to put a backend node/web server node in maintenance mode (to change configuration/update package/update app/whatever), you just "remove" that node from the production flow, do what you need to do, verify it's up again and put that node back in production. The principle of "rolling updates" is then interesting as you still have 24/7 flows in production.

But what if you're not in charge of the whole infrastructure ? AKA for example you're in charge of some servers, but not the load balancers in front of your infrastructure. Let's consider the following situation, and how we'll use ansible to still disable/enable a backend server behind Apache reverse proxies.

So here is the (simplified) situation : two Apache reverse proxies (using the mod_proxy_balancer module) are used to load balance traffic to four backend nodes (Jboss in our simplified case). We can't directly touch those upstream Apache nodes, but we can still interact on them , thanks to the fact that "balancer manager support" is active (and protected !)

Let's have a look at a (simplified) ansible inventory file :

[jboss-cluster]

jboss-1

jboss-2

jboss-3

jboss-4

[apache-group-1]

apache-node-1

apache-node-2

Let's now create a generic (write once/use it many) task to disable a backend node from apache ! :

---
##############################################################################
#
# This task can be included in a playbook to pause a backend node
# being load balanced by Apache Reverse Proxies
# Several variables need to be defined :
#   - ${apache_rp_backend_url} : the URL of the backend server, as known by Apache server
#   - ${apache_rp_backend_cluster} : the name of the cluster as defined on the Apache RP (the group the node is member of)
#   - ${apache_rp_group} : the name of the group declared in hosts.cfg containing Apache Reverse Proxies
#   - ${apache_rp_user}: the username used to authenticate against the Apache balancer-manager
#   - ${apache_rp_password}: the password used to authenticate against the Apache balancer-manager
#   - ${apache_rp_balancer_manager_uri}: the URI where to find the balancer-manager Apache mod
#
##############################################################################
- name: Disabling the worker in Apache Reverse Proxies
local_action: shell /usr/bin/curl -k --user ${apache_rp_user}:${apache_rp_password} "https://${item}/${apache_rp_balancer_manager_uri}?b=${apache_rp_backend_cluster}&w=${apache_rp_backend_url}&nonce=$(curl -k --user ${apache_rp_user}:${apache_rp_password} https://${item}/${apache_rp_balancer_manager_uri} |grep nonce|tail -n 1|cut -f 3 -d '&'|cut -f 2 -d '='|cut -f 1 -d '"')&dw=Disable"
with_items: ${groups.${apache_rp_group}}

- name: Waiting 20 seconds to be sure no traffic is being sent anymore to that worker backend node
pause: seconds=20

The interesting bit is the with_items one : it will use the apache_rp_group variable to know which apache servers are used upstream (assuming you can have multiple nodes/clusters) and will play that command for every host in the list obtained from the inventory !

We can now, in the "rolling-updates" playbook, just call the previous tasks (assuming we saved it as ../tasks/apache-disable-worker.yml) :

---

- hosts: jboss-cluster

serial: 1

user: root

tasks:

- include: ../tasks/apache-disable-worker.yml

- etc/etc ...

- wait_for: port=8443 state=started

- include: ../tasks/apache-enable-worker.yml

But Wait ! As you've seen, we still need to declare some variables : let's do that in the inventory, under group_vars and host_vars !

group_vars/jboss-cluster :

# Apache reverse proxies settins
apache_rp_group: apache-group-1
apache_rp_user: my-admin-account
apache_rp_password: my-beautiful-pass
apache_rp_balancer_manager_uri: balancer-manager-hidden-and-redirected

host_vars/jboss-1 :

apache_rp_backend_url : 'https://jboss1.myinternal.domain.org:8443'
apache_rp_backend_cluster : nameofmyclusterdefinedinapache

Now when we'll use that playbook, we'll have a local action that will interact with the balancer manager to disable that backend node while we do maintainance.

I let you imagine (and create) a ../tasks/apache-enable-worker.yml file to enable it (which you'll call at the end of your playbook).

DRBD 8.4.3: faster than ever

Posted in LINBIT Blogs by flip at February 22, 2013 08:17 AM

For the people who don’t already have DRBD 8.4.3 deployed: here’s another good reason — Performance.

As you know DRBD marks the to-be-changed disk areas in the Activity Log.

Until now that meant that for random-write workloads a DRBD speed penalty of up to 50%, ie. each application-issued write request translated to two write requests on storage.


With DRBD 8.4.3 Lars managed to reduce that overhead1, from 1:2 down to 64:65, ie. to about 1.6%. (In sales speak “up to 64 times faster” ;) )

Here are two graphics showing the difference on one of our test clusters; both using 10GigE and synchronous replication (protocol C):

Random Writes Benchmark, Spinning Disk
The raw LVM line shows the hardware limit of 350 IOPS; while 8.4.2 and 8.3.15 are quickly limited by harddisk seeks, the 8.4.3 bars go up much further – in this hardware setup we get 4 times the randwrite performance!


When using SSDs the difference is even more visible ­— the 8.4.2 to 8.4.3 speedup is a factor ~16.7.

Random Writes Benchmark, SSD
Again, the raw LVM line shows the hardware limit of 50k IOPS; 8.4.2 needs to wait for the synchronous writes (at 1.5k IOPS), but 8.4.3 gives 25k IOPS, at least half the pure SSD speed.


Please note that every setup is different — and storage subsystems are very complex beasts, with many, non-linear, interacting parts. During our tests we found many “interesting” (but reproduceable) behaviours – so you’ll have to tune your specific setup2,3.


Furthermore, the activity log can now be much bigger4; but, as the impact on performance of leaving the “hot” area is now very much reduced, you may even want to lower the al-extents – ie. tune the AL-size to the working set, to reduce re-sync times after a failed Primary.

And, last but not least, the AL can be striped – this might help for some hardware setups, too.
Please see the documentation for more details.

BTW: these changes are in the DRBD 9 branch too, so you won’t lose the benefits.


Change the cluster distribution without downtime

Posted in LINBIT Blogs by flip at February 11, 2013 01:26 PM

Recently we’ve upgraded one of our virtualization clusters (more RAM), and in the course of this did an upgrade of the virtualization hosts from Ubuntu Lucid to RHEL 6.3 — without any service interruption.

That was not that complicated, really; as our core product DRBD works on (nearly) every Linux distribution, we simply

  1. live-migrated all VMs to one of the nodes;
  2. reinstalled the root filesystem on the other node with RHEL 6.31 and configured GRUB to boot into that one;
  3. installed matching DRBD modules
  4. waited a few seconds for the resync to complete (which was really that fast, because we didn’t touch the existing logical volumes, and so the changed data were only a few GiB);
  5. and then let Pacemaker take control over the cluster again, allowing us to migrate the VMs to the newly installed node. Without any service interruption.

The key to this was that DRBD and Pacemaker are available in compatible versions on most current distributions — and that’s not a big problem, because we make such packages available for our customers in our repositories.

Upgrading DRBD from 8.3 to 8.4 at the same time is only a small, secondary change; after all, its network code can talk to different versions by design.

Raspberry Tau: a Pi cluster

Posted in LINBIT Blogs by flip at November 28, 2012 01:20 AM

The Raspberry PI is a small ARM computer (hardware specifications in wiki, outline and FAQs). Of course, you can build a cluster with it!

As 2π is proposed to be named τ we’ve chosen the name “Raspberry Tau” for this proof-of-concept.

We’ve connected two Raspberry Pis via their on-board ethernet interfaces (via a switch, so we can simply SSH into them), booted via 2GB SD-cards with a Raspbian image on them. After upgrading to a kernel that has kernel-headers available we built DRBD modules, and voilá! A Raspberry Tau cluster is born.

We’re replicating the data on the USB-Sticks; their performance nicely matches the available network. Here’s /proc/drbd (shortened and line-wrapped for readability):

root@raspberry-alice:~# cat /proc/version 
Linux version 3.2.0-3-rpi (Debian 3.2.21-1+rpi1) 
  (debian-kernel@lists.debian.org) (… Debian 4.6.3-1.1+rpi2)…)
root@raspberry-alice:~# cat /proc/drbd 
version: 8.4.2 (api:1/proto:86-101)
GIT-hash: 7ad5f850d711223713d6dcadc3dd48860321070c build by
    root@raspberry-bob, 2012-09-18 12:58:08
 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
    ns:805304 nr:0 dw:348628 dr:818596 al:127 bm:70 lo:0 pe:0
      ua:0 ap:0 ep:1 wo:d oos:0

As Raspbian is Debian-based, there are Pacemaker (and Heartbeat resp. Corosync) packages available … so a cheap, low-power, High-Availability cluster is easily built.

Disclaimer: for a real HA-cluster you’d need a few more things.

  • a STONITH device (if power is supplied via a Linux-PC, you could turn off the USB port by software), and
  • redundant network connectivity (USB ethernet adapter).

Of course, if you’re just clustering your media library these things might not be mandatory.

Packages are available for everyone – just drop an email to sales@linbit.com, and we will be happy to provide them.

Ansible as an alternative to puppet/chef/cfengine and others …

Posted in Arrfab's Blog » Cluster by fabian.arrotin at October 26, 2012 02:02 PM

I already know that i'll be criticized for this post, but i don't care :-) . Strangely my last blog post (which is *very* old ...) was about a puppet dashboard, so why speaking about another tool ? Well, first i got a new job and some prerequisites have changed. I still like puppet (and I'd even want to be able to use puppet but that's another story ...) but I was faced to some constraints when being in front of a new project. For that specific project,  I had to configure a bunch of new Virtual Machines (RHEL6) coming as OVF files. Problem number one was that I can't alter or modify the base image so i can't push packages (from the distro or third-party repositories). Second issue is that I can't install nor have a daemon/agent running on those machines. I had a look at the different config tools available but they all require either a daemon to be started, or at least having extra packages to be installed on each managed node. (so not possible to have puppetd nor puppetrun or invoke puppet directly through ssh , as puppet can't even be installed, same for saltstack). That's why i decided to give Ansible a try. It was already on my "TO-test" list for a long time but it seems it was really fitting the bill for that specific project and constraints : using the 'already-in-place' ssh authorization, no packages to be installed on the managed nodes, and last-but-no-least, a learning curve that is really thin (compared to puppet and others, but that's my personal opinion/experience).

The other good thing with Ansible is that you can start very easily and then slowly add 'complexity' to your playbooks/tasks. I'm still using for example a flat inventory file, but already organized to reflect what we can do in the future (hostnames included in groups, themselves included in parents groups - aka nested groups). Same for the variables inheritance : at the group level and down to the host level, host variables overwriting those defined at the group level , etc ...)

The Yaml syntax is really easy to understand so you can have quickly your first playbook being played on a bunch of machines simultaneously (thanks to paramiko/parallel ssh). The number of modules is less than the puppet resources, but is quickly growing. I also just tested to tie the execution of ansible playbook with Jenkins so that people not having access to the ansible inventory/playbooks/tasks (stored in a vcs, subversion in my case) can use it from a gui.. More to come on Ansible in the future

Backup ideas: using a double-stacked setup

Posted in LINBIT Blogs by flip at October 01, 2012 08:17 AM

Have you ever wanted to do a file based backup of your data without impacting
your application, and without stopping your HA replication? Here is one
possible method.

For DRBD 8.4.2 we’ve removed the double-stacked check in the userspace tools; now it’s easier to do a multi-stacked setup.

A picture says more than a thousand words; please see here:

Here is

  • a HA cluster, consisting of nodes A and B;
  • a DR node, attached via DRBD proxy; and
  • a Backup node.

The special thing here is that the dashed DRBD backup connection is not always connected; via a cronjob the backup node disconnects itself, does a (file-based) backup of the data to another storage (tape, harddisk, DVD-RW, paper tape…), and reconnects afterwards – to get the newer data, by a standard synchronization.

The advantage is that the IO load on the backup node has no impact on the primary node – you can even start a local database, and do some CPU- and IO-intensive evaluations. As the backup node is not connected to the HA cluster it cannot slow down the normal operation.

The backup itself can be taken with automatic LVM Snapshots, which prevents a split brain situation when mounting the backup drbd resource – or you need to do a drbdadm connect --discard-my-data before re-connecting.

And: all of that is possible while having a continuous connection to the disaster-recovery site.

Credit for this also goes to Mark Olliver from Thermeon, who was involved in testing and developing this setup.

Mirrored SAN vs. DRBD

Posted in LINBIT Blogs by flip at September 11, 2012 12:42 PM

Every now and then we get asked “why not simply use a mirrored SAN instead of DRBD”? This post shows some important differences.

Basically, the first setup is having two servers, one of them being actively driving a DM-mirror (RAID1) over (eg.) two iSCSI volumes that are exported by two SANs; the alternative is using a standard DRBD setup. Please note that both setups need some kind of cluster manager (like Pacemaker).

Here are the two setups visualized:
The main differences are:

# SAN DRBD
1. High cost, single supplier Lower cost, commercial-off-the-shelf parts
2. At least 4 boxes (2 application servers, 2 SANs) 2 servers are sufficient
3. DM-Mirror has only recently got a write-intent-bitmap, and at least had performance problems (needed if active node crashes) Optimized Activity Log
4. Maintenance needs multiple commands Single userspace command: drbdadm
5. Split-Brain not automatically handled Automatical Split-Brain detection, policies via DRBD configuration
6. Data Verification needs to get all data over the network – twice Online-Verify transports (optionally) only checksums over the wire
7. Asynchronous mode (via WAN) not in standard product Protocol A available, optional proxy for compression and buffering
8. Black Box GPL solution, integrated in standard Linux Kernel since 2.6.33

So the Open-Source solution via DRBD has some clear technical advantages — not just the price.

And, if that’s not enough — with LINBIT you get world-class support, too!

Highly available infrastructure for your own website

Posted in Anchor Web Hosting Blog » drbd by Barney Desmond at August 07, 2012 05:47 AM

Every site is different, so this isn’t so much a tutorial as some things to watch out for. We’ll take a reasonably representative database-backed site and talk about what changes when we make it highly available.

The site

For the purposes of demonstration we’ll use Magento, an e-commerce website written in PHP with a MySQL backend. As well as exemplifying the popular LAMP pattern, Magento allows for extensions that uses extra software components, which also need to be taken into consideration in a highly available setup.

It’s worth noting that these notes apply even to vastly different systems. Taking some big customers that we’ve worked on as examples, Github is effectively a Rails app and Testflight’s core is mostly Django – the problem is approached in the same way.

Types of problems you’ll face

The approach we’re taking is to separate the moving parts and make each one highly available. This has the benefit of making the system more scalable in the process.

The parts we’re dealing with are:

  • Webserver frontend
  • Database backend
  • Add-on components
  • Load balancer, necessary for running the frontends

Webserver

The webserver tier is generally the simplest to scale – just deploy more webservers and put them behind a load balancer. The catch here lies in keeping everything in sync, and sharing state between the servers.

Rolling out your codebase

Your site will need periodic updates for bugfixes and the occasional new feature, and this applies to any site. Content systems like Magento and WordPress generally have a one-click method to apply these, while something written in-house might use Capistrano, or something as simple as a subversion/git checkout from a repo.

When these occur, you’ll generally want to do some testing and allow for a clean changeover, and having a load balancer makes this very straightforward. To perform a change/upgrade on each frontend: manually remove it from the load balancer, apply the update, then reinsert it into the load balancer.

This way the end user never sees a half-ready server, and you can perform some testing before reinsertion if needed. If you like, you can extend this to do “blue-green deployments“.

Session state

Magento, along with pretty much every substantial website, uses sessions to remember users and provide a continuous browsing/shopping experience. This data is commonly stored on the server in files, tied to a cookie that the client keeps.

This breaks if you start using multiple servers as requests will tend to be spread over all the servers, resulting in inconsistent state depending on which server the user happens to reach, and a broken experience overall. Some apps can store state in the client itself, but this tends to be inefficient as more data is transferred with every single request.

The solution to this is to share session data between all the servers. The most common approach is to store sessions in a shared database, in effect turning a “file problem” into a “database problem”, which we’ll deal with in the next section.

Memcached is a simple, high-performance and widely-used database used for sharing session data between frontends. You run a single instance of memcached on the network and have all the frontends connect to it. The primary downside of memcached is that it’s purely in-memory – if your memcached server ever crashes, you lose all session data. It’s not the end of the world, but it makes for a poor experience if a customer is in the middle of a payment transaction.

Specifically for Magento, we’ve found these extensions for storing data in Redis, Cm_RedisSession and Cm_Cache_Backend_Redis. We love Redis because it behaves well and stores data to disk, unlike memcached. That’s a win in our books, and is ideal for HA Magento.

An alternative offered by load balancers is “sticky sessions”, which ensures that a given client always hits the same frontend server. We’re not fans of shifting persistence to the load balancer as it doesn’t actually make for proper seamless HA, and can have problems scaling up. Sticky sessions will also need to be expired from the load balancer (it has a finite amount of memory), and you’ll run into mismatches with the website’s idea of sessions.

Generated files and user uploads

This is usually a lesser concern, but some Magento extensions create files on the server itself. A classic case is the use of minification tools to consolidate and cache CSS and Javascript files, allowing for faster page delivery. If written correctly, this should Just Work on each frontend, although the cache won’t be warmed up until requests hit each server.

A related issue is handling uploaded user content, most commonly image files. These need to go to some form of shared storage, along with any thumbnails that the site is likely to generate. A shared filesystem, such as NFS, is an easy way to do this. Another option is a clustered filesystem such a GFS or OCFS2. However you do this, the shared storage also needs to be highly available.

Webserver configuration

As a final point, you’ll want to keep the webserver (apache, nginx, etc.) config consistent across all the servers. We use Puppet for config management and automation, which makes things super simple. If you’re not doing something similar, you’re going to have a bad time.

Database

So now you’ve got your frontends scaling out nicely and storing data in a database or shared filesystem. Now you need to make the storage layer highly available.

If you’ve been reading our HA articles, you’ll know that Corosync and Pacemaker are the way to go. Databases and filesystems are backed by DRBD storage, with a standby server ready to take over if the active server goes up in smoke.

This is our general formula for anything that needs on-disk storage; everything above the block device is just a service, which can be stopped and started on another node to effect a failover. This works great for NFS, MySQL, PostgreSQL and Redis, as well as more exotic things like AMQP servers.

Additional components

The above sections cover almost all the problems you’re likely to run into. Even so, we went hunting to look for other problems you might face. Something we thought was interesting was integration with frontend caches and CDNs.

One particular piece of software we’ve worked with is Varnish, a high-performance caching proxy designed to reduce the load on webservers, which tend to serve large amounts of static content. Caching can be hard to get right, especially so in a distributed environment. Care needs to be taken to ensure that content is correctly cached, without inadvertently leaking sessions between users.

Dealing with CDNs is more closely related to handling server-generated files. If using such an extension for Magento, you’ll want to test that things behave properly when it comes to pushing content to the CDNs, with particular attention paid to any versioning that the extension performs.

Load balancer

The load balancer isn’t too special on its own, beyond the necessary HA-ification. Our preference is to run a pair of load balancers, each one a virtual machine to allow for easy scaling, with ldirector on each, and Pacemaker-managed virtual-IP for each service.

There’s a particular caveat when it comes to dealing with load balancers and source IP addresses: if your load balancer performs NAT before forwarding traffic to the frontend servers, the application will see the load balancer as the source-address instead of the client’s real address. This can be worked around if the load balancer adds an X-Forwarded-For header to the incoming request, but we prefer to use LVS’ Direct Routing mechanism and avoid the problem entirely.


There’s a lot to take into consideration if you’re planning to deploy a highly available website. Some sites can be patched up fairly easily, while for others it’ll be a lot of work. A well-written and architected site can make this easier, but it tends to come at the cost of added complexity, and can be harder to maintain in future.

Next time we’ll talk about some of the realities of an HA deployment. High availability is a great concept, but it doesn’t come for free, and in many cases it’s hard to argue that it’s worth it. What’s your website worth to you, and how much disruption would you really be willing to tolerate?

Interested in building state-of-the-art high availability websites and infrastructure? We’re hiring.

The post Highly available infrastructure for your own website appeared first on Anchor Web Hosting Blog.

Pacemaker and Corosync for virtual machines

Posted in Anchor Web Hosting Blog » drbd by Barney Desmond at July 20, 2012 07:23 AM

In the previous post we talked about using Corosync and Pacemaker to create highly available services. Subject to a couple of caveats, this is a good all-round approach.

The caveats are what we’ll deal with today. Sometimes you’re dealing with software that won’t play nice when moved between systems, like a certain Enterprise™ database solution. Sometimes you can’t feasibly decompose an existing system into neat resource tiers to HA-ify it. And sometimes, you just want HA virtual machines! This can be done.

The solution

If the solution to our problem is to run everything on a single server, so be it. We then virtualise that server, and make it highly available.

Once again, it’s important to remember that we’re guarding against a physical server going up in smoke. There’s no magic scalability here, and ideally the HA subsystem never actually does anything, except when there’s a major problem.

As per our standard setup, we’re using KVM for virtualisation as it’s been mainlined into the Linux kernel.

Pacemaker resources

A highly-available VM is really simple, it comprises just two pacemaker resources:

  • DRBD for replicated storage
  • Running of the VM itself

After this, everything else is pretty standard – DRBD needs to start before the VM, and stop after it. The VM management is a new type of pacemaker resource, and that’s it. The start/stop/monitor actions in the resource agent script call out to the libvirt library, and let it handle the hard work.

How failure is handled

At this point, the VM is an opaque black box. As long as libvirt reports that the VM is running, pacemaker won’t do anything. This is good because it means you can treat the VM as a normal machine; apply all your usual monitoring for the Enterprise database app, and kick it as usual when it breaks. A BSoD or kernel panic is nothing special either: the VM is still running.

The failure case that we do care about is if one of the KVM hosts stops working. If the VM monitor action times out or the host stops responding, the standby node in the clustered pair will notice, possibly STONITH the bad node, and take over the running of your VM.

It’s important to know what this means for your VM: Pacemaker will attempt to cleanly shut down the VM, then yank the virtual power cord if that fails. This means that when the VM comes up on the standby node it will have to deal with an unclean shutdown, which can take a long time if a fsck/chkdsk is needed. HA cannot help you in this scenario!

Things to note

Pacemaker adds an extra layer of fun if you forget (or don’t know) that it’s keeping an eye on things: it’ll keep restarting a VM that you’re trying to shutdown, unless you tell it to stop managing it. This doesn’t happen on an ordinary “services deployment” because pacemaker will hand off resources when it shuts down. Watching an unwitting sysadmin deal with this is like playing with a roly-poly toy. :)

Summary and evaluation

While not without limitations, HA for whole VMs can be very convenient. At its extreme, it allows you to offer high availability for servers that you don’t even have login access for.

One catch is that it can be expensive – each KVM host needs enough RAM and diskspace to run all of the VMs in the event of a failure. If you have many VMs on an HA pair there’s a lot of unused computing capacity, which tends to have a large capital cost upfront.

The Linux HA suite offers robust solutions, but doesn’t always ensure the best utilisation of resources. Next time we’ll start talking about high-availability through load-balancing and redundancy, which can be a very nice way to get the scalability and availability that you need if you’re willing to make substantial changes to your application architecture.

The post Pacemaker and Corosync for virtual machines appeared first on Anchor Web Hosting Blog.

Pacemaker and Corosync for HA services

Posted in Anchor Web Hosting Blog » drbd by Barney Desmond at July 19, 2012 05:20 AM

Now that we’ve got our terminology sorted out, we can talk about real deployments. Our most common HA deployments use the Linux HA suite, with multiple services managed by pacemaker. This is roughly the “stack” that we referred to in the first post in the series.

We’ve already covered the resources involved, so we’ll focus on the important bit:
What happens when something goes wrong?

Normal operation

Recall that on our hypothetical HA database server, we’ve got the following managed resources:

  • DRBD storage
  • The filesystem
  • Floating IP address for the service
  • The DB service itself

Each resource has its own monitor action, specified by the Resource Agent (RA). Roughly speaking, an RA is a script that implements a common interface between pacemaker and the resources it can manage. It looks a lot like an initscript, but more rigorously defined.

The monitor action is straightforward: pacemaker runs it regularly (20sec is a normal interval), and it either says the resource is running fine, is not running, or the action times out. So long as pacemaker keeps hearing good news, nothing exciting happens.

Before we go too much further, let’s quickly discuss what “monitor” means.

Monitoring cluster resources

Each resource needs some sort of monitoring to be useful. Pacemaker doesn’t care how it works, so long as it happens. “Success” in the monitor action means:

  • For a DRBD device we check that the kernel module is loaded, and that the local node is in either the DRBD Primary or Secondary role
  • A filesystem must be mounted (check /proc/mounts). We can optionally also check that the filesystem is writeable
  • An IP address is bound to an interface
  • A database must answer a basic SELECT query over a standard client connection

Monitoring is pretty straightforward, but it’s important (and sometimes difficult) to write monitoring actions that accurately reflect the state of the resource, without depending on correct functionality of an unrelated component.

An example of this would be a network fault causing problems for an NFS mount, which affects your ability to read the status of a local (pacemaker-managed) filesystem.

Recovering from monitoring failures

So what happens if a monitor action fails? If the resource isn’t running, pacemaker will try to run the start action to bring it up. If the monitor action times out, it will try to cleanly stop the resource and then start it again, possibly on the other cluster node.

This makes for a resilient system that tries to repair itself in the face of failure. Things get more interesting if recovery also fails, and there’s where STONITH steps in.

When recovery fails

All of the stop/start/monitor actions have timeouts built-in, and pacemaker will attempt to handle a timeout condition as well. We’ve already seen that a monitor-timeout translates to a stop and start. A timeout on start isn’t a big deal, we can try to start it again. A failure or timeout on stop is considered critical.

A broken resource that can’t be stopped gracefully needs to be taken by force. We’ve already covered this pretty well in the first article so we won’t dwell on it, but it suffices to say that you’ll incur a bit of downtime as the cluster sorts things out and brings services up again.

Summary and evaluation

The deployment we’ve described is reliable and well-behaved. Because each resource is self-contained and independent, any problems are usually straightforward to diagnose and repair.

We’ve found that most services can be decomposed into a similar stack of resources – in the end it’s just a daemon being started-up on a server, and sometimes the server it runs on changes.

There are some services that don’t play nicely this way though, and sometimes you want to manage something bigger, like a VM. We’ll cover this in our next post on high availability deployments.

The post Pacemaker and Corosync for HA services appeared first on Anchor Web Hosting Blog.

Anatomy of an HA stack

Posted in Anchor Web Hosting Blog » drbd by Barney Desmond at July 09, 2012 05:23 AM

In what we plan to be a small series of articles about our high availability deployments, we thought we’d start by defining the key components in the stack and how they work together.

In future we’ll cover some of the more specific details and things that need to be taken into consideration when deploying such a system. For now we’ll talk about the bits that we use, and why we use them.

Type of deployment

A highly available system is also highly complex, so it’s important to know just what problem you’re trying to solve when you take on that burden.

Our systems are designed to deal with the total failure of a server chassis. This is very low-level and was chosen because it provides the greatest flexibility when dealing with various software stacks.

To be clear, this is not at all like a clustered application, which is written to run on multiple servers at once. In our setup the active server can fail, and the standby server will step in to take the load.

A high-availability deployment can be be as large and complex as you want, but we like to keep things simple. Some nomenclature:

  • We’ll only talk about two-server deployments, which covers almost every system we manage
  • In a two-server setup, one is “active” and the other is “standby”
  • Together, these form an “HA cluster”
  • Each server in the cluster is an “node”

Hardware

Because we’re dealing with whole-machine failure scenarios, we use servers with identical specifications to build the cluster.

Each chassis must be powerful enough to shoulder the full load on its own, as there’s no expectation to share the load within the cluster.

The scenario

Now that we’ve got the basics out of the way, we’ll present a fairly common use case for such a setup: a highly-available PostgreSQL database server.

Note that we’ve not trying to use replication here, that’s for solving a different problem. Replication could be used to effect the same outcome in this scenario, but it introduces a different sort of complexity and more work to repair things when the active server fails.

Total hardware failures aren’t terribly common. The point of HA here is to mitigate the risk of extended downtime if things go bad, and squeeze out an improved uptime figure. As a bonus, routine maintenance can be carried out on the cluster servers with minimal disruption to services.

Corosync

At its most basic, running a cluster is a matter of ensuring all the members are talking to each other, on the same page, and then sending messages to negotiate who should be running a particular cluster-managed service.

Corosync is the messaging layer of the cluster, effectively holding everything together. It handles membership of the cluster and ensures that problems are detected very quickly. This information is communicated up the stack to the Cluster Resource Manager (crm) in Pacemaker, whose job it is to actually do something about it.

Pacemaker

Making use of the Corosync cluster engine, it’s Pacemaker’s job to actually take care of the managed resources in the cluster.

While we tell Corosync about the nodes in the cluster, we tell Pacemaker about what resources to run in the cluster, and how it should be done.

A resource is just anything that Pacemaker can manage. While it can be almost anything you like, typical examples are DRBD devices, filesystem mounts, IP addresses, etc.

Just starting resources isn’t enough though – we need to make sure that resources are started in the correct order, and on the right node. This is where constraints come in (eg. start A before B, and B before C). For example, we can’t mount a filesystem until the underlying DRBD block device is up and running. Similarly, we can’t start a daemon that listens on the network until its IP address is brought up on the same machine.

Resources

Now that we have the management components out of the way, we can talk about the building blocks of actually running an HA database on it.

Without constraints, HA resources are effectively independent. To be useful to us, we build them into a stack. Resources higher in the stack necessarily depend on resources further down the stack, as described in the previous section.

The stack isn’t part of Pacemaker’s config, it’s purely conceptual. In action, we’ll push the whole stack of resources between cluster nodes.

In rough order that they appear in the stack, we’ll look at the DRBD storage, filesystem, IP addresses, and the database daemon.

DRBD

DRBD stands for Distributed Replicated Block Device, and can be thought of as RAID-1 over a network. DRBD provides us with a block device that is guaranteed to be identical at both ends, giving us a form of shared storage between two cluster nodes.

Because DRBD presents a generic block device to the system, it can be formatted with a filesystem and used exactly as you would any other storage medium.

A DRBD device is a Pacemaker-managed resource, with the constraint that it can only be used on one cluster node at a time (the one that will run the database daemon).

DRBD must be prepared before we can mount the resident filesystem, which is the next step.

Filesystem

Before using a DRBD device for the first time we create a filesystem, usually a vanilla ext3 or ext4. Once prepared, we can then have Pacemaker manage the mounting/unmounting at /var/lib/pgsql.

The filesystem can only be mounted on one cluster node at a time, which Pacemaker will guarantee. A cluster-filesystem can be multi-mounted, but would provide no benefit in this scenario.

The filesystem must be mounted after DRBD is started, and before we attempt to start the Postgres daemon.

IP address

To provide a consistent entry point to the database, we create a special “floating” HA IP address that will always be present on the active cluster node.

Like the other resources, the IP address can only be used on one cluster node at a time. Pacemaker will handle this for us.

The IP address can be brought online at any time (eg. while the DRBD device and filesystem are being prepared), but it must be before Postgres is started.

STONITH

The last component in the cluster is STONITH, which stands for “Shoot The Other Node In The Head” (this page has a fairly accurate graphical depiction of the concept).

Things will sometimes break or malfunction in an HA cluster; this is expected. Some types of failure are tolerable (eg. by retrying), while others are more critical. A communications failure is the latter.

STONITH exists to solve a problem called “split-brain“. If the two nodes can’t talk to each other, they can’t be sure who’s at fault. Because it’s their job to make sure all the resources are running, they’ll both want to take the “active” role in the cluster. This is the split-brain.

A split-brain situation is dangerous because both nodes will attempt to use resources that can’t be shared. If a communications failure is preventing us from asking the other node to gracefully let go of (“stop”) a resource, we use the nuclear option and switch off power to the other node.

As an example, assume we have two nodes Alpha and Beta that manage an ext3 filesystem.

  1. The filesystem is currently mounted on Alpha
  2. A clumsy datacentre technician is moving some cabling and inadvertently unplugs the switch carrying the cluster traffic
  3. Alpha thinks that Beta has crashed. This is no big deal, the filesystem is still mounted
  4. Beta thinks that Alpha has crashed, oh no! We need to unmount the filesystem on Alpha and mount it locally on Beta
  5. The network is down, so Beta can’t ask Alpha to unmount the filesystem
  6. Beta invokes a STONITH action on Alpha, it’s the only way to be sure! Alpha’s DRAC receives a poweroff command and promptly shuts down, hard
  7. Beta now mounts the filesystem. It needs a fsck because it wasn’t cleanly unmounted, but we’re up and running again shortly

STONITH is obviously a very violent operation, so we want to make sure it only kicks in when things have really gone bad and we’re out of options to get the resources started. We guard against this possibility by having redundant links for our cluster traffic.


That wraps up our introduction to HA. In the near future we’ll talk more about how HA clusters are used as part of a larger system, and the kinds of considerations you need to make when adding one to your architecture. If anything is unclear or you just have a burning question, feel free to leave a comment.

The post Anatomy of an HA stack appeared first on Anchor Web Hosting Blog.

Best Practice: Use Backup with your DRBD cluster!

Posted in LINBIT Blogs by devin at May 30, 2012 08:16 PM

We want to take an opportunity to explain LINBIT’s best practices in regards to DRBD and backup procedures.

DRBD is designed as a storage solution to provide High Availability, Disaster Recovery and Cross Site High Availability to your systems.  As developers of DRBD, we sometimes get community feedback that some folks are using DRBD as a “pseudo” backup solution, and in response to this we wanted to share some abstract guidelines on utilizing DRBD properly by following some key best practice methodologies.

Although DRBD is not backup software, it doesn’t mean you can’t use it in your backup procedures. Utilizing DRBD with LVM as a backing device, one can create backups with minimal to no interference to performance. This is done by utilizing LVM snapshotting as outlined in LINBIT’s DRBD User’s Guide.  Although this page outlines how to do snapshots before and after a resync, these could easily be adapted to a cron job.  Essentially one would disconnect the Secondary, snapshot the backing device, mount the snapshot, perform the backups, umount the snapshot, reconnect the Secondary.  These point in time backups are great for technology such as iSCSI targets, Virtual Machine storage or Databases such as MySQL and PostgreSQL.  As you can imagine, this methodology is quite popular in the Linux HA and DRBD communities.

LINBIT advises systems administrators to:

  1. Utilize DRBD for High Availability, Disaster Recovery and Cross Site High Availability (business continuity) purposes.
  2. Plan, review and execute a full backup strategy that makes sense for your organization and data.   Be sure to keep in mind how much data you’re planning on storing, backing up and at what intervals.  It is important to choose the point in time to make your backups to minimize things such as user error.  In many cases, backing up every day is the appropriate strategy.
  3. Test, test, test.  We cannot say this enough.We develop software that is designed to prevent loss from failure, so you could say we’re experts on this topic.  It’s very important that you not only test DRBD’s configuration, but the components that make up your backup system as well.  Then, on a scheduled basis, you should be reviewing your data to ensure its completeness and correctness.  As well, on an annual basis it would be wise to review your top level strategy and make updates if your requirements have changed.  In summation, it is advised to routinely test your backup procedure and also verify (checksum) your backups to ensure their completeness.

In closing, DRBD is designed to prevent loss of service as the result of equipment failure.  LINBIT strongly advises systems administrators to implement a strategy that incorporates “point in time” backups so administrators can restore, rewind and rejoice knowing that they’re not only backed by the best open source replication technology: DRBD, but a comprehensive backup solution that is designed for the organization’s needs in mind.

How do you backup your DRBD cluster?

Share your thoughts or comments below! :)

Imitation is the sincerest form of flattery

Posted in Florian's blog by Florian Haas at May 29, 2012 10:37 AM

Know how they say that you don’t know you’re doing something right until someone starts imitating you? Well, this is a great time for us.

Someone evidently took a good long read of our hastexo High Availability Expert class agenda, and made this. It’s wonderful for us to see that it’s such an inspiration to others (even the acronym!), and we hope to see more folks doing this in the future.

Meanwhile, I should say this since I haven’t yet mentioned it here on my blog: there’s an new feature to our own HHAX class that we’ve added to our upcoming class in Berlin, and that is the booth arbitration daemon and ticket manager.

Read more…


4 extra seats available in Cloud Bootcamp in Wellington!

Posted in Florian's blog by Florian Haas at May 21, 2012 09:00 PM

Our Cloud Bootcamp for OpenStack™ in Wellington next month just got 4 extra seats! If you’re in New Zealand, Australia, or the Pacific region, and want to learn about OpenStack, now is your chance!

Read more…


Returning to Paris for OpenStack in Action 2: Production Ready

Posted in Florian's blog by Florian Haas at May 10, 2012 11:30 AM

This month, I’m thrilled to go to Paris to talk about highly-available OpenStack. The event I’m speaking at is OpenStack in Action 2: Production Ready, and it’s being organized by French hosting & cloud services provider eNovance.

Read more…


“read-balancing” with 8.4.1+

Posted in LINBIT Blogs by flip at May 09, 2012 06:45 AM

DRBD 8.4.1 introduces a new feature: read-balancing, which is configured in the disk section of the configuration file(s). This feature enables DRBD to balance read requests between the Primary/Secondary nodes.

While writes occur on both sides of the cluster, by default the reads are served locally (ie., the value is prefer-local). This might not be optimal if you’ve got a big pipe to the other node and a heavily loaded IO subsystem.

read-balancing has several options to choose from:

  • 32K-striping up to 1M-striping chooses the node to read from via the block address – eg. for 512K-striping the first half of each MiByte would be read from one machine, and the second half from the other1.
    This is a simple, static load-balancing.
  • round-robin just passes the request to alternating nodes.
    This might go wrong if your application reads 4kiB, 1MiB, 4kiB, 1MiB, and so on – but this is fairly unlikely.
  • least-pending chooses the node with the smallest number of open requests.
  • when-congested-remote uses the remote node if there are local requests2.
  • prefer-remote is implemented for completeness, however as of this writing there is no viable use case.

Please note that all this is still below the filesystem layer – so even if the secondary is used for reading, this won’t speed up a failover, as the pages read are not kept anywhere.

Our first Cloud Bootcamp is now Sold Out

Posted in Florian's blog by Florian Haas at May 08, 2012 09:44 PM

Less than two weeks after it’s been announced, our inaugural Cloud Bootcamp for OpenStack™ in Wellington, New Zealand is now sold out. Our friends at Catalyst IT have put up a wait list, and we’re currently working on tacking on extra days to fill the excess demand.

This will be fun.

Read more…


An exciting day for the Ceph community

Posted in Florian's blog by Florian Haas at May 03, 2012 11:10 AM

Today, as you’ve probably noticed if you’re following the development of the Ceph stack, something mighty cool has been happening. The ceph.com web site received a major makeover with a slick new design, and the people behind Ceph haveannounced the launch of a brand new company to drive the Ceph stack, Inktank.

Read more…


More details on OSCON 2012, and your chance to get in cheaper!

Posted in Florian's blog by Florian Haas at May 01, 2012 06:29 PM

A few more details on my speaking slot at this year’s OSCON, titled Highly Available Cloud: OpenStack Integration with Pacemaker.

Read more…


Coming to New Zealand!

Posted in Florian's blog by Florian Haas at April 25, 2012 05:37 AM

hastexo is offering Cloud Bootcamp for OpenStack™ in Wellington. Another fine example of the global OpenStack community at work.

Read more…


A look back at my first OpenStack Design Summit & Conference

Posted in Florian's blog by Florian Haas at April 24, 2012 09:35 AM

I’ve just returned from the OpenStack Folsom Design Summit and Spring 2012 Conference, and am finally getting rid of my jet lag. Here’s a summary of what’s been a mind-blowing conference experience for me.

Read more…


LINBIT participates in the German Cloud (“Deutsche Wolke”)

Posted in LINBIT Blogs by flip at April 23, 2012 06:30 PM

Deutsche Wolke, Logo

Deutsche Wolke (“German Cloud”) was founded to establish Federal Cloud Infrastructure in Germany.

This infrastructure will provide additional legal and security protections for hosted data.  No longer will small businesses be exposed to the legal risk of losing their website presence without a trial (an unfortunate reality when doing business on transatlantic clouds).

The natural partner for backend storage infrastructure is LINBIT; as authors and maintainers of DRBD, we are best suited to provide the technical expertise to achieve High Availability.  Also, DRBD Proxy is the obvious choice for off-site or disaster recovery replication (from the office into the cloud).

We at LINBIT look forward to seeing this project grow and prosper!

Answers for DRBD time-travel issues

Posted in Anchor Web Hosting Blog » drbd by Barney Desmond at April 18, 2012 04:49 AM

A little update on a DRBD problem we wrote about at the start of April, in which in which we lost a few months of data during a cluster failover.

Linbit got in touch with us to offer assistance, and we were happy to be enlightened. We had a good idea of what had happened, but no idea why.

It seems that a race condition was introduced in version 8.3.9, when the fence-peer script was changed to run asynchronously. The engineering team explained that if the connection is reestablished while the script runs, it may happen that the peer’s disk-state gets overwritten with stale information.

This was fixed in 8.3.11, and of course we’re running version 8.3.10 on the cluster in question. We’d like to thank Linbit for their assistance and expertise in sorting this out, we’ve already started testing our plans for an upgrade.

The post Answers for DRBD time-travel issues appeared first on Anchor Web Hosting Blog.

Speaking at the Percona Live MySQL Conference and Expo

Posted in Florian's blog by Florian Haas at April 14, 2012 04:09 PM

This week, I had the pleasure of speaking at the Percona Live MySQL Conference & Expo. This was the first year it was not the O’Reilly MySQL Conference & Expo, and also the first time Oracle was not involved in any way. And what can I say, Terry Erisman and his team at Percona have put together an awe-inspiring conference.

Read more…


Speaking at OSCON 2012

Posted in Florian's blog by Florian Haas at April 03, 2012 09:28 AM

I’ll be speaking at OSCON 2012 in Portland, on high availability in OpenStack.

Read more…


Monitoring: better safe than sorry…

Posted in LINBIT Blogs by flip at April 03, 2012 09:07 AM

Stumbling upon the Holy time-travellin’ DRBD, batman! blog post there’s only one thing to be said …

Be strict in what you emit, liberal in what you accept1

is simply not true when dealing with mission-critical systems.

It’s ok to be alerted on upgrading a machine because the “old, working” RegEx that did the parsing doesn’t match anymore2; it’s not a problem to get an email when someone adds the 100th DRBD resource and causes the grep to fail; and so on.

Better to have a few false positives when you’re actively changing things than to get a false negative that costs you months of data; that’s what an assert (and monitoring isn’t that different) is for, after all.

Keep monitoring strict, and let it fail loudly on unexpected things – after the first few occurrences they’re not unexpected anymore and can be dealt with.

Feature article on Pacemaker in this month’s Linux Journal

Posted in Florian's blog by Florian Haas at April 02, 2012 12:45 PM

I’ve written an article on the Pacemaker stack that’s being featured in this month’s Issue 216 of Linux Journal.

Read more…


Maximum volume size on DRBD

Posted in LINBIT Blogs by flip at April 02, 2012 12:25 PM

From time to time we get asked things like this:

I want to use a 10TiB volume with DRBD, is that supported”?

The easiest way to answer things like that is to say look for yourself on the public DRBD usage page – the biggest public device size is ~220TiB, so go figure ;)

The current maximum device size is 1EiB (1 ExiByte = 1024 PetiByte, 1 PetiByte = 1024 TibiByte1), so there’s a bit of room left.

DRBD needs about 32MiB RAM per TiB storage; so for 1PiB storage you are at 32GiB, and for 1 EiB storage at 32TiB of RAM just for the DRBD bitmap2. Having a bit more for the OS, userspace and buffer cache is left as an exercise for the reader.

If you’ve got questions, ask the DRBD experts at LINBIT – we wrote the code, after all!

Holy time-travellin’ DRBD, batman!

Posted in Anchor Web Hosting Blog » drbd by Barney Desmond at March 31, 2012 04:52 PM

Here at Anchor we’ve developed High-Availability (HA) systems for our customers to ensure they remain online in the event of catastrophic hardware failure. Most of our HA systems involve the use of DRBD, the Distributed Replicated Block Device. DRBD is like RAID-1 across a network.

We’d like to share some notes on a recent issue that involved a DRBD volume jumping into a time-warp and rolling back four months. If you run your own DRBD setup, you’ll want to know about this. The chances that you hit the same problem are slim, but it’s not hard to avoid.


We have a script for Nagios that checks the health of your DRBD volumes, it was basically the go-to default check_drbd script on Nagios Exchange. The script is meant to ensure that both ends are in-sync, and that the connection is up.

The volume in question is the backing store for a virtual machine (VM) guest. One day, after an otherwise-ordinary cluster failover event, it was noticed that the VM’s disks had reverted to a state from last year in November. The monitoring had never tripped, what the heck was going on?

Our sysadmins started digging. Pacemaker generates a lot of logging output, this was one time it came in useful. This assumes you have some familiarity with DRBD and (ideally) Pacemaker’s cluster management functions:

  1. Everything was working fine
  2. A blip on the cluster caused the active server (server A) to attempt a fence action on the volume on the standby (server B)
  3. The fence action failed for some reason
  4. Server A says “Hmm, okay, whatever”, and stops sending DRBD updates to server B
  5. The DRBD connection remains up and running
  6. Server A’s monitoring script says “I’m the Primary node so I’m up-to-date, and the connection is up: OK”
  7. Server B’s monitoring script says “I’m the Secondary node, my data is ‘consistent’ (not half-synced), and the connection is up: OK”
  8. Everything looks okay and noone is aware that server B’s copy of the data is slipping further and further out of date
  9. Eventually a full-on cluster failover occurs, server B receives the call to action, and goes right ahead as it knows its data is consistent (represents a known point-in-time) but not that it’s very outdated

In short, a corner case in DRBD’s workings got the volume into a bad state that went undetected by the monitoring script. This allowed old data to replace new data during a failover.

When server A attempted the fencing action and failed, it knew something was wrong. It couldn’t tell what, but it didn’t trust it any more, so it stopped sending new data to server B.

Each server knows a little bit about the disk at the other end, thanks to the DRBD connection working just fine. Server A knew its disk was good but noted server B’s disk as “DUnknown” – something dodgy going on. Server B thought its own disk was fine (correct: it hadn’t received the fencing request) and knew server A’s disk was fine (server A is automatically trusted as the Primary node).

Server B’s “DUnknown” state is what Nagios didn’t see, and it should’ve been a warning bell. Server B willingly took over after the failover because everything looked fine, just that server A had been really, really quiet for the last few months. As the new Primary node it promptly pushed its copy of the volume back to server A, steamrolling 4 months of changes in the process.


Our immediate fix for this was to improve the monitoring script. The remote peer’s disk state is now taken into account, and the script was heavily restructured to improve readability and aggregate data in a more structured manner. We’ll be able to push the improvements to Github once we’ve cleaned it up a little further.

EDIT: It’s been published now: https://github.com/anchor/nagios-plugin-drbd

We’re also further investigating the fencing actions for DRBD. Building fault-tolerant systems is hard, which is why you employ defense-in-depth strategies – it may be that the fencing actions also need defensive measures.

The post Holy time-travellin’ DRBD, batman! appeared first on Anchor Web Hosting Blog.