Quick links:

LINBIT Blogs: Albireo Virtual Data Optimizer (VDO) on DRBD

LINBIT Blogs: DRBD 9 over RDMA with Micron SSDs

That Cluster Guy: Working with OpenStack Images

That Cluster Guy: Evolving the OpenStack HA Architecture

LINBIT Blogs: Persistent and Replicated Docker Volumes with DRBD9 and DRBD Manage

LINBIT Blogs: RDMA Performance with Real Storage

LINBIT Blogs: RDMA Performance

LINBIT Blogs: Having Fun with the DRBD Manage Control Volume

LINBIT Blogs: Testing SSD Drives with DRBD: SanDisk Optimus Ascend

LINBIT Blogs: Testing SSD Drives with DRBD: Intel DC 3700 Series

LINBIT Blogs: What is RDMA, and why should we care?

That Cluster Guy: Minimum Viable Cluster

That Cluster Guy: Receiving Reliable Notification of Cluster Events

That Cluster Guy: Fencing for Fun and Profit with SBD

LINBIT Blogs: Benchmarking DRBD

That Cluster Guy: Double Failure - Get out of Jail Free? Not so Fast

That Cluster Guy: Life at the Intersection of Pets and Cattle

That Cluster Guy: Adding Managed Compute Nodes to a Highly Available Openstack Control Plane

Arrfab's Blog » Cluster: Provisioning quickly nodes in a SeaMicro chassis with Ansible

Arrfab's Blog » Cluster: Switching from Ethernet to Infiniband for Gluster access (or why we had to …)

That Cluster Guy: Feature Spotlight - Smart Resource Restart from the Command Line

That Cluster Guy: Feature Spotlight - Controllable Resource Discovery

LINBIT Blogs: DRBD and SSD: I was made for loving you

LINBIT Blogs: Root-on-DRBD followup: Pre-production staging servers

That Cluster Guy: Release Candidate: 1.1.12-rc1

LINBIT Blogs: DRBDmanage installation is now easier!

That Cluster Guy: Potential for data corruption affecting Pacemaker 1.1.6 through 1.1.9

LINBIT Blogs: DRBDManage release 0.10

LINBIT Blogs: DRBD-Manager

That Cluster Guy: Announcing 1.1.11 Beta Testing

Albireo Virtual Data Optimizer (VDO) on DRBD

Posted in LINBIT Blogs by matt at July 21, 2016 05:43 PM

TL;DR: Pairing DRBD with VDO reduces the replication network and storage utilization by ~85% while increasing load by ~0.8.

VDO (Virtual Data Optimizer)[1] is a ready-to-run software package that delivers block-level deduplication, compression, and thin provisioning capabilities to Linux. VDO operates inline at a 4 KB granularity, delivering the best possible balance of performance and data reduction rates.

This sounds like a great software to pair with DRBD, right?! We were most interested in adding deduplication to the storage stack above DRBD for more efficiency in our replicated writes. So we ran some tests doing just that while measuring the IO on our backing disks as well as the network traffic on our replication network; we did the same on a vanilla DRBD device for comparison.

We decided that deploying and cloning numerous CentOS 7 virtual machines on both of our devices was a good way to test VDO’s deduplication and it’s effects on replication. We chose this test because VMs of the same distro will have identical blocks where their binaries, libraries, etc (no, not that /etc) are stored. They will also have some blank space at the end of their virtual disks (zeros) that VDO should be able to compress/dedup down to almost nothing. Lastly, but more importantly, replicating VM’s virtual disks is a real-life use case and not just an academic experiment.

The Setup: First, we setup two DRBD devices and followed the appropriate steps[2] to use them as our LVM physical volumes (VDO requires LVM), created PV and VG signatures for both devices, and created the VDO device on one of our DRBD disks. We then formatted the resulting block devices with XFS filesystems, mounted them, and created CentOS 7 VMs with fully allocated 20GiB qcow2 virtual disks in each mount point. Since we also recorded system load during our testing I should mention that the test systems each had 16 CPUs (Intel Xeon E7520s).

The Testing: We wanted to see how much data would be pushed to disk and onto the replication network when we created a clone of the virtual machines; we used ‘iostat’ to measure the IO on DRBD’s backing disks and ‘iptraf’ to measure the replicated data on DRBD’s replication ports during the cloning. We also recorded the peak load during each iteration of our testing to see how expensive the dedup and compression was on our dataset.

The Results: On a vanilla DRBD device we saw system load climb to 2.62, the backing disk saw 20506MB of writes, and the replication network transferred 21607MB of data. On the DRBD backed VDO device we saw system load climb to 3.45, the backing disk saw 2833MB of writes, and the replication network transferred 3234MB of data. Look at those savings!!  VDO reduced the replication network and storage utilization by about 85% while only increasing load by ~0.8.

This is a small dataset which was intended as a POC, but it isn’t hard to imagine other datasets that could benefit from deduplication. It’s also not hard to see the benefit of deduplication when considering replication over slow networks (WAN) where every Byte of bandwidth counts. Pairing VDO with DRBD and DRBD Proxy seems like a win to us!

Check out the official press release: HERE

[1] See http://permabit.com/products-overview/albireo-virtual-data-optimizer-vdo/
[2] See https://www.drbd.org/en/doc/users-guide-84/s-lvm-drbd-as-pv, note the on RHEL/CentOS 7 you also need to disable lvmetad in the lvm.conf and systemd.

DRBD 9 over RDMA with Micron SSDs

Posted in LINBIT Blogs by matt at June 22, 2016 06:08 PM

We have been testing out some 240GB Micron M500DC SSDs with DRBD 9 and DRBD’s RDMA Transport layer.  Micron, based in Boise Idaho, is a leader in NAND, flash production and storage.  We found that that their M500DC SSD’s are write optimized for data center use cases and in some cases exceeded the expected performance.

For those who are just joining us, leveraging RDMA as a transport protocol is relatively new to DRBD and is only possible with DRBD 9. You can find some background on RDMA and how DRBD benefits from it in one of our past blog posts, “What is RDMA, and why should we care?”. Also, check out our technical guide on benchmarking DRBD 9 on Ultrastar SN150 NVMe SSDs if you are interested in seeing some of the numbers we were able to achieve with DRBD 9.0.1-1 and RDMA on very fast storage.

Back to the matter at hand.

In our test environment we used two 240GB Micron M500DC SSDs in RAID0 in each of our two nodes. We connected the two peers using Infiniband ConnectX-4 10Gbe. We then ran a series of tests to compare the performance of DRBD disconnected (not-replicating), DRBD connected using TCP over Infiniband, and DRBD connected using RDMA over Infiniband, all against the performance of the backing disks without DRBD.

For testing random read/write IOPs we used fio with 4K blocksize and 16 parallel jobs. For testing sequential writes we used dd with 4M blocks. Both tests used the appropriate flag for direct IO in order to remove any caching that might skew the results.

We also levereaged DRBD’s “when-congested-remote” read-balancing option to pull read’s from the peer if the IO subsystem is congested on the Primary node. We will see that this produces dramatic increases to performance of our random reads; especially when combined with RDMA’s extremely low latency.

Here are the results from our Random Read/Write IOPs testing:
micron-random-chart

micron-random-graph

As you can see from the numbers and graphs we achieve huge gains in read performance when using DRBD with read-balancing; roughly a 26% increase when using TCP and 62% with RDMA.

We also see that using the RDMA transport protocol results in less than 1% of overhead when synchronously replicating the writes to our DRBD device; that’s pretty sweet. :)

Sequential reads cannot benefit from DRBD’s read-balancing at all, and large sequential writes are going to be heavily segmented by the TCP stack, so our numbers for sequential writes better represent the impact a transport protocol has on synchronous replication.

Here are the results from our Sequential Write testing:micron-sequential-chartmicron-sequential-graph

Looking at the graph it’s easy to see that RDMA is the transport mode of choice if your IO patterns are sequential. With TCP we see ~19.1% overhead, while RDMA results in ~1.1% overhead.

Working with OpenStack Images

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at June 20, 2016 02:45 AM

Creating Images

For creating images, I recommend the virt-builder tool that ships with RHEL based distributions and possibly others:

virt-builder centos-7.2 --format qcow2 --install "cloud-init" --selinux-relabel

Note the use of the --selinux-relabel option. If you specify --install but do not include this option, you may end up with instances that treats all attempts to log in as security violations and blocks them.

The cloud-init package is incredibly useful (discussed later) but isn’t available in CentOS images by default, so I recommend adding it to any image you create.

For the full list of supported targets, try virt-builder -l. Targets should include CirrOS as well as several versions of openSUSE, Fedora, CentOS, Debian, and Ubuntu.

Adding Packages to an existing Image

On RHEL based distributions, the virt-customize tool is available and makes adding a new package to an existing image simple.

virt-customize  -v -a myImage --install "wget,ntp" --selinux-relabel

Note once again the use of the --selinux-relabel option. This should only be used for the last step of your customization. As above, not doing so may result in an instance that treats all attempts to log in as security violations and blocks them.

Richard Jones also has a good post about updating RHEL images since they require subscriptions. Just be sure to use --sm-unregister and --selinux-relabel at the very end.

Logging in

If you haven’t already, tell OpenStack about your keypair:

nova keypair-add myKey --pub-key ~/.ssh/id_rsa.pub

Now you can tell your provisioning tool to add it to the instances it creates. For Heat, the template would look like this:

myInstance:
  type: OS::Nova::Server
  properties:
    image: { get_param: image }
    flavor: { get_param: flavor }
    key_name: myKey

However almost no image will let you log in, via ssh or on the console, as root. Instead they normally create a new user that has full sudo access. Red Hat images default to cloud-user while CentOS has a centos user.

If you don’t already know which user your instance has, you can use nova console-log myImage to see what happens at boot time.

Assuming you configured a key to add to the instance, you might see a line such as:

ci-info: ++++++Authorized keys from /home/cloud-user/.ssh/authorized_keys for user cloud-user+++++++

which tells you which user your image supports.

Customizing an Instance at Boot Time

This section relies heavily on the cloud-init package. If it is not present in your images, be sure to add it using the techniques above before trying anything below.

Running Scripts

Running scripts on the instances once they’re up can be a useful way to customize your images, start services and generally work-around bugs in officially provided images.

The list of commands to run is specified as part of the user_data section of a Heat template or can be passed to nova boot with the --user-data option:

myNode:
  type: OS::Nova::Server
  properties:
    image: { get_param: image }
    flavor: { get_param: flavor }
    user_data_format: RAW
    user_data:
      #!/bin/sh -ex

      # Fix broken qemu/strstr()
      # https://bugzilla.redhat.com/show_bug.cgi?id=1269529#c9
      touch /etc/sysconfig/64bit_strstr_via_64bit_strstr_sse2_unaligned

Note the extra options passed to /bin/sh The e tells the script to terminate if any command produces an error and the x tells the shell to log everything that is being executed. This is particularly useful as it causes the script’s execution to be available in the console’s log (nova console-log myServer).

When Scripts Take a Really Long Time

If we have scripts that take a really long time, we may want to delay the creation of subsequent resources until our instance is fully configured.

If we are using Heat, we can set this up by creating SwiftSignal and SwiftSignalHandle resources to coordinate resource creation with notifications/signals that could be coming from sources external or internal to the stack.

signal_handle:
  type: OS::Heat::SwiftSignalHandle

wait_on_server:
  type: OS::Heat::SwiftSignal
  properties:
    handle: {get_resource: signal_handle}
    count: 1
    timeout: 2000

We then add a layer of indirection to the user_data: portion of the instance definition using the str_replace: function to replace all occurences of “wc_notify” in the script with an appropriate curl PUT request using the “curl_cli” attribute of the SwiftSignalHandle resource.

myNode:
  type: OS::Nova::Server
  properties:
    image: { get_param: image }
    flavor: { get_param: flavor }
    user_data_format: RAW
    user_data:
      str_replace:
        params:
          wc_notify:   { get_attr: ['signal_handle', 'curl_cli'] }
        template: |
          #!/bin/sh -ex

          my_command_that --takes-a-really long-time

          wc_notify --data-binary '{"status": "SUCCESS", "data": "Script execution succeeded"}'

Now the creation of myNode will only be considered successful if and when the script completes.

Installing Packages

One should avoid the temptation to hardcode calls to a specific package manager as part of a script as it limits the usefulness of your template. Instead, this is done in a platform agnostic way using the packages directive.

Note that instance creation will not fail if packages fail to install or are already present. Check for any required binaries or files as part of the script.

user_data_format: RAW
user_data:
  #cloud-config
  # See http://cloudinit.readthedocs.io/en/latest/topics/examples.html
  packages:
    - ntp
    - wget

Note that this will NOT work for images that need a Red Hat subscription. There is supposed to be a way to have it register, however I’ve had no success with this method and instead I recommend you create a new image that has any packages listed here pre-installed.

Installing Packages and Running scripts

The first line of the user_data: section (#config or #!/bin/sh) is used to determine how it should be interpreted. So if we wish to take advantage of scripting and cloud-init, we must combine the two pieces into a multi-part MIME message.

The cloud-init docs include a MIME helper script to assist in the creation of complex user_data: blocks.

Simply create a file for each section and invoke with a command line similar to:

python ./mime.py cloud.config:text/cloud-config cloud.sh:text/x-shellscript

The resulting output can then be pasted in as a template and even edited in-place later. Here is an example that includes notification for a long running process:

user_data_format: RAW
user_data:
  str_replace:
    params:
      wc_notify:   { get_attr: ['signal_handle', 'curl_cli'] }
    template: |
      Content-Type: multipart/mixed; boundary="===============3343034662225461311=="
      MIME-Version: 1.0
      
      --===============3343034662225461311==
      MIME-Version: 1.0
      Content-Type: text/cloud-config; charset="us-ascii"
      Content-Transfer-Encoding: 7bit
      Content-Disposition: attachment; filename="cloud.config"

      #cloud-config
      packages:
        - ntp
        - wget

      --===============3343034662225461311==
      MIME-Version: 1.0
      Content-Type: text/x-shellscript; charset="us-ascii"
      Content-Transfer-Encoding: 7bit
      Content-Disposition: attachment; filename="cloud.sh"
      
      #!/bin/sh -ex

      my_command_that --takes-a-really long-time

      wc_notify --data-binary '{"status": "SUCCESS", "data": "Script execution succeeded"}'

Evolving the OpenStack HA Architecture

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at June 07, 2016 04:06 AM

In the current OpenStack HA architecture used by Red Hat, SuSE and others, Systemd is the entity in charge of starting and stopping most OpenStack services. Pacemaker exists as a layer on top, signalling when this should happen, but Systemd is the part making it happen.

This is a valuable contribution for active/passive (A/P) services and those that require all their dependancies be available during their startup and shutdown sequences. However as OpenStack matures, more and more components are able to operate in an unconstrained active/active capacity with little regard for the startup/shutdown order of their peers or dependancies - making them well suited to be managed by Systemd.

For this reason, a future revision of the HA architecture should limit Pacemaker’s involvement to core services like Galera and Rabbit as well as the few remaining OpenStack services that run A/P.

This would be particularly useful as we look towards a containerised future. It both allows OpenStack to play nicely with the current generation of container managers which lack Orchestration, as well as reduces recovery and downtime by allowing for the maximum parallelism.

Divesting most OpenStack services from the cluster also removes Pacemaker as a potential obstacle for moving them to WSGI. It is as-yet unclear if services will live under a single Apache instance or many and the former would conflict with Pacemaker’s model of starting, stopping and monitoring services as individual components.

Objection 1 - Pacemaker as an Alerting Mechanism

Using Pacemaker as an alerting mechanism for a large software stack is of limited value. Of course Pacemaker needs to know when a service dies but it necessarily takes action straight away, not wait around to see if there will be any others with which it can correlate a root cause.

In large complex software stacks, the recovery and alerting components should not be the same thing because they do and should operate on different timescales.

Pacemaker also has no way to include the context of a failure in an alert and thus no way to report the difference between Nova failing and Nova failing because Keystone is dead. Indeed Keystone being the root cause could be easily lost in a deluge of notifications about the failure of services that depend on it.

For this reason, as the number of services and dependancies grow, Pacemaker makes a poor substitute for a well configured monitoring and alerting system (such as Nagios or Sensu) that can also integrate hardware and network metrics.

Objection 2 - Pacemaker has better Monitoring

Pacemaker’s native ability to monitor services is more flexible than Systemd’s which relies on a “PID up == service healthy” mode of thinking 1.

However, just as Systemd is the entity performing the startup and shutdown of most OpenStack services, it is also the one performing the actual service health checks.

To actually take advantage of Pacemaker’s monitoring capabilities, you would need to write Open Cluster Framework (OCF) agents 2 for every OpenStack service. While this would not take a rocket scientist to achieve, it is an opportunity for the way services are started in a clustered and non-clustered environment to diverge.

So while it may feel good to look at a cluster and see that Pacemaker is configured to check the health of a service every N seconds, all that really achieves is to sync Pacemaker’s understanding of the service with what Systemd already knew. In practice, on average, this ends up delaying recovery by N/2 seconds instead of making it faster.

Bonus Round - Active/Passive FTW

Some people have the impression that A/P is a better or simpler mode of operation for services and in this was justify the continued use of Pacemaker to manage OpenStack services.

Support for A/P configurations is important, it allows us to make applications that are in no way cluster-aware more available by reducing the requirements on the application to almost zero.

However, the downside is slower recovery as the service must be bootstrapped on the passive node, which implies increased downtime. So at the point the service becomes smart enough to run in an unconstrained A/A configuration, you are better off to do so - with or without a cluster manager.

  1. Watchdog-like functionality is only a variation on this, it only tells you that the thread responsible for heartbeating to Systemd is alive and well - not if the APIs it exposes are functioning.

  2. Think SYS-V init scripts with some extra capabilities and requirements particular to clustered/automated environment. It’s a standard historically supported by the Linux Foundation but hasn’t caught on much since it was created in the late 90’s.

Persistent and Replicated Docker Volumes with DRBD9 and DRBD Manage

Posted in LINBIT Blogs by Roland Kammerer at June 03, 2016 08:56 AM

Nowadays, Docker has support for plugins; for LINBIT, volume plugins are certainly the most interesting feature. Volume plugins open the way for storing content residing in usual Docker volumes on DRBD backed storage.

In this blog post we show a simple example of using our new Docker volume plugin to create a WordPress powered blog with a MariaDB database, where both the content of the blog and the database is replicated among two cluster nodes.

The advantage of this setup is that there are multiple copies of your important data and that switching hosts that run the blog service is a breeze because other nodes in the cluster have instant access to the replicated data.

For the rest of this blog entry we assume two nodes with recent versions of Docker, DRBD9, and DRBD Manage. For the sake of simplicity we use two nodes based on Ubuntu Xenial, called alpha and bravo.

The first step is to install the required software on both nodes:

# sudo su -
$ add-apt-repository ppa:linbit/linbit-drbd9-stack
$ apt update
$ apt install -y docker.io docker-compose
$ apt install -y drbd-dkms drbd-utils \
  python-drbdmanage drbdmanage-docker-volume

Before we continue, let’s check if docker works:

docker run -it --rm alpine:latest /bin/echo It works

Setting up a drbdmanage cluster is documented here, we assume that alpha and bravo are already added to the cluster and that creating resources on a DRBD Manage level works as expected. This then looks as follows:

dmnodes

The Docker plugin for DRBD Manage is not enabled by default, so let’s do that on both nodes:

$ systemctl enable docker-drbdmanage-plugin.socket
$ systemctl start docker-drbdmanage-plugin.socket

Let’s create our first docker volume that is backed by DRBD:

$ docker volume create -d drbdmanage --name=first --opt size=20
$ docker volume ls
DRIVER              VOLUME NAME
drbdmanage          first
$ docker volume rm first

For our blog service we create two Docker volumes: one for the WordPress content and one for the database. In this case we choose xfs as the file system and make the volumes 300MB in size. The volume driver has some other interesting options, like specifying additional file system options or a replica count for the number of redundant copies in the cluster (man drbdmanage-docker-volume). The MariaDB container is a bit picky about the size, so don’t make that volume too small, or the container does not start:

$ docker volume create -d drbdmanage \
    --name=bloghtml --opt fs=xfs --opt size=300
$ docker volume create -d drbdmanage \
    --name=blogdb   --opt fs=xfs --opt size=300
$ docker volume ls
DRIVER              VOLUME NAME
drbdmanage          blogdb
drbdmanage          bloghtml

After that, we create a yaml configuration for our blog service on both nodes:

$ mkdir ~/ha-blog && cd ~/ha-blog

Create a file with the name docker-compose.yml and the following content:

wordpress:
  image: wordpress
  links:
    - db:mysql
  ports:
    - 8080:80
  volume_driver: drbdmanage
  volumes:
    - bloghtml:/var/www/html

db:
  image: mariadb
  environment:
    MYSQL_ROOT_PASSWORD: mysecretpwd
  volume_driver: drbdmanage
  volumes:
    - blogdb:/var/lib/mysql

And now it is time to start our blog. For this demonstration we will just start the blog interactively; in a real world scenario one would write a systemd service file that automates the following manual steps.

docker-compose up

Wait until the two containers are started up:

composeup

Open a browser and connect to http://alpha:8080. Do the initial WordPress setup and modify the initial post.

blogalpha

When you are satisfied, we can now migrate our blog from node alpha to node bravo.

First press ctrl-c in alpha‘s terminal where docker-compose is still running. Being nice Docker citizens, we remove the now unused containers:

docker-compose rm

Now we switch to bravo and execute:

$ cd ~/ha-blog
$ docker-compose up

Now point your browser to bravo‘s IP on port 8080 and you will see the content you created on alpha before the migration.

blogbravo

Please note that this demonstrates the capabilities of our Docker volume plugin for DRBD Manage, but is in no way a Tech-Guide on how to run a highly available WordPress blog. This would require further configuration, like floating IPs and/or DNS updates, and a cluster manager that starts/stops/monitors the required containers.

Still – for five minutes work, a nice result!

RDMA Performance with Real Storage

Posted in LINBIT Blogs by flip at April 19, 2016 07:47 AM

As an update to the previous post, we now have the Tech Guide for RDMA performance with non-volatile storage available online.

Just head over to the LINBIT Tech Guide area and read the HGST Ultrastar SN150 NVMe performance report! (Free registration required.)

RDMA Performance

Posted in LINBIT Blogs by flip at April 06, 2016 07:18 AM

As promised in the previous RDMA post, we gathered some performance data for the RDMA transport. Read and enjoy!

Basic hardware information:

  • Two IBM 8247-22L’s (Power8, 2 sockets * 10 CPUs, hyperthreading turned off)
  • 128GiByte RAM
  • ConnectX4 Infiniband, two connections with 100Gbit each
  • The DRBD TCP connection was run across one “bnx2x” 10Gbit adapter pair (ie. one in each server, no bonding)
  • dm-zero as we don’t have fast enough storage available; directly on hardware there was no IO scheduler, within the VM we switched to “noop“.
    NOTE: if you’d like to see performance data with real, persistent storage being used, check out our newest Tech Guide – “DRBD9 on an Ultrastar SN150 NVMe SSD“.

Software we used:

  • Ubuntu Xenial (not officially released at the time of testing)
  • Linux Kernel version 4.4.0-15-generic (ppc64el)
  • DRBD 9.0.1-1 (ded61af75823)
  • DRBD RDMA transport 2.0.0
  • fio version 2.2.10-1

Our underlying block devices were built to have some “persistent” (ha!) space at the beginning and the end, to keep the DRBD and filesystem superblocks; the rest in the middle was mapped to dm-zero:

zero-block-1: 0 8192 linear 1:1 0
zero-block-1: 8192 2147483648 zero
zero-block-1: 2147491840 8192 linear 1:1 8192

Due to the large number of variables and potential test cases, we restricted fios run-time to 10 seconds (as that should be good enough for statistical purposes, see below).

The graphics only show data for a single thread (but for multiple IO-depths), for ease of reading.

For the performance point – here is fio directly on the hardware (i.e. without a virtualization layer in between).

DRBD9, connected via RDMA over 100Gbit IB, writing to dm-zero

This graphic shows a few points that should be highlighted:

  • For small block sizes (4kiB & 16kiB), the single-threaded/io-depth=1 performance is about 10k IOPsfootnote:[That, times 10 seconds, amounts to 100,000 measurements – a nice statistical base, I believe ;)] : with io-depth=2 it’s 20k IOPs, and when io-depth is 8 or higher, we reach top performance of ~48k IOPs.
  • For large block sizes, the best bandwidth result is a tad above 11GiB/sec (sic!)
  • Last, but not least, the best latency was below 40µsec! For two threads, io-depth=2, 4KiB block size we had this result:
      lat (usec): min=39, max=4038, avg=97.44, stdev=36.22

 

As a small aside, here’s the same setup, but using TCP instead of RDMA; we kept the same scale to make comparison easier.

DRBD9, connected via TCP on 10Gbit, writing to dm-zero

As you can see, copying data around isn’t that efficient – TCP is clearly slower, topping out at 1.1GiB/sec on this hardware. (But I have to admit, apart from tcp_rmem and tcp_wmem I didn’t do any tuning here either).

Now, we move on to results from within a VM; let’s start with reading.

The VM sees the DRBD device as /dev/sdb; the scheduler was set to “noop” to not interfere with the read IOs.

vm.connected.read

Reading in a VM, DRBD handled in Hypervisor

Here we get quite nice results, too:

  • 3.2GiB/sec, within the VM, should be “good enough” for most purposes, right?
  • ~20k IOPs for some io-depth, and still 3.5k IOPs with sequential IO is still better than using harddisks on hardware.

Our next milestone is writing…

Write requests have additional constraints (compared to reading) – every single write request done in the VM has to be replicated (and confirmed) in DRBD in the Hypervisor before the okay is relayed to the VM’s application.

Writing from VM, DRBD in Hypervisor

Writing from VM, DRBD in Hypervisor

The most visible difference is the bandwidth – it tops out at ~1.1GiB/sec.

Now, these bandwidths were measured in a hyperconverged setup – the host running the VM has a copy of the data available locally. As that might not always be the case, I detached this LV, and tested again.


So, if the hypervisor does not have local storage (but always has to ask some other node), we get these pictures:

vm.remote.read

Reading within a VM, remote storage only

vm.remote.write

Writing from a VM, remote storage only

As we can see, the results are mostly the same – apart from a bit of noise, the limiting factor here is the virtualization bottleneck, not the storage transport.

The only thing left now is a summary and conclusion…

  • We lack the storage speed in our test setup1:
    Even now, without multi-queue capable DRBD, we can already utilize the full 100Gbit Infiniband RDMA bandwidth  and every performance optimizations will only move the parallelity and blocksizes needed to reach line speed down to more common values.
  • VM performance is probably acceptable already
    If you need performance above the now available range (3.2GiB/sec reading, 1.1GiB/sec writing), you’ll want to put your workload on hardware anyway.
  • Might get much faster still, by using DRBD within the VM but removing the virtualization delay.
    As the used 4.4 kernel does not yet support SR-IOV for the ConnectX-4 cards, we couldn’t test that yet footnote:[Support for SR-IOV should be in the 4.5 series, though…]. In theory this should give approximately the same speed in the VM as on hardware, as the OS running in the VM should be able to read and write data directly to/from the remote storage nodes…

I guess we’ll need to do another follow-up in this series later on … 😉

Questions? Contact sales@linbit.com!


 

Having Fun with the DRBD Manage Control Volume

Posted in LINBIT Blogs by Roland Kammerer at April 01, 2016 06:26 AM

As you might know, DRBD Manage is a tool that is used in the DRBD9 stack to manage (create, remove, snapshot) DRBD resources in a multi-node DRBD cluster. DRBD Manage stores the cluster information in the so called Control Volume. The control volume is a DRBD9 resource itself which is then replicated across the whole cluster. This means that the control volume itself is just a block device, like all the regular DRBD resources.

ap1_blockdev

The control volume is just a regular DRBD block device

In this case the control volume contains the cluster information for 4 nodes with 3 resources. Usually, the user shows this information with the according drbdmanage commands.

ap1_dm_nodes_resources

Status information of drbdmanage

The cluster information is stored at known offsets in the control volume, which gives some space to sneak in some additional information. Let’s see what else is hidden in the control volume.

ap1_UUID

UUID in the control volume

The control volume contains a magic, which is used by the blkid command and a UUID that is generated on initialization time.

Oh, and it contains some nice ASCII Art if you show the first 10 lines:

The output of the head command

But this is not all, that part nicely fits into the first 512 bytes:

The output of the dd command

Maybe it has some extra magic powers? Let’s try and execute it with Perl:

ap1_perlsucc

The control volume can print itself

See? It can print its own content 😉

Testing SSD Drives with DRBD: SanDisk Optimus Ascend

Posted in LINBIT Blogs by brian at March 30, 2016 08:20 PM

This week we continue our SSD testing series with the SanDisk Optimus Ascend 2.5 800GB SAS drives. Sandisk-Logo

Background:
SanDisk Corporation designs, develops and manufactures flash memory storage solutions. LINBIT is known for developing DRBD (Distributed Replicated Block Device), the backbone of Linux High Availability software. LINBIT tested how quickly data can be synchronously replicated from a SanDisk 800 GB SSD in server A, to an identical SSD located in server B. Disaster Recovery replication was also investigated, using the same hardware to an off-site server.

For those who are unfamiliar with the “shared nothing” High Availability approach to block level synchronous data replication: DRBD uses two (2) separate servers so that if one (1) fails, the other takes over. Synchronous replication is completely transaction safe and is used for 100% data protection purposes. DRBD has been available as part of the mainline Linux kernel since version 2.6.33.

This post reviews DRBD in an active/passive configuration using synchronous replication (DRBD’s Protocol C). Server A is active and server B is passive. Due to DRBD’s positioning in the Linux kernel (just above the disk scheduler), DRBD is application agnostic. It can work with any filesystem, database, or application that writes data to disk on Linux.

High Availability Testing: Sequential Read/Writes
Objective: Determine the performance implications of synchronous replication when using high performance SanDisk SSD drives.

In the initial test, LINBIT used a 10GbE connection between servers. The Ethernet Connection’s latency became the bottleneck when replicating data. We replaced the 10GbE with Dolphin Interconnects cards – removing the latency constraint.

Each test was ran 5 times, the averages are displayed below:

Sandisk-SequentialRWSandisk-Sequential-disks

As you can see from the above graph, the overhead introduced by using DRBD synchronous replication was only 2.42%.

Mounting the ext4 filesystem on top of DRBD, writing 1GiB of data incurs a 1.41% performance hit. Even when writing a larger 10GiB file, utilizing the DRBD replication software never introduced more than an average 2.16% overhead.

Random Read/Write Tests:
LINBIT dug deeper after finding the theoretical maximum speeds of DRBD Replication with SanDisk Optimus Ascend™ 800GB SSDs by using random read and write assessments. These random read/writes simulate how many applications and databases work in a production environment. The purpose of random read/write tests are to provide a realistic example of what users will experience when they add a load to their systems.

Naturally, the disks will slow down when separating the reads and writes.

Sandisk-RandReadWrite Sandisk-RandRWIOPS

On average, DRBD introduced a 1.02% overhead as compared to using a single disk without DRBD. In many of LINBIT’s random read/write tests, the disks performed faster with DRBD installed, than without it. DRBD achieves this by allowing us to tune, or completely disable, write barriers and flushing at the block device level; this is considered safe as long as the users RAID controller has a healthy battery backed write cache.
Conclusion:
Shared Nothing High Availability and Disaster Recovery replication architectures, with the help of fast SSD storage, can add outstanding resiliency to IT systems with minimal performance implications.

LINBIT found that when synchronously replicating data they can achieve write speeds near the advertised speeds of using a single SanDisk 800GB SSD using sequential read/writes. While using random read/writes, DRBD will also have very little impact on SSD performance as compared to using a single drive. Users can guarantee 100% data protection without sacrificing performance using this Open Source Software Solution. Users simply need two separate systems, DRBD data replication software, and high performance storage in the form of SanDisk Optimus Ascend SSDs.

Our next installment will cover Micron’s M500DC SSD drives.  Until then, happy replicating!

Testing SSD Drives with DRBD: Intel DC 3700 Series

Posted in LINBIT Blogs by brian at February 23, 2016 07:07 PM

Over the next few weeks we’ll be posting results from tests that we’ve run against various manufactures SSD drives; including Intel, SanDisk, and Micron, to name a few.

The first post in this series goes over our findings of the Intel DC S 3700 Series 800GB SATA SSD drives.IntelSSDs

Background:
Intel Corporation designs, manufactures, and sells integrated digital technology platforms worldwide. The company produces  SSDs as well as the NAND flash memory products used inside them.

For those who are unfamiliar with the “shared nothing” High Availability approach to block level synchronous data replication: DRBD uses two (2) separate servers so that if one (1) fails, the other takes over. Synchronous replication is completely transaction safe and is used for 100% data protection purposes. DRBD has been available as part of the mainline Linux kernel since version 2.6.33.

This post reviews DRBD in an active/passive configuration using synchronous replication (DRBD’s Protocol C). Server A is active and server B is passive. Due to DRBD’s positioning in the Linux kernel (just above the disk scheduler), DRBD is application agnostic. It can work with any filesystem, database, or application that writes data to disk on Linux.

High Availability Testing: Sequential Read/Writes
Objective: Determine the performance implications of synchronous replication when using high performance Intel SSD drives.

In the initial test, LINBIT used a 10GbE connection between servers. The Ethernet Connection’s latency became the bottleneck when replicating data. We replaced the 10GbE with Dolphin Interconnects cards – removing the latency constraint.

Each test was ran 5 times, the averages displayed below:

Screen Shot 2016-02-19 at 4.10.12 PM

The advertised Intel drive speed is: Read- 500MB/s, Write- 460MB/s. As you can see from the above table, installing DRBD introduced negligible write overhead. Mounting an EXT4 filesystem on top of DRBD, only incurs a 1.98% performance hit.

Running DRBD, the SSD’s work above the advertised speed of the drive. In each write scenario, using high performance Intel SSD drives with DRBD either performed near or above advertised speeds for all sequential read/write tests.  0.5-2% overhead is a small price to pay for 100% guaranteed data integrity.

The data above in Chart 1.0 graphically represented:

03-Intel-Sequential-read-write

High Availability Testing: Random Read/Write tests
Objective: Mimic production scenarios by using random reads and writes to determine the performance implications of synchronous replication.

Here we dig deeper after finding the theoretical maximum speeds of the DRBD software replication with Intel DC S 3700 800GB SSDs; using random read and write assessments. These random reads and writes simulate how many applications and databases work in a production environment.

Screen Shot 2016-02-19 at 4.21.23 PM

The data demonstrates that in this type of environment, enacting DRBD for local data replication with Intel hardware will have a minimal impact on overall performance as compared to running a single SSD, and can even have positive implications.

As of DRBD 8.4, the DRBD software has the ability to do read balancing, used to increase the read performance of the DRBD device. As you can see, the read performance surpasses that of a single Intel SSD by up to 63.9%.  This feature enables you to make use of the idle server, that would otherwise just be sitting there waiting for a fail over.

We saw 11367 IOPS when writing to the SSD with the EXT4 filesystem without DRBD installed and 11480 IOPS when replicating writes with DRBD. This represents a slight performance enhancement when using DRBD and synchronously replicating data. The performance improvements are even bigger for reads.

05-Intel-ReadWriteGraph

Increased performance when using DRBD is counter intuitive. There is natural overhead when synchronously replicating data, so why are the disks performing faster? DRBD is carefully optimized for performance. This involves flushing kernel internal request queries where it makes sense from DRBD’s point of view. This can lead to the effect that a certain test pattern gets executed faster with DRBD than without it.

In random read/write mode, it is safe to say that using these technologies together will enhance service availability with minimal performance implications.

Stay tuned next week for our findings on SanDisk’s Optimus AscendTM 2.5” 800GB SAS SSD drives.

Authored by : Greg Eckert, Matt Kereczman, Devin Vance

 

What is RDMA, and why should we care?

Posted in LINBIT Blogs by flip at December 21, 2015 09:31 PM

DRBD9 has a new transport abstraction layer and it is designed for speed; apart from SSOCKS and TCP the next generation link will be RDMA.

So, what is RDMA, and how is it different from TCP?


The TCP transport is a streaming protocol, which for nearly all Linux set ups means that the Linux Kernel takes care to deliver the messages in order and without losing any data. 1  To send these messages, the TCP transport has to copy the supplied data into some buffers, which takes a bit of time. Yes, zerocopy-send solutions exist, but on the receiving side the fragments have to be accumulated, sorted, and merged into buffers so that the storage (harddisks or SSD) can do its DMA from continuous 4KiB pages.
These internal copy functions moving into and out of buffers cause one of the major bottlenecks for network IO, and you can start to see the performance degradation in the 10GBit/sec performance range, it continues to severely limit performance from there on up.  All these copy functions create and cause higher latency, effecting that all important IOPS number.  We talk about this in our user guide: Latency vs. IOPs.


In contrast to that, RDMA gives network hardware the ability to directly move data from RAM in one machine to RAM in another, without involving the CPU (apart from specifying what should be transferred). It comes in various forms and implementations ­(Infiniband, iWarp, RoCE) and with different on-wire protocols (some use IP, can therefore be routed, and so could be seen as “just” an advanced offload engine).

The common and important point is that the sender and receiver do not have to bother with splitting the data up (into MTU-sized chunks) or joining it back together (to get a single, aligned, 4KiB page that can be transmitted to storage, for example) – they just specify “here are 16 pages of 4kiB, please store data coming from this channel into these next time” and “please push those 32KiB across this channel“. This means real zero-copy send and receive, and much lower latency.

Another interesting fact is that some hardware allows splitting the physical device into multiple virtual ones; this feature is called SR-IOV, and it means that a VM can push memory pages directly to another machine, without involving the hypervisor OS or copying data around. Needless to say that this should improve performance quite a bit, as compared to cutting data into pieces and moving them through the hypervisor… 😉


Since we started on the transport layer abstraction in 7d7a29ae8 quite some effort was spent in that area; currently we’re doing a few benchmarks, and we’re about to publish performance results in the upcoming weeks – so stay tuned!

Spoiler alert: we’re going to use RAM-disks as “storage”, because we don’t have any fast-enough storage media available…


Minimum Viable Cluster

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at October 07, 2015 05:51 AM

In the past there was a clear distinction between high performance (HP) clustering and high availability (HA) clustering, however the lines have been bluring for some time. People have scaled HA clusters upwards and HP inspired clusters have been used to provide availability through redundancy.

The trend in providing availablity of late has been towards the HP model - pools of anonymous and stateless workers that can be replaced at will. A really attractive idea but in order to pull it off they have to make assumptions that may or may not be compatible with some peoples’ workloads.

Assumptions are neither wrong nor bad, you just need to make sure they are compatible with your environment.

So when looking for an availablity solution, keep Occam’s razor in mind but don’t be a slave to it. Look for the simplest architecture but then work upwards until you find one that meets the needs of your actual (not ideal) application or stack.

Starting Simple

Application HA is the simplest kind of cluster you can deploy because the cluster and the application are the same thing. It takes care of talking to its peers, checking to see if they’re still online, deciding if it should remain operational (because too many peers were lost) and synchronising any data between itself and peers.

This gives you basic fault tolerance, when a node fails there are other copies with sufficient state to take up the workload.

Galera and RabbitMQ (with replicated queues) are two popular examples in this category.

However when I said Application HA was the simplest, thats only from an admin’s point of view, because the application is doing everything.

Some issues the creators of these kinds of applications think about ahead of time:

  • Can I assume a node that I can’t see is offline?
  • What to do when some of the nodes cannot see other ones? (quorum)
  • What to do when half the nodes cannot see the other half? (split-brain)
  • Does it matter if the application is still active on nodes we cannot see? (data integrity)
  • Is there state that needs to synchronised? (replication)
  • If so, how to do so reliably and in the presence of past and future failures? (reconciliation)

So if you’re looking to create a custom application with similar properties, make sure you can fund the development team you will need to make it happen.

And remember that the reality of those simplifying assumptions will only be apparent after everything else has already hit the fan.

But lets assume the best-case here… if all you need is one of these existing applications, great! Install, configure, done. Right?

Maybe. It might depend on your hardware budget.

Unfortunately (or perhaps not) most companies aren’t Google, Twitter or Bookface. Most companies do not have thousands of nodes in their cluster, in fact getting many of them to have more than two can be a struggle.

In such environments the overhead of having 1?, 2?, 10?!? spare nodes (just in case of a failure - which will surely never happen) starts to represent a significant portion of their balance sheet.

As the number of spare nodes goes down, so does the number of failures that the system can absorb. It is irrelevant if a failure leaves two (or twenty) functional nodes if the load generated by the clients exceeds the application’s ability to keep up.

An overloaded system leads to operation timeouts which generates even more load and more timeouts. The surviving nodes aren’t really functional at that point either.

If the services lived in a widget of some kind (perhaps Docker containers or KVM images), we could have a higher level entity that would make new copies for us. Problem solved right?

Maybe. Is your application truely stateless?

Some are, Memcache is one that comes to mind because its a cache, neither creating nor consuming anything. However even web servers seem to want session state these days, so chances are your application isn’t stateless either.

Stateless is hard.

Where do new instances recover their state from? Who are its peers? A static list isn’t going to be possible if the widgets are anonymous cattle. Do you need a discovery protocol in your application?

There may also be a penalty for bringing up a brand new instance. For example, the sync time for a new Galera instance is a function of the dataset size and network bandwidth. That can easily run into the tens-of-minutes range.

So there is an incentive to either stop modelling everything as cattle or to keep the state somewhere else.

Ok, so lets put all the state in a database. Problem solved right?

Maybe. How smart is your widget manager?

You could create a single widget with both the application and the database. That would allow you to use systemd to achieve Node HA - the automated and deterministic management of a set of resources within a single node.

In some ways, systemd looks a lot like a cluster manager. It knows about sets of services, it knows relationships between them (so that the database is started before the application) and it knows how to recover the stack if something fails.

Unfortunately you’re out of luck if a failure on node A requires recovery (of the same or a different service) on nodeB because the caveat is that all these relationships must exist within a single node.

This of course is not the container model - which likes to have each service in its own widget. More importantly, you always need to pay the database synchronisation cost for every failure which is not ideal.

Alternatively, if your application isn’t active-active, you don’t even get the option of combining them into a single flavour of widget.

By splitting them up into two however, the synchronisation cost is only payable when a database widget dies. This improves your recovery time and makes the widget purists happy, but now you make need to make sure the application doesn’t start until the database is both running, synchronised and available to take requests.

About now you might be tempted to think about putting retry loops in the application instead.

Chances are however, there is another service that is a client of the application (and there is a client of the client of the …).

Every time you build in another level of retry loops, you increase your failure detection time and ultimately your downtime.

Hence the question: How smart is your widget manager?

  • It needs to ensure there are at least N copies of a widget active.
  • It might need to ensure there are less than M copies available.
  • It might need to ensure the application starts after the database.
  • It might need to be able to stop the application if not enough copies of the database are around and/or writable. Perhaps it got corrupted? Perhaps someone needs to do maintenance?

Lets assume the widget manager can do these things. Most can, that means we’re done right?

Maybe. What happens if the widget manager cannot see one of its peers?

Just because the widget manager cannot see one of its peers with a bunch of application widgets, does not mean they’re not happily swallowing client requests they can never process and/or writing to the data store via some other subnet.

If this does not apply to your application, consider yourself blessed.

For the rest of us, in order to preserve data integrity, we need the widget manager to take steps to ensure that the peer it can no longer see does not have any active widgets.

This is one reason why systemd is rarely sufficient on its own.

Hint: A great way to do this is to power off the host

Are you done yet?

One thing we skipped over is where the database itself is storing its state.

If you were using bare metal, you could store it there - but thats old-fashioned. Storing it in the KVM image or docker container isn’t a good idea, you’d loose everything if the last container ever died.

Projects like glusterfs are options, just be sure you understand what happens when partitions form.

If you’re thinking of something like NFS or iSCSI, consider where those would come from. Almost certainly you don’t want a single node serving them up - that would introduce a single point of failure and the whole point of this is to remove those.

You could add a SAN into the mix for some hardware redundancy, however you need to ensure either:

  • exactly one node accesses the SAN at any time (active/passive), or
  • your filesystem can handle concurrent reads and write from multiple hosts (active/active)

Both options will require quorum and fencing in order to reliably hand out locks. This is the sweet-spot of a full blown cluster manager, System HA, and why traditional, scary, cluster managers like Pacemaker and Veritas exist.

Unless you’d like to manually resolve block-level conflicts after a split-brain, some part of the system needs to rigorously enforce these things. Otherwise its widget managers all the way down.

One of Us

Once you have a traditional cluster manager, you might be surprised how useful it can be.

A lot of applications are resilient to failures once they’re up, but have non-trivial startup sequences. Consider RabbitMQ:

  • Pick one active node and start rabbitmq-server
  • Everywhere else, run
    • Start rabbitmq-server
    • rabbitmqctl stop_app
    • rabbitmqctl join_cluster rabbit@${firstnode}
    • rabbitmqctl start_app

Now the Rabbit’s built-in HA can take over but to get to that point:

  • How do you pick which is the first node?
  • How do you tell everyone else who it is?
  • Can rabbitmq accept updates before all peers have joined?
  • Can your app?

This is the sort of thing traditional cluster managers do before breakfast. They are afterall really just distributed finite state machines.

Recovery can be a troublesome time too:

http://previous.rabbitmq.com/v3_3_x/clustering.html

the last node to go down must be the first node to be brought online. If this doesn’t happen, the nodes will wait 30 seconds for the last discconected node to come back online, and fail afterwards.

Depending on how the nodes were started, you may see some nodes running and some stopped. What happens if the last node isn’t online yet?

Some cluster managers support concepts like dual-phased services (or “Master/slave” to use the politically incorrect term) that can allow automated recovery even with constraints such as these. We have Galera agents that also take advantage of these capabilities - finding the ‘right’ node to bootstrap before synchronising it to all the peers.

Final thoughts

HA is a spectrum, where you fit depends on what assumptions you can make about your application stack.

Just don’t make those assumptions before you really understand the problem at hand, because retrofitting an application to remove simplifying assumptions (such as only supporting pets) is even harder than designing it in in the first place.

Whats your minimal viable cluster?

Receiving Reliable Notification of Cluster Events

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at September 01, 2015 02:16 AM

When Pacemaker 1.1.14 arrives, there will be a more reliable way to receive notification of cluster events.

In the past, we relied on a ocf:pacemaker:ClusterMon resource to monitor the cluster status with the crm_mon daemon and trigger alerts on each cluster event.

One of the arguments to ClusterMon was the location of a custom script that would be called when the event happened. This script could then create an SNMP trap, SMS, email, etc to alert the admin based on dynamically filled environment variables describing precisely the cluster event that occurred.

The Problem

Relying on a cluster resource proved to be not such a great idea for a number of reasons:

  • Alerts ceased if the resource was not running
  • There was no indication that the alerts had ceased
  • The resource was likely to be stopped at exactly the point that something interesting was happening
  • Old alerts were likely to be resent whenever the status section of the cib was rebuilt when a new DC was elected

The Solution

Clearly support for notifications needed to be baked into the core of Pacemaker, so thats what we’ve now done. Finally (sorry, you wouldn’t believe the length of my TODO list).

To make it work, drop a script onto each of the nodes, /var/lib/pacemaker/notify.sh would be a good option, then tell the cluster to start using it:

    pcs property set notification-agent=/var/lib/pacemaker/notify.sh

Like resource agents, this one can be written in whatever language makes you happy - as long as it can read environment variables.

Pacemaker will also check your agent completed and report the return code. If the return code is not 0, Pacemaker will also log any output from your agent.

The agent is called asynchronously and should complete quickly. If it has not completed after 5 minutes it will be terminated by the cluster.

Where to Send Alerts

I think we can all agree that hard coding the intended recipient of the notification into the scripts would be a bad idea. It would make updating the recipient (vacation, change of role, change of employer) annoying and prevent the scripts from being reused between different clusters.

So there is also a notification-recipient cluster property which will be passed to the script. It can contain whatever you like, in whatever format you like, as long as the notification-agent knows what to do with it.

To get people started, the source includes a sample agent which assumes notification-recipient is a filename, eg.

    pcs property set notification-recipient=/var/lib/pacemaker/notify.log

Interface

We preserved the old list of environment variables, so any existing ClusterMon scripts will still work in this new mode. I have added a few extra ones though.

Environment variables common to all notification events:

  • CRM_notify_kind (New) Indicates the type of notification. One of resource, node, and fencing.
  • CRM_notify_version (New) Indicates the version of Pacemaker sending the notification.
  • CRM_notify_recipient The value specified by notification-recipient from the cluster configuration.

Additional environment variables available for notification of node up/down events (new):

  • CRM_notify_node The node name for which the status changed
  • CRM_notify_nodeid (New) The node id for which the status changed
  • CRM_notify_desc The current node state. One of member or lost.

Additional environment variables available for notification of fencing events (both successful and failed):

  • CRM_notify_node The node for which the status changed.
  • CRM_notify_task The operation that caused the status change.
  • CRM_notify_rc The numerical return code of the operation.
  • CRM_notify_desc The textual output relevant error code of the operation (if any) that caused the status change.

Additional environment variables available for notification of resource operations:

  • CRM_notify_node The node on which the status change happened.
  • CRM_notify_rsc The name of the resource that changed the status.
  • CRM_notify_task The operation that caused the status change.
  • CRM_notify_interval (New) The interval of a resource operation
  • CRM_notify_rc The numerical return code of the operation.
  • CRM_notify_target_rc The expected numerical return code of the operation.
  • CRM_notify_status The numerical representation of the status of the operation.
  • CRM_notify_desc The textual output relevant error code of the operation (if any) that caused the status change.

Fencing for Fun and Profit with SBD

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at August 31, 2015 06:51 AM

What is this Fencing Thing and do I Really Need it?

Fundamentally fencing is a mechanism for turning a question

Is node X capable of causing corruption?

into an answer

No

so that the cluster can safely initiate recovery after a failure.

This question exists because we cannot assume that an unreachable node is in fact off.

Sometimes it will do this by powering the node off, clearly a dead node can do no harm. Other times we will use a combination of network (stop new work from arriving) and disk (stop a rogue process from writing anything to shared storage) fencing.

Fencing is a requirement of almost any cluster, regardless of whether it is active/active, active/passive or involves shared storage (or not).

One of the best ways of implementing fencing is with a remotely accessible power switch, however some environments may not allow them, see the value in them, or have ones that are suitable for clustering (such as IPMI devices that loose power with the host they control).

Enter SBD

SBD can be particularly useful in environments where traditional fencing mechanisms are not possible.

SBD integrates with Pacemaker, a watchdog device and, optionally, shared storage to arrange for nodes to reliably self-terminate when fencing is required (such as node failure or loss of quorum).

This is achieved through a watchdog device, which will reset the machine if SBD does not poke it on a regular basis or if SBD closes its connection “ungracefully”.

Without shared storage, SBD will arrange for the watchdog to expire if:

  • the local node looses quorum, or
  • the Pacemaker, Corosync or SBD daemons are lost on the local node and are not recovered, or
  • Pacemaker determines that the local node requires fencing, or
  • in the extreme case that Pacemaker kills the sbd daemon as part of recovery escalation

When shared storage is available, SBD can also be used to trigger fencing of its peers.

It does this through the exchange of messages via shared block storage such as a SAN, iSCSI, FCoE. SBD on the target peer sees the message and triggers the watchdog to reset the local node.

These properties of SBD also make it particularly useful for dealing with network outages, potentially between different datacenters, or when the cluster needs to forcefully recover a resource that refuses to stop.

Documentation is another area where diskless SBD shines, because it requires no special knowledge of the user’s environment.

Not a Silver Bullet

One of the ways in which SBD recognises that the node has become unhealthy is to look for quorum being lost. However traditional quorum makes no sense in a two-node cluster and is often disabled by setting no-qorum-policy=ignore.

SBD will honour this setting though, so in the event of a network failure in a two-node cluster, the node isn’t going to self-terminate.

Likewise if you enabled Corosync 2’s two_node option, both sides will always have quorum and neither party will self-terminate.

It is therefor suggested to have three or more nodes when using SBD without shared storage.

Additionally, using SBD for fencing relies on at least part of a system that has already showed itself to be malfunctioning (otherwise we wouldn’t be fencing it) to function correctly.

Everything has been done to keep SBD as small, simple and reliable as possible, however all software has bugs and you should choose an appropriate level of paranoia for your circumstances.

Installation

RHEL 7 and derivatives like CentOS include sbd, so all you need is yum install -y sbd.

For other distributions, you’ll need to build it from source.

# git clone git@github.com:ClusterLabs/sbd.git
# cd sbd
# autoreconf -i
# ./configure

then either

# make rpm

or

# sudo make all install
# sudo install -D -m 0644 src/sbd.service /usr/lib/systemd/system/sbd.service
# sudo install -m 644 src/sbd.sysconfig /etc/sysconfig/sbd

NOTE: The instructions here do not apply to the version of SBD that currently ships with openSUSE and SLES.

Configuration

SBD’s configuration lives in /etc/sysconfig/sbd by default and the we include a sample to get you started.

For our purposes here, we can ignore the shared disk functionality and concentrate on how SBD can help us recover from loss of quorum as well as daemon and resource-level failures.

Most of the defaults will be fine, and really all you need to do is specify the watchdog device present on your machine.

Simply set SBD_WATCHDOG_DEV to the path where we can find your device and thats it. Below is the config from my cluster:

# grep -v \# /etc/sysconfig/sbd | sort | uniq
SBD_DELAY_START=no
SBD_PACEMAKER=yes
SBD_STARTMODE=clean
SBD_WATCHDOG_DEV=/dev/watchdog
SBD_WATCHDOG_TIMEOUT=5

Beware: If uname -n does not match the name of the node in the cluster configuration, you will need to pass the advertised name to SBD with the -n option. Eg. SBD_OPTS="-n special-name-1"

Adding a Watchdog to a Virtual Machine

Anyone experimenting with virtual machines can add a watchdog device to an existing instance by editing the xml and restarting the instance:

virsh edit vmnode

Add <watchdog model='i6300esb'/> underneath the ‘' tag. Save and close, then reboot the instance to have the config change take effect:

virsh destroy vmnode
virsh start vmnode

You can then confirm the watchdog was added:

virsh dumpxml vmnode | grep -A 1 watchdog 

The output should look something like:

<watchdog model='i6300esb' action='reset'>
  <alias name='watchdog0'/>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
</watchdog>

Using a Software Watchdog

If you do not have a real watchdog device, you should go out and get one.

However you’re probably investigating SBD because it was not possible/permitted to get a real fencing device, so there is a strong chance you’re going to using a software based watchdog device.

Software based watchdog devices are not evil incarnate however you should be aware of their limitations, they are after-all software and require a degree of correctness from a system that has already showed itself to not be (functioning correctly, otherwise we wouldn’t be fencing it).

That being said, it still provides value when there is a network outage, potentially between different datacenters, or the cluster needs to forcefully recover a resource that refuses to stop.

To use a software watchdog, you’ll need to load the kernel’s softdog module:

/sbin/modprobe softdog

Once loaded you’ll see the device appear and you can set SBD_WATCHDOG_DEV accordingly:

# ls -al /dev/watchdog
crw-rw----. 1 root root 10, 130 Aug 31 14:19 /dev/watchdog

Don’t forget to arrange for the softdog module to be loaded at boot time too:

# echo softdog > /etc/modules-load.d/softdog.conf 

Using SBD

On a systemd based system, enabling SBD with systemctl enable sbd will ensure that SBD is automatically started and stopped whenever corosync is.

If you’re integrating SBD with a distro that doesn’t support systemd, you’ll likely want to edit the corosync or cman init script to both source the sysconfig file and start the sbd daemon.

Simulating a Failure

To see SBD in action, you could:

  • stop pacemaker without stopping corosync, and/or
  • kill the sbd daemon, and/or
  • use stonith_admin -F

Killing pacemakerd is usually not enough to trigger fencing because systemd will restart it “too” quickly. Likewise, killing one of the child daemons will only result in pacemakerd respawning them.

Uninstalling

On every host, run:

# systemctl disable sbd

Then on one node, run:

# pcs property set stonith-watchdog-timeout=0
# pcs cluster stop --all

At this point no part of the cluster, including Corosync, Pacemaker or SBD should be running on any node.

Now you can start the cluster again and completely remove the stonith-watchdog-timeout option:

# pcs cluster start --all
# pcs property unset stonith-watchdog-timeout

Troubleshooting

SBD will refuse to start if the configured watchdog device does not exist. You might see something like this:

# systemctl status sbd
sbd.service - Shared-storage based fencing daemon
   Loaded: loaded (/usr/lib/systemd/system/sbd.service; disabled)
   Active: inactive (dead)

To obtain more logging from SBD, pass additional -V options to the sbd daemon when launching it.

SBD will trigger the watchdog (and your node will reboot) if uname -n is different to the name of the node in the cluster configuration. If this is the case for you, pass the correct name to sbd with the -n option.

Pacemaker will refuse to start if it detects that SBD should be in use but cannot find the sbd process.

The have-watchdog property will indicate if Pacemaker considers SBD to be in use:

# pcs property 
Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: STSRHTS2609
 dc-version: 1.1.12-a14efad
 have-watchdog: false
 no-quorum-policy: freeze

Benchmarking DRBD

Posted in LINBIT Blogs by flip at July 27, 2015 12:50 PM

We often see people on #drbd or on drbd-user trying to measure the performance of their setup. Here are a few best practices to do this. 


First, a few facts.

  • The synchronization rate shown in /proc/drbd has nothing to do with the replication rate. These are different things, don’t mistake the speed: value there for a performance indicator.
  • Use an appropriate tool. dd with default settings and cp don’t write to the device, but only into the Linux buffers at first – so timing these won’t tell you anything about your storage performance.
  • The hardest discipline is single-threaded, io-depth 1. Here every access has to wait for the preceding to finish, so each bit of latency will bite you hard.
    Getting some bandwidth with four thousand concurrent writes is easy!
  • Network benchmarking isn’t that easy, either. iperf will typically send only NULs; checksum offloading might hide or create problems; switches, firewalls, etc. will all introduce noise.

What you want to do is this:

  1. Start at the bottom of the stack. Measure (and tune) the LV that DRBD will sit upon, then the network, then DRBD.
  2. Our suggestion is still to use a direct connection, ie. a crossover cable.
  3. If you don’t have any data on the device, test against the block device. A filesystem on top will create additional meta-data load and barriers, this can severely affect your IOPs. (Especially on rotating media.)
  4. Useful tools are fio direct=1, and for a basic single-threaded io-depth=1 run you can use dd oflag=direct (for writes, when reading set iflag).
    dd with bs=4096 is nice to measure the IOPs, bs=1M will give you the bandwidth.
  5. Get enough data. Running dd with settings that make it finish within 0.5 seconds means that you are likely to suffer from outliers, make it run 5 seconds or longer!
    fio has the nice runtime parameter, just let it run 20 seconds to have some data.
  6. For any unexpected result try to measure again a minute later, then think hard what could be wrong and where your clusters bottlenecks are.

Some problems that we’ve seen in the past are:

  • Misaligned partitions (sector 63, anyone?) might hurt you plenty. Really.
    If you suffer from that, get the secondary correctly aligned, switch over, and re-do the previous primary node.
  • iperf goes fast, but a connected DRBD doesn’t: try turning off the offloading on the network cards; some will trash the checksum for non-zero data, and that means retransmissions.
  • Some RAID controllers can be tuned – to either IOPs or bandwidth. Sounds strange, but we have seen such effects.
  • Concurrent load – trying to benchmark the storage on your currently active database machine is not a good idea.
  • Broken networks should be looked for even if there are no error counters on the interface. Recently a pair started to connect just fine, but then couldn’t even synchronize with a meagre 10MiByte/sec…
    The best hint was the ethtool output that said Speed: 10MBit; switching cables did resolve that issue.

If you’re doing all that correctly, and are using a recent DRBD version (please, don’t come whining about DRBD 8.0.16 performance! ;), for a pure random-write IO you should only see 1-3% difference between the lower-level LV directly and a connected DRBD.


Update: here’s an example fio call.

fio --name $name --filename $dev --ioengine libaio --direct 1 \
   --rw randwrite --bs 4k --runtime 30s --numjobs $threads \
   --iodepth $iodepth --append-terse

Double Failure - Get out of Jail Free? Not so Fast

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at May 29, 2015 03:34 AM

Occasionally you might hear someone explain away a failure/recovery scenario with “that’s a double failure, we can’t/don’t protect against those”.

There are certainly situations where this is true. A node failure combined with a fencing device failure will and should prevent a cluster from recovering services on that node.

However!

It doesn’t mean we can ignore the failure. Nor does it it make it acceptable to forget that services on the failed node still need to be recovered one day.

Playing the “double failure” card also requires the failures to be in different layers. Notice that the example above was for a node failure and fencing failure.

The failure of a second node while recovering from the first doesn’t count (unless it was your last one).

Just something to keep in mind in case anyone was thinking about designing something to support highly available openstack instances…

Life at the Intersection of Pets and Cattle

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at May 28, 2015 02:50 AM

Scale UP: Servers are like pets - you name them and when they get sick you nurse them back to health

Scale OUT: Servers are like cattle - you number them and when they get sick you shoot them

Why Pets?

The theory goes that pets have no place in the server room. Everything should be stateless and replicated and if one copy dies, who cares because there are 100 more.

Except real life isn’t like that.

It’s hard to build replicated, stateless, shared-nothing systems, its even hard just bringing them online because the applications often need external context in order to distinguish between different recovery scenarios.

I cover some of these ideas in my Highly Available Openstack Deployments document.

Indeed that document shows that even the systems built to manage cattle contain pieces that need to be treated as pets. They demonstrate the very limitations of the model they advocate.

Life at the Intersection

Eventually someone realises they need a pet after-all.

This is when things get interesting, because baked in from the start is the assumption that we don’t need to care about cattle:

  • It doesn’t matter if some of the cattle die, there’s plenty more
  • It doesn’t matter if some of the cattle die when a paddock explodes, there’s plenty more
  • It doesn’t matter if some of the cattle are lost moving them from one paddock to another, there’s plenty more
  • It doesn’t matter if new cattle stock is lost unloading them into a new paddock, there’s plenty more
  • It doesn’t matter, just try again

The assumptions manifest themselves in a variety of ways:

  • Failed actions are not retried
  • Error reporting is confused for error handling
  • Incomplete records (since the cattle can be easily re-counted)

All of which makes adopting some cattle as pets really freaking hard.

Raising Pets in Openstack

Some things are easier said than done.

When the compute node hosting an instance dies, evacuate it elsewhere

Easy right?

All we need to do is notice that the compute node disappeared, make sure its really dead (otherwise it might be running twice, which would be bad), and pick someone to call evacuate.

Except:

  • You can’t call evacuate before nova notices it’s peer is gone
  • You can’t (yet) tell nova that it’s peer has gone

Ok, so we can define a fencing device that talks to nova, loops until it notices the peer is gone and calls evacuate.

Not so fast, the failure that took out the compute node may have also taken out part of the control plane, which needs fencing to complete before it can be recovered. However in order for fencing to complete, the control plane needs to have recovered (nova isn’t going to be able to tell you it noticed the peer died if your request can’t be authenticated).

Ok, so we can’t use a fencing device, but the cluster will let services know when their peers go away. The notifications are even independent of the recovering the VIPs, so as long as at least one of the control nodes survives, we can block waiting for nova and make it work. We just need to arrange for only one of the survivors to perform the evacuations.

Job done, retire…

Not so fast kimosabi. Although we can recover a single set of failed compute and/or control nodes, what if there is a subsequent failure? You’ve had one failure, that means more work, more work means more opportunities to create more failures.

Oh, and by the way, you can’t call evacuate more than once. Nor is there a definitive way to determine if an instance is being evacuated.

Here are some of the ways we could still fail to recover instances:

  • A compute node that is in the process of initiating evacuations dies
    It takes time for nova to accept the evacuation calls, there is a window for some to be lost if this node dies too.
  • A compute node which is receiving an evacuated node dies
    At what point does the new compute node “own” the instance such that if this node died to, the instance would be picked up by a subsequent evacuate call? Depending on what is recorded inside nova and when, you might have a problem.
  • The control node which is orchestrating an evacuation dies
    There a window between a request being removed from the queue and it being actioned to the point that it will complete? Depending on what is recorded inside nova and when, you might have a problem.
  • A control node hosting one of the VIPs dies while an evacuation is in progress (probably)
    Do any of the activities associated with evacuation require the use of inter-component APIs? If so, you might have a problem if one of those APIs is temporarily unavailable
  • Some other entity (human or otherwise) also initiates an evacuation If there is no way for nova to say if an instance is being evacuated, how can we expect an admin to know that initiating one would be unsafe?
  • It is the 3rd Tuesday of a month with 5 Saturdays and the moon is a waxing gibbous
    Ok, perhaps I’m a little jaded at this point.

Doing it Right

Hopefully I’ve demonstrated the difficulty of adding pets (highly available instances) as an after thought. All it took to derail the efforts here was the seemingly innocuous decision that the admin should be responsible for retrying failed evacuations (based on it not having appeared somewhere else after a while?). Who knows what similar assumptions are still lurking.

At this point, people are probably expecting that I put my Pacemaker hat on and advocate for it to be given responsibility for all the pets. Sure we could do it, we could use nova APIs to managed them just like we do when people use their hypervisors directly.

But that’s never going to happen, so lets look at the alternatives. I foresee three main options:

  1. First class support for pets in nova
    Seriously, the scheduler is the best place for all this, it has all the info to make decisions and the ability to make them happen.

  2. First class support for pets in something that replaces nova
    If the technical debt or political situation is such that nova cannot move in this direction, perhaps someone else might.

  3. Creation of a distributed finite state machine that:

    • watches or is somehow told of new instances to track
    • watches for successful fencing events
    • initiates and tracks evacuations
    • keeps track of its peer processes so that instances are still evacuated in the event of process or node failure

The cluster community has pretty much all the tech needed for the last option, but it is mostly written in C so I expect that someone will replicate it all in Python or Go :-)

If anyone is interested in pursuing capabilities in this area and would like to benefit from the knowledge that comes with 14 years experience writing cluster managers, drop me a line.

Adding Managed Compute Nodes to a Highly Available Openstack Control Plane

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at May 12, 2015 11:21 PM

As previously announced on RDO list and GitHub, we now have a way to allow Pacemaker to manage compute nodes within a single cluster while still allowing us to scale beyond corosync’s limits.

Having this single administrative domain then allows us to do clever things like automated recovery of VMs running on a failed or failing compute node.

The main difference with the previous deployment mode is that services on the compute nodes are now managed and driven by the Pacemaker cluster on the control plane.

The compute nodes do not become full members of the cluster and they no longer require the full cluster stack, instead they run pacemaker_remoted which acts as a conduit.

Assumptions

We start by assuming you have a functional Juno or Kilo control plane configured for HA and access to the pcs cluster CLI.

If you don’t have this already, there is a decent guide on Github for how to achieve this.

Basics

We start by installing the required packages onto the compute nodes from your faviorite provider:

yum install -y openstack-nova-compute openstack-utils python-cinder openstack-neutron-openvswitch openstack-ceilometer-compute python-memcached wget openstack-neutron pacemaker-remote resource-agents pcs

While we’re here, we’ll also install some pieces that aren’t in any packages yet (do this on both the compute nodes and the control plane):

mkdir /usr/lib/ocf/resource.d/openstack/
wget -O /usr/lib/ocf/resource.d/openstack/NovaCompute https://github.com/beekhof/osp-ha-deploy/raw/master/pcmk/NovaCompute
chmod a+x /usr/lib/ocf/resource.d/openstack/NovaCompute 

wget -O /usr/sbin/fence_compute https://github.com/beekhof/osp-ha-deploy/raw/master/pcmk/fence_compute
chmod a+x /usr/sbin/fence_compute

Next, on one node generate a key that pacemaker on the control plane with use to authenticate with pacemaker-remoted on the compute nodes.

dd if=/dev/urandom of=/etc/pacemaker/authkey bs=4096 count=1

Now copy that to all the other machines (control plane and compute nodes).

At this point we can enable and start pacemaker-remoted on the compute nodes:

chkconfig pacemaker_remote on
service pacemaker_remote start

Finally, copy /etc/nova/nova.conf, /etc/nova/api-paste.ini, /etc/neutron/neutron.conf, and /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini from the control plane to each of your compute nodes.

If you’re using ceilometer, you’ll also want /etc/ceilometer/ceilometer.conf from there too.

Preparing the Control Plane

At this point, we need to take down the control plane in order to safely update the cluster configuration. We don’t want things to be bouncing around while we make large scale modifications.

pcs resource disable keystone

Next we must tell the cluster to look for and run the existing control plane services only on the control plane (and not the about to be defined compute nodes). We can automate this with the clever use of scripting tools:

for i in $(cibadmin -Q --xpath //primitive --node-path | tr ' ' '\n' | awk -F "id='" '{print $2}' | awk -F "'" '{print $1}' | uniq | grep -v "\-fence") ; do pcs constraint location $i rule resource-discovery=exclusive score=0 osprole eq controller ; done

Defining the Compute Node Services

Now we can create the services that can run on the compute node. We create them in a disabled state so that we have a chance to can limit where they can run before the cluster attempts to start them.

pcs resource create neutron-openvswitch-agent-compute  systemd:neutron-openvswitch-agent --clone interleave=true --disabled --force
pcs constraint location neutron-openvswitch-agent-compute-clone rule resource-discovery=exclusive score=0 osprole eq compute 

pcs resource create libvirtd-compute systemd:libvirtd  --clone interleave=true --disabled --force
pcs constraint location libvirtd-compute-clone rule resource-discovery=exclusive score=0 osprole eq compute

pcs resource create ceilometer-compute systemd:openstack-ceilometer-compute --clone interleave=true --disabled --force
pcs constraint location ceilometer-compute-clone rule resource-discovery=exclusive score=0 osprole eq compute

pcs resource create nova-compute ocf:openstack:NovaCompute user_name=admin tenant_name=admin password=keystonetest domain=${PHD_VAR_network_domain} --clone interleave=true notify=true --disabled --force
pcs constraint location nova-compute-clone rule resource-discovery=exclusive score=0 osprole eq compute

Please note, a previous version of this post used:

pcs resource create nova-compute ocf:openstack:NovaCompute –clone interleave=true –disabled –force

Make sure you use the new form

Now that the services and where they can be located is defined, we specify the order in which they must be started.

pcs constraint order start neutron-server-clone then neutron-openvswitch-agent-compute-clone require-all=false

pcs constraint order start neutron-openvswitch-agent-compute-clone then libvirtd-compute-clone
pcs constraint colocation add libvirtd-compute-clone with neutron-openvswitch-agent-compute-clone

pcs constraint order start libvirtd-compute-clone then ceilometer-compute-clone
pcs constraint colocation add ceilometer-compute-clone with libvirtd-compute-clone

pcs constraint order start ceilometer-notification-clone then ceilometer-compute-clone require-all=false

pcs constraint order start ceilometer-compute-clone then nova-compute-clone
pcs constraint colocation add nova-compute-clone with ceilometer-compute-clone

pcs constraint order start nova-conductor-clone then nova-compute-clone require-all=false

Configure Fencing for the Compute nodes

At this point we need to define how compute nodes can be powered off (‘fenced’ in HA terminology) in the event of a failure.

I have an switched APC PUD, the configuration for which looks like this:

pcs stonith create fence-compute fence_apc ipaddr=east-apc login=apc passwd=apc pcmk_host_map="east-01:2;east-02:3;east-03:4;"

But you might be using Drac or iLO, which would require you do define one for each node, eg.

pcs stonith create fence-compute-1 fence_ipmilan login="root" passwd="supersecret" ipaddr="192.168.1.1" pcmk_host_list="compute-1"
pcs stonith create fence-compute-2 fence_ipmilan login="root" passwd="supersecret" ipaddr="192.168.1.2" pcmk_host_list="compute-2"
pcs stonith create fence-compute-3 fence_ipmilan login="root" passwd="supersecret" ipaddr="192.168.1.3" pcmk_host_list="compute-3"

Be careful when using devices that loose power with the hosts they control. For such devices, a power failure and network failure look identical to the cluster which makes automated recovery unsafe.

Obsolete Instructions

Please note, a previous version of this post included the following instructions, however they are no longer required.

Next we configure the integration piece that notifies nova whenever the cluster fences one of the compute nodes. Adjust the following command to conform to your environment:

pcs –force stonith create fence-nova fence_compute domain=example.com login=admin tenant-name=admin passwd=keystonetest auth-url=http://vip-keystone:35357/v2.0/

Use pcs stonith describe fence_compute if you need more information about any of the options.

Finally we instruct the cluster that both methods are required to consider the host safely fenced. Assuming the fence_ipmilan case, you would then configure:

pcs stonith level add 1 compute-1 fence-compute-1,fence-nova pcs stonith level add 1 compute-2 fence-compute-2,fence-nova pcs stonith level add 1 compute-3 fence-compute-3,fence-nova

Re-enabling the Control Plane and Registering Compute Nodes

The location constraints we defined above reference node properties which we now define with the help of some scripting magic:

for node in $(cibadmin -Q -o nodes | grep uname | sed s/.*uname..// | awk -F\" '{print $1}' | awk -F. '{print $1}'); do pcs property set --node ${node} osprole=controller; done

Connections to remote hosts are modelled as resources in Pacemaker. So in order to add them to the cluster, we define a service for each one and set the node property that allows it to run compute services.

Once again assuming the three compute nodes from earlier, we would run:

pcs resource create compute-1 ocf:pacemaker:remote
pcs resource create compute-2 ocf:pacemaker:remote
pcs resource create compute-3 ocf:pacemaker:remote

pcs property set --node compute-1 osprole=compute
pcs property set --node compute-2 osprole=compute
pcs property set --node compute-3 osprole=compute

Thunderbirds are Go!

The only remaining step is to re-enable all the services and run crm_mon to watch the cluster bring them and the compute nodes up:

pcs resource enable keystone
pcs resource enable neutron-openvswitch-agent-compute
pcs resource enable libvirtd-compute
pcs resource enable ceilometer-compute
pcs resource enable nova-compute

Provisioning quickly nodes in a SeaMicro chassis with Ansible

Posted in Arrfab's Blog » Cluster by fabian.arrotin at January 12, 2015 02:19 PM

Recently I had to quickly test and deploy CentOS on 128 physical nodes, just to test hardware and that all currently "supported" CentOS releases could be installed quickly when needed. The interesting bit is that it was a completely new infra, without any traditional deployment setup in place, so obviously, as sysadmin, we directly think about pxe/kickstart, which is so trivial to setup. That was the first time I had to "play" with SeaMicro devices/chassis though, and so understanding how they work (the SeaMicro 15K fabric chassis, to be precise). One thing to note is that those seamicro chassis don't provide remote VGA/KVM feature (but who cares, as we'll automate the whole thing, right ? ) but they instead provide either cli (ssh) or rest api access to the management interface, so that you can quickly reset/reconfigure a node, changing vlan assignement, and so on.

It's not a secret that I like to use Ansible for ad-hoc tasks, and I thought that it would be (again) a good tool for that quick task. If you have used Ansible already, you know that you have to declare nodes and variables (not needed, but really useful) in the inventory (if you don't gather inventory from an external source). To configure my pxe setup (and so being able to reconfigure it when needed) I obviously needed to get mac addresses from all 64 nodes in each chassis, decide that hostnames will be n${slot-number}., etc .. (and yes in Seamicro slot 1 = 0/0, slot 2 = 1/0, and so on ...)

The following quick-and-dirty bash script let you do that quickly in 2 seconds (ssh into chassis, gather information, and fill some variables in my ansible host_vars/${hostname} file) :

#!/bin/bash
ssh admin@hufty.ci.centos.org "enable ;  show server summary | include Intel ; quit" | while read line ;
  do
  seamicrosrvid=$(echo $line |awk '{print $1}')
  slot=$(echo $seamicrosrvid| cut -f 1 -d '/')
  id=$(( $slot + 1)); ip=$id ; mac=$(echo $line |awk '{print $3}')
  echo -e "name: n${id}.hufty.ci.centos.org \nseamicro_chassis: hufty \nseamicro_srvid: $seamicrosrvid \nmac_address: $mac \nip: 172.19.3.$ip \ngateway: 172.19.3.254 \nnetmask: 255.255.252.0 \nnameserver: 172.19.0.12 \ncentos_dist: 6" > inventory/n${id}.hufty.ci.centos.org
done

Nice so we have all ~/ansible/hosts/host_vars/${inventory_hostname} files in one go (I let you add ${inventory_hostname} in the ~/ansible/hosts/hosts.cfg file with the same script, but modify to your needs
For the next step, we assume that we already have dnsmasq installed on the "head" node, and that we also have a httpd setup to provide the kickstart to the nodes during installation.
So our basic ansible playbook looks like this :

---
- hosts: ci-nodes
  sudo: True
  gather_facts: False

  vars:
    deploy_node: admin.ci.centos.org
    seamicro_user_login: admin
    seamicro_user_pass: obviously-hidden-and-changed
    seamicro_reset_body:
      action: reset
      using-pxe: "true"
      username: "{{ seamicro_user_login }}"
      password: "{{ seamicro_user_pass }}"

  tasks:
    - name: Generate kickstart file[s] for Seamicro node[s]
      template: src=../templates/kickstarts/ci-centos-{{ centos_dist }}-ks.j2 dest=/var/www/html/ks/{{ inventory_hostname }}-ks.cfg mode=0755
      delegate_to: "{{ deploy_node }}"

    - name: Adding the entry in DNS (dnsmasq)
      lineinfile: dest=/etc/hosts regexp="^{{ ip }} {{ inventory_hostname }}" line="{{ ip }} {{ inventory_hostname }}"
      delegate_to: "{{ deploy_node }}"
      notify: reload_dnsmasq

    - name: Adding the DHCP entry in dnsmasq
      template: src=../templates/dnsmasq-dhcp.j2 dest=/etc/dnsmasq.d/{{ inventory_hostname }}.conf
      delegate_to: "{{ deploy_node }}"
      register: dhcpdnsmasq

    - name: Reloading dnsmasq configuration
      service: name=dnsmasq state=restarted
      run_once: true
      when: dhcpdnsmasq|changed
      delegate_to: "{{ deploy_node }}"

    - name: Generating the tftp configuration boot file
      template: src=../templates/pxeboot-ci dest=/var/lib/tftpboot/pxelinux.cfg/01-{{ mac_address | lower | replace(":","-") }} mode=0755
      delegate_to: "{{ deploy_node }}"

    - name: Resetting the Seamicro node[s]
      uri: url=https://{{ seamicro_chassis }}.ci.centos.org/v2.0/server/{{ seamicro_srvid }}
           method=POST
           HEADER_Content-Type="application/json"
           body='{{ seamicro_reset_body | to_json }}'
           timeout=60
      delegate_to: "{{ deploy_node }}"

    - name: Waiting for Seamicro node[s] to be available through ssh ...
      action: wait_for port=22 host={{ inventory_hostname }} timeout=1200
      delegate_to: "{{ deploy_node }}"

  handlers:
    - name: reload_dnsmasq
      service: name=dnsmasq state=reloaded

The first thing to notice is that you can use Ansible to provision nodes that aren't already running : people think than ansible is just to interact with already provisioned and running nodes, but by providing useful informations in the inventory, and by delegating actions, we can already start "managing" those yet-to-come nodes.
All the templates used in that playbook are really basic ones, so nothing "rocket science". For example the only diff for the kickstart.j2 template is that we inject ansible variables (for network and storage) :

network  --bootproto=static --device=eth0 --gateway={{ gateway }} --ip={{ ip }} --nameserver={{ nameserver }} --netmask={{ netmask }} --ipv6=auto --activate
network  --hostname={{ inventory_hostname }}
<snip>
part /boot --fstype="ext4" --ondisk=sda --size=500
part pv.14 --fstype="lvmpv" --ondisk=sda --size=10000 --grow
volgroup vg_{{ inventory_hostname_short }} --pesize=4096 pv.14
logvol /home  --fstype="xfs" --size=2412 --name=home --vgname=vg_{{ inventory_hostname_short }} --grow --maxsize=100000
logvol /  --fstype="xfs" --size=8200 --name=root --vgname=vg_{{ inventory_hostname_short }} --grow --maxsize=1000000
logvol swap  --fstype="swap" --size=2136 --name=swap --vgname=vg_{{ inventory_hostname_short }}
<snip>

The dhcp step isn't mandatory, but at least in that subnet we only allow dhcp to "already known" mac address, retrieved from the ansible inventory (and previously fetched directly from the seamicro chassis) :

# {{ name }} ip assignement
dhcp-host={{ mac_address }},{{ ip }}

Same thing for the pxelinux tftp config file :

SERIAL 0 9600
DEFAULT text
PROMPT 0
TIMEOUT 50
TOTALTIMEOUT 6000
ONTIMEOUT {{ inventory_hostname }}-deploy

LABEL local
        MENU LABEL (local)
        MENU DEFAULT
        LOCALBOOT 0

LABEL {{ inventory_hostname}}-deploy
        kernel CentOS/{{ centos_dist }}/{{ centos_arch}}/vmlinuz
        MENU LABEL CentOS {{ centos_dist }} {{ centos_arch }}- CI Kickstart for {{ inventory_hostname }}
        {% if centos_dist == 7 -%}
	append initrd=CentOS/7/{{ centos_arch }}/initrd.img net.ifnames=0 biosdevname=0 ip=eth0:dhcp inst.ks=http://admin.ci.centos.org/ks/{{ inventory_hostname }}-ks.cfg console=ttyS0,9600n8
	{% else -%}
        append initrd=CentOS/{{ centos_dist }}/{{ centos_arch }}/initrd.img ksdevice=eth0 ip=dhcp ks=http://admin.ci.centos.org/ks/{{ inventory_hostname }}-ks.cfg console=ttyS0,9600n8
 	{% endif %}

The interesting part is the one on which I needed to spend more time : as said, it was the first time I had to play with SeaMicro hardware, so I had to dive into documentation (which I *always* do, RTFM FTW !) and understand how to use their Rest API but once done, it was a breeze. Ansible by default doesn't provide a native resource for Seamicro, but that's why Rest exists, right and thanfully, Ansible has a native URI module, which we use here . The only thing on which I had to spend more time was to understand how to properly construct the body, but declaring in the yaml file as a variable/list and then converting it on the fly to json (with the magical body='{{ seamicro_reset_body | to_json }}' ) was the way to go and is so self-explained when read now.

And here we go, calling that ansible playbook and suddenly 128 physical machines were being installed (and reinstalled with different CentOS versions - 5,6,7 - and arches i386,x86_64)

Hope this helps if you  have to interact with Seamicro chassis from within an ansible playbook too

Switching from Ethernet to Infiniband for Gluster access (or why we had to …)

Posted in Arrfab's Blog » Cluster by fabian.arrotin at November 24, 2014 10:37 AM

As explained in my previous (small) blog post, I had to migrate a Gluster setup we have within CentOS.org Infra. As said in that previous blog post too, Gluster is really easy to install, and sometimes it can even "smells" too easy to be true. One thing to keep in mind when dealing with Gluster is that it's a "file-level" storage solution, so don't try to compare it with "block-level" solutions (so typically a NAS vs SAN comparison, even if "SAN" itself is wrong for such discussion, as SAN is what's *between* your nodes and the storage itself, just a reminder.)

Within CentOS.org infra, we have a multiple nodes Gluster setup, that we use for multiple things at the same time. The Gluster volumes are used to store some files, but also to host (different gluster volumes with different settings/ACLs) KVM virtual-disks (qcow2). People knowing me will say : "hey, but for performances reasons, it's faster to just dedicate for example a partition , or a Logical Volume instead of using qcow2 images sitting on top a filesystem for Virtual Machines, right ?" and that's true. But with our limited amount of machines, and a need to "move" Virtual Machine without a proper shared storage solution (and because in our setup, those physical nodes *are* both glusterd and hypervisors), Gluster was an easy to use solution to :

It was working, but not that fast ... I then heard about the fact that (obviously) accessing those qcow2 images file through fuse wasn't efficient at all, but that Gluster had libgfapi that could be used to "talk" directly to the gluster daemons, bypassing completely the need to mount your gluster volumes locally through fuse. Thankfully, qemu-kvm from CentOS 6 is built against libgfapi so can use that directly (and that's the reason why it's automatically installed when you install KVM hypervisor components). Results ? better , but still not was I/we was/were expecting ...

When trying to find the issue, I discussed with some folks in the #gluster irc channel (irc.freenode.net) and suddenly I understood something that it's *not* so obvious for Gluster in distributed+replicated mode : for people having dealt with storage solutions at the hardware level (or people using DRBD, which I did too in the past, and that I also liked a lot ..) in the past, we expect the replication to happens automatically at the storage/server side, but that's not true for Gluster : in fact Glusterd just exposes metadata to gluster clients, which then know where to read/write (being "redirected" to correct gluster nodes). That means so than replication happens at the *client* side : in replicated mode, the clients will write itself twice the same data : once on each server ...

So back to our example, as our nodes have 2*1Gb/s Ethernet card, and that one is a bridge used by the Virtual Machines, and the other one "dedicated" to gluster, and that each node is itself a glusterd/gluster client, I let you think about the max perf we could get : for a write operation : 1Gbit/s , divided by two (because of the replication) so ~ 125MB / 2 => in theory ~ 62 MB/sec (and then remove tcp/gluster/overhead and that drops to ~ 55MB/s)

How to solve that ? well, I tested that theory and confirmed directly that it was the case, when in distributed mode only, write performances were automatically doubled. So yes, running Gluster on Gigabit Ethernet suddenly was the bottleneck. Upgrading to 10Gb wasn't something we could do, but , thanks to Justin Clift (and some other Gluster folks), we were able to find some "second hand" Infiniband hardware (10Gbps HCAs and switch)

While Gluster has native/builtin rdma/Infiniband capabilities (see "tranport" option in the "gluster create volume" command), we had in our case to migrate existing Gluster volumes from plain TCP/Ethernet to Infiniband, while trying to get the downtime as small as possible. That is/was my first experience with Infiniband, but it's not as hard as it seems, especially when you discover IPoIB (IP over Infiniband). So from a Syadmin POV, it's just "yet another network interface", but a 10Gbps one now :)

The Gluster volume migration then goes like this : (schedule a - obvious - downtime for this) :

On all gluster nodes (assuming that we start from machines installed only with @core group, so minimal ones) :

yum groupinstall "Infiniband Support"

chkconfig rdma on

<stop your clients or other  apps accessing gluster volumes, as they will be stopped>

service glusterd stop && chkconfig glusterd off &&  init 0

Install then the hardware in each server, connect all Infiniband cards to the IB switch (previously configured) and power back on all servers. When machines are back online, you have "just" to configure the ib interfaces. As in my cases, machines were "remote nodes" and not having a look at how they were configured, I  had to use some IB tools to see which port was connected (a tool like "ibv_devinfo" showed me which port was active/connected, while "ibdiagnet" shows you the topology and other nodes/devices). In our case it was port 2, so let's create the ifcfg-ib{0,1} devices (and ib1 being the one we'll use) :

DEVICE=ib1
TYPE=Infiniband
BOOTPROTO=static
BROADCAST=192.168.123.255
IPADDR=192.168.123.2
NETMASK=255.255.255.0
NETWORK=192.168.123.0
ONBOOT=yes
NM_CONTROLLED=no
CONNECTED_MODE=yes

The interesting part here is the "CONNECTED_MODE=yes" : for people who already uses iscsi, you know that Jumbo frames are really important if you have a dedicated VLAN (and that the Ethernet switch support Jumbo frames too). As stated in the IPoIB kernel doc , you can have two operation mode : datagram (default 2044 bytes MTU) or  Connected (up to 65520 bytes MTU). It's up to you to decide which one to use, but if you understood the Jumbo frames thing for iscsi, you get the point already.

An "ifup ib1" on all nodes will bring the interfaces up and you can verify that everything works by pinging each other node, including with larger mtu values :

ping -s 16384 <other-node-on-the-infiniband-network>

If everything's fine, you can then decide to start gluster *but* don't forget that gluster uses FQDN (at least I hope that's how you configured initially your gluster setup, already on a dedicated segment, and using different FQDN for the storage vlan). You just have to update your local resolver (internal DNS, local hosts files, whatever you want) to be sure that gluster will then use the new IP subnet on the Infiniband network. (If you haven't previously defined different hostnames for your gluster setup, you can "just" update that in the different /var/lib/glusterd/peers/* and /var/lib/glusterd/vols/*/*.vol)

Restart the whole gluster stack (on all gluster nodes) and verify that it works fine :

service glusterd start

gluster peer status

gluster volume status

# and if you're happy with the results :

chkconfig glusterd on

So, in a short summary:

  • Infiniband isn't that difficult (and surely if you use IPoIB, which has though a very small overhead)
  • Migrating gluster from Ethernet to Infiniband is also easy (and surely if you planned carefully your initial design about IP subnet/VLAN/segment/DNS resolution for "transparent" move)

Feature Spotlight - Smart Resource Restart from the Command Line

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at November 14, 2014 10:17 AM

Restarting a resource can be a complex affair if there are things that depend on that resource or if any of the operations might take a long time.

Stopping a resource is easy, but it can be hard for scripts to determine at what point the the target resource has stopped (in order to know when to re-enable it), at what point it is appropriate to give up, and even what resources might have prevented the stop or start phase from completing.

For this reason, I am pleased to report that we will be introducing a --restart option for crm_resource in Pacemaker 1.1.13.

How it works

Assuming the following invocation

crm_resource --restart --resource dummy

The tool will:

  1. Check the current state of the cluster
  2. Set the target-role for dummy to stopped
  3. Calculate the future state of the cluster
  4. Compare the current state to the future state
  5. Work out the list of resources that still need to stop
  6. If there are resources to be stopped
    1. Work out the longest timeout of all stopping resource
    2. Look for changes until the timeout
    3. If nothing changed, indicate which resources failed to stop and exit
    4. Go back to step 4.
  7. Now that everything has stopped, remove the target-role setting for dummy to allow it to start again
  8. Calculate the future state of the cluster
  9. Compare the current state to the future state
  10. Work out the list of resources that still need to stop
  11. If there are resources to be stopped
    1. Work out the longest timeout of all stopping resource
    2. Look for changes until the timeout
    3. If nothing changed, indicate which resources failed to start and exit
    4. Go back to step 9.
  12. Done

Considering Clones

crm_resource is also smart enough to restart clone instances running on specific nodes with the optional --node hostname argument. In this scenario instead of setting target-role (which would take down the entire clone), we use the same logic as crm_resource --ban and crm_resource --clear to enable/disable the clone from running on the named host.

Want to know more?

Drop by IRC or ask us a question on the Pacemaker mailing list

There is also plenty of documentation available.

Feature Spotlight - Controllable Resource Discovery

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at November 13, 2014 02:09 AM

Coming in 1.1.13 is a new option for location constraints: resource-discovery

This new option controls whether or not Pacemaker performs resource discovery for the specified resource on nodes covered by the constraint. The default always, preserves the pre-1.1.13 behaviour.

The options are:

  • always - (Default) Always perform resource discovery for the specified resource on this node.

  • never - Never perform resource discovery for the specified resource on this node. This option should generally be used with a -INFINITY score. Although that is not strictly required.

  • exclusive - Only perform resource discovery for the specified resource on this node. Multiple location constraints using exclusive discovery for the same resource across different nodes creates a subset of nodes resource-discovery is exclusive to. If a resource is marked for exclusive discovery on one or more nodes, that resource is only allowed to be placed within that subset of nodes.

Why would I want this?

Limiting resource discovery to a subset of nodes the resource is physically capable of running on can significantly boost performance when a large set of nodes are preset. When pacemaker_remote is in use to expand the node count into the 100s of nodes range, this option can have a dramatic affect on the speed of the cluster.

Is using this option ever a bad idea?

Absolutely!

Setting this option to never or exclusive allows the possibility for the resource to be active in those locations without the cluster’s knowledge. This can lead to the resource being active in more than one location!

There are primarily three ways for this to happen:

  1. If the service is started outside the cluster’s control (ie. at boot time by init, systemd, etc; or by an admin)
  2. If the resource-discovery property is changed while part of the cluster is down or suffering split-brain
  3. If the resource-discovery property is changed for a resource/node while the resource is active on that node

When is it safe?

For the most part, it is only appropriate when:

  1. you have more than 8 nodes (including bare metal nodes with pacemaker-remoted), and
  2. there is a way to guarentee that the resource can only run in a particular location (eg. the required software is not installed anywhere else)

Want to know more?

Drop by IRC or ask us a question on the Pacemaker mailing list

There is also plenty of documentation available.

DRBD and SSD: I was made for loving you

Posted in LINBIT Blogs by flip at November 04, 2014 09:29 AM

When DRBD 8.4.4 integrated TRIM/Discard support, a lot of things got much better… for example, 700MB/sec over a 1GBit/sec connection.

As described in the Root-on-DRBD tech guide, my notebook uses DRBD on top of an SSD; apart from the IO speed, the other important thing is the Trim/Discard support.

In practice that means, e.g., that the resync goes much faster: most of the blocks that were written while being off-site have already been discarded again, and so the automatical fstrim can drop the needed amount of data by “up to” 100%.

Result: with a single SSD on one end, 1GBit network connectivity, and thin LVM on top of a 2-harddisk RAID1 on the other end, a resync rate of 700MB/sec!

Here are the log lines, heavily shortened so that they’re readable; starting at 09:39:00:

block drbd9: drbd_sync_handshake:
block drbd9: self 7:4:4:4 bits:15377903 flags:0
block drbd9: peer 4:0:2:2 bits:15358173 flags:4
block drbd9: uuid_compare()=1 by rule 70
block drbd9: Becoming sync source due to disk states.
block drbd9: peer( Unknown -> Sec ) conn( WFRepPar -> WFBitMapS )
block drbd9: send bitmap stats total 122068; compression: 98.4%
block drbd9: receive bitmap stats: total 122068; compression: 98.4%
block drbd9: helper cmd: drbdadm before-resync-src minor9
block drbd9: helper cmd: drbdadm before-resync-src minor9 exit code 0
block drbd9: conn( WFBitMapS -> SyncSource ) 
block drbd9: Began resync as SyncSrc (will sync 58 GB [15382819 bits]).
block drbd9: updated sync UUID 7:4:4:4

At 09:40:27 the resync concludes; the first line is the relevant one:

block drbd9: Resync done (total 87 sec; paused 0 sec; 707256 K/sec)
block drbd9: updated UUIDs 7:0:4:4
block drbd9: conn( SyncSource -> Connected ) pdsk( Inc -> UpToDate ) 

That’s how it’s done 😉

Root-on-DRBD followup: Pre-production staging servers

Posted in LINBIT Blogs by flip at October 16, 2014 12:28 PM

In the Root-on-DRBD” Tech-Guide we showed how to cleanly get DRBD below the Root filesystem, how to use it, and a few advantages and disadvantages. Now, if there’s a complete, live, backup of a machine available, a few more use-cases become available; here we want to discuss testing upgrades of production servers.

Everybody knows that upgrading production servers can be risky business. Even for the simplest changes (like upgrading DRBD on a Secondary) things can go wrong. If you have an HA Cluster in place, you can at least avoid a lot of pressure: the active cluster member is still running normally, so you don’t have to hurry the upgrade as if you had only a single production server.

Now, in a perfect world, all changes would have to go through a staging server first, perhaps several times, until all necessary changes are documented and the affected people know exactly what to do. However, that means having a staging server that is as identical to the production machine as possible: exactly the same package versions, using production data during schema changes (helps to assess the DB load [queue your most famous TheDailyWTF article about that here]), and so on.
That’s quite some work.

Well, no, wait, it isn’t that much … if you have a simple process to copy the production server.

That might be fairly easy if the server is virtualized – a few clicks are sufficient; but on physical hardware you will need DRBD to quickly get the staging machine up-to-date after a failed attempt – and that’s exactly what DRBD can give you.

The trick is to “shutdown” the machine in a way that makes the root filesystem unused; then resync DRBD from the production server, to reboot into the freshly updated “installation”.
(Yes, the data will have to be done in a similar way – but that’s possible with DRBD8, and will get even easier with DRBD9.)
A sample script that shows a basic outline is presented in the resync-root branch in the Root-on-DRBD github Repository. It should be run on the staging server only.

Please note that this is a barely-tested draft – you’ll need to put quite some installation-specific things in there, like other DRBD resources to resynchronize at that time and so on!

Feedback is very welcome; Pull-requests even more so 😉

Release Candidate: 1.1.12-rc1

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at May 07, 2014 08:16 PM

As promised, this announcement brings the first release candidate for Pacemaker 1.1.12

https://github.com/ClusterLabs/pacemaker/releases/Pacemaker-1.1.12-rc1

This release primarily focuses on important but mostly invisible changes under-the-hood:

  • The CIB is now O(2) faster. Thats 100x for those not familiar with Big-O notation :-)

    This has massively reduced the cluster’s use of system resources, allowing us to scale further on the same hardware, and dramatically reduced failover times for large clusters.

  • Support for ACLs are is enabled by default.

    The new implementation can restrict cluster access for containers where pacemaker-remoted is used and is also more efficient.

  • All CIB updates are now serialized and pre-synchronized via the corosync CPG interface. This makes it impossible for updates to be lost, even when the cluster is electing a new DC.

  • Schema versioning changes

    New features are no longer silently added to the schema. Instead the ${Y} in pacemaker-${X}-${Y} will be incremented for simple additions, and ${X} will be bumped for removals or other changes requiring an XSL transformation.

    To take advantage of new features, you will need to updates all the nodes and then run the equivalent of cibadmin --upgrade.

Thankyou to everyone that has tested out the new CIB and ACL code already. Please keep those bug reports coming in!

List of known bugs to be investigating during the RC phase:

  • 5206 Fileencoding broken
  • 5194 A resource starts with a standby node. (Latest attrd does not serve as the crmd-transition-delay parameter)
  • 5197 Fail-over is delayed. (State transition is not calculated.)
  • 5139 Each node fenced in its own transition during start-up fencing
  • 5200 target node is over-utilized with allow-migrate=true
  • 5184 Pending probe left in the cib
  • 5165 Add support for transient node utilization attributes

To build rpm packages for testing:

  1. Clone the current sources:

    # git clone –depth 0 git://github.com/ClusterLabs/pacemaker.git # cd pacemaker

  2. Install dependancies (if you haven’t already)

    [Fedora] # sudo yum install -y yum-utils [ALL] # make rpm-dep

  3. Build Pacemaker

    # make rc

  4. Copy and deploy as needed

Details

Changesets: 633 Diff: 184 files changed, 12690 insertions(+), 5843 deletions(-)

Highlights

Features added since Pacemaker-1.1.11

  • Changes to the ACL schema to support nodes and unix groups
  • cib: Check ACLs prior to making the update instead of parsing the diff afterwards
  • cib: Default ACL support to on
  • cib: Enable the more efficient xml patchset format
  • cib: Implement zero-copy status update (performance)
  • cib: Send all r/w operations via the cluster connection and have all nodes process them
  • crm_mon: Display brief output if “-b/–brief” is supplied or ‘b’ is toggled
  • crm_ticket: Support multiple modifications for a ticket in an atomic operation
  • Fencing: Add the ability to call stonith_api_time() from stonith_admin
  • logging: daemons always get a log file, unless explicitly set to configured ‘none’
  • PE: Automatically re-unfence a node if the fencing device definition changes
  • pengine: cl#5174 - Allow resource sets and templates for location constraints
  • pengine: Support cib object tags
  • pengine: Support cluster-specific instance attributes based on rules
  • pengine: Support id-ref in nvpair with optional “name”
  • pengine: Support per-resource maintenance mode
  • pengine: Support site-specific instance attributes based on rules
  • tools: Display pending state in crm_mon/crm_resource/crm_simulate if –pending/-j is supplied (cl#5178)
  • xml: Add the ability to have lightweight schema revisions
  • xml: Enable resource sets in location constraints for 1.2 schema
  • xml: Support resources that require unfencing

Changes since Pacemaker-1.1.11

  • acl: Authenticate pacemaker-remote requests with the node name as the client
  • cib: allow setting permanent remote-node attributes
  • cib: Do not disable cib disk writes if on-disk cib is corrupt
  • cib: Ensure ‘cibadmin -R/–replace’ commands get replies
  • cib: Fix remote cib based on TLS
  • cib: Ingore patch failures if we already have their contents
  • cib: Resolve memory leaks in query paths
  • cl#5055: Improved migration support.
  • cluster: Fix segfault on removing a node
  • controld: Do not consider the dlm up until the address list is present
  • controld: handling startup fencing within the controld agent, not the dlm
  • crmd: Ack pending operations that were cancelled due to rsc deletion
  • crmd: Actions can only be executed if their pre-requisits completed successfully
  • crmd: Do not erase the status section for unfenced nodes
  • crmd: Do not overwrite existing node state when fencing completes
  • crmd: Do not start timers for already completed operations
  • crmd: Fenced nodes that return prior to an election do not need to have their status section reset
  • crmd: make lrm_state hash table not case sensitive
  • crmd: make node_state erase correctly
  • crmd: Prevent manual fencing confirmations from attempting to create node entries for unknown nodes
  • crmd: Prevent memory leak in error paths
  • crmd: Prevent memory leak when accepting a new DC
  • crmd: Prevent message relay from attempting to create node entries for unknown nodes
  • crmd: Prevent SIGPIPE when notifying CMAN about fencing operations
  • crmd: Report unsuccessful unfencing operations
  • crm_diff: Allow the generation of xml patchsets without digests
  • crm_mon: Allow the file created by –as-html to be world readable
  • crm_mon: Ensure resource attributes have been unpacked before displaying connectivity data
  • crm_node: Only remove the named resource from the cib
  • crm_node: Prevent use-after-free in tools_remove_node_cache()
  • crm_resource: Gracefully handle -EACCESS when querying the cib
  • fencing: Advertise support for reboot/on/off in the metadata for legacy agents
  • fencing: Automatically switch from ‘list’ to ‘status’ to ‘static-list’ if those actions are not advertised in the metadata
  • fencing: Correctly record which peer performed the fencing operation
  • fencing: default to ‘off’ when agent does not advertise ‘reboot’ in metadata
  • fencing: Execute all required fencing devices regardless of what topology level they are at
  • fencing: Pass the correct options when looking up the history by node name
  • fencing: Update stonith device list only if stonith is enabled
  • get_cluster_type: failing concurrent tool invocations on heartbeat
  • iso8601: Different logic is needed when logging and calculating durations
  • lrmd: Cancel recurring operations before stop action is executed
  • lrmd: Expose logging variables expected by OCF agents
  • lrmd: Merge duplicate recurring monitor operations
  • lrmd: Provide stderr output from agents if available, otherwise fall back to stdout
  • mainloop: Fixes use after free in process monitor code
  • make resource ID case sensitive
  • mcp: Tell systemd not to respawn us if we exit with rc=100
  • pengine: Allow container nodes to migrate with connection resource
  • pengine: cl#5186 - Avoid running rsc on two nodes when node is fenced during migration
  • pengine: cl#5187 - Prevent resources in an anti-colocation from even temporarily running on a same node
  • pengine: Correctly handle origin offsets in the future
  • pengine: Correctly search failcount
  • pengine: Default sequential to TRUE for resource sets for consistency with colocation sets
  • pengine: Delay unfencing until after we know the state of all resources that require unfencing
  • pengine: Do not initiate fencing for unclean nodes when fencing is disabled
  • pengine: Do not unfence nodes that are offline, unclean or shutting down
  • pengine: Fencing devices default to only requiring quorum in order to start
  • pengine: fixes invalid transition caused by clones with more than 10 instances
  • pengine: Force record pending for migrate_to actions
  • pengine: handles edge case where container order constraints are not honored during migration
  • pengine: Ignore failure-timeout only if the failed operation has on-fail=”block”
  • pengine: Log when resources require fencing but fencing is disabled
  • pengine: Memory leaks
  • pengine: Unfencing is based on device probes, there is no need to unfence when normal resources are found active
  • Portability: Use basic types for DBus compatability struct
  • remote: Allow baremetal remote-node connection resources to migrate
  • remote: Enable migration support for baremetal connection resources by default
  • services: Correctly reset the nice value for lrmd’s children
  • services: Do not allow duplicate recurring op entries
  • services: Do not block synced service executions
  • services: Fixes segfault associated with cancelling in-flight recurring operations.
  • services: Reset the scheduling policy and priority for lrmd’s children without replying on SCHED_RESET_ON_FORK
  • services_action_cancel: Interpret return code from mainloop_child_kill() correctly
  • stonith_admin: Ensure pointers passed to sscanf() are properly initialized
  • stonith_api_time_helper now returns when the most recent fencing operation completed
  • systemd: Prevent use-of-NULL when determining if an agent exists
  • upstart: Allow comilation with glib versions older than 2.28
  • xml: Better move detection logic for xml nodes
  • xml: Check all available schemas when doing upgrades
  • xml: Convert misbehaving #define into a more easily understood inline function
  • xml: If validate-with is missing, we find the most recent schema that accepts it and go from there
  • xml: Update xml validation to allow ‘<node type=remote />’

Release Candidate: 1.1.12-rc1 was originally published by at That Cluster Guy on May 07, 2014.

DRBDmanage installation is now easier!

Posted in LINBIT Blogs by flip at March 21, 2014 05:03 PM

In the last blog post about DRBDmanage we mentioned

Initial setup is a bit involved (see the README)

… with the new release, this is no longer true!

All that’s needed is now one command to initialize a new DRBDmanage control volume:

nodeA# drbdmanage init «local-ip-address»

You are going to initalize a new drbdmanage cluster.
CAUTION! Note that:
  * Any previous drbdmanage cluster information may be removed
  * Any remaining resources managed by a previous drbdmanage
    installation that still exist on this system will no longer
    be managed by drbdmanage

Confirm:

  yes/no:

Acknowledging that question will (still) print a fair bit of data, ie. the output of the commands that are run in the background; if everything works, you’ll get a freshly initialized DRBDmanage control volume, with the current node already registered.

Well, a single node is boring … let’s add further nodes!

nodeA# drbdmanage new-node «nodeB» «its-ip-address»

Join command for node nodeB:
  drbdmanage join some arguments ....

Now you copy and paste the one command line on the new node:

nodeB# drbdmanage join «arguments as above....»
You are going to join an existing drbdmanage cluster.
CAUTION! Note that:
...

Another yes and enter – and you’re done! Every further node is just one command on the existing cluster, which will give you the command line to use on the to-be-added node.


So, another major point is fixed … there are a few more things to be done, of course, but that was a big step (in the right direction) 😉

Potential for data corruption affecting Pacemaker 1.1.6 through 1.1.9

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at March 19, 2014 01:24 PM

It has come to my attention that the potential for data corruption exists in Pacemaker versions 1.1.6 to 1.1.9

Everyone is strongly encouraged to upgrade to 1.1.10 or later.

Those using RHEL 6.4 or later (or a RHEL clone) should already have access to 1.1.10 via the normal update channels.

At issue is some faulty logic in a function called tengine_stonith_notify() which can incorrectly add successfully fenced nodes to a list, causing Pacemaker to subsequently erase that node’s status section when the next DC election occurs.

With the status section erased, the cluster thinks that node is safely down and begins starting any services it has on other nodes - despite those already being active.

In order to trigger the logic, the fenced node must:

  1. have been the previous DC
  2. been sufficently functional to request its own fencing, and
  3. the fencing notification must arrive after the new DC has been elected, but before it invokes the policy engine

That this is the first we have heard of the issue since the problem was introduced in August 2011, the above sequence of events is apparently hard to hit under normal conditions.

Logs symptomatic of the issue look as follows:

# grep -e do_state_transition -e reboot  -e do_dc_takeover -e tengine_stonith_notify -e S_IDLE /var/log/corosync.log

Mar 08 08:43:22 [9934] lorien       crmd:     info: do_dc_takeover: 	Taking over DC status for this partition
Mar 08 08:43:22 [9934] lorien       crmd:   notice: tengine_stonith_notify: 	Peer gandalf was terminated (st_notify_fence) by mordor for gandalf: OK (ref=10d27664-33ed-43e0-a5bd-7d0ef850eb05) by client crmd.31561
Mar 08 08:43:22 [9934] lorien       crmd:   notice: tengine_stonith_notify: 	Notified CMAN that 'gandalf' is now fenced
Mar 08 08:43:22 [9934] lorien       crmd:   notice: tengine_stonith_notify: 	Target may have been our leader gandalf (recorded: <unset>)
Mar 08 09:13:52 [9934] lorien       crmd:     info: do_dc_takeover: 	Taking over DC status for this partition
Mar 08 09:13:52 [9934] lorien       crmd:   notice: do_dc_takeover: 	Marking gandalf, target of a previous stonith action, as clean
Mar 08 08:43:22 [9934] lorien       crmd:     info: do_state_transition: 	State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ]
Mar 08 08:43:28 [9934] lorien       crmd:     info: do_state_transition: 	State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED cause=C_FSA_INTERNAL origin=check_join_state ]

Note in particular the final entry from tengine_stonith_notify():

Target may have been our leader gandalf (recorded: <unset>)

If you see this after Taking over DC status for this partition but prior to State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE, then you are likely to have resources running in more than one location after the next DC election.

The issue was fixed during a routine cleanup prior to Pacemaker-1.1.10 in @f30e1e43 However the implications of what the old code allowed were not fully appreciated at the time.

Potential for data corruption affecting Pacemaker 1.1.6 through 1.1.9 was originally published by at That Cluster Guy on March 19, 2014.

DRBDManage release 0.10

Posted in LINBIT Blogs by flip at February 06, 2014 03:11 PM

As already announced in another blog post, we’re preparing a new tool to simplify DRBD administration. Now we’re publishing its first release! Prior to DRBD Manage, in order to deploy a DRBD resource you’d have to create a config file and copy it to all the necessary nodes.  As The Internet says “ain’t nobody got time for that”.  Using DRBD Manage, all you need to do is execute the following command:

drbdmanage new-volume vol0 4 --deploy 3

Here is what happens on the back-end:

  • It chooses three nodes from the available set1;
  • drbdmanage creates a 4GiB LV on all these nodes;
  • generates DRBD configuration files;
  • writes the DRBD meta-data into the LV;
  • starts the initial sync, and
  • makes the volume on a node Primary so that it can be used right now.

This process takes only a few seconds.


Please note that there are some things to take into consideration:

  • drbdmanage is a lot to type; however, an alias dm=drbdmanage in your ~/.*shrc takes care of that 😉
  • Initial setup is a bit involved (see the README) 2
  • You’ll need at least DRBD 9.0.0-pre7.
  • Being that both DRBD Manage and DRBD9 are still under heavy development there are more than likely some undiscovered bugs.  Bug reports, ideas, wishes, or any other feedback, are welcome.

Anyway – head over to the DRBD-Manage homepage and fetch your source tarballs (a few packages are prepared, too), or a GIT checkout if you plan to keep up-to-date. For questions please use the drbd-user mailing list; patches, or other development-related topics are welcome on the drbd-dev mailing list.

What do you think? Drop us a note!


DRBD-Manager

Posted in LINBIT Blogs by flip at November 22, 2013 12:41 PM

One of the projects that LINBIT will publish soon1 is drbdmanage, which allows easy cluster-wide storage administration with DRBD 9.

Every DRBD user knows the drill – create an LV, write a DRBD resource configuration file, create-md, up, initial sync, …

But that is no more.

The new way is this: drbdmanage new-volume r0 50 deploy 4, and here comes your quadruple replicated 50 gigabyte DRBD volume.

This is accomplished by a cluster-wide DRBD volume that holds some drbdmanage data, and a daemon on each node that receives DRBD events from the kernel.

Every time some configuration change is wanted,

  1. drbdmanage writes into the common volume,
  2. causing the other nodes to see the PrimarySecondary events,
  3. so that they know to reload the new configuration,
  4. and act upon it – creating or removing an LV, reconfiguring DRBD, etc.
  5. and, if required, cause an initial sync.

As DRBD 8.4.4 now supports DISCARD/TRIM, the initial sync (on SSDs or Thin LVM) is essentially free – a few seconds is all it takes. (See eg. mkfs.ext4 for a possible user).

Further usecases are various projects that can benefit by a “shared storage” layer – like oVirt, OpenStack, libVirt, etc.
Just imagine using a non-cluster-aware tool like virt-manager to create a new VM, and the storage gets automatically sync’ed across multiple nodes…

Interested? You’ll have to wait for a few weeks, but you can always drop us a line.


Announcing 1.1.11 Beta Testing

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at November 21, 2013 02:00 PM

With over 400 updates since the release of 1.1.10, its time to start thinking about a new release.

Today I have tagged release candidate 1. The most notable fixes include:

  • attrd: Implementation of a truely atomic attrd for use with corosync 2.x
  • cib: Allow values to be added/updated and removed in a single update
  • cib: Support XML comments in diffs
  • Core: Allow blackbox logging to be disabled with SIGUSR2
  • crmd: Do not block on proxied calls from pacemaker_remoted
  • crmd: Enable cluster-wide throttling when the cib heavily exceeds its target load
  • crmd: Use the load on our peers to know how many jobs to send them
  • crm_mon: add –hide-headers option to hide all headers
  • crm_report: Collect logs directly from journald if available
  • Fencing: On timeout, clean up the agent’s entire process group
  • Fencing: Support agents that need the host to be unfenced at startup
  • ipc: Raise the default buffer size to 128k
  • PE: Add a special attribute for distinguishing between real nodes and containers in constraint rules
  • PE: Allow location constraints to take a regex pattern to match against resource IDs
  • pengine: Distinguish between the agent being missing and something the agent needs being missing
  • remote: Properly version the remote connection protocol
  • services: Detect missing agents and permission errors before forking
  • Bug cl#5171 - pengine: Don’t prevent clones from running due to dependant resources
  • Bug cl#5179 - Corosync: Attempt to retrieve a peer’s node name if it is not already known
  • Bug cl#5181 - corosync: Ensure node IDs are written to the CIB as unsigned integers

If you are a user of pacemaker_remoted, you should take the time to read about changes to the online wire protocol that are present in this release.

To build rpm packages for testing:

  1. Clone the current sources:

    # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
    # cd pacemaker
    
  2. If you haven’t already, install Pacemaker’s dependancies

    [Fedora] # sudo yum install -y yum-utils
    [ALL]	# make rpm-dep
    
  3. Build Pacemaker

    # make rc
    
  4. Copy the rpms and deploy as needed

Announcing 1.1.11 Beta Testing was originally published by at That Cluster Guy on November 21, 2013.