Quick links:

LINBIT Blogs: Root-on-DRBD followup: Pre-production staging servers

The Cluster Guy: Release Candidate: 1.1.12-rc1

LINBIT Blogs: DRBDmanage installation is now easier!

The Cluster Guy: Potential for data corruption affecting Pacemaker 1.1.6 through 1.1.9

LINBIT Blogs: DRBDManage release 0.10

LINBIT Blogs: DRBD-Manager

The Cluster Guy: Announcing 1.1.11 Beta Testing

The Cluster Guy: Pacemaker and RHEL 6.4 (redux)

The Cluster Guy: Changes to the Remote Wire Protocol in 1.1.11

LINBIT Blogs: DRBD and the sync rate controller, part 2

LINBIT Blogs: DRBD Proxy 3.1: Performance improvements

The Cluster Guy: Pacemaker 1.1.10 - final

The Cluster Guy: Release candidate: 1.1.10-rc7

The Cluster Guy: Release candidate: 1.1.10-rc6

The Cluster Guy: GPG Quickstart

The Cluster Guy: Release candidate: 1.1.10-rc5

LINBIT Blogs: “umount is too slow”

Arrfab's Blog » Cluster: Rolling updates with Ansible and Apache reverse proxies

The Cluster Guy: Release candidate: 1.1.10-rc3

The Cluster Guy: Pacemaker Logging

The Cluster Guy: Debugging the Policy Engine

The Cluster Guy: Debugging Pacemaker

The Cluster Guy: Pacemaker on RHEL6.4

The Cluster Guy: Release candidate: 1.1.10-rc2

The Cluster Guy: Mixing Pacemaker versions

The Cluster Guy: Release candidate: 1.1.10-rc1

LINBIT Blogs: DRBD 8.4.3: faster than ever

The Cluster Guy: Now powered by Octopress

The Cluster Guy: Large Cluster Performance: Redux

LINBIT Blogs: Change the cluster distribution without downtime

Root-on-DRBD followup: Pre-production staging servers

Posted in LINBIT Blogs by flip at October 16, 2014 12:28 PM

In the Root-on-DRBD” Tech-Guide we showed how to cleanly get DRBD below the Root filesystem, how to use it, and a few advantages and disadvantages. Now, if there’s a complete, live, backup of a machine available, a few more use-cases become available; here we want to discuss testing upgrades of production servers.

Everybody knows that upgrading production servers can be risky business. Even for the simplest changes (like upgrading DRBD on a Secondary) things can go wrong. If you have an HA Cluster in place, you can at least avoid a lot of pressure: the active cluster member is still running normally, so you don’t have to hurry the upgrade as if you had only a single production server.

Now, in a perfect world, all changes would have to go through a staging server first, perhaps several times, until all necessary changes are documented and the affected people know exactly what to do. However, that means having a staging server that is as identical to the production machine as possible: exactly the same package versions, using production data during schema changes (helps to assess the DB load [queue your most famous TheDailyWTF article about that here]), and so on.
That’s quite some work.

Well, no, wait, it isn’t that much … if you have a simple process to copy the production server.

That might be fairly easy if the server is virtualized – a few clicks are sufficient; but on physical hardware you will need DRBD to quickly get the staging machine up-to-date after a failed attempt – and that’s exactly what DRBD can give you.

The trick is to “shutdown” the machine in a way that makes the root filesystem unused; then resync DRBD from the production server, to reboot into the freshly updated “installation”.
(Yes, the data will have to be done in a similar way – but that’s possible with DRBD8, and will get even easier with DRBD9.)
A sample script that shows a basic outline is presented in the resync-root branch in the Root-on-DRBD github Repository. It should be run on the staging server only.

Please note that this is a barely-tested draft – you’ll need to put quite some installation-specific things in there, like other DRBD resources to resynchronize at that time and so on!

Feedback is very welcome; Pull-requests even more so ;)

Release Candidate: 1.1.12-rc1

Posted in The Cluster Guy at May 07, 2014 10:16 AM

As promised, this announcement brings the first release candidate for Pacemaker 1.1.12

https://github.com/ClusterLabs/pacemaker/releases/Pacemaker-1.1.12-rc1

This release primarily focuses on important but mostly invisible changes under-the-hood:

  • The CIB is now O(2) faster. Thats 100x for those not familiar with Big-O notation :-)

    This has massively reduced the cluster’s use of system resources, allowing us to scale further on the same hardware, and dramatically reduced failover times for large clusters.

  • Support for ACLs are is enabled by default.

    The new implementation can restrict cluster access for containers where pacemaker-remoted is used and is also more efficient.

  • All CIB updates are now serialized and pre-synchronized via the corosync CPG interface. This makes it impossible for updates to be lost, even when the cluster is electing a new DC.

  • Schema versioning changes

    New features are no longer silently added to the schema. Instead the ${Y} in pacemaker-${X}-${Y} will be incremented for simple additions, and ${X} will be bumped for removals or other changes requiring an XSL transformation.

    To take advantage of new features, you will need to updates all the nodes and then run the equivalent of cibadmin --upgrade.

Thankyou to everyone that has tested out the new CIB and ACL code already. Please keep those bug reports coming in!

List of known bugs to be investigating during the RC phase:

  • 5206 Fileencoding broken
  • 5194 A resource starts with a standby node. (Latest attrd does not serve as the crmd-transition-delay parameter)
  • 5197 Fail-over is delayed. (State transition is not calculated.)
  • 5139 Each node fenced in its own transition during start-up fencing
  • 5200 target node is over-utilized with allow-migrate=true
  • 5184 Pending probe left in the cib
  • 5165 Add support for transient node utilization attributes

To build rpm packages for testing:

  1. Clone the current sources:

    # git clone –depth 0 git://github.com/ClusterLabs/pacemaker.git # cd pacemaker

  2. Install dependancies (if you haven’t already)

    [Fedora] # sudo yum install -y yum-utils [ALL] # make rpm-dep

  3. Build Pacemaker

    # make rc

  4. Copy and deploy as needed

Details

Changesets: 633 Diff: 184 files changed, 12690 insertions(+), 5843 deletions(-)

Highlights

Features added since Pacemaker-1.1.11

  • Changes to the ACL schema to support nodes and unix groups
  • cib: Check ACLs prior to making the update instead of parsing the diff afterwards
  • cib: Default ACL support to on
  • cib: Enable the more efficient xml patchset format
  • cib: Implement zero-copy status update (performance)
  • cib: Send all r/w operations via the cluster connection and have all nodes process them
  • crm_mon: Display brief output if “-b/–brief” is supplied or ‘b’ is toggled
  • crm_ticket: Support multiple modifications for a ticket in an atomic operation
  • Fencing: Add the ability to call stonith_api_time() from stonith_admin
  • logging: daemons always get a log file, unless explicitly set to configured ‘none’
  • PE: Automatically re-unfence a node if the fencing device definition changes
  • pengine: cl#5174 - Allow resource sets and templates for location constraints
  • pengine: Support cib object tags
  • pengine: Support cluster-specific instance attributes based on rules
  • pengine: Support id-ref in nvpair with optional “name”
  • pengine: Support per-resource maintenance mode
  • pengine: Support site-specific instance attributes based on rules
  • tools: Display pending state in crm_mon/crm_resource/crm_simulate if –pending/-j is supplied (cl#5178)
  • xml: Add the ability to have lightweight schema revisions
  • xml: Enable resource sets in location constraints for 1.2 schema
  • xml: Support resources that require unfencing

Changes since Pacemaker-1.1.11

  • acl: Authenticate pacemaker-remote requests with the node name as the client
  • cib: allow setting permanent remote-node attributes
  • cib: Do not disable cib disk writes if on-disk cib is corrupt
  • cib: Ensure ‘cibadmin -R/–replace’ commands get replies
  • cib: Fix remote cib based on TLS
  • cib: Ingore patch failures if we already have their contents
  • cib: Resolve memory leaks in query paths
  • cl#5055: Improved migration support.
  • cluster: Fix segfault on removing a node
  • controld: Do not consider the dlm up until the address list is present
  • controld: handling startup fencing within the controld agent, not the dlm
  • crmd: Ack pending operations that were cancelled due to rsc deletion
  • crmd: Actions can only be executed if their pre-requisits completed successfully
  • crmd: Do not erase the status section for unfenced nodes
  • crmd: Do not overwrite existing node state when fencing completes
  • crmd: Do not start timers for already completed operations
  • crmd: Fenced nodes that return prior to an election do not need to have their status section reset
  • crmd: make lrm_state hash table not case sensitive
  • crmd: make node_state erase correctly
  • crmd: Prevent manual fencing confirmations from attempting to create node entries for unknown nodes
  • crmd: Prevent memory leak in error paths
  • crmd: Prevent memory leak when accepting a new DC
  • crmd: Prevent message relay from attempting to create node entries for unknown nodes
  • crmd: Prevent SIGPIPE when notifying CMAN about fencing operations
  • crmd: Report unsuccessful unfencing operations
  • crm_diff: Allow the generation of xml patchsets without digests
  • crm_mon: Allow the file created by –as-html to be world readable
  • crm_mon: Ensure resource attributes have been unpacked before displaying connectivity data
  • crm_node: Only remove the named resource from the cib
  • crm_node: Prevent use-after-free in tools_remove_node_cache()
  • crm_resource: Gracefully handle -EACCESS when querying the cib
  • fencing: Advertise support for reboot/on/off in the metadata for legacy agents
  • fencing: Automatically switch from ‘list’ to ‘status’ to ‘static-list’ if those actions are not advertised in the metadata
  • fencing: Correctly record which peer performed the fencing operation
  • fencing: default to ‘off’ when agent does not advertise ‘reboot’ in metadata
  • fencing: Execute all required fencing devices regardless of what topology level they are at
  • fencing: Pass the correct options when looking up the history by node name
  • fencing: Update stonith device list only if stonith is enabled
  • get_cluster_type: failing concurrent tool invocations on heartbeat
  • iso8601: Different logic is needed when logging and calculating durations
  • lrmd: Cancel recurring operations before stop action is executed
  • lrmd: Expose logging variables expected by OCF agents
  • lrmd: Merge duplicate recurring monitor operations
  • lrmd: Provide stderr output from agents if available, otherwise fall back to stdout
  • mainloop: Fixes use after free in process monitor code
  • make resource ID case sensitive
  • mcp: Tell systemd not to respawn us if we exit with rc=100
  • pengine: Allow container nodes to migrate with connection resource
  • pengine: cl#5186 - Avoid running rsc on two nodes when node is fenced during migration
  • pengine: cl#5187 - Prevent resources in an anti-colocation from even temporarily running on a same node
  • pengine: Correctly handle origin offsets in the future
  • pengine: Correctly search failcount
  • pengine: Default sequential to TRUE for resource sets for consistency with colocation sets
  • pengine: Delay unfencing until after we know the state of all resources that require unfencing
  • pengine: Do not initiate fencing for unclean nodes when fencing is disabled
  • pengine: Do not unfence nodes that are offline, unclean or shutting down
  • pengine: Fencing devices default to only requiring quorum in order to start
  • pengine: fixes invalid transition caused by clones with more than 10 instances
  • pengine: Force record pending for migrate_to actions
  • pengine: handles edge case where container order constraints are not honored during migration
  • pengine: Ignore failure-timeout only if the failed operation has on-fail=”block”
  • pengine: Log when resources require fencing but fencing is disabled
  • pengine: Memory leaks
  • pengine: Unfencing is based on device probes, there is no need to unfence when normal resources are found active
  • Portability: Use basic types for DBus compatability struct
  • remote: Allow baremetal remote-node connection resources to migrate
  • remote: Enable migration support for baremetal connection resources by default
  • services: Correctly reset the nice value for lrmd’s children
  • services: Do not allow duplicate recurring op entries
  • services: Do not block synced service executions
  • services: Fixes segfault associated with cancelling in-flight recurring operations.
  • services: Reset the scheduling policy and priority for lrmd’s children without replying on SCHED_RESET_ON_FORK
  • services_action_cancel: Interpret return code from mainloop_child_kill() correctly
  • stonith_admin: Ensure pointers passed to sscanf() are properly initialized
  • stonith_api_time_helper now returns when the most recent fencing operation completed
  • systemd: Prevent use-of-NULL when determining if an agent exists
  • upstart: Allow comilation with glib versions older than 2.28
  • xml: Better move detection logic for xml nodes
  • xml: Check all available schemas when doing upgrades
  • xml: Convert misbehaving #define into a more easily understood inline function
  • xml: If validate-with is missing, we find the most recent schema that accepts it and go from there
  • xml: Update xml validation to allow ’’

DRBDmanage installation is now easier!

Posted in LINBIT Blogs by flip at March 21, 2014 05:03 PM

In the last blog post about DRBDmanage we mentioned

Initial setup is a bit involved (see the README)

… with the new release, this is no longer true!

All that’s needed is now one command to initialize a new DRBDmanage control volume:

nodeA# drbdmanage init «local-ip-address»

You are going to initalize a new drbdmanage cluster.
CAUTION! Note that:
  * Any previous drbdmanage cluster information may be removed
  * Any remaining resources managed by a previous drbdmanage
    installation that still exist on this system will no longer
    be managed by drbdmanage

Confirm:

  yes/no:

Acknowledging that question will (still) print a fair bit of data, ie. the output of the commands that are run in the background; if everything works, you’ll get a freshly initialized DRBDmanage control volume, with the current node already registered.

Well, a single node is boring … let’s add further nodes!

nodeA# drbdmanage new-node «nodeB» «its-ip-address»

Join command for node nodeB:
  drbdmanage join some arguments ....

Now you copy and paste the one command line on the new node:

nodeB# drbdmanage join «arguments as above....»
You are going to join an existing drbdmanage cluster.
CAUTION! Note that:
...

Another yes and enter – and you’re done! Every further node is just one command on the existing cluster, which will give you the command line to use on the to-be-added node.


So, another major point is fixed … there are a few more things to be done, of course, but that was a big step (in the right direction) ;)

Potential for data corruption affecting Pacemaker 1.1.6 through 1.1.9

Posted in The Cluster Guy at March 19, 2014 02:24 AM

It has come to my attention that the potential for data corruption exists in Pacemaker versions 1.1.6 to 1.1.9

Everyone is strongly encouraged to upgrade to 1.1.10 or later.

Those using RHEL 6.4 or later (or a RHEL clone) should already have access to 1.1.10 via the normal update channels.

At issue is some faulty logic in a function called tengine_stonith_notify() which can incorrectly add successfully fenced nodes to a list, causing Pacemaker to subsequently erase that node’s status section when the next DC election occurs.

With the status section erased, the cluster thinks that node is safely down and begins starting any services it has on other nodes - despite those already being active.

In order to trigger the logic, the fenced node must:

  1. have been the previous DC
  2. been sufficently functional to request its own fencing, and
  3. the fencing notification must arrive after the new DC has been elected, but before it invokes the policy engine

That this is the first we have heard of the issue since the problem was introduced in August 2011, the above sequence of events is apparently hard to hit under normal conditions.

Logs symptomatic of the issue look as follows:

# grep -e do_state_transition -e reboot  -e do_dc_takeover -e tengine_stonith_notify -e S_IDLE /var/log/corosync.log

Mar 08 08:43:22 [9934] lorien       crmd:     info: do_dc_takeover:     Taking over DC status for this partition
Mar 08 08:43:22 [9934] lorien       crmd:   notice: tengine_stonith_notify:     Peer gandalf was terminated (st_notify_fence) by mordor for gandalf: OK (ref=10d27664-33ed-43e0-a5bd-7d0ef850eb05) by client crmd.31561
Mar 08 08:43:22 [9934] lorien       crmd:   notice: tengine_stonith_notify:     Notified CMAN that 'gandalf' is now fenced
Mar 08 08:43:22 [9934] lorien       crmd:   notice: tengine_stonith_notify:     Target may have been our leader gandalf (recorded: <unset>)
Mar 08 09:13:52 [9934] lorien       crmd:     info: do_dc_takeover:     Taking over DC status for this partition
Mar 08 09:13:52 [9934] lorien       crmd:   notice: do_dc_takeover:     Marking gandalf, target of a previous stonith action, as clean
Mar 08 08:43:22 [9934] lorien       crmd:     info: do_state_transition:    State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ]
Mar 08 08:43:28 [9934] lorien       crmd:     info: do_state_transition:    State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED cause=C_FSA_INTERNAL origin=check_join_state ]

Note in particular the final entry from tengine_stonith_notify():

Target may have been our leader gandalf (recorded: <unset>)

If you see this after Taking over DC status for this partition but prior to State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE, then you are likely to have resources running in more than one location after the next DC election.

The issue was fixed during a routine cleanup prior to Pacemaker-1.1.10 in @f30e1e43 However the implications of what the old code allowed were not fully appreciated at the time.

DRBDManage release 0.10

Posted in LINBIT Blogs by flip at February 06, 2014 03:11 PM

As already announced in another blog post, we’re preparing a new tool to simplify DRBD administration. Now we’re publishing its first release! Prior to DRBD Manage, in order to deploy a DRBD resource you’d have to create a config file and copy it to all the necessary nodes.  As The Internet says “ain’t nobody got time for that”.  Using DRBD Manage, all you need to do is execute the following command:

drbdmanage new-volume vol0 4 --deploy 3

Here is what happens on the back-end:

  • It chooses three nodes from the available set1;
  • drbdmanage creates a 4GiB LV on all these nodes;
  • generates DRBD configuration files;
  • writes the DRBD meta-data into the LV;
  • starts the initial sync, and
  • makes the volume on a node Primary so that it can be used right now.

This process takes only a few seconds.


Please note that there are some things to take into consideration:

  • drbdmanage is a lot to type; however, an alias dm=drbdmanage in your ~/.*shrc takes care of that ;)
  • Initial setup is a bit involved (see the README) 2
  • You’ll need at least DRBD 9.0.0-pre7.
  • Being that both DRBD Manage and DRBD9 are still under heavy development there are more than likely some undiscovered bugs.  Bug reports, ideas, wishes, or any other feedback, are welcome.

Anyway – head over to the DRBD-Manage homepage and fetch your source tarballs (a few packages are prepared, too), or a GIT checkout if you plan to keep up-to-date. For questions please use the drbd-user mailing list; patches, or other development-related topics are welcome on the drbd-dev mailing list.

What do you think? Drop us a note!


DRBD-Manager

Posted in LINBIT Blogs by flip at November 22, 2013 12:41 PM

One of the projects that LINBIT will publish soon1 is drbdmanage, which allows easy cluster-wide storage administration with DRBD 9.

Every DRBD user knows the drill – create an LV, write a DRBD resource configuration file, create-md, up, initial sync, …

But that is no more.

The new way is this: drbdmanage new-volume r0 50 deploy 4, and here comes your quadruple replicated 50 gigabyte DRBD volume.

This is accomplished by a cluster-wide DRBD volume that holds some drbdmanage data, and a daemon on each node that receives DRBD events from the kernel.

Every time some configuration change is wanted,

  1. drbdmanage writes into the common volume,
  2. causing the other nodes to see the PrimarySecondary events,
  3. so that they know to reload the new configuration,
  4. and act upon it – creating or removing an LV, reconfiguring DRBD, etc.
  5. and, if required, cause an initial sync.

As DRBD 8.4.4 now supports DISCARD/TRIM, the initial sync (on SSDs or Thin LVM) is essentially free – a few seconds is all it takes. (See eg. mkfs.ext4 for a possible user).

Further usecases are various projects that can benefit by a “shared storage” layer – like oVirt, OpenStack, libVirt, etc.
Just imagine using a non-cluster-aware tool like virt-manager to create a new VM, and the storage gets automatically sync’ed across multiple nodes…

Interested? You’ll have to wait for a few weeks, but you can always drop us a line.


Announcing 1.1.11 Beta Testing

Posted in The Cluster Guy at November 21, 2013 03:00 AM

With over 400 updates since the release of 1.1.10, its time to start thinking about a new release.

Today I have tagged release candidate 1. The most notable fixes include:

  • attrd: Implementation of a truely atomic attrd for use with corosync 2.x
  • cib: Allow values to be added/updated and removed in a single update
  • cib: Support XML comments in diffs
  • Core: Allow blackbox logging to be disabled with SIGUSR2
  • crmd: Do not block on proxied calls from pacemaker_remoted
  • crmd: Enable cluster-wide throttling when the cib heavily exceeds its target load
  • crmd: Use the load on our peers to know how many jobs to send them
  • crm_mon: add –hide-headers option to hide all headers
  • crm_report: Collect logs directly from journald if available
  • Fencing: On timeout, clean up the agent’s entire process group
  • Fencing: Support agents that need the host to be unfenced at startup
  • ipc: Raise the default buffer size to 128k
  • PE: Add a special attribute for distinguishing between real nodes and containers in constraint rules
  • PE: Allow location constraints to take a regex pattern to match against resource IDs
  • pengine: Distinguish between the agent being missing and something the agent needs being missing
  • remote: Properly version the remote connection protocol
  • services: Detect missing agents and permission errors before forking
  • Bug cl#5171 - pengine: Don’t prevent clones from running due to dependant resources
  • Bug cl#5179 - Corosync: Attempt to retrieve a peer’s node name if it is not already known
  • Bug cl#5181 - corosync: Ensure node IDs are written to the CIB as unsigned integers

If you are a user of pacemaker_remoted, you should take the time to read about changes to the online wire protocol that are present in this release.

To build rpm packages for testing:

  1. Clone the current sources:

    # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
    # cd pacemaker
    
  2. If you haven’t already, install Pacemaker’s dependancies

    [Fedora] # sudo yum install -y yum-utils
    [ALL]    # make rpm-dep
    
  3. Build Pacemaker

    # make rc
    
  4. Copy the rpms and deploy as needed

Pacemaker and RHEL 6.4 (redux)

Posted in The Cluster Guy at November 03, 2013 10:33 PM

The good news is that as of Novemeber 1st, Pacemaker is now supported on RHEL 6.4 - with two caveats.

  1. You must be using the updated pacemaker, resource-agents and pcs packages
  2. You must be using CMAN for membership and quorum (background)

Technically, support is currently limited to Pacemaker’s use in the context of OpenStack. In practice however, any bug that can be shown to affect OpenStack deployments has a good chance of being fixed.

Since a cluster with no services is rather pointless, the heartbeat OCF agents are now also officially supported. However, as Red Hat’s policy is to only ship supported agents, some agents are not present for this initial release.

The three primary reasons for not shipping agents were:

  1. The software the agent controls is not shipped in RHEL
  2. Insufficient experience to provide support
  3. Avoiding agent duplication

Filing bugs is definitly the best way to get agents in the second categories prioritized for inclusion.

Likewise, if there is no shipping agent that provides the functionality of agents in the third category (IPv6addr and IPaddr2 might be an example here), filing bugs is the best way to get that fixed.

In the meantime, since most of the agents are just shell scripts, downloading the latest upstream agents is a viable work-around in most cases. For example:

    agents="Raid1 Xen"
    for a in $agents; do wget -O /usr/lib/ocf/resource.d/heartbeat/$a https://github.com/ClusterLabs/resource-agents/raw/master/heartbeat/$a; done

Changes to the Remote Wire Protocol in 1.1.11

Posted in The Cluster Guy at October 31, 2013 01:04 AM

Unfortunately the current wire protocol used by pacemaker_remoted for exchanging messages was found to be suboptimal and we have taken the decision to change it now before it becomes widely adopted.

We attempted to do this in a backwards compatibile manner, however the two methods we tried were either overly complicated and fragile, or not possible due to the way the released crm_remote_parse_buffer() function operated.

The changes include a versioned binary header that contains the size of the header, payload and total message, control flags and a big/little-endian detector.

These changes will appear in the upstream repo shortly and ship in 1.1.11. Anyone for this will be a problem is encouraged to get in contact to discuss possible options.

For RHEL users, any version on which pacemaker_remoted is supported will have the new versioned protocol. That means 7.0 and potentially a future 6.x release.

DRBD and the sync rate controller, part 2

Posted in LINBIT Blogs by flip at October 29, 2013 09:36 AM

As an update to the earlier blog post, take a look below.

As a reminder: this is about resynchronization (ie. recovery after a node or network problem), not about the replication.


If you’ve got a demanding application it’s possible that it completely fills your I/O bandwidth, disk and/or network, leaving no room for the synchronization to complete. To make the synchronization slow down and let the application proceed, DRBD has the dynamically adaptive resync rate controller.

It is enabled by default with 8.4, and disabled by default with 8.3.
To explicitly enable or disable, set c-plan-ahead to 20 (enable) or 0 (disable).

Note that, while enabled, the setting for the old fixed sync rate is used only as initial guess for the controller. After that, only the c-* settings are used, so changing the fixed sync rate while the controller is enabled won’t have much effect.

What it does

The resync controller tries to use up as much network and disk bandwidth as it can get, but no more than c-max-rate, and throttles if either

  • more resync requests are in flight than what amounts to c-fill-target 1
  • it detects application IO (read or write), and the current estimated resync rate is above c-min-rate2.

The default c-min-rate with 8.4.x is 250 kiB/sec (the old default of the fixed sync-rate), with 8.3.x it was 4MiB/sec.

This “throttle if application IO is detected” is active even if the fixed sync rate is used. You can (but should not, see below) disable this specific throttling by setting c-min-rate to 0.

Tuning the resync controller

It’s hard, or next to impossible, for DRBD to detect how much activity your backend can handle. But it is very easy for DRBD to know how much resync-activity it causes itself.
So, you tune how much resync-activity you allow during periods of application activity.

To do that you should

  • set c-plan-ahead to 20 (default with 8.4), or more if there’s a lot of latency on the connection (WAN link with protocol A);
  • leave the fixed resync rate (the initial guess for the controller) at about 30% or less of what your hardware can handle;
  • set c-max-rate to 100% (or slightly more) of what your hardware can handle;
  • set c-fill-target to the minimum (just as high as necessary) that gets your hardware saturated, if the system is otherwise idle.
    Respectively, figure out the maximum possible resync rate in your setup while the system is idle, then set c-fill-target to the minimum setting that still reaches that rate.
  • And finally, while checking application request latency/responsiveness, tune c-min-rate to the maximum that still allows for acceptable responsiveness.

Most parts of this post were originally published as an ML post by Lars.
Additional information you also find in the drbd.conf manpage.


DRBD Proxy 3.1: Performance improvements

Posted in LINBIT Blogs by flip at October 14, 2013 06:35 AM

The threading model in DRBD Proxy 3.1 received a complete overhaul; below you can see the performance implications of these changes.First of all, as it suffered from the distinction between low latency for the meta-data connections vs. high bandwidth for the data connections, a second set of pthreads has been added. The first one runs at the (normally negative) nice level the DRBD Proxy process is started at, while the second set, in order to be “nicer” to the other processes, adds +10 to the nice level and therefore gets a smaller chunk of the cpu time.

Secondly, the internal processing has been changed, too. This isn’t visible externally, of course – you can only notice the performance improvements.

DRBD Proxy 3.1 buffer usage

In the example graph above a few sections can be clearly seen:

  • From 0 to about 11.5 seconds the Proxy buffer gets filled. In case anyone’s interested, here’s the dd output:
    3712983040 Bytes (3.7 GB) copied, 11.4573 s, 324 MB/s
  • Until up to ~44 seconds, there is lzma compression active, with a single context. Slow, but compresses the best.
  • Then I switched to zlib; this is a fair bit faster. All cores are being used, so external requests (by some VMs and other processes) show up as irregular spikes. (Different compression ratios for various input data are “at fault”, too.)
  • At 56 seconds the compression is turned off completely; the time needed for the rest of the data (3GiB in about 13 seconds) shows the bonded-ethernet bandwidth of about 220MB/sec.

For two sets of test machines1 a plausible rate for transferring large blocks2 into the Proxy buffers is 450-500MiB/sec3.
For small buffers there are a few code paths that are not fully optimized yet4, further improvements are to be expected in the next versions, too.

The roadmap for the near future includes a shared memory pool for all connections and WAN bandwidth shaping (ie. limitation to some configured value) — and some more ideas that have to be researched first.

Opinions? Contact us!


Pacemaker 1.1.10 - final

Posted in The Cluster Guy at July 26, 2013 12:11 AM

Announcing the release of Pacemaker 1.1.10

There were three changes of note since rc7:

  • Bug cl#5161 - crmd: Prevent memory leak in operation cache
  • cib: Correctly read back archived configurations if the primary is corrupted
  • cman: Do not pretend we know the state of nodes we’ve never seen

Along with assorted bug fixes, the major topics for this release were:

  • stonithd fixes
  • fixing memory leaks, often caused by incorrect use of glib reference counting
  • supportability improvements (code cleanup and deduplication, standardized error codes)

Release candidates for the next Pacemaker release (1.1.11) can be expected some time around Novemeber.

A big thankyou to everyone that spent time testing the release candidates and/or contributed patches. However now that Pacemaker is perfect, anyone reporting bugs will be shot :-)

To build rpm packages:

  1. Clone the current sources:

    # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
    # cd pacemaker
    
  2. Install dependancies (if you haven’t already)

    [Fedora] # sudo yum install -y yum-utils
    [ALL]    # make rpm-dep
    
  3. Build Pacemaker

    # make release
    
  4. Copy and deploy as needed

Details - 1.1.10 - final

Changesets  602
Diff 143 files changed, 8162 insertions(+), 5159 deletions(-)

Highlights

Features added since Pacemaker-1.1.9

  • Core: Convert all exit codes to positive errno values
  • crm_error: Add the ability to list and print error symbols
  • crm_resource: Allow individual resources to be reprobed
  • crm_resource: Allow options to be set recursively
  • crm_resource: Implement –ban for moving resources away from nodes and –clear (replaces –unmove)
  • crm_resource: Support OCF tracing when using –force-(check|start|stop)
  • PE: Allow active nodes in our current membership to be fenced without quorum
  • PE: Suppress meaningless IDs when displaying anonymous clone status
  • Turn off auto-respawning of systemd services when the cluster starts them
  • Bug cl#5128 - pengine: Support maintenance mode for a single node

Changes since Pacemaker-1.1.9

  • crmd: cib: stonithd: Memory leaks resolved and improved use of glib reference counting
  • attrd: Fixes deleted attributes during dc election
  • Bug cf#5153 - Correctly display clone failcounts in crm_mon
  • Bug cl#5133 - pengine: Correctly observe on-fail=block for failed demote operation
  • Bug cl#5148 - legacy: Correctly remove a node that used to have a different nodeid
  • Bug cl#5151 - Ensure node names are consistently compared without case
  • Bug cl#5152 - crmd: Correctly clean up fenced nodes during membership changes
  • Bug cl#5154 - Do not expire failures when on-fail=block is present
  • Bug cl#5155 - pengine: Block the stop of resources if any depending resource is unmanaged
  • Bug cl#5157 - Allow migration in the absence of some colocation constraints
  • Bug cl#5161 - crmd: Prevent memory leak in operation cache
  • Bug cl#5164 - crmd: Fixes crash when using pacemaker-remote
  • Bug cl#5164 - pengine: Fixes segfault when calculating transition with remote-nodes.
  • Bug cl#5167 - crm_mon: Only print “stopped” node list for incomplete clone sets
  • Bug cl#5168 - Prevent clones from being bounced around the cluster due to location constraints
  • Bug cl#5170 - Correctly support on-fail=block for clones
  • cib: Correctly read back archived configurations if the primary is corrupted
  • cib: The result is not valid when diffs fail to apply cleanly for CLI tools
  • cib: Restore the ability to embed comments in the configuration
  • cluster: Detect and warn about node names with capitals
  • cman: Do not pretend we know the state of nodes we’ve never seen
  • cman: Do not unconditionally start cman if it is already running
  • cman: Support non-blocking CPG calls
  • Core: Ensure the blackbox is saved on abnormal program termination
  • corosync: Detect the loss of members for which we only know the nodeid
  • corosync: Do not pretend we know the state of nodes we’ve never seen
  • corosync: Ensure removed peers are erased from all caches
  • corosync: Nodes that can persist in sending CPG messages must be alive afterall
  • crmd: Do not get stuck in S_POLICY_ENGINE if a node we couldn’t fence returns
  • crmd: Do not update fail-count and last-failure for old failures
  • crmd: Ensure all membership operations can complete while trying to cancel a transition
  • crmd: Ensure operations for cleaned up resources don’t block recovery
  • crmd: Ensure we return to a stable state if there have been too many fencing failures
  • crmd: Initiate node shutdown if another node claims to have successfully fenced us
  • crmd: Prevent messages for remote crmd clients from being relayed to wrong daemons
  • crmd: Properly handle recurring monitor operations for remote-node agent
  • crmd: Store last-run and last-rc-change for all operations
  • crm_mon: Ensure stale pid files are updated when a new process is started
  • crm_report: Correctly collect logs when ‘uname -n’ reports fully qualified names
  • fencing: Fail the operation once all peers have been exhausted
  • fencing: Restore the ability to manually confirm that fencing completed
  • ipc: Allow unpriviliged clients to clean up after server failures
  • ipc: Restore the ability for members of the haclient group to connect to the cluster
  • legacy: Support “crm_node –remove” with a node name for corosync plugin (bnc#805278)
  • lrmd: Default to the upstream location for resource agent scratch directory
  • lrmd: Pass errors from lsb metadata generation back to the caller
  • pengine: Correctly handle resources that recover before we operate on them
  • pengine: Delete the old resource state on every node whenever the resource type is changed
  • pengine: Detect constraints with inappropriate actions (ie. promote for a clone)
  • pengine: Ensure per-node resource parameters are used during probes
  • pengine: If fencing is unavailable or disabled, block further recovery for resources that fail to stop
  • pengine: Implement the rest of get_timet_now() and rename to get_effective_time
  • pengine: Re-initiate active recurring monitors that previously failed but have timed out
  • remote: Workaround for inconsistent tls handshake behavior between gnutls versions
  • systemd: Ensure we get shut down correctly by systemd
  • systemd: Reload systemd after adding/removing override files for cluster services
  • xml: Check for and replace non-printing characters with their octal equivalent while exporting xml text
  • xml: Prevent lockups by setting a more reliable buffer allocation strategy

Release candidate: 1.1.10-rc7

Posted in The Cluster Guy at July 22, 2013 12:50 AM

Announcing the seventh release candidate for Pacemaker 1.1.10

This RC is a result of bugfixes to the policy engine, fencing daemon and crmd. We’ve squashed a bug involving constructing compressed messages and stonith-ng can now recover when a configuration ordering change is detected.

Please keep the bug reports coming in!

To build rpm packages for testing:

  1. Clone the current sources:

    # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
    # cd pacemaker
    
  2. Install dependancies (if you haven’t already)

    [Fedora] # sudo yum install -y yum-utils
    [ALL]    # make rpm-dep
    
  3. Build Pacemaker

    # make rc
    
  4. Copy and deploy as needed

Details - 1.1.10-rc7

Changesets  57
Diff 37 files changed, 414 insertions(+), 331 deletions(-)

Highlights

Features added in Pacemaker-1.1.10-rc7

  • N/A

Changes since Pacemaker-1.1.10-rc6

  • Bug cl#5168 - Prevent clones from being bounced around the cluster due to location constraints
  • Bug cl#5170 - Correctly support on-fail=block for clones
  • Bug cl#5164 - crmd: Fixes crmd crash when using pacemaker-remote
  • cib: The result is not valid when diffs fail to apply cleanly for CLI tools
  • cluster: Correctly construct the header for compressed messages
  • cluster: Detect and warn about node names with capitals
  • Core: remove the mainloop_trigger that are no longer needed.
  • corosync: Ensure removed peers are erased from all caches
  • cpg: Correctly free sent messages
  • crmd: Prevent messages for remote crmd clients from being relayed to wrong daemons
  • crmd: Properly handle recurring monitor operations for remote-node agent
  • crm_mon: Bug cl#5167 - Only print “stopped” node list for incomplete clone sets
  • crm_node: Return 0 if –remove passed
  • fencing: Correctly detect existing device entries when registering a new one
  • lrmd: Prevent use-of-NULL in client library
  • pengine: cl5164 - Fixes pengine segfault when calculating transition with remote-nodes.
  • pengine: Do the right thing when admins specify the internal resource instead of the clone
  • pengine: Re-allow ordering constraints with fencing devices now that it is safe to do so

Release candidate: 1.1.10-rc6

Posted in The Cluster Guy at July 04, 2013 06:46 AM

Announcing the sixth release candidate for Pacemaker 1.1.10

This RC is a result of bugfixes in the policy engine, fencing daemon and crmd. Previous fixes in rc5 have also now been confirmed.

Help is specifically requested for testing plugin-based clusters, ACLs, the –ban and –clear commands, and admin actions (such as moving and stopping resources, calls to stonith_admin) which are hard to test in an automated manner.

There is one bug open for David’s remote nodes feature (involving managing services on non-cluster nodes), but everything else seems good.

Please keep the bug reports coming in!

To build rpm packages for testing:

  1. Clone the current sources:

    # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
    # cd pacemaker
    
  2. Install dependancies (if you haven’t already)

    [Fedora] # sudo yum install -y yum-utils
    [ALL]    # make rpm-dep
    
  3. Build Pacemaker

    # make rc
    
  4. Copy and deploy as needed

Details - 1.1.10-rc6

Changesets  63
Diff 24 files changed, 356 insertions(+), 133 deletions(-)

Highlights

Features added in Pacemaker-1.1.10-rc6

  • tools: crm_mon –neg-location drbd-fence-by-handler
  • pengine: cl#5128 - Support maintenance mode for a single node

Changes since Pacemaker-1.1.10-rc5

  • cluster: Correctly remove duplicate peer entries
  • crmd: Ensure operations for cleaned up resources don’t block recovery
  • pengine: Bug cl#5157 - Allow migration in the absence of some colocation constraints
  • pengine: Delete the old resource state on every node whenever the resource type is changed
  • pengine: Detect constraints with inappropriate actions (ie. promote for a clone)
  • pengine: Do the right thing when admins specify the internal resource instead of the clone

GPG Quickstart

Posted in The Cluster Guy at June 24, 2013 04:40 AM

It seemed timely that I should refresh both my GPG knowledge and my keys. I am summarizing my method (and sources) below in the event that they may prove useful to others:

Preparation

The following settings ensure that any keys you create in the future are strong ones by 2013’s standards. Paste the following into ~/.gnupg/gpg.conf:

# when multiple digests are supported by all recipients, choose the strongest one:
personal-digest-preferences SHA512 SHA384 SHA256 SHA224
# preferences chosen for new keys should prioritize stronger algorithms: 
default-preference-list SHA512 SHA384 SHA256 SHA224 AES256 AES192 AES CAST5 BZIP2 ZLIB ZIP Uncompressed
# when making an OpenPGP certification, use a stronger digest than the default SHA1:
cert-digest-algo SHA512

The next batch of settings are optional but aim to improve the output of gpg commands in various ways - particularly against spoofing. Again, paste them into ~/.gnupg/gpg.conf:

# when outputting certificates, view user IDs distinctly from keys:
fixed-list-mode
# long keyids are more collision-resistant than short keyids (it's trivial to make a key with any desired short keyid)
keyid-format 0xlong
# If you use a graphical environment (and even if you don't) you should be using an agent:
# (similar arguments as  https://www.debian-administration.org/users/dkg/weblog/64)
use-agent
# You should always know at a glance which User IDs gpg thinks are legitimately bound to the keys in your keyring:
verify-options show-uid-validity
list-options show-uid-validity
# include an unambiguous indicator of which key made a signature:
# (see http://thread.gmane.org/gmane.mail.notmuch.general/3721/focus=7234)
sig-notation issuer-fpr@notations.openpgp.fifthhorseman.net=%g

Create a New Key

There are several checks for deciding if your old key(s) are any good. However, if you created a key more than a couple of years ago, then realistically you probably need a new one.

I followed instructions from Ana Guerrero’s post, which were the basis of the current debian guide, but selected the 2013 default key type:

  1. run gpg --gen-key
  2. Select (1) RSA and RSA (default)
  3. Select a keysize greater than 2048
  4. Set a key expiration of 2-5 years. [rationale]
  5. Do NOT specify a comment for User ID. [rationale]

Add Additional UIDs and Setting a Default

At this point my keyring gpg --list-keys looked like this:

pub   4096R/0x726724204C644D83 2013-06-24
uid                 [ultimate] Andrew Beekhof <andrew@beekhof.net>
sub   4096R/0xC88100891A418A6B 2013-06-24 [expires: 2015-06-24]

Like most people, I have more than one email address and I will want to use GPG with them too. So now is the time to add them to the key. You’ll want the gpg --edit-key command for this. Ana has a good exmaple of adding UIDs and setting a preferred one. Just search her instructions for Add other UID.

Separate Subkeys for Encryption and Signing

The general consensus is that separate keys should be used for signing versus encryption.

tl;dr - you want to be able to encrypt things without signing them as “signing” may have unintended legal implications. There is also the possibility that signed messages can be used in an attack against encrypted data.

By default gpg will create a subkey for encryption, but I followed Debian’s subkey guide for creating one for signing too (instead of using the private master key).

Doing this allows you to make your private master key even safer by removing it from your day-to-day keychain.

The idea is to make a copy first and keep it in an even more secure location, so that if a subkey (or the machine its on) gets compromised, your master key remains safe and you are always in a position to revoke subkeys and create new ones.

Sign the New Key with the Old One

If you have an old key, you should sign the new one with it. This tells everyone who trusted the old key that the new one is legitimate and can therefor also be trusted.

Here I went back to Ana’s instructions. Basically:

gpg --default-key OLDKEY --sign-key NEWKEY

or, in my case:

gpg --default-key 0xEC3584EFD449E59A --sign-key 0x726724204C644D83

Send it to a Key Server

Tell the world so they can verfiy your signature and send you encrypted messages:

gpg --send-key 0x726724204C644D83

Revoking Old UIDs

If you’re like me, your old key might have some addresses which you have left behind. You can’t remove addresses from your keys, but you can tell the world to stop using them.

To do this for my old key, I followed instructions on the gnupg mailing list

Everything still looks the same when you search for my old key:

pub  1024D/D449E59A 2007-07-20 Andrew Beekhof <beekhof@mac.com>
                               Andrew Beekhof <abeekhof@suse.de>
                               Andrew Beekhof <beekhof@gmail.com>
                               Andrew Beekhof <andrew@beekhof.net>
                               Andrew Beekhof <abeekhof@novell.com>
     Fingerprint=E5F5 BEFC 781F 3637 774F  C1F8 EC35 84EF D449 E59A 

But if you click through to the key details, you’ll see the addresses associated with my time at Novell/SuSE now show revok in red.

pub  1024D/D449E59A 2007-07-20            
     Fingerprint=E5F5 BEFC 781F 3637 774F  C1F8 EC35 84EF D449 E59A 

uid Andrew Beekhof <beekhof@mac.com>
sig  sig3  D449E59A 2007-07-20 __________ __________ [selfsig]

uid Andrew Beekhof <abeekhof@suse.de>
sig  sig3  D449E59A 2007-07-20 __________ __________ [selfsig]
sig revok  D449E59A 2013-06-24 __________ __________ [selfsig]
...

This is how other people’s copy of gpg knows not to use this key for that address anymore. And also why its important to refresh your keys periodically.

Revoking Old Keys

Realistically though, you probably don’t want people using old and potentially compromised (or compromise-able) keys to send you sensitive messages. The best thing to do is revoke the entire key.

Since keys can’t be removed once you’ve uploaded them, you’re actually updating the existing entry. To do this you need the original private key - so keep it safe!

Some people advise you to pre-generate the revocation key - personally that seems like just one more thing to keep track of.

Orphaned keys that can’t be revoked still appear valid to anyone wanting to send you a secure message - a good reason to set an expiry date as a failsafe!

This is what one of my old revoked key looks like:

pub  1024D/DABA170E 2004-10-11 *** KEY REVOKED *** [not verified]
                               Andrew Beekhof (SuSE VPN Access) <andrew@beekhof.net>
     Fingerprint=9A53 9DBB CF73 AB8F B57B  730A 3279 4AE9 DABA 170E 

Final Result

My new key:

pub  4096R/4C644D83 2013-06-24 Andrew Beekhof <andrew@beekhof.net>
                               Andrew Beekhof <beekhof@mac.com>
                               Andrew Beekhof <abeekhof@redhat.com>
     Fingerprint=C503 7BA2 D013 6342 44C0  122C 7267 2420 4C64 4D83 

Closing word

I am by no means an expert at this, I would be very grateful to hear about any mistakes I may have made above.

Release candidate: 1.1.10-rc5

Posted in The Cluster Guy at June 19, 2013 12:32 AM

Lets try this again… Announcing the fourth and a half release candidate for Pacemaker 1.1.10

I previously tagged rc4 but ended up making several changes shortly afterwards, so it was pointless to announce it.

This RC is a result of cleanup work in several ancient areas of the codebase:

  • A number of internal membership caches have been combined
  • The three separate CPG code paths have been combined

As well as:

  • Moving clones is now both possible and sane
  • Improved behavior on systemd based nodes
  • and other assorted bugfixes (see below)

Please keep the bug reports coming in!

Help is specifically requested for testing plugin-based clusters, ACLs, the new –ban and –clear commands, and admin actions (such as moving and stopping resources, calls to stonith_admin) which are hard to test in an automated manner.

Also any light that can be shed on possible memory leaks would be much appreciated.

If everything looks good in a week from now, I will re-tag rc5 as final.

To build rpm packages for testing:

  1. Clone the current sources:

    # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
    # cd pacemaker
    
  2. Install dependancies

    [Fedora] # sudo yum install -y yum-utils
    [ALL]    # make rpm-dep
    
  3. Build Pacemaker

    # make rc
    
  4. Copy and deploy as needed

Details - 1.1.10-rc5

Changesets  168
Diff 96 files changed, 4983 insertions(+), 3097 deletions(-)

Highlights

Features added in Pacemaker-1.1.10-rc5

  • crm_error: Add the ability to list and print error symbols
  • crm_resource: Allow individual resources to be reprobed
  • crm_resource: Implement –ban for moving resources away from nodes and –clear (replaces –unmove)
  • crm_resource: Support OCF tracing when using –force-(check|start|stop)
  • PE: Allow active nodes in our current membership to be fenced without quorum
  • Turn off auto-respawning of systemd services when the cluster starts them

Changes since Pacemaker-1.1.10-rc3

  • Bug pengine: cl#5155 - Block the stop of resources if any depending resource is unmanaged
  • Convert all exit codes to positive errno values
  • Core: Ensure the blackbox is saved on abnormal program termination
  • corosync: Detect the loss of members for which we only know the nodeid
  • corosync: Do not pretend we know the state of nodes we’ve never seen
  • corosync: Nodes that can persist in sending CPG messages must be alive afterall
  • crmd: Do not get stuck in S_POLICY_ENGINE if a node we couldn’t fence returns
  • crmd: Ensure all membership operations can complete while trying to cancel a transition
  • crmd: Everyone who gets a fencing notification should mark the node as down
  • crmd: Initiate node shutdown if another node claims to have successfully fenced us
  • crmd: Update the status section with details of nodes for which we only know the nodeid
  • crm_report: Find logs in compressed files
  • logging: If SIGTRAP is sent before tracing is turned on, turn it on
  • pengine: If fencing is unavailable or disabled, block further recovery for resources that fail to stop
  • remote: Workaround for inconsistent tls handshake behavior between gnutls versions
  • systemd: Ensure we get shut down correctly by systemd

“umount is too slow”

Posted in LINBIT Blogs by flip at May 27, 2013 07:31 AM

A question we see over and over again is

Why is umount so slow? Why does it take so long?

Part of the answer was already given in an earlier blog post; here’s some more explanation.

The write() syscall typically writes into RAM only. In Linux we call that “page cache“, or “buffer cache“, depending on what exactly the actual target of the write() system call was.

From that RAM (cache inside the operating system, high in the IO stack) the operating system does periodically do writeouts, at its leisure, unless it is urged to write out particular pieces (or all of it) now.

A sync (or fsync(), or fdatasync(), or …) does exactly that: it urges the operating system to do the write out.
A umount also causes a write out of all not yet written data of the affected file system.

Note:

  • Of course the “performance” of writes that go into volatile RAM only will be much better than anything that goes to stable, persistent, storage. All things that have only been written to cache but not yet synced (written out to the block layer) will be lost if you have a power outage or server crash.
    The linux block layer has never seen these changes, DRBD has never seen these changes, they cannot possibly be replicated anywhere.
    Data will be lost.

There are also controller caches which may or may not be volatile, and disk caches, which typically are volatile. These are below and outside the operating system, and not part of this discussion. Just make sure you disable all volatile caches on that level.

Now, for a moment, assume

  • you don’t have DRBD in the stack, and
  • a moderately capable IO backend that writes, say, 300 MByte/s, and
  • around 3 GiByte of dirty data around at the time you trigger the umount, and
  • you are not seek-bound, so your backend can actually reach that 300 MB/s,

you get a umount time of around 10 seconds.


Still with me?

Ok. Now, introduce DRBD to your IO stack, and add a long distance replication link. Just for the sake of me trying to explain it here, assume that because it is long distance and you have a limited budget, you can only afford 100 MBit/s. And “long distance” implies larger round trip times, so lets assume we have a RTT of 100 ms.

Of course that would introduce a single IO request latency of > 100 ms for anything but DRBD protocol A, so you opt for protocol A. (In other words, using protocol A “masks” the RTT of the replication link from the application-visible latency.)

That was latency.

But, the limited bandwidth of that replication link also limits your average sustained write throughput, in the given example to about 11MiByte/s.
The same 3 GByte of dirty data would now drain much slower, in fact that same umount would now take not 10 seconds, but 5 minutes.

You can also take a look at a drbd-user mailing list post.


So, concluding: try to avoid having much unsaved data in RAM; it might bite you. For example, you want your cluster to do a switchover, but the umount takes too long and a timeout hits: the node (should) get fenced, and the data not written to stable storage will be lost.

Please follow the advice about setting some sysctls to start write-out earlier!

Rolling updates with Ansible and Apache reverse proxies

Posted in Arrfab's Blog » Cluster by fabian.arrotin at May 23, 2013 04:36 PM

It's not a secret anymore that I use Ansible to do a lot of things. That goes from simple "one shot" actions with ansible on multiple nodes to "configuration management and deployment tasks" with ansible-playbook. One of the thing I also really like with Ansible is the fact that it's also a great orchestration tool.

For example, in some WSOA flows you can have a bunch of servers behind load balancer nodes. When you want to put a backend node/web server node in maintenance mode (to change configuration/update package/update app/whatever), you just "remove" that node from the production flow, do what you need to do, verify it's up again and put that node back in production. The principle of "rolling updates" is then interesting as you still have 24/7 flows in production.

But what if you're not in charge of the whole infrastructure ? AKA for example you're in charge of some servers, but not the load balancers in front of your infrastructure. Let's consider the following situation, and how we'll use ansible to still disable/enable a backend server behind Apache reverse proxies.

So here is the (simplified) situation : two Apache reverse proxies (using the mod_proxy_balancer module) are used to load balance traffic to four backend nodes (Jboss in our simplified case). We can't directly touch those upstream Apache nodes, but we can still interact on them , thanks to the fact that "balancer manager support" is active (and protected !)

Let's have a look at a (simplified) ansible inventory file :

[jboss-cluster]

jboss-1

jboss-2

jboss-3

jboss-4

[apache-group-1]

apache-node-1

apache-node-2

Let's now create a generic (write once/use it many) task to disable a backend node from apache ! :

---
##############################################################################
#
# This task can be included in a playbook to pause a backend node
# being load balanced by Apache Reverse Proxies
# Several variables need to be defined :
#   - ${apache_rp_backend_url} : the URL of the backend server, as known by Apache server
#   - ${apache_rp_backend_cluster} : the name of the cluster as defined on the Apache RP (the group the node is member of)
#   - ${apache_rp_group} : the name of the group declared in hosts.cfg containing Apache Reverse Proxies
#   - ${apache_rp_user}: the username used to authenticate against the Apache balancer-manager
#   - ${apache_rp_password}: the password used to authenticate against the Apache balancer-manager
#   - ${apache_rp_balancer_manager_uri}: the URI where to find the balancer-manager Apache mod
#
##############################################################################
- name: Disabling the worker in Apache Reverse Proxies
local_action: shell /usr/bin/curl -k --user ${apache_rp_user}:${apache_rp_password} "https://${item}/${apache_rp_balancer_manager_uri}?b=${apache_rp_backend_cluster}&w=${apache_rp_backend_url}&nonce=$(curl -k --user ${apache_rp_user}:${apache_rp_password} https://${item}/${apache_rp_balancer_manager_uri} |grep nonce|tail -n 1|cut -f 3 -d '&'|cut -f 2 -d '='|cut -f 1 -d '"')&dw=Disable"
with_items: ${groups.${apache_rp_group}}

- name: Waiting 20 seconds to be sure no traffic is being sent anymore to that worker backend node
pause: seconds=20

The interesting bit is the with_items one : it will use the apache_rp_group variable to know which apache servers are used upstream (assuming you can have multiple nodes/clusters) and will play that command for every host in the list obtained from the inventory !

We can now, in the "rolling-updates" playbook, just call the previous tasks (assuming we saved it as ../tasks/apache-disable-worker.yml) :

---

- hosts: jboss-cluster

serial: 1

user: root

tasks:

- include: ../tasks/apache-disable-worker.yml

- etc/etc ...

- wait_for: port=8443 state=started

- include: ../tasks/apache-enable-worker.yml

But Wait ! As you've seen, we still need to declare some variables : let's do that in the inventory, under group_vars and host_vars !

group_vars/jboss-cluster :

# Apache reverse proxies settins
apache_rp_group: apache-group-1
apache_rp_user: my-admin-account
apache_rp_password: my-beautiful-pass
apache_rp_balancer_manager_uri: balancer-manager-hidden-and-redirected

host_vars/jboss-1 :

apache_rp_backend_url : 'https://jboss1.myinternal.domain.org:8443'
apache_rp_backend_cluster : nameofmyclusterdefinedinapache

Now when we'll use that playbook, we'll have a local action that will interact with the balancer manager to disable that backend node while we do maintainance.

I let you imagine (and create) a ../tasks/apache-enable-worker.yml file to enable it (which you'll call at the end of your playbook).

Release candidate: 1.1.10-rc3

Posted in The Cluster Guy at May 23, 2013 12:32 AM

Announcing the third release candidate for Pacemaker 1.1.10

This RC is a result of work in several problem areas reported by users, some of which date back to 1.1.8:

  • manual fencing confirmations
  • potential problems reported by Coverity
  • the way anonymous clones are displayed
  • handling of resource output that includes non-printing characters
  • handling of on-fail=block

Please keep the bug reports coming in. There is a good chances that this will be the final release candidate and 1.1.10 will be tagged on May 30th.

Help is specifically requested for testing plugin-based clusters, ACLs and admin actions (such as moving and stopping resources, calls to stonith_admin) which are hard to test in an automated manner.

To build rpm packages for testing:

  1. Clone the current sources:

    # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
    # cd pacemaker
    
  2. Install dependancies

    [Fedora] # sudo yum install -y yum-utils
    [ALL]    # make rpm-dep
    
  3. Build Pacemaker

    # make rc
    
  4. Copy and deploy as needed

Details - 1.1.10-rc3

Changesets  116
Diff 59 files changed, 707 insertions(+), 408 deletions(-)

Highlights

Features added in Pacemaker-1.1.10-rc3

  • PE: Display a list of nodes on which stopped anonymous clones are not active instead of meaningless clone IDs
  • PE: Suppress meaningless IDs when displaying anonymous clone status

Changes since Pacemaker-1.1.10-rc2

  • Bug cl#5133 - pengine: Correctly observe on-fail=block for failed demote operation
  • Bug cl#5151 - Ensure node names are consistently compared without case
  • Check for and replace non-printing characters with their octal equivalent while exporting xml text
  • cib: CID#1023858 - Explicit null dereferenced
  • cib: CID#1023862 - Improper use of negative value
  • cib: CID#739562 - Improper use of negative value
  • cman: Our daemons have no need to connect to pacemakerd in a cman based cluster
  • crmd: Do not record pending delete operations in the CIB
  • crmd: Ensure pending and lost actions have values for last-run and last-rc-change
  • crmd: Insert async failures so that they appear in the correct order
  • crmd: Store last-run and last-rc-change for fail operations
  • Detect child processes that terminate before our SIGCHLD handler is installed
  • fencing: CID#739461 - Double close
  • fencing: Correctly broadcast manual fencing ACKs
  • fencing: Correctly mark manual confirmations as complete
  • fencing: Do not send duplicate replies for manual confirmation operations
  • fencing: Restore the ability to manually confirm that fencing completed
  • lrmd: CID#1023851 - Truncated stdio return value
  • lrmd: Don’t complain when heartbeat invokes us with -r
  • pengine: Correctly handle resources that recover before we operate on them
  • pengine: Re-initiate active recurring monitors that previously failed but have timed out
  • xml: Restore the ability to embed comments in the cib

Pacemaker Logging

Posted in The Cluster Guy at May 20, 2013 03:43 AM

Normal operation

Pacemaker inherits most of its logging setting from either CMAN or Corosync - depending on what its running on top of.

In order to avoid spamming syslog, Pacemaker only logs a summary of its actions (NOTICE and above) to syslog.

If the level of detail in syslog is insufficient, you should enable a cluster log file. Normally one is configured by default and it contains everything except debug and trace messages.

To find the location of this file, either examine your CMAN (cluster.conf) or Corosync (corosync.conf) configuration file or look for syslog entries such as:

pacemakerd[1823]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log

If you do not see a line like this, either update the cluster configuration or set PCMK_debugfile in /etc/sysconfig/pacemaker

crm_report also knows how to find all the Pacemaker related logs and blackbox files

If the level of detail in the cluster log file is still insufficient, or you simply wish to go blind, you can turn on debugging in Corosync/CMAN, or set PCMK_debug in /etc/sysconfig/pacemaker.

A minor advantage of setting PCMK_debug is that the value can be a comma-separated list of processes which should produce debug logging instead of a global yes/no.

When an ERROR occurs

Pacemaker includes support for a blackbox.

When enabled, the blackbox contains a rolling buffer of all logs (not just those sent to syslog or a file) and is written to disk after a crash or assertion failure.

The blackbox recorder can be enabled by setting PCMK_blackbox in /etc/sysconfig/pacemaker or at runtime by sending SIGUSR1. Eg.

killall -USR1 crmd

When enabled you’ll see a log such as:

crmd[1811]:   notice: crm_enable_blackbox: Initiated blackbox recorder: /var/lib/pacemaker/blackbox/crmd-1811

If a crash occurs, the blackbox will be available at that location. To extract the contents, pass it to qb-blackbox:

qb-blackbox /var/lib/pacemaker/blackbox/crmd-1811

Which produces output like:

Dumping the contents of /var/lib/pacemaker/blackbox/crmd-1811
[debug] shm size:5242880; real_size:5242880; rb->word_size:1310720
[debug] read total of: 5242892
Ringbuffer:
 ->NORMAL
 ->write_pt [5588]
 ->read_pt [0]
 ->size [1310720 words]
 =>free [5220524 bytes]
 =>used [22352 bytes]
...
trace   May 19 23:20:55 gio_read_socket(368):0: 0x11ab920.5 1 (ref=1)
trace   May 19 23:20:55 pcmk_ipc_accept(458):0: Connection 0x11aee00
info    May 19 23:20:55 crm_client_new(302):0: Connecting 0x11aee00 for uid=0 gid=0 pid=24425 id=0e943a2a-dd64-49bc-b9d5-10fa6c6cb1bd
debug   May 19 23:20:55 handle_new_connection(465):2147483648: IPC credentials authenticated (24414-24425-14)
...
[debug] Free'ing ringbuffer: /dev/shm/qb-create_from_file-header

When an ERROR occurs you’ll also see the function and line number that produced it such as:

crmd[1811]: Problem detected at child_death_dispatch:872 (mainloop.c), please see /var/lib/pacemaker/blackbox/crmd-1811.1 for additional details
crmd[1811]: Problem detected at main:94 (crmd.c), please see /var/lib/pacemaker/blackbox/crmd-1811.2 for additional details

Again, simply pass the files to qb-blackbox to extract and query the contents.

Note the a counter is added to the end so as to avoid name collisions.

Diving into files and functions

In case you have not already guessed, all logs include the name of the function that generated them. So:

crmd[1811]:   notice: crm_update_peer_state: cman_event_callback: Node corosync-host-1[1] - state is now lost (was member)

came from the function crm_update_peer_state().

To obtain more detail from that or any other function, you can set PCMK_trace_functions in /etc/sysconfig/pacemaker to a comma separated list of function names. Eg.

PCMK_trace_functions=crm_update_peer_state,run_graph

For a bigger stick, you may also activate trace logging for all the functions in a particular source file or files by setting PCMK_trace_files as well.

PCMK_trace_files=cluster.c,election.c

These additional logs are sent to the cluster log file. Note that enabling tracing options also alters the output format.

Instead of:

crmd:  notice: crm_cluster_connect:     Connecting to cluster infrastructure: cman

the output includes file and line information:

crmd: (   cluster.c:215   )  notice: crm_cluster_connect:   Connecting to cluster infrastructure: cman

But wait there’s still more

Still need more detail? You’re in luck! The blackbox can be dumped at any time, not just when an error occurs.

First, make sure the blackbox is active (we’ll assume its the crmd that needs to be debugged):

killall -USR1 crmd

Next, discard any previous contents by dumping them to disk

killall -TRAP crmd

now cause whatever condition you’re trying to debug, and send -TRAP when you’re ready to see the result.

killall -TRAP crmd

You can now look for the result in syslog:

grep -e crm_write_blackbox: /var/log/messages

This will include a filename containing the trace logging:

crmd[1811]:   notice: crm_write_blackbox: Blackbox dump requested, please see /var/lib/pacemaker/blackbox/crmd-1811.1 for contents
crmd[1811]:   notice: crm_write_blackbox: Blackbox dump requested, please see /var/lib/pacemaker/blackbox/crmd-1811.2 for contents

To extract the trace loging for our test, pass the most recent file to qb-blackbox:

qb-blackbox /var/lib/pacemaker/blackbox/crmd-1811.2

At this point you’ll probably want to use grep :)

Debugging the Policy Engine

Posted in The Cluster Guy at May 20, 2013 01:49 AM

Finding the right node

The Policy Engine is the component that takes the cluster’s current state, decides on the optimal next state and produces an ordered list of actions to achieve that state.

You can get a summary of what the cluster did in response to resource failures and nodes joining/leaving the cluster by looking at the logs from pengine:

grep -e pengine\\[ -e pengine: /var/log/messages

Although the pengine process is active on all cluster nodes, it is only doing work on one of them. The “active” instance is chosen through the crmd’s DC election process and may move around as nodes leave/join the cluster.

If you do not see anything from pengine at the time the problem occurs, continue to the next machine.

If you do not see anything from pengine on any node, check your cluster if logging to syslog is enabled and the syslog configuration to see where it is being sent. If in doubt, refer to Pacemaker Logging.

Once you have located the correct node to investigate, the first thing to do is look for the terms ERROR and WARN, eg.

grep -e pengine\\[ -e pengine: /var/log/messages | grep -e ERROR -e WARN

This will highlight any problems the software encountered.

Next expand the query to all pengine logs:

grep -e pengine\\[ -e pengine: /var/log/messages

The output will look a little like:

pengine[6132]:   notice: LogActions: Move    mysql  (Started corosync-host-1 -> corosync-host-4)
pengine[6132]:   notice: LogActions: Start   www    (corosync-host-6)
pengine[6132]:   notice: process_pe_message: Calculated Transition 7: /var/lib/pacemaker/pengine/pe-input-4424.bz2
pengine[6132]:   notice: process_pe_message: Calculated Transition 8: /var/lib/pacemaker/pengine/pe-input-4425.bz2

In the above logs, transition 7 resulted in mysql being moved and www being started. Later, transition 8 occurred but everything was where it should be and no action was required.

Other notable entries include:

pengine[6132]:  warning: cluster_status: We do not have quorum - fencing and resource management disabled
pengine[6132]:   notice: stage6: Scheduling Node corosync-host-1 for shutdown
pengine[6132]:  warning: stage6: Scheduling Node corosync-host-8 for STONITH

as well as

pengine[6132]:   notice: LogActions: Start   Fencing      (corosync-host-1 - blocked)

which indicates that the cluster would like to start the Fencing resource, but some dependancy is not satisfied.

pengine[6132]:  warning: determine_online_status: Node corosync-host-8 is unclean

which indicates that either corosync-host-8 has failed, or a resource on it has failed to stop when requested.

pengine[6132]:  warning: unpack_rsc_op: Processing failed op monitor for www on corosync-host-4: unknown error (1)

which indicates a health check for the www resource failed with a return code of 1 (aka. OCF_ERR_GENERIC). See Pacemaker Explained for more details on OCF return codes.

  • Is there anything from the Policy Engine at about the time of the problem?
    If not, go back to the crmd logs and see why no recovery was attempted.

  • Did pengine log why something happened? does that sound correct?
    Excellent, thanks for playing.

Getting more detail from the Policy Engine

The job performed by the Policy engine is a very complex and frequent task, so to avoid filling up the disk with logs, it only indicates what it is doing and rarely the reason why. Normally the why can be found in the crmd logs, but it also saves the current state (the cluster configuration and the state of all resources) to disk for situations when it can’t.

These files can later be replayed using crm_simulate with a higher level of verbosity to diagnose issues and, as part of our regression suite, to make sure they stay fixed afterwards.

Finding these state files is a matter of looking for logs such as

crmd[1811]:   notice: run_graph: Transition 2 (... Source=/var/lib/pacemaker/pengine/pe-input-473.bz2): Complete
pengine[1810]:   notice: process_pe_message: Calculated Transition 0: /var/lib/pacemaker/pengine/pe-input-473.bz2

The “correct” entry will depend on the context of your query.

Please note, sometimes events occur while the pengine is performing its calculation. In this situation, the calculation logged by process_pe_message() is discarded and a new one performed. As a result, not all transitions/files listed by the pengine process are executed by the crmd.

After obtaining the file named by run_graph() or process_pe_message(), either directly or from a crm_report archive, pass it to crm_simulate which will display its view of the cluster at that time:

crm_simulate --xml-file ./pe-input-473.bz2
  • Does the cluster state look correct?

    If not, file a bug. It is possible we have misparsed the state of the resources, any calculation we make based on this would therefor also be wrong.

Next, see what recovery actions the cluster thinks need to be performed:

crm_simulate --xml-file ./pe-input-473.bz2 --save-graph problem.graph --save-dotfile problem.dot --run

In addition to the normal output, this command creates:

  • problem.graph, the ordered graph of actions, their parameters and prerequisites
  • problem.dot, a more human readable version of the same graph focussed on the action ordering.

Open problem.dot in dotty or graphviz to obtain a graphical representation:

  • Arrows indicate ordering dependencies
  • Dashed-arrows indicate dependencies that are not present in the transition graph
  • Actions with a dashed border of any color do not form part of the transition graph
  • Actions with a green border form part of the transition graph
  • Actions with a red border are ones the cluster would like to execute but cannot run
  • Actions with a blue border are ones the cluster does not feel need to be executed
  • Actions with orange text are pseudo/pretend actions that the cluster uses to simplify the graph
  • Actions with black text are sent to the lrmd
  • Resource actions have text of the form ${rsc}_${action}_${interval} ${node}
  • Actions of the form ${rsc}_monitor_0 ${node} is the cluster’s way of finding out the resource’s status before we try and start it anywhere
  • Any action depending on an action with a red border will not be able to execute.
  • Loops are really bad. Please report them to the development team.

Check the relative ordering of actions:

  • Are there any extra ones?
    Do they need to be removed from the configuration?
    Are they implied by the group construct?
  • Are there any missing?
    Are they specified in the configuration?

You can obtain excruicating levels of detial by adding additional -V options to the crm_simulate command line.

Now see what the cluster thinks the “next state” will look like:

crm_simulate --xml-file ./pe-input-473.bz2 --save-graph problem.graph --save-dotfile problem.dot --simulate
  • Does the new cluster state look correct based on the input and actions performed?
    If not, file a bug.

Debugging Pacemaker

Posted in The Cluster Guy at May 19, 2013 11:08 PM

Where to start

The first thing to do is look in syslog for the terms ERROR and WARN, eg.

grep -e ERROR -e WARN /var/log/messages

If nothing looks appropriate, find the logs from crmd

grep -e crmd\\[ -e crmd: /var/log/messages

If you do not see anything from crmd, check your cluster if logging to syslog is enabled and the syslog configuration to see where it is being sent. If in doubt, refer to Pacemaker Logging for how to obtain more detail.

Although the crmd process is active on all cluster nodes, decisions are only occuring on one of them. The “DC” is chosen through the crmd’s election process and may move around as nodes leave/join the cluster.

For node failures, you’ll always want the logs from the DC (or the node that becomes the DC).
For resource failures, you’ll want the logs from the DC and the node on which the resource failed.

Log entries like:

crmd[1811]:   notice: crm_update_peer_state: cman_event_callback: Node corosync-host-1[1] - state is now lost (was member)

indicate a node is no longer part of the cluster (either because it failed or was shut down)

crmd[1811]:   notice: crm_update_peer_state: cman_event_callback: Node corosync-host-1[1] - state is now member (was lost)

indicates a node has (re)joined the cluster

crmd[1811]:   notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE ...
crmd[1811]:   notice: run_graph: Transition 2 (... Source=/var/lib/pacemaker/pengine/pe-input-473.bz2): Complete
crmd[1811]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE ...

indicates recovery was attempted

crmd[1811]:   notice: te_rsc_command: Initiating action 36: monitor www_monitor_0 on corosync-host-5
crmd[1811]:   notice: te_rsc_command: Initiating action 54: monitor mysql_monitor_10000 on corosync-host-4

indicates we performed a resource action, in this case we are checking the status of the www resource on corosync-host-5 and starting a recurring health check for mysql on corosync-host-4.

crmd[1811]:   notice: te_fence_node: Executing reboot fencing operation (83) on corosync-host-8 (timeout=60000)

indicates that we are attempting to fence corosync-host-8.

crmd[1811]:   notice: tengine_stonith_notify: Peer corosync-host-8 was terminated (st_notify_fence) by corosync-host-1 for corosync-host-1: OK

indicates that corosync-host-1 successfully fenced corosync-host-8.

Node-level failures

  • Did the crmd fail to notice the failure?

    If you do not see any entries from crm_update_peer_state(), check the corosync logs to see if membership was correct/timely

  • Did the crmd fail to initiate recovery?

    If you do not see entries from do_state_transition() and run_graph(), then the cluster failed to react at all. Refer to Pacemaker Logging to for how to obtain more detail about why the crmd ignored the failure.

  • Did the crmd fail to perform recovery?

    If you DO see entries from do_state_transition() but the run_graph() entry(ies) include the text Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, then the cluster did not think it needed to do anything.

    Obtain the file named by run_graph() (eg. /var/lib/pacemaker/pengine/pe-input-473.bz2) either directly or from a crm_report archive and continue to debugging the Policy Engine.

  • Was fencing attempted?

    Check if the stonith-enabled property is set to true/1/yes, if so obtain file named by run_graph() (eg. /var/lib/pacemaker/pengine/pe-input-473.bz2) either directly or from a crm_report archive and continue to debugging the Policy Engine.

  • Did fencing complete?

    Check the configuration of fencing resources and if so proceed to Debugging Stonith.

Resource-level failures

  • Did the resource actually fail?

    If not, check for logs matching the resource name to see why the resource agent thought a failure occurred.

    Check the resource agent source to see what code paths could have produced those logs (or the lack of them)

  • Did crmd notice the resource failure?

    If not, check for logs matching the resource name to see if the resource agent noticed.

    Check a recurring monitor was configured.

  • Did the crmd fail to initiate recovery?

    If you do not see entries from do_state_transition() and run_graph(), then the cluster failed to react at all. Refer to Pacemaker Logging to for how to obtain more detail about why the crmd ignored the failure.

  • Did the crmd fail to perform recovery?

    If you DO see entries from do_state_transition() but the run_graph() entry(ies) include the text Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, then the cluster did not think it needed to do anything.

    Obtain the file named by run_graph() (eg. /var/lib/pacemaker/pengine/pe-input-473.bz2) either directly or from a crm_report archive and continue to debugging the Policy Engine.

  • Did resources stop/start/move unexpectedly or fail to stop/start/move when expected?

    Obtain the file named by run_graph() (eg. /var/lib/pacemaker/pengine/pe-input-473.bz2) either directly or from a crm_report archive and continue to debugging the Policy Engine.

Pacemaker on RHEL6.4

Posted in The Cluster Guy at May 03, 2013 10:09 PM

Over the last couple of years, we have been evolving the stack in two ways of particular relevance to RHEL customers:

  • minimizing the differences to the default RHEL-6 stack to reduce the implication of supporting Pacemaker there (making it more likely to happen)

  • adapting to changes to Corosync’s direction (ie. the removal of plugins and the addition of a quorum API) for the future

As a general rule, Red Hat does not ship packages it doesn’t at least plan to support. So part of readying Pacemaker for “supported” status is removing or deprecating the parts of Red Hat’s packages that they have no interest and/or capacity to support.

Removal of crmsh

For reasons that you may or may not agree with, Red Hat has decided to rely on pcs for command line and GUI cluster management in RHEL-7.

As a result there is no future, in RHEL, for the original cluster shell crmsh.

Normally it would have been deprecated. However since crmsh is now a stand-alone project, it’s removal from the Pacemaker codebase also resulted in it’s removal from RHEL-6 once the packages were refreshed.

To fill the void and help prepare people for RHEL-7, pcs is now also available on RHEL-6.

Status of the Plugin

Anyone taking the recommended approach of using Pacemaker with CMAN (ie. cluster.conf) on RHEL-6 or any of its derivatives can stop reading for now (we’ll need to talk again when RHEL 7 comes out with corosync-2, but thats another conversation).

Anyone using corosync.conf on RHEL 6 should keep reading…

One of the differences between the Pacemaker and rgmanager stacks is where membership and quorum come from.

Pacemaker has traditionally obtained it from a custom plugin, whereas rgmanager used CMAN. Neither source is “better” than the other, the only thing that matters is that everyone obtains it from the same place.

Since the rest of the components in a RHEL-6 cluster use CMAN, support for it was added to Pacemaker which also helps minimize the support load. Additionally, in RHEL-7, Corosync’s support for plugins such as Pacemaker’s (and CMAN’s) goes away.

Without any chance of being supported in the short or long-term, configuring plugin-based clusters (ie. via corosync.conf ) is now officially deprecated in RHEL. As some of you may have already noticed, starting corosync in 6.4 produces the following entries in the logs:

Apr 23 17:35:36 corosync [pcmk  ] ERROR: process_ais_conf: You have configured a cluster using the Pacemaker plugin for Corosync. The plugin is not supported in this environment and will be removed very soon.
Apr 23 17:35:36 corosync [pcmk  ] ERROR: process_ais_conf:  Please see Chapter 8 of 'Clusters from Scratch' (http://clusterlabs.org/doc) for details on using Pacemaker with CMAN

Everyone is highly encouraged to switch to CMAN-based Pacemaker clusters

While the plugin will still be around, running Pacemaker in configurations that are not well tested by Red Hat (or, for the most part, by upstream either) contains an element of risk.

For example, the messages above were originally added for 6.3, however since logging from the plugin was broken for over a year, no-one noticed. It only got fixed when I was trying to figure out why no-one had complained about them yet!

A lack of logging is annoying but not usually problematic, unfortunately there is also something far worse…

Fencing Failures when using the Pacemaker plugin

It has come to light that fencing for plugin-based clusters is critically broken.

The cause was a single stray ‘n’-character, probably from a copy+paste, that prevents the crmd from correctly reacting to a membership-level failures (ie. killall -9 corosync) of it’s peers.

The problem did not show up in any of Red Hat’s testing because of the way Pacemaker processes talk to their peers on other nodes when CMAN (or Corosync 2.0) is in use.

For CMAN and Corosync 2.0 we use Corosync’s CPG API which provides notifications when peer processes (the crmd in this case) join or leave the messaging group. These additional notifications from CPG follow a different code path and are unaffected by the bug… allowing the cluster to function as intended.

Unfortunately, despite the size and obviousness of the fix, a z-stream update for a bug affecting a deprecated use-case of an as-yet-unsupported package is a total non-starter.

People wanting to stick with plugin-based clusters should obtain 1.1.9 or later from the Clusterlabs repos that includes the fix

You can read more about the bug and the fix on the Red Hat bugzilla

For details on converting to a CMAN-based stack, please see Clusters from Scratch.

Switching to CMAN is really far less painful than it sounds

There is also a quickstart guide for easily generating cluster.conf, just substitute the name of your cluster nodes.

Release candidate: 1.1.10-rc2

Posted in The Cluster Guy at May 03, 2013 02:12 AM

Announcing the second release candidate for Pacemaker 1.1.10

No major changes have been introduced, just some fixes for a few niggles that were discovered since RC1.

Unless blocker bugs are found, this will be the final release candidate and 1.1.10 will be tagged on May 10th.

Help is specifically requested for testing plugin-based clusters, ACLs and admin actions (such as moving and stopping resources) which are hard to test in an automated manner.

To build rpm packages for testing:

  1. Clone the current sources:

    # git clone git://github.com/ClusterLabs/pacemaker.git
    # cd pacemaker
    
  2. Install dependancies

    On Fedora/RHEL and its derivatives, you can do this by running:

    # yum install -y yum-utils
    # make yumdep
    

    Otherwise you will need to investigate the spec file and/or wait for rpmbuild to report missing packages.

  3. Build Pacemaker

    # make rpm
    
  4. Copy and deploy as needed

Details - 1.1.10-rc2

Changesets  31
Diff 30 files changed, 687 insertions(+), 138 deletions(-)

Highlights

Features added in Pacemaker-1.1.10-rc2

N/A

Changes since Pacemaker-1.1.10-rc1

  • Bug cl#5152 - Correctly clean up fenced nodes during membership changes
  • Bug cl#5153 - Correctly display clone failcounts in crm_mon
  • Bug cl#5154 - Do not expire failures when on-fail=block is present
  • cman: Skip cman_pre_stop in the init script if fenced is not running
  • Core: Ensure the last field in transition keys is 36 characters
  • crm_mon: Check if a process can be daemonized before forking so the parent can report an error
  • crm_mon: Ensure stale pid files are updated when a new process is started
  • crm_report: Correctly collect logs when ‘uname -n’ reports fully qualified names
  • crm_resource: Allow –cleanup without a resource name
  • init: Unless specified otherwise, assume cman is in use if cluster.conf exists
  • mcp: inhibit error messages without cman
  • pengine: Ensure per-node resource parameters are used during probes
  • pengine: Implement the rest of get_timet_now() and rename to get_effective_time

Mixing Pacemaker versions

Posted in The Cluster Guy at April 18, 2013 11:47 PM

When mixing Pacemaker versions, there are two factors that need to be considered. The first is obviously the package version - if that is the same, then there is no problem.

If not, then the Pacemaker feature set needs to be checked. This feature set increases far less regularly than the normal package version. Newer versions of Pacemaker expose this value in the output of pacemakerd --features:

$ pacemakerd –features
Pacemaker 1.1.9 (Build: 9048b7b)
Supporting v3.0.7: generated-manpages agent-manpages ascii-docs publican-docs ncurses gcov libqb-logging libqb-ipc lha-fencing upstart systemd nagios heartbeat corosync-native snmp

In this case, the feature set is 3.0.7 (major 3, minor 0, revision 7).

For older versions, you should refer to the definition of CRM_FEATURE_SET in crm.h, usually this will be located at /usr/include/pacemaker/crm/crm.h.

If two packages or versions share the same feature set, then the expectation is that they are fully compatible. Any other behavior is a bug which needs to be reported.

If the feature sets between two versions differ but have the same major value (ie. the 3 in 3.0.7 and 3.1.5), then they are said to be upgrade compatible.

What does upgrade compatible mean?

When two versions are upgrade compatible, it means that they will co-exist during a rolling upgrade but not on an extended or permanent basis as the newer version requires all its peers to support API feature(s) that the old one does not have.

The following two rules apply when mixing installations with different feature sets:

  • When electing a node to run the cluster (the Designated Co-ordinator or “DC”), the node with the lowest feature set always wins.
  • The DC records its feature set in the CIB
  • Nodes may not join the cluster if their feature set is less than the one recorded in the CIB

Example

Consider node1 with a feature set of 3.0.7 and node2 with feature set 3.0.8… when node2 first joins the cluster, node1 will naturally remain the DC.

However if node1 leaves the cluster, either by being shut down or due to a failure, node2 will become the DC (as it is by itself and by definition has the lowest feature set of any active node).

At this point, node1 will be rejected if it attempts to rejoin the cluster and will shut down, as its feature set is lower than that of the DC (node2).

Is this happening to me?

If you are affected by this, you will see an error in the logs along the lines of:

error: We can only support up to CRM feature set 3.0.7 (current=3.0.8)

In this case, the DC (node2) has feature set 3.0.8 but we (node1) only have 3.0.7.

To get these two nodes talking to each other again:

  1. stop the cluster on both nodes
  2. on both nodes, run:
    CIB_file=/path/to/cib.xml cibadmin -M -X '<cib crm_feature_set="3.0.7"/>'
    
  3. start node1 and wait until it is elected as the DC
  4. start node2

Release candidate: 1.1.10-rc1

Posted in The Cluster Guy at April 17, 2013 11:32 PM

A funny thing happened on the way to 1.1.9…

Between tagging it on the Friday, and announcing it on the following Monday, people started actually testing it and found a couple of problems.

Specifically, there were some significant memory leaks and problems in a couple of areas that our unit and integration tests can’t sanely test.

So while 1.1.9 is out, it was never formally announced. Instead we’ve been fixing the reported bugs (as well as looking for a few more by running valgrind on a live cluster) and preparing for 1.1.10.

Also, in an attempt to learn from previous mistakes, the new release procedure involves release candidates. If no blocker bugs are reported in the week following a release candidate, it is re-tagged as the official release.

So without further ado, here is the 1.1.9 release notes as well as what changed in 1.1.10-rc1.

Details - 1.1.10-rc1

Changesets  143
Diff 104 files changed, 3327 insertions(+), 1186 deletions(-)

Highlights

Features added in Pacemaker-1.1.10

  • crm_resource: Allow individual resources to be reprobed
  • mcp: Alternate Upstart job controlling both pacemaker and corosync
  • mcp: Prevent the cluster from trying to use cman even when it is installed

Changes since Pacemaker-1.1.9

  • Allow programs in the haclient group to use CRM_CORE_DIR
  • cman: Do not unconditionally start cman if it is already running
  • core: Ensure custom error codes are less than 256
  • crmd: Clean up memory at exit
  • crmd: Do not update fail-count and last-failure for old failures
  • crmd: Ensure we return to a stable state if there have been too many fencing failures
  • crmd: Indicate completion of refresh to callers
  • crmd: Indicate completion of re-probe to callers
  • crmd: Only perform a dry run for deletions if built with ACL support
  • crmd: Prevent use-after-free when the blackbox is enabled
  • crmd: Suppress secondary errors when no metadata is found
  • doc: Pacemaker Remote deployment and reference guide
  • fencing: Avoid memory leak in can_fence_host_with_device()
  • fencing: Clean up memory at exit
  • fencing: Correctly filter devices when no nodes are configured yet
  • fencing: Correctly unpack device parameters before using them
  • fencing: Fail the operation once all peers have been exhausted
  • fencing: Fix memory leaks during query phase
  • fencing: Prevent empty call-id during notification processing
  • fencing: Prevent invalid read in parse_host_list()
  • fencing: Prevent memory leak when registering devices
  • crmd: lrmd: stonithd: fixed memory leaks
  • ipc: Allow unpriviliged clients to clean up after server failures
  • ipc: Restore the ability for members of the haclient group to connect to the cluster
  • legacy: cl#5148 - Correctly remove a node that used to have a different nodeid
  • legacy: Support “crm_node –remove” with a node name for corosync plugin (bnc#805278)
  • logging: Better checks when determining if file based logging will work
  • Pass errors from lsb metadata generation back to the caller
  • pengine: Do not use functions from the cib library during unpack
  • Prevent use-of-NULL when reading CIB_shadow from the environment
  • Skip WNOHANG when waiting after sending SIGKILL to child processes
  • tools: crm_mon - Print a timing field only if its value is non-zero
  • Use custom OCF_ROOT_DIR if requested
  • xml: Prevent lockups by setting a more reliable buffer allocation strategy
  • xml: Prevent use-after-free in cib_process_xpath()
  • xml: Prevent use-after-free when not processing all xpath query results

Details - 1.1.9

Changesets  731
Diff 1301 files changed, 92909 insertions(+), 57455 deletions(-)

Highlights

Features added in Pacemaker-1.1.9

  • corosync: Allow cman and corosync 2.0 nodes to use a name other than uname()
  • corosync: Use queues to avoid blocking when sending CPG messages
  • ipc: Compress messages that exceed the configured IPC message limit
  • ipc: Use queues to prevent slow clients from blocking the server
  • ipc: Use shared memory by default
  • lrmd: Support nagios remote monitoring
  • lrmd: Pacemaker Remote Daemon for extending pacemaker functionality outside corosync cluster.
  • pengine: Check for master/slave resources that are not OCF agents
  • pengine: Support a ‘requires’ resource meta-attribute for controlling whether it needs quorum, fencing or nothing
  • pengine: Support for resource container
  • pengine: Support resources that require unfencing before start

Changes since Pacemaker-1.1.8

  • attrd: Correctly handle deletion of non-existant attributes
  • Bug cl#5135 - Improved detection of the active cluster type
  • Bug rhbz#913093 - Use crm_node instead of uname
  • cib: Avoid use-after-free by correctly support cib_no_children for non-xpath queries
  • cib: Correctly process XML diff’s involving element removal
  • cib: Performance improvements for non-DC nodes
  • cib: Prevent error message by correctly handling peer replies
  • cib: Prevent ordering changes when applying xml diffs
  • cib: Remove text nodes from cib replace operations
  • cluster: Detect node name collisions in corosync
  • cluster: Preserve corosync membership state when matching node name/id entries
  • cman: Force fenced to terminate on shutdown
  • cman: Ignore qdisk ‘nodes’
  • core: Drop per-user core directories
  • corosync: Avoid errors when closing failed connections
  • corosync: Ensure peer state is preserved when matching names to nodeids
  • corosync: Clean up CMAP connections after querying node name
  • corosync: Correctly detect corosync 2.0 clusters even if we don’t have permission to access it
  • crmd: Bug cl#5144 - Do not updated the expected status of failed nodes
  • crmd: Correctly determin if cluster disconnection was abnormal
  • crmd: Correctly relay messages for remote clients (bnc#805626, bnc#804704)
  • crmd: Correctly stall the FSA when waiting for additional inputs
  • crmd: Detect and recover when we are evicted from CPG
  • crmd: Differentiate between a node that is up and coming up in peer_update_callback()
  • crmd: Have cib operation timeouts scale with node count
  • crmd: Improved continue/wait logic in do_dc_join_finalize()
  • crmd: Prevent election storms caused by getrusage() values being too close
  • crmd: Prevent timeouts when performing pacemaker level membership negotiation
  • crmd: Prevent use-after-free of fsa_message_queue during exit
  • crmd: Store all current actions when stalling the FSA
  • crm_mon: Do not try to render a blank cib and indicate the previous output is now stale
  • crm_mon: Fixes crm_mon crash when using snmp traps.
  • crm_mon: Look for the correct error codes when applying configuration updates
  • crm_report: Ensure policy engine logs are found
  • crm_report: Fix node list detection
  • crm_resource: Have crm_resource generate a valid transition key when sending resource commands to the crmd
  • date/time: Bug cl#5118 - Correctly convert seconds-since-epoch to the current time
  • fencing: Attempt to provide more information that just ‘generic error’ for failed actions
  • fencing: Correctly record completed but previously unknown fencing operations
  • fencing: Correctly terminate when all device options have been exhausted
  • fencing: cov#739453 - String not null terminated
  • fencing: Do not merge new fencing requests with stale ones from dead nodes
  • fencing: Do not start fencing until entire device topology is found or query results timeout.
  • fencing: Do not wait for the query timeout if all replies have arrived
  • fencing: Fix passing of parameters from CMAN containing ‘=’
  • fencing: Fix non-comparison when sorting devices by priority
  • fencing: On failure, only try a topology device once from the remote level.
  • fencing: Only try peers for non-topology based operations once
  • fencing: Retry stonith device for duration of action’s timeout period.
  • heartbeat: Remove incorrect assert during cluster connect
  • ipc: Bug cl#5110 - Prevent 100% CPU usage when looking for synchronous replies
  • ipc: Use 50k as the default compression threshold
  • legacy: Prevent assertion failure on routing ais messages (bnc#805626)
  • legacy: Re-enable logging from the pacemaker plugin
  • legacy: Relax the ‘active’ check for plugin based clusters to avoid false negatives
  • legacy: Skip peer process check if the process list is empty in crm_is_corosync_peer_active()
  • mcp: Only define HA_DEBUGLOG to avoid agent calls to ocf_log printing everything twice
  • mcp: Re-attach to existing pacemaker components when mcp fails
  • pengine: Any location constraint for the slave role applies to all roles
  • pengine: Avoid leaking memory when cleaning up failcounts and using containers
  • pengine: Bug cl#5101 - Ensure stop order is preserved for partially active groups
  • pengine: Bug cl#5140 - Allow set members to be stopped when the subseqent set has require-all=false
  • pengine: Bug cl#5143 - Prevent shuffling of anonymous master/slave instances
  • pengine: Bug rhbz#880249 - Ensure orphan masters are demoted before being stopped
  • pengine: Bug rhbz#880249 - Teach the PE how to recover masters into primitives
  • pengine: cl#5025 - Automatically clear failcount for start/monitor failures after resource parameters change
  • pengine: cl#5099 - Probe operation uses the timeout value from the minimum interval monitor by default (#bnc776386)
  • pengine: cl#5111 - When clone/master child rsc has on-fail=stop, insure all children stop on failure.
  • pengine: cl#5142 - Do not delete orphaned children of an anonymous clone
  • pengine: Correctly unpack active anonymous clones
  • pengine: Ensure previous migrations are closed out before attempting another one
  • pengine: Introducing the whitebox container resources feature
  • pengine: Prevent double-free for cloned primitive from template
  • pengine: Process rsc_ticket dependencies earlier for correctly allocating resources (bnc#802307)
  • pengine: Remove special cases for fencing resources
  • pengine: rhbz#902459 - Remove rsc node status for orphan resources
  • systemd: Gracefully handle unexpected DBus return types
  • Replace the use of the insecure mktemp(3) with mkstemp(3)

DRBD 8.4.3: faster than ever

Posted in LINBIT Blogs by flip at February 22, 2013 08:17 AM

For the people who don’t already have DRBD 8.4.3 deployed: here’s another good reason — Performance.

As you know DRBD marks the to-be-changed disk areas in the Activity Log.

Until now that meant that for random-write workloads a DRBD speed penalty of up to 50%, ie. each application-issued write request translated to two write requests on storage.


With DRBD 8.4.3 Lars managed to reduce that overhead1, from 1:2 down to 64:65, ie. to about 1.6%. (In sales speak “up to 64 times faster” ;) )

Here are two graphics showing the difference on one of our test clusters; both using 10GigE and synchronous replication (protocol C):

Random Writes Benchmark, Spinning Disk
The raw LVM line shows the hardware limit of 350 IOPS; while 8.4.2 and 8.3.15 are quickly limited by harddisk seeks, the 8.4.3 bars go up much further – in this hardware setup we get 4 times the randwrite performance!


When using SSDs the difference is even more visible ­— the 8.4.2 to 8.4.3 speedup is a factor ~16.7.

Random Writes Benchmark, SSD
Again, the raw LVM line shows the hardware limit of 50k IOPS; 8.4.2 needs to wait for the synchronous writes (at 1.5k IOPS), but 8.4.3 gives 25k IOPS, at least half the pure SSD speed.


Please note that every setup is different — and storage subsystems are very complex beasts, with many, non-linear, interacting parts. During our tests we found many “interesting” (but reproduceable) behaviours – so you’ll have to tune your specific setup2,3.


Furthermore, the activity log can now be much bigger4; but, as the impact on performance of leaving the “hot” area is now very much reduced, you may even want to lower the al-extents – ie. tune the AL-size to the working set, to reduce re-sync times after a failed Primary.

And, last but not least, the AL can be striped – this might help for some hardware setups, too.
Please see the documentation for more details.

BTW: these changes are in the DRBD 9 branch too, so you won’t lose the benefits.


Now powered by Octopress

Posted in The Cluster Guy at February 18, 2013 02:48 AM

With Posterous being shut down (even though I wasn’t using it for TheClusterGuy), I’ve decided the time has come to take back control of my content.

As a result I’ve started using Octopress for publishing.

It is pretty nifty. Octopress generates a static site (good for performance!) that can either be hosted at GitHub or (if GitHub ever goes dark) anywhere Apache can run. As you will notice, I was even able to easily import all my old posts!

For now I’m taking the GitHub path with a custom domain name (not the same as this one so that the old links still work).

I’m quite linking the theme/layout and feature set. I’m even taking the opportunity to support comments for the first time… we’ll see how long that lasts before the spammers make it no longer worth the time.

Large Cluster Performance: Redux

Posted in The Cluster Guy at February 17, 2013 11:49 PM

Normally I have access to 4 virtual cluster nodes on my laptop, however for the first time since leaving SUSE, I have had the opportunity to test with 8 and 16 node clusters.

Performance testing doesn’t just benefit large clusters, it helps reduce Pacemaker’s footprint for smaller ones too

I am very pleased to say that 8-nodes ‘just worked’. The fun began when I tried to step up to 16 nodes.

When Pacemaker 1.1.9 is released, it can confidently claim to support 16 node clusters

By far the biggest roadbloack turned out to be a memory allocation issue in libxml2’s xmlNodeDump() function. Essentially, CIB updates that would normally complete in 10’ths of a second would, at the worst possible time, begin to take upwards of 18 seconds. This would cause the CIB to get backed up, updates would start timing out and the cluster would produce even more CIB updates while attempting to recover. Compounding the problem.

As a result the cluster began falling over even before it could be brought fully online.

I was able to work-around this problem (and make some additional optimizations) with the following two commits:

55f8a94: Refactor: Use a custom xml-to-string function for performance
0066122: Refactor: Core: A faster and more consistant digest function

With such large clusters, we also started hitting IPC limits. If we were lucky, this meant the message didn’t get through.

Otherwise, the server (cib) got into a tightly coupled send/recv loop with a potenitally slow client - causing the cib to get backed up, updates would start timing out and the cluster would produce even more CIB updates while attempting to recover. (Are you seeing the pattern yet?)

To account for this, the cluster now:

  • compresses messages that exceed the limit
  • uses shared memory for IPC so we know in advance if trying to send the current message would block
  • queues messages that would otherwise have blocked (and disconnects clients if they fall too far behind)

These are the relevant commits:

3c56aa1: Feature: Reliably detect when an IPC message size exceeds the connection's maximum
14bbe6c: Feature: Compress messages that exceed the configured IPC message limit
566db4f: Feature: IPC: Use queues to prevent slow clients from blocking the server
85543e6: Feature: Use shared memory for IPC by default

Additionally, a previous decision to broadcast updates even when nothing changed was determined to be causing significant and unnecessary load throughout the cluster. This and other updates that made more efficient use of the available system resources also helped:

004d515: High: cib: Performance improvements for non-DC nodes
01baad5: Medium: crmd: Don't go into a tight loop while waiting for something to happen
27b6306: High: crmd: Only call set_slave() if we were the DC to avoid spamming the the cib
a3bba8d: Refactor: cib: Construct each notification once, not once per client
24a7ec2: Refactor: ipc: Allow messages to be constructed once and sent multiple times
e72c348: Medium: cib: Improve performance by only validating every 20th slave update
21dfddc: High: crmd: Have cib operation timeouts scale with node count

Results

The results were quite remarkable

The graphs below shows the 1, 5 and 15 minute load averages on while the cluster ran each CTS test once.

This is node 1 (of 4) running Pacemaker 1.0.5 back in 2009:

1.0.5 - 4-nodes

Here is the current code running a CMAN cluster twice as big:

1.1.9 - 8-nodes

Note the difference in scale on the Y-axis (double in 2009)

The difference is even more pronounced for larger clusters. Here we have (node 1 of) a 13 node cluster running 1.0.5 again:

1.0.5 - 13-nodes

Note the load almost hits 8

compared with a 16 node Corosync 2.x cluster with 1.1.9:

1.1.9 - 16-nodes

Note the load still barely spikes above 1 even though the size of the cluster is over 30% larger than it was in 2009.

Even with the benefit of slightly more modern hardware, thats a significant improvement.

Testing Notes

Unfortunately, these tests could not be a true apples-for-apples comparision.

The 16 node cluster was comprised physical hardware with 8Gb RAM and two:

  • model name : Dual-Core AMD Opteron(tm) Processor 2212 HE
  • cpu MHz : 1000.000
  • cache size : 1024 KB
  • bogomips : 2000.04

The 8 node cluster was comprised of virtual machines with 512Mb RAM and one:

  • model name : AMD Opteron 23xx (Gen 3 Class Opteron)
  • cpu MHz : 2099.998
  • cache size : 512 KB
  • bogomips : 4199.99

The machines from 2009 had 512MB RAM and (we believe) one 2 GHz HT Xeon (which looks comparable to the virtual machines in the 8-node case). The hardware appears to have met its demise so we’re unable to find more details.

If someone has the time/energy/interest to re-run the 1.0.5 tests on modern hardware, please get in touch :-)

Change the cluster distribution without downtime

Posted in LINBIT Blogs by flip at February 11, 2013 01:26 PM

Recently we’ve upgraded one of our virtualization clusters (more RAM), and in the course of this did an upgrade of the virtualization hosts from Ubuntu Lucid to RHEL 6.3 — without any service interruption.

That was not that complicated, really; as our core product DRBD works on (nearly) every Linux distribution, we simply

  1. live-migrated all VMs to one of the nodes;
  2. reinstalled the root filesystem on the other node with RHEL 6.31 and configured GRUB to boot into that one;
  3. installed matching DRBD modules
  4. waited a few seconds for the resync to complete (which was really that fast, because we didn’t touch the existing logical volumes, and so the changed data were only a few GiB);
  5. and then let Pacemaker take control over the cluster again, allowing us to migrate the VMs to the newly installed node. Without any service interruption.

The key to this was that DRBD and Pacemaker are available in compatible versions on most current distributions — and that’s not a big problem, because we make such packages available for our customers in our repositories.

Upgrading DRBD from 8.3 to 8.4 at the same time is only a small, secondary change; after all, its network code can talk to different versions by design.