Quick links:

Arrfab's Blog » Cluster: Switching from Ethernet to Infiniband for Gluster access (or why we had to …)

That Cluster Guy: Feature Spotlight - Smart Resource Restart from the Command Line

That Cluster Guy: Feature Spotlight - Controllable Resource Discovery

LINBIT Blogs: DRBD and SSD: I was made for loving you

LINBIT Blogs: Root-on-DRBD followup: Pre-production staging servers

That Cluster Guy: Release Candidate: 1.1.12-rc1

LINBIT Blogs: DRBDmanage installation is now easier!

That Cluster Guy: Potential for data corruption affecting Pacemaker 1.1.6 through 1.1.9

LINBIT Blogs: DRBDManage release 0.10

LINBIT Blogs: DRBD-Manager

That Cluster Guy: Announcing 1.1.11 Beta Testing

That Cluster Guy: Pacemaker and RHEL 6.4 (redux)

That Cluster Guy: Changes to the Remote Wire Protocol in 1.1.11

LINBIT Blogs: DRBD and the sync rate controller, part 2

LINBIT Blogs: DRBD Proxy 3.1: Performance improvements

That Cluster Guy: Pacemaker 1.1.10 - final

That Cluster Guy: Release candidate: 1.1.10-rc7

That Cluster Guy: Release candidate: 1.1.10-rc6

That Cluster Guy: GPG Quickstart

That Cluster Guy: Release candidate: 1.1.10-rc5

LINBIT Blogs: “umount is too slow”

Arrfab's Blog » Cluster: Rolling updates with Ansible and Apache reverse proxies

That Cluster Guy: Release candidate: 1.1.10-rc3

That Cluster Guy: Pacemaker Logging

That Cluster Guy: Debugging the Policy Engine

That Cluster Guy: Debugging Pacemaker

That Cluster Guy: Pacemaker on RHEL6.4

That Cluster Guy: Release candidate: 1.1.10-rc2

That Cluster Guy: Mixing Pacemaker versions

That Cluster Guy: Release candidate: 1.1.10-rc1

Switching from Ethernet to Infiniband for Gluster access (or why we had to …)

Posted in Arrfab's Blog » Cluster by fabian.arrotin at November 24, 2014 10:37 AM

As explained in my previous (small) blog post, I had to migrate a Gluster setup we have within CentOS.org Infra. As said in that previous blog post too, Gluster is really easy to install, and sometimes it can even "smells" too easy to be true. One thing to keep in mind when dealing with Gluster is that it's a "file-level" storage solution, so don't try to compare it with "block-level" solutions (so typically a NAS vs SAN comparison, even if "SAN" itself is wrong for such discussion, as SAN is what's *between* your nodes and the storage itself, just a reminder.)

Within CentOS.org infra, we have a multiple nodes Gluster setup, that we use for multiple things at the same time. The Gluster volumes are used to store some files, but also to host (different gluster volumes with different settings/ACLs) KVM virtual-disks (qcow2). People knowing me will say : "hey, but for performances reasons, it's faster to just dedicate for example a partition , or a Logical Volume instead of using qcow2 images sitting on top a filesystem for Virtual Machines, right ?" and that's true. But with our limited amount of machines, and a need to "move" Virtual Machine without a proper shared storage solution (and because in our setup, those physical nodes *are* both glusterd and hypervisors), Gluster was an easy to use solution to :

It was working, but not that fast ... I then heard about the fact that (obviously) accessing those qcow2 images file through fuse wasn't efficient at all, but that Gluster had libgfapi that could be used to "talk" directly to the gluster daemons, bypassing completely the need to mount your gluster volumes locally through fuse. Thankfully, qemu-kvm from CentOS 6 is built against libgfapi so can use that directly (and that's the reason why it's automatically installed when you install KVM hypervisor components). Results ? better , but still not was I/we was/were expecting ...

When trying to find the issue, I discussed with some folks in the #gluster irc channel (irc.freenode.net) and suddenly I understood something that it's *not* so obvious for Gluster in distributed+replicated mode : for people having dealt with storage solutions at the hardware level (or people using DRBD, which I did too in the past, and that I also liked a lot ..) in the past, we expect the replication to happens automatically at the storage/server side, but that's not true for Gluster : in fact Glusterd just exposes metadata to gluster clients, which then know where to read/write (being "redirected" to correct gluster nodes). That means so than replication happens at the *client* side : in replicated mode, the clients will write itself twice the same data : once on each server ...

So back to our example, as our nodes have 2*1Gb/s Ethernet card, and that one is a bridge used by the Virtual Machines, and the other one "dedicated" to gluster, and that each node is itself a glusterd/gluster client, I let you think about the max perf we could get : for a write operation : 1Gbit/s , divided by two (because of the replication) so ~ 125MB / 2 => in theory ~ 62 MB/sec (and then remove tcp/gluster/overhead and that drops to ~ 55MB/s)

How to solve that ? well, I tested that theory and confirmed directly that it was the case, when in distributed mode only, write performances were automatically doubled. So yes, running Gluster on Gigabit Ethernet suddenly was the bottleneck. Upgrading to 10Gb wasn't something we could do, but , thanks to Justin Clift (and some other Gluster folks), we were able to find some "second hand" Infiniband hardware (10Gbps HCAs and switch)

While Gluster has native/builtin rdma/Infiniband capabilities (see "tranport" option in the "gluster create volume" command), we had in our case to migrate existing Gluster volumes from plain TCP/Ethernet to Infiniband, while trying to get the downtime as small as possible. That is/was my first experience with Infiniband, but it's not as hard as it seems, especially when you discover IPoIB (IP over Infiniband). So from a Syadmin POV, it's just "yet another network interface", but a 10Gbps one now :)

The Gluster volume migration then goes like this : (schedule a - obvious - downtime for this) :

On all gluster nodes (assuming that we start from machines installed only with @core group, so minimal ones) :

yum groupinstall "Infiniband Support"

chkconfig rdma on

<stop your clients or other  apps accessing gluster volumes, as they will be stopped>

service glusterd stop && chkconfig glusterd off &&  init 0

Install then the hardware in each server, connect all Infiniband cards to the IB switch (previously configured) and power back on all servers. When machines are back online, you have "just" to configure the ib interfaces. As in my cases, machines were "remote nodes" and not having a look at how they were configured, I  had to use some IB tools to see which port was connected (a tool like "ibv_devinfo" showed me which port was active/connected, while "ibdiagnet" shows you the topology and other nodes/devices). In our case it was port 2, so let's create the ifcfg-ib{0,1} devices (and ib1 being the one we'll use) :

DEVICE=ib1
TYPE=Infiniband
BOOTPROTO=static
BROADCAST=192.168.123.255
IPADDR=192.168.123.2
NETMASK=255.255.255.0
NETWORK=192.168.123.0
ONBOOT=yes
NM_CONTROLLED=no
CONNECTED_MODE=yes

The interesting part here is the "CONNECTED_MODE=yes" : for people who already uses iscsi, you know that Jumbo frames are really important if you have a dedicated VLAN (and that the Ethernet switch support Jumbo frames too). As stated in the IPoIB kernel doc , you can have two operation mode : datagram (default 2044 bytes MTU) or  Connected (up to 65520 bytes MTU). It's up to you to decide which one to use, but if you understood the Jumbo frames thing for iscsi, you get the point already.

An "ifup ib1" on all nodes will bring the interfaces up and you can verify that everything works by pinging each other node, including with larger mtu values :

ping -s 16384 <other-node-on-the-infiniband-network>

If everything's fine, you can then decide to start gluster *but* don't forget that gluster uses FQDN (at least I hope that's how you configured initially your gluster setup, already on a dedicated segment, and using different FQDN for the storage vlan). You just have to update your local resolver (internal DNS, local hosts files, whatever you want) to be sure that gluster will then use the new IP subnet on the Infiniband network. (If you haven't previously defined different hostnames for your gluster setup, you can "just" update that in the different /var/lib/glusterd/peers/* and /var/lib/glusterd/vols/*/*.vol)

Restart the whole gluster stack (on all gluster nodes) and verify that it works fine :

service glusterd start

gluster peer status

gluster volume status

# and if you're happy with the results :

chkconfig glusterd on

So, in a short summary:

  • Infiniband isn't that difficult (and surely if you use IPoIB, which has though a very small overhead)
  • Migrating gluster from Ethernet to Infiniband is also easy (and surely if you planned carefully your initial design about IP subnet/VLAN/segment/DNS resolution for "transparent" move)

Feature Spotlight - Smart Resource Restart from the Command Line

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at November 14, 2014 09:17 PM

Restarting a resource can be a complex affair if there are things that depend on that resource or if any of the operations might take a long time.

Stopping a resource is easy, but it can be hard for scripts to determine at what point the the target resource has stopped (in order to know when to re-enable it), at what point it is appropriate to give up, and even what resources might have prevented the stop or start phase from completing.

For this reason, I am pleased to report that we will be introducing a --restart option for crm_resource in Pacemaker 1.1.13.

How it works

Assuming the following invocation

crm_resource --restart --resource dummy

The tool will:

  1. Check the current state of the cluster
  2. Set the target-role for dummy to stopped
  3. Calculate the future state of the cluster
  4. Compare the current state to the future state
  5. Work out the list of resources that still need to stop
  6. If there are resources to be stopped
    1. Work out the longest timeout of all stopping resource
    2. Look for changes until the timeout
    3. If nothing changed, indicate which resources failed to stop and exit
    4. Go back to step 4.
  7. Now that everything has stopped, remove the target-role setting for dummy to allow it to start again
  8. Calculate the future state of the cluster
  9. Compare the current state to the future state
  10. Work out the list of resources that still need to stop
  11. If there are resources to be stopped
    1. Work out the longest timeout of all stopping resource
    2. Look for changes until the timeout
    3. If nothing changed, indicate which resources failed to start and exit
    4. Go back to step 9.
  12. Done

Considering Clones

crm_resource is also smart enough to restart clone instances running on specific nodes with the optional --node hostname argument. In this scenario instead of setting target-role (which would take down the entire clone), we use the same logic as crm_resource --ban and crm_resource --clear to enable/disable the clone from running on the named host.

Want to know more?

Drop by IRC or ask us a question on the Pacemaker mailing list

There is also plenty of documentation available.

Feature Spotlight - Smart Resource Restart from the Command Line was originally published by Andrew Beekhof at That Cluster Guy on November 14, 2014.

Feature Spotlight - Controllable Resource Discovery

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at November 13, 2014 01:09 PM

Coming in 1.1.13 is a new option for location constraints: resource-discovery

This new option controls whether or not Pacemaker performs resource discovery for the specified resource on nodes covered by the constraint. The default always, preserves the pre-1.1.13 behaviour.

The options are:

  • always - (Default) Always perform resource discovery for the specified resource on this node.

  • never - Never perform resource discovery for the specified resource on this node. This option should generally be used with a -INFINITY score. Although that is not strictly required.

  • exclusive - Only perform resource discovery for the specified resource on this node. Multiple location constraints using exclusive discovery for the same resource across different nodes creates a subset of nodes resource-discovery is exclusive to. If a resource is marked for exclusive discovery on one or more nodes, that resource is only allowed to be placed within that subset of nodes.

Why would I want this?

Limiting resource discovery to a subset of nodes the resource is physically capable of running on can significantly boost performance when a large set of nodes are preset. When pacemaker_remote is in use to expand the node count into the 100s of nodes range, this option can have a dramatic affect on the speed of the cluster.

Is using this option ever a bad idea?

Absolutely!

Setting this option to never or exclusive allows the possibility for the resource to be active in those locations without the cluster’s knowledge. This can lead to the resource being active in more than one location!

There are primarily three ways for this to happen:

  1. If the service is started outside the cluster’s control (ie. at boot time by init, systemd, etc; or by an admin)
  2. If the resource-discovery property is changed while part of the cluster is down or suffering split-brain
  3. If the resource-discovery property is changed for a resource/node while the resource is active on that node

When is it safe?

For the most part, it is only appropriate when:

  1. you have more than 8 nodes (including bare metal nodes with pacemaker-remoted), and
  2. there is a way to guarentee that the resource can only run in a particular location (eg. the required software is not installed anywhere else)

Want to know more?

Drop by IRC or ask us a question on the Pacemaker mailing list

There is also plenty of documentation available.

Feature Spotlight - Controllable Resource Discovery was originally published by Andrew Beekhof at That Cluster Guy on November 13, 2014.

DRBD and SSD: I was made for loving you

Posted in LINBIT Blogs by flip at November 04, 2014 09:29 AM

When DRBD 8.4.4 integrated TRIM/Discard support, a lot of things got much better… for example, 700MB/sec over a 1GBit/sec connection.

As described in the Root-on-DRBD tech guide, my notebook uses DRBD on top of an SSD; apart from the IO speed, the other important thing is the Trim/Discard support.

In practice that means, e.g., that the resync goes much faster: most of the blocks that were written while being off-site have already been discarded again, and so the automatical fstrim can drop the needed amount of data by “up to” 100%.

Result: with a single SSD on one end, 1GBit network connectivity, and thin LVM on top of a 2-harddisk RAID1 on the other end, a resync rate of 700MB/sec!

Here are the log lines, heavily shortened so that they’re readable; starting at 09:39:00:

block drbd9: drbd_sync_handshake:
block drbd9: self 7:4:4:4 bits:15377903 flags:0
block drbd9: peer 4:0:2:2 bits:15358173 flags:4
block drbd9: uuid_compare()=1 by rule 70
block drbd9: Becoming sync source due to disk states.
block drbd9: peer( Unknown -> Sec ) conn( WFRepPar -> WFBitMapS )
block drbd9: send bitmap stats total 122068; compression: 98.4%
block drbd9: receive bitmap stats: total 122068; compression: 98.4%
block drbd9: helper cmd: drbdadm before-resync-src minor9
block drbd9: helper cmd: drbdadm before-resync-src minor9 exit code 0
block drbd9: conn( WFBitMapS -> SyncSource ) 
block drbd9: Began resync as SyncSrc (will sync 58 GB [15382819 bits]).
block drbd9: updated sync UUID 7:4:4:4

At 09:40:27 the resync concludes; the first line is the relevant one:

block drbd9: Resync done (total 87 sec; paused 0 sec; 707256 K/sec)
block drbd9: updated UUIDs 7:0:4:4
block drbd9: conn( SyncSource -> Connected ) pdsk( Inc -> UpToDate ) 

That’s how it’s done ;)

Root-on-DRBD followup: Pre-production staging servers

Posted in LINBIT Blogs by flip at October 16, 2014 12:28 PM

In the Root-on-DRBD” Tech-Guide we showed how to cleanly get DRBD below the Root filesystem, how to use it, and a few advantages and disadvantages. Now, if there’s a complete, live, backup of a machine available, a few more use-cases become available; here we want to discuss testing upgrades of production servers.

Everybody knows that upgrading production servers can be risky business. Even for the simplest changes (like upgrading DRBD on a Secondary) things can go wrong. If you have an HA Cluster in place, you can at least avoid a lot of pressure: the active cluster member is still running normally, so you don’t have to hurry the upgrade as if you had only a single production server.

Now, in a perfect world, all changes would have to go through a staging server first, perhaps several times, until all necessary changes are documented and the affected people know exactly what to do. However, that means having a staging server that is as identical to the production machine as possible: exactly the same package versions, using production data during schema changes (helps to assess the DB load [queue your most famous TheDailyWTF article about that here]), and so on.
That’s quite some work.

Well, no, wait, it isn’t that much … if you have a simple process to copy the production server.

That might be fairly easy if the server is virtualized – a few clicks are sufficient; but on physical hardware you will need DRBD to quickly get the staging machine up-to-date after a failed attempt – and that’s exactly what DRBD can give you.

The trick is to “shutdown” the machine in a way that makes the root filesystem unused; then resync DRBD from the production server, to reboot into the freshly updated “installation”.
(Yes, the data will have to be done in a similar way – but that’s possible with DRBD8, and will get even easier with DRBD9.)
A sample script that shows a basic outline is presented in the resync-root branch in the Root-on-DRBD github Repository. It should be run on the staging server only.

Please note that this is a barely-tested draft – you’ll need to put quite some installation-specific things in there, like other DRBD resources to resynchronize at that time and so on!

Feedback is very welcome; Pull-requests even more so ;)

Release Candidate: 1.1.12-rc1

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at May 07, 2014 08:16 PM

As promised, this announcement brings the first release candidate for Pacemaker 1.1.12

https://github.com/ClusterLabs/pacemaker/releases/Pacemaker-1.1.12-rc1

This release primarily focuses on important but mostly invisible changes under-the-hood:

  • The CIB is now O(2) faster. Thats 100x for those not familiar with Big-O notation :-)

    This has massively reduced the cluster’s use of system resources, allowing us to scale further on the same hardware, and dramatically reduced failover times for large clusters.

  • Support for ACLs are is enabled by default.

    The new implementation can restrict cluster access for containers where pacemaker-remoted is used and is also more efficient.

  • All CIB updates are now serialized and pre-synchronized via the corosync CPG interface. This makes it impossible for updates to be lost, even when the cluster is electing a new DC.

  • Schema versioning changes

    New features are no longer silently added to the schema. Instead the ${Y} in pacemaker-${X}-${Y} will be incremented for simple additions, and ${X} will be bumped for removals or other changes requiring an XSL transformation.

    To take advantage of new features, you will need to updates all the nodes and then run the equivalent of cibadmin --upgrade.

Thankyou to everyone that has tested out the new CIB and ACL code already. Please keep those bug reports coming in!

List of known bugs to be investigating during the RC phase:

  • 5206 Fileencoding broken
  • 5194 A resource starts with a standby node. (Latest attrd does not serve as the crmd-transition-delay parameter)
  • 5197 Fail-over is delayed. (State transition is not calculated.)
  • 5139 Each node fenced in its own transition during start-up fencing
  • 5200 target node is over-utilized with allow-migrate=true
  • 5184 Pending probe left in the cib
  • 5165 Add support for transient node utilization attributes

To build rpm packages for testing:

  1. Clone the current sources:

    # git clone –depth 0 git://github.com/ClusterLabs/pacemaker.git # cd pacemaker

  2. Install dependancies (if you haven’t already)

    [Fedora] # sudo yum install -y yum-utils [ALL] # make rpm-dep

  3. Build Pacemaker

    # make rc

  4. Copy and deploy as needed

Details

Changesets: 633 Diff: 184 files changed, 12690 insertions(+), 5843 deletions(-)

Highlights

Features added since Pacemaker-1.1.11

  • Changes to the ACL schema to support nodes and unix groups
  • cib: Check ACLs prior to making the update instead of parsing the diff afterwards
  • cib: Default ACL support to on
  • cib: Enable the more efficient xml patchset format
  • cib: Implement zero-copy status update (performance)
  • cib: Send all r/w operations via the cluster connection and have all nodes process them
  • crm_mon: Display brief output if “-b/–brief” is supplied or ‘b’ is toggled
  • crm_ticket: Support multiple modifications for a ticket in an atomic operation
  • Fencing: Add the ability to call stonith_api_time() from stonith_admin
  • logging: daemons always get a log file, unless explicitly set to configured ‘none’
  • PE: Automatically re-unfence a node if the fencing device definition changes
  • pengine: cl#5174 - Allow resource sets and templates for location constraints
  • pengine: Support cib object tags
  • pengine: Support cluster-specific instance attributes based on rules
  • pengine: Support id-ref in nvpair with optional “name”
  • pengine: Support per-resource maintenance mode
  • pengine: Support site-specific instance attributes based on rules
  • tools: Display pending state in crm_mon/crm_resource/crm_simulate if –pending/-j is supplied (cl#5178)
  • xml: Add the ability to have lightweight schema revisions
  • xml: Enable resource sets in location constraints for 1.2 schema
  • xml: Support resources that require unfencing

Changes since Pacemaker-1.1.11

  • acl: Authenticate pacemaker-remote requests with the node name as the client
  • cib: allow setting permanent remote-node attributes
  • cib: Do not disable cib disk writes if on-disk cib is corrupt
  • cib: Ensure ‘cibadmin -R/–replace’ commands get replies
  • cib: Fix remote cib based on TLS
  • cib: Ingore patch failures if we already have their contents
  • cib: Resolve memory leaks in query paths
  • cl#5055: Improved migration support.
  • cluster: Fix segfault on removing a node
  • controld: Do not consider the dlm up until the address list is present
  • controld: handling startup fencing within the controld agent, not the dlm
  • crmd: Ack pending operations that were cancelled due to rsc deletion
  • crmd: Actions can only be executed if their pre-requisits completed successfully
  • crmd: Do not erase the status section for unfenced nodes
  • crmd: Do not overwrite existing node state when fencing completes
  • crmd: Do not start timers for already completed operations
  • crmd: Fenced nodes that return prior to an election do not need to have their status section reset
  • crmd: make lrm_state hash table not case sensitive
  • crmd: make node_state erase correctly
  • crmd: Prevent manual fencing confirmations from attempting to create node entries for unknown nodes
  • crmd: Prevent memory leak in error paths
  • crmd: Prevent memory leak when accepting a new DC
  • crmd: Prevent message relay from attempting to create node entries for unknown nodes
  • crmd: Prevent SIGPIPE when notifying CMAN about fencing operations
  • crmd: Report unsuccessful unfencing operations
  • crm_diff: Allow the generation of xml patchsets without digests
  • crm_mon: Allow the file created by –as-html to be world readable
  • crm_mon: Ensure resource attributes have been unpacked before displaying connectivity data
  • crm_node: Only remove the named resource from the cib
  • crm_node: Prevent use-after-free in tools_remove_node_cache()
  • crm_resource: Gracefully handle -EACCESS when querying the cib
  • fencing: Advertise support for reboot/on/off in the metadata for legacy agents
  • fencing: Automatically switch from ‘list’ to ‘status’ to ‘static-list’ if those actions are not advertised in the metadata
  • fencing: Correctly record which peer performed the fencing operation
  • fencing: default to ‘off’ when agent does not advertise ‘reboot’ in metadata
  • fencing: Execute all required fencing devices regardless of what topology level they are at
  • fencing: Pass the correct options when looking up the history by node name
  • fencing: Update stonith device list only if stonith is enabled
  • get_cluster_type: failing concurrent tool invocations on heartbeat
  • iso8601: Different logic is needed when logging and calculating durations
  • lrmd: Cancel recurring operations before stop action is executed
  • lrmd: Expose logging variables expected by OCF agents
  • lrmd: Merge duplicate recurring monitor operations
  • lrmd: Provide stderr output from agents if available, otherwise fall back to stdout
  • mainloop: Fixes use after free in process monitor code
  • make resource ID case sensitive
  • mcp: Tell systemd not to respawn us if we exit with rc=100
  • pengine: Allow container nodes to migrate with connection resource
  • pengine: cl#5186 - Avoid running rsc on two nodes when node is fenced during migration
  • pengine: cl#5187 - Prevent resources in an anti-colocation from even temporarily running on a same node
  • pengine: Correctly handle origin offsets in the future
  • pengine: Correctly search failcount
  • pengine: Default sequential to TRUE for resource sets for consistency with colocation sets
  • pengine: Delay unfencing until after we know the state of all resources that require unfencing
  • pengine: Do not initiate fencing for unclean nodes when fencing is disabled
  • pengine: Do not unfence nodes that are offline, unclean or shutting down
  • pengine: Fencing devices default to only requiring quorum in order to start
  • pengine: fixes invalid transition caused by clones with more than 10 instances
  • pengine: Force record pending for migrate_to actions
  • pengine: handles edge case where container order constraints are not honored during migration
  • pengine: Ignore failure-timeout only if the failed operation has on-fail=”block”
  • pengine: Log when resources require fencing but fencing is disabled
  • pengine: Memory leaks
  • pengine: Unfencing is based on device probes, there is no need to unfence when normal resources are found active
  • Portability: Use basic types for DBus compatability struct
  • remote: Allow baremetal remote-node connection resources to migrate
  • remote: Enable migration support for baremetal connection resources by default
  • services: Correctly reset the nice value for lrmd’s children
  • services: Do not allow duplicate recurring op entries
  • services: Do not block synced service executions
  • services: Fixes segfault associated with cancelling in-flight recurring operations.
  • services: Reset the scheduling policy and priority for lrmd’s children without replying on SCHED_RESET_ON_FORK
  • services_action_cancel: Interpret return code from mainloop_child_kill() correctly
  • stonith_admin: Ensure pointers passed to sscanf() are properly initialized
  • stonith_api_time_helper now returns when the most recent fencing operation completed
  • systemd: Prevent use-of-NULL when determining if an agent exists
  • upstart: Allow comilation with glib versions older than 2.28
  • xml: Better move detection logic for xml nodes
  • xml: Check all available schemas when doing upgrades
  • xml: Convert misbehaving #define into a more easily understood inline function
  • xml: If validate-with is missing, we find the most recent schema that accepts it and go from there
  • xml: Update xml validation to allow ‘<node type=remote />’

Release Candidate: 1.1.12-rc1 was originally published by Andrew Beekhof at That Cluster Guy on May 07, 2014.

DRBDmanage installation is now easier!

Posted in LINBIT Blogs by flip at March 21, 2014 05:03 PM

In the last blog post about DRBDmanage we mentioned

Initial setup is a bit involved (see the README)

… with the new release, this is no longer true!

All that’s needed is now one command to initialize a new DRBDmanage control volume:

nodeA# drbdmanage init «local-ip-address»

You are going to initalize a new drbdmanage cluster.
CAUTION! Note that:
  * Any previous drbdmanage cluster information may be removed
  * Any remaining resources managed by a previous drbdmanage
    installation that still exist on this system will no longer
    be managed by drbdmanage

Confirm:

  yes/no:

Acknowledging that question will (still) print a fair bit of data, ie. the output of the commands that are run in the background; if everything works, you’ll get a freshly initialized DRBDmanage control volume, with the current node already registered.

Well, a single node is boring … let’s add further nodes!

nodeA# drbdmanage new-node «nodeB» «its-ip-address»

Join command for node nodeB:
  drbdmanage join some arguments ....

Now you copy and paste the one command line on the new node:

nodeB# drbdmanage join «arguments as above....»
You are going to join an existing drbdmanage cluster.
CAUTION! Note that:
...

Another yes and enter – and you’re done! Every further node is just one command on the existing cluster, which will give you the command line to use on the to-be-added node.


So, another major point is fixed … there are a few more things to be done, of course, but that was a big step (in the right direction) ;)

Potential for data corruption affecting Pacemaker 1.1.6 through 1.1.9

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at March 19, 2014 01:24 PM

It has come to my attention that the potential for data corruption exists in Pacemaker versions 1.1.6 to 1.1.9

Everyone is strongly encouraged to upgrade to 1.1.10 or later.

Those using RHEL 6.4 or later (or a RHEL clone) should already have access to 1.1.10 via the normal update channels.

At issue is some faulty logic in a function called tengine_stonith_notify() which can incorrectly add successfully fenced nodes to a list, causing Pacemaker to subsequently erase that node’s status section when the next DC election occurs.

With the status section erased, the cluster thinks that node is safely down and begins starting any services it has on other nodes - despite those already being active.

In order to trigger the logic, the fenced node must:

  1. have been the previous DC
  2. been sufficently functional to request its own fencing, and
  3. the fencing notification must arrive after the new DC has been elected, but before it invokes the policy engine

That this is the first we have heard of the issue since the problem was introduced in August 2011, the above sequence of events is apparently hard to hit under normal conditions.

Logs symptomatic of the issue look as follows:

# grep -e do_state_transition -e reboot  -e do_dc_takeover -e tengine_stonith_notify -e S_IDLE /var/log/corosync.log

Mar 08 08:43:22 [9934] lorien       crmd:     info: do_dc_takeover: 	Taking over DC status for this partition
Mar 08 08:43:22 [9934] lorien       crmd:   notice: tengine_stonith_notify: 	Peer gandalf was terminated (st_notify_fence) by mordor for gandalf: OK (ref=10d27664-33ed-43e0-a5bd-7d0ef850eb05) by client crmd.31561
Mar 08 08:43:22 [9934] lorien       crmd:   notice: tengine_stonith_notify: 	Notified CMAN that 'gandalf' is now fenced
Mar 08 08:43:22 [9934] lorien       crmd:   notice: tengine_stonith_notify: 	Target may have been our leader gandalf (recorded: <unset>)
Mar 08 09:13:52 [9934] lorien       crmd:     info: do_dc_takeover: 	Taking over DC status for this partition
Mar 08 09:13:52 [9934] lorien       crmd:   notice: do_dc_takeover: 	Marking gandalf, target of a previous stonith action, as clean
Mar 08 08:43:22 [9934] lorien       crmd:     info: do_state_transition: 	State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ]
Mar 08 08:43:28 [9934] lorien       crmd:     info: do_state_transition: 	State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED cause=C_FSA_INTERNAL origin=check_join_state ]

Note in particular the final entry from tengine_stonith_notify():

Target may have been our leader gandalf (recorded: <unset>)

If you see this after Taking over DC status for this partition but prior to State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE, then you are likely to have resources running in more than one location after the next DC election.

The issue was fixed during a routine cleanup prior to Pacemaker-1.1.10 in @f30e1e43 However the implications of what the old code allowed were not fully appreciated at the time.

Potential for data corruption affecting Pacemaker 1.1.6 through 1.1.9 was originally published by Andrew Beekhof at That Cluster Guy on March 19, 2014.

DRBDManage release 0.10

Posted in LINBIT Blogs by flip at February 06, 2014 03:11 PM

As already announced in another blog post, we’re preparing a new tool to simplify DRBD administration. Now we’re publishing its first release! Prior to DRBD Manage, in order to deploy a DRBD resource you’d have to create a config file and copy it to all the necessary nodes.  As The Internet says “ain’t nobody got time for that”.  Using DRBD Manage, all you need to do is execute the following command:

drbdmanage new-volume vol0 4 --deploy 3

Here is what happens on the back-end:

  • It chooses three nodes from the available set1;
  • drbdmanage creates a 4GiB LV on all these nodes;
  • generates DRBD configuration files;
  • writes the DRBD meta-data into the LV;
  • starts the initial sync, and
  • makes the volume on a node Primary so that it can be used right now.

This process takes only a few seconds.


Please note that there are some things to take into consideration:

  • drbdmanage is a lot to type; however, an alias dm=drbdmanage in your ~/.*shrc takes care of that ;)
  • Initial setup is a bit involved (see the README) 2
  • You’ll need at least DRBD 9.0.0-pre7.
  • Being that both DRBD Manage and DRBD9 are still under heavy development there are more than likely some undiscovered bugs.  Bug reports, ideas, wishes, or any other feedback, are welcome.

Anyway – head over to the DRBD-Manage homepage and fetch your source tarballs (a few packages are prepared, too), or a GIT checkout if you plan to keep up-to-date. For questions please use the drbd-user mailing list; patches, or other development-related topics are welcome on the drbd-dev mailing list.

What do you think? Drop us a note!


DRBD-Manager

Posted in LINBIT Blogs by flip at November 22, 2013 12:41 PM

One of the projects that LINBIT will publish soon1 is drbdmanage, which allows easy cluster-wide storage administration with DRBD 9.

Every DRBD user knows the drill – create an LV, write a DRBD resource configuration file, create-md, up, initial sync, …

But that is no more.

The new way is this: drbdmanage new-volume r0 50 deploy 4, and here comes your quadruple replicated 50 gigabyte DRBD volume.

This is accomplished by a cluster-wide DRBD volume that holds some drbdmanage data, and a daemon on each node that receives DRBD events from the kernel.

Every time some configuration change is wanted,

  1. drbdmanage writes into the common volume,
  2. causing the other nodes to see the PrimarySecondary events,
  3. so that they know to reload the new configuration,
  4. and act upon it – creating or removing an LV, reconfiguring DRBD, etc.
  5. and, if required, cause an initial sync.

As DRBD 8.4.4 now supports DISCARD/TRIM, the initial sync (on SSDs or Thin LVM) is essentially free – a few seconds is all it takes. (See eg. mkfs.ext4 for a possible user).

Further usecases are various projects that can benefit by a “shared storage” layer – like oVirt, OpenStack, libVirt, etc.
Just imagine using a non-cluster-aware tool like virt-manager to create a new VM, and the storage gets automatically sync’ed across multiple nodes…

Interested? You’ll have to wait for a few weeks, but you can always drop us a line.


Announcing 1.1.11 Beta Testing

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at November 21, 2013 02:00 PM

With over 400 updates since the release of 1.1.10, its time to start thinking about a new release.

Today I have tagged release candidate 1. The most notable fixes include:

  • attrd: Implementation of a truely atomic attrd for use with corosync 2.x
  • cib: Allow values to be added/updated and removed in a single update
  • cib: Support XML comments in diffs
  • Core: Allow blackbox logging to be disabled with SIGUSR2
  • crmd: Do not block on proxied calls from pacemaker_remoted
  • crmd: Enable cluster-wide throttling when the cib heavily exceeds its target load
  • crmd: Use the load on our peers to know how many jobs to send them
  • crm_mon: add –hide-headers option to hide all headers
  • crm_report: Collect logs directly from journald if available
  • Fencing: On timeout, clean up the agent’s entire process group
  • Fencing: Support agents that need the host to be unfenced at startup
  • ipc: Raise the default buffer size to 128k
  • PE: Add a special attribute for distinguishing between real nodes and containers in constraint rules
  • PE: Allow location constraints to take a regex pattern to match against resource IDs
  • pengine: Distinguish between the agent being missing and something the agent needs being missing
  • remote: Properly version the remote connection protocol
  • services: Detect missing agents and permission errors before forking
  • Bug cl#5171 - pengine: Don’t prevent clones from running due to dependant resources
  • Bug cl#5179 - Corosync: Attempt to retrieve a peer’s node name if it is not already known
  • Bug cl#5181 - corosync: Ensure node IDs are written to the CIB as unsigned integers

If you are a user of pacemaker_remoted, you should take the time to read about changes to the online wire protocol that are present in this release.

To build rpm packages for testing:

  1. Clone the current sources:

    # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
    # cd pacemaker
    
  2. If you haven’t already, install Pacemaker’s dependancies

    [Fedora] # sudo yum install -y yum-utils
    [ALL]	# make rpm-dep
    
  3. Build Pacemaker

    # make rc
    
  4. Copy the rpms and deploy as needed

Announcing 1.1.11 Beta Testing was originally published by Andrew Beekhof at That Cluster Guy on November 21, 2013.

Pacemaker and RHEL 6.4 (redux)

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at November 04, 2013 09:33 AM

The good news is that as of Novemeber 1st, Pacemaker is now supported on RHEL 6.4 - with two caveats.

  1. You must be using the updated pacemaker, resource-agents and pcs packages
  2. You must be using CMAN for membership and quorum (background)

Technically, support is currently limited to Pacemaker’s use in the context of OpenStack. In practice however, any bug that can be shown to affect OpenStack deployments has a good chance of being fixed.

Since a cluster with no services is rather pointless, the heartbeat OCF agents are now also officially supported. However, as Red Hat’s policy is to only ship supported agents, some agents are not present for this initial release.

The three primary reasons for not shipping agents were:

  1. The software the agent controls is not shipped in RHEL
  2. Insufficient experience to provide support
  3. Avoiding agent duplication

Filing bugs is definitly the best way to get agents in the second categories prioritized for inclusion.

Likewise, if there is no shipping agent that provides the functionality of agents in the third category (IPv6addr and IPaddr2 might be an example here), filing bugs is the best way to get that fixed.

In the meantime, since most of the agents are just shell scripts, downloading the latest upstream agents is a viable work-around in most cases. For example:

    agents="Raid1 Xen"
    for a in $agents; do wget -O /usr/lib/ocf/resource.d/heartbeat/$a https://github.com/ClusterLabs/resource-agents/raw/master/heartbeat/$a; done

Pacemaker and RHEL 6.4 (redux) was originally published by Andrew Beekhof at That Cluster Guy on November 04, 2013.

Changes to the Remote Wire Protocol in 1.1.11

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at October 31, 2013 12:04 PM

Unfortunately the current wire protocol used by pacemaker_remoted for exchanging messages was found to be suboptimal and we have taken the decision to change it now before it becomes widely adopted.

We attempted to do this in a backwards compatibile manner, however the two methods we tried were either overly complicated and fragile, or not possible due to the way the released crm_remote_parse_buffer() function operated.

The changes include a versioned binary header that contains the size of the header, payload and total message, control flags and a big/little-endian detector.

These changes will appear in the upstream repo shortly and ship in 1.1.11. Anyone for this will be a problem is encouraged to get in contact to discuss possible options.

For RHEL users, any version on which pacemaker_remoted is supported will have the new versioned protocol. That means 7.0 and potentially a future 6.x release.

Changes to the Remote Wire Protocol in 1.1.11 was originally published by Andrew Beekhof at That Cluster Guy on October 31, 2013.

DRBD and the sync rate controller, part 2

Posted in LINBIT Blogs by flip at October 29, 2013 09:36 AM

As an update to the earlier blog post, take a look below.

As a reminder: this is about resynchronization (ie. recovery after a node or network problem), not about the replication.


If you’ve got a demanding application it’s possible that it completely fills your I/O bandwidth, disk and/or network, leaving no room for the synchronization to complete. To make the synchronization slow down and let the application proceed, DRBD has the dynamically adaptive resync rate controller.

It is enabled by default with 8.4, and disabled by default with 8.3.
To explicitly enable or disable, set c-plan-ahead to 20 (enable) or 0 (disable).

Note that, while enabled, the setting for the old fixed sync rate is used only as initial guess for the controller. After that, only the c-* settings are used, so changing the fixed sync rate while the controller is enabled won’t have much effect.

What it does

The resync controller tries to use up as much network and disk bandwidth as it can get, but no more than c-max-rate, and throttles if either

  • more resync requests are in flight than what amounts to c-fill-target 1
  • it detects application IO (read or write), and the current estimated resync rate is above c-min-rate2.

The default c-min-rate with 8.4.x is 250 kiB/sec (the old default of the fixed sync-rate), with 8.3.x it was 4MiB/sec.

This “throttle if application IO is detected” is active even if the fixed sync rate is used. You can (but should not, see below) disable this specific throttling by setting c-min-rate to 0.

Tuning the resync controller

It’s hard, or next to impossible, for DRBD to detect how much activity your backend can handle. But it is very easy for DRBD to know how much resync-activity it causes itself.
So, you tune how much resync-activity you allow during periods of application activity.

To do that you should

  • set c-plan-ahead to 20 (default with 8.4), or more if there’s a lot of latency on the connection (WAN link with protocol A);
  • leave the fixed resync rate (the initial guess for the controller) at about 30% or less of what your hardware can handle;
  • set c-max-rate to 100% (or slightly more) of what your hardware can handle;
  • set c-fill-target to the minimum (just as high as necessary) that gets your hardware saturated, if the system is otherwise idle.
    Respectively, figure out the maximum possible resync rate in your setup while the system is idle, then set c-fill-target to the minimum setting that still reaches that rate.
  • And finally, while checking application request latency/responsiveness, tune c-min-rate to the maximum that still allows for acceptable responsiveness.

Most parts of this post were originally published as an ML post by Lars.
Additional information you also find in the drbd.conf manpage.


DRBD Proxy 3.1: Performance improvements

Posted in LINBIT Blogs by flip at October 14, 2013 06:35 AM

The threading model in DRBD Proxy 3.1 received a complete overhaul; below you can see the performance implications of these changes.First of all, as it suffered from the distinction between low latency for the meta-data connections vs. high bandwidth for the data connections, a second set of pthreads has been added. The first one runs at the (normally negative) nice level the DRBD Proxy process is started at, while the second set, in order to be “nicer” to the other processes, adds +10 to the nice level and therefore gets a smaller chunk of the cpu time.

Secondly, the internal processing has been changed, too. This isn’t visible externally, of course – you can only notice the performance improvements.

DRBD Proxy 3.1 buffer usage

In the example graph above a few sections can be clearly seen:

  • From 0 to about 11.5 seconds the Proxy buffer gets filled. In case anyone’s interested, here’s the dd output:
    3712983040 Bytes (3.7 GB) copied, 11.4573 s, 324 MB/s
  • Until up to ~44 seconds, there is lzma compression active, with a single context. Slow, but compresses the best.
  • Then I switched to zlib; this is a fair bit faster. All cores are being used, so external requests (by some VMs and other processes) show up as irregular spikes. (Different compression ratios for various input data are “at fault”, too.)
  • At 56 seconds the compression is turned off completely; the time needed for the rest of the data (3GiB in about 13 seconds) shows the bonded-ethernet bandwidth of about 220MB/sec.

For two sets of test machines1 a plausible rate for transferring large blocks2 into the Proxy buffers is 450-500MiB/sec3.
For small buffers there are a few code paths that are not fully optimized yet4, further improvements are to be expected in the next versions, too.

The roadmap for the near future includes a shared memory pool for all connections and WAN bandwidth shaping (ie. limitation to some configured value) — and some more ideas that have to be researched first.

Opinions? Contact us!


Pacemaker 1.1.10 - final

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at July 26, 2013 10:11 AM

Announcing the release of Pacemaker 1.1.10

There were three changes of note since rc7:

  • Bug cl#5161 - crmd: Prevent memory leak in operation cache
  • cib: Correctly read back archived configurations if the primary is corrupted
  • cman: Do not pretend we know the state of nodes we’ve never seen

Along with assorted bug fixes, the major topics for this release were:

  • stonithd fixes
  • fixing memory leaks, often caused by incorrect use of glib reference counting
  • supportability improvements (code cleanup and deduplication, standardized error codes)

Release candidates for the next Pacemaker release (1.1.11) can be expected some time around Novemeber.

A big thankyou to everyone that spent time testing the release candidates and/or contributed patches. However now that Pacemaker is perfect, anyone reporting bugs will be shot :-)

To build rpm packages:

  1. Clone the current sources:

    # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
    # cd pacemaker
    
  2. Install dependancies (if you haven’t already)

    [Fedora] # sudo yum install -y yum-utils
    [ALL]	# make rpm-dep
    
  3. Build Pacemaker

    # make release
    
  4. Copy and deploy as needed

Details - 1.1.10 - final

Changesets  602
Diff 143 files changed, 8162 insertions(+), 5159 deletions(-)


Highlights

Features added since Pacemaker-1.1.9

  • Core: Convert all exit codes to positive errno values
  • crm_error: Add the ability to list and print error symbols
  • crm_resource: Allow individual resources to be reprobed
  • crm_resource: Allow options to be set recursively
  • crm_resource: Implement –ban for moving resources away from nodes and –clear (replaces –unmove)
  • crm_resource: Support OCF tracing when using –force-(check start stop)
  • PE: Allow active nodes in our current membership to be fenced without quorum
  • PE: Suppress meaningless IDs when displaying anonymous clone status
  • Turn off auto-respawning of systemd services when the cluster starts them
  • Bug cl#5128 - pengine: Support maintenance mode for a single node

Changes since Pacemaker-1.1.9

  • crmd: cib: stonithd: Memory leaks resolved and improved use of glib reference counting
  • attrd: Fixes deleted attributes during dc election
  • Bug cf#5153 - Correctly display clone failcounts in crm_mon
  • Bug cl#5133 - pengine: Correctly observe on-fail=block for failed demote operation
  • Bug cl#5148 - legacy: Correctly remove a node that used to have a different nodeid
  • Bug cl#5151 - Ensure node names are consistently compared without case
  • Bug cl#5152 - crmd: Correctly clean up fenced nodes during membership changes
  • Bug cl#5154 - Do not expire failures when on-fail=block is present
  • Bug cl#5155 - pengine: Block the stop of resources if any depending resource is unmanaged
  • Bug cl#5157 - Allow migration in the absence of some colocation constraints
  • Bug cl#5161 - crmd: Prevent memory leak in operation cache
  • Bug cl#5164 - crmd: Fixes crash when using pacemaker-remote
  • Bug cl#5164 - pengine: Fixes segfault when calculating transition with remote-nodes.
  • Bug cl#5167 - crm_mon: Only print “stopped” node list for incomplete clone sets
  • Bug cl#5168 - Prevent clones from being bounced around the cluster due to location constraints
  • Bug cl#5170 - Correctly support on-fail=block for clones
  • cib: Correctly read back archived configurations if the primary is corrupted
  • cib: The result is not valid when diffs fail to apply cleanly for CLI tools
  • cib: Restore the ability to embed comments in the configuration
  • cluster: Detect and warn about node names with capitals
  • cman: Do not pretend we know the state of nodes we’ve never seen
  • cman: Do not unconditionally start cman if it is already running
  • cman: Support non-blocking CPG calls
  • Core: Ensure the blackbox is saved on abnormal program termination
  • corosync: Detect the loss of members for which we only know the nodeid
  • corosync: Do not pretend we know the state of nodes we’ve never seen
  • corosync: Ensure removed peers are erased from all caches
  • corosync: Nodes that can persist in sending CPG messages must be alive afterall
  • crmd: Do not get stuck in S_POLICY_ENGINE if a node we couldn’t fence returns
  • crmd: Do not update fail-count and last-failure for old failures
  • crmd: Ensure all membership operations can complete while trying to cancel a transition
  • crmd: Ensure operations for cleaned up resources don’t block recovery
  • crmd: Ensure we return to a stable state if there have been too many fencing failures
  • crmd: Initiate node shutdown if another node claims to have successfully fenced us
  • crmd: Prevent messages for remote crmd clients from being relayed to wrong daemons
  • crmd: Properly handle recurring monitor operations for remote-node agent
  • crmd: Store last-run and last-rc-change for all operations
  • crm_mon: Ensure stale pid files are updated when a new process is started
  • crm_report: Correctly collect logs when ‘uname -n’ reports fully qualified names
  • fencing: Fail the operation once all peers have been exhausted
  • fencing: Restore the ability to manually confirm that fencing completed
  • ipc: Allow unpriviliged clients to clean up after server failures
  • ipc: Restore the ability for members of the haclient group to connect to the cluster
  • legacy: Support “crm_node –remove” with a node name for corosync plugin (bnc#805278)
  • lrmd: Default to the upstream location for resource agent scratch directory
  • lrmd: Pass errors from lsb metadata generation back to the caller
  • pengine: Correctly handle resources that recover before we operate on them
  • pengine: Delete the old resource state on every node whenever the resource type is changed
  • pengine: Detect constraints with inappropriate actions (ie. promote for a clone)
  • pengine: Ensure per-node resource parameters are used during probes
  • pengine: If fencing is unavailable or disabled, block further recovery for resources that fail to stop
  • pengine: Implement the rest of get_timet_now() and rename to get_effective_time
  • pengine: Re-initiate active recurring monitors that previously failed but have timed out
  • remote: Workaround for inconsistent tls handshake behavior between gnutls versions
  • systemd: Ensure we get shut down correctly by systemd
  • systemd: Reload systemd after adding/removing override files for cluster services
  • xml: Check for and replace non-printing characters with their octal equivalent while exporting xml text
  • xml: Prevent lockups by setting a more reliable buffer allocation strategy

Pacemaker 1.1.10 - final was originally published by Andrew Beekhof at That Cluster Guy on July 26, 2013.

Release candidate: 1.1.10-rc7

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at July 22, 2013 10:50 AM

Announcing the seventh release candidate for Pacemaker 1.1.10

This RC is a result of bugfixes to the policy engine, fencing daemon and crmd. We’ve squashed a bug involving constructing compressed messages and stonith-ng can now recover when a configuration ordering change is detected.

Please keep the bug reports coming in!

To build rpm packages for testing:

  1. Clone the current sources:

    # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
    # cd pacemaker
    
  2. Install dependancies (if you haven’t already)

    [Fedora] # sudo yum install -y yum-utils
    [ALL]	# make rpm-dep
    
  3. Build Pacemaker

    # make rc
    
  4. Copy and deploy as needed

Details - 1.1.10-rc7

Changesets  57
Diff 37 files changed, 414 insertions(+), 331 deletions(-)


Highlights

### Features added in Pacemaker-1.1.10-rc7

  • N/A

Changes since Pacemaker-1.1.10-rc6

  • Bug cl#5168 - Prevent clones from being bounced around the cluster due to location constraints
  • Bug cl#5170 - Correctly support on-fail=block for clones
  • Bug cl#5164 - crmd: Fixes crmd crash when using pacemaker-remote
  • cib: The result is not valid when diffs fail to apply cleanly for CLI tools
  • cluster: Correctly construct the header for compressed messages
  • cluster: Detect and warn about node names with capitals
  • Core: remove the mainloop_trigger that are no longer needed.
  • corosync: Ensure removed peers are erased from all caches
  • cpg: Correctly free sent messages
  • crmd: Prevent messages for remote crmd clients from being relayed to wrong daemons
  • crmd: Properly handle recurring monitor operations for remote-node agent
  • crm_mon: Bug cl#5167 - Only print “stopped” node list for incomplete clone sets
  • crm_node: Return 0 if –remove passed
  • fencing: Correctly detect existing device entries when registering a new one
  • lrmd: Prevent use-of-NULL in client library
  • pengine: cl5164 - Fixes pengine segfault when calculating transition with remote-nodes.
  • pengine: Do the right thing when admins specify the internal resource instead of the clone
  • pengine: Re-allow ordering constraints with fencing devices now that it is safe to do so

Release candidate: 1.1.10-rc7 was originally published by Andrew Beekhof at That Cluster Guy on July 22, 2013.

Release candidate: 1.1.10-rc6

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at July 04, 2013 04:46 PM

Announcing the sixth release candidate for Pacemaker 1.1.10

This RC is a result of bugfixes in the policy engine, fencing daemon and crmd. Previous fixes in rc5 have also now been confirmed.

Help is specifically requested for testing plugin-based clusters, ACLs, the –ban and –clear commands, and admin actions (such as moving and stopping resources, calls to stonith_admin) which are hard to test in an automated manner.

There is one bug open for David’s remote nodes feature (involving managing services on non-cluster nodes), but everything else seems good.

Please keep the bug reports coming in!

To build rpm packages for testing:

  1. Clone the current sources:

    # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
    # cd pacemaker
    
  2. Install dependancies (if you haven’t already)

    [Fedora] # sudo yum install -y yum-utils
    [ALL]	# make rpm-dep
    
  3. Build Pacemaker

    # make rc
    
  4. Copy and deploy as needed

Details - 1.1.10-rc6

Changesets  63
Diff 24 files changed, 356 insertions(+), 133 deletions(-)


Highlights

### Features added in Pacemaker-1.1.10-rc6

  • tools: crm_mon –neg-location drbd-fence-by-handler
  • pengine: cl#5128 - Support maintenance mode for a single node

Changes since Pacemaker-1.1.10-rc5

  • cluster: Correctly remove duplicate peer entries
  • crmd: Ensure operations for cleaned up resources don’t block recovery
  • pengine: Bug cl#5157 - Allow migration in the absence of some colocation constraints
  • pengine: Delete the old resource state on every node whenever the resource type is changed
  • pengine: Detect constraints with inappropriate actions (ie. promote for a clone)
  • pengine: Do the right thing when admins specify the internal resource instead of the clone

Release candidate: 1.1.10-rc6 was originally published by Andrew Beekhof at That Cluster Guy on July 04, 2013.

GPG Quickstart

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at June 24, 2013 02:40 PM

It seemed timely that I should refresh both my GPG knowledge and my keys. I am summarizing my method (and sources) below in the event that they may prove useful to others:

Preparation

The following settings ensure that any keys you create in the future are strong ones by 2013’s standards. Paste the following into ~/.gnupg/gpg.conf:

# when multiple digests are supported by all recipients, choose the strongest one:
personal-digest-preferences SHA512 SHA384 SHA256 SHA224
# preferences chosen for new keys should prioritize stronger algorithms: 
default-preference-list SHA512 SHA384 SHA256 SHA224 AES256 AES192 AES CAST5 BZIP2 ZLIB ZIP Uncompressed
# when making an OpenPGP certification, use a stronger digest than the default SHA1:
cert-digest-algo SHA512

The next batch of settings are optional but aim to improve the output of gpg commands in various ways - particularly against spoofing. Again, paste them into ~/.gnupg/gpg.conf:

# when outputting certificates, view user IDs distinctly from keys:
fixed-list-mode
# long keyids are more collision-resistant than short keyids (it's trivial to make a key with any desired short keyid)
keyid-format 0xlong
# If you use a graphical environment (and even if you don't) you should be using an agent:
# (similar arguments as  https://www.debian-administration.org/users/dkg/weblog/64)
use-agent
# You should always know at a glance which User IDs gpg thinks are legitimately bound to the keys in your keyring:
verify-options show-uid-validity
list-options show-uid-validity
# include an unambiguous indicator of which key made a signature:
# (see http://thread.gmane.org/gmane.mail.notmuch.general/3721/focus=7234)
sig-notation issuer-fpr@notations.openpgp.fifthhorseman.net=%g

Create a New Key

There are several checks for deciding if your old key(s) are any good. However, if you created a key more than a couple of years ago, then realistically you probably need a new one.

I followed instructions from Ana Guerrero’s post, which were the basis of the current debian guide, but selected the 2013 default key type:

  1. run gpg --gen-key
  2. Select (1) RSA and RSA (default)
  3. Select a keysize greater than 2048
  4. Set a key expiration of 2-5 years. [rationale]
  5. Do NOT specify a comment for User ID. [rationale]

Add Additional UIDs and Setting a Default

At this point my keyring gpg --list-keys looked like this:

pub   4096R/0x726724204C644D83 2013-06-24
uid                 [ultimate] Andrew Beekhof <andrew@beekhof.net>
sub   4096R/0xC88100891A418A6B 2013-06-24 [expires: 2015-06-24]

Like most people, I have more than one email address and I will want to use GPG with them too. So now is the time to add them to the key. You’ll want the gpg --edit-key command for this. Ana has a good exmaple of adding UIDs and setting a preferred one. Just search her instructions for Add other UID.

Separate Subkeys for Encryption and Signing

The general consensus is that separate keys should be used for signing versus encryption.

tl;dr - you want to be able to encrypt things without signing them as “signing” may have unintended legal implications. There is also the possibility that signed messages can be used in an attack against encrypted data.

By default gpg will create a subkey for encryption, but I followed Debian’s subkey guide for creating one for signing too (instead of using the private master key).

Doing this allows you to make your private master key even safer by removing it from your day-to-day keychain.

The idea is to make a copy first and keep it in an even more secure location, so that if a subkey (or the machine its on) gets compromised, your master key remains safe and you are always in a position to revoke subkeys and create new ones.

Sign the New Key with the Old One

If you have an old key, you should sign the new one with it. This tells everyone who trusted the old key that the new one is legitimate and can therefor also be trusted.

Here I went back to Ana’s instructions. Basically:

gpg --default-key OLDKEY --sign-key NEWKEY

or, in my case:

gpg --default-key 0xEC3584EFD449E59A --sign-key 0x726724204C644D83

Send it to a Key Server

Tell the world so they can verfiy your signature and send you encrypted messages:

gpg --send-key 0x726724204C644D83

Revoking Old UIDs

If you’re like me, your old key might have some addresses which you have left behind. You can’t remove addresses from your keys, but you can tell the world to stop using them.

To do this for my old key, I followed instructions on the gnupg mailing list

Everything still looks the same when you search for my old key:

pub  1024D/D449E59A 2007-07-20 Andrew Beekhof <beekhof@mac.com>
                               Andrew Beekhof <abeekhof@suse.de>
                               Andrew Beekhof <beekhof@gmail.com>
                               Andrew Beekhof <andrew@beekhof.net>
                               Andrew Beekhof <abeekhof@novell.com>
	 Fingerprint=E5F5 BEFC 781F 3637 774F  C1F8 EC35 84EF D449 E59A 

But if you click through to the key details, you’ll see the addresses associated with my time at Novell/SuSE now show revok in red.

pub  1024D/D449E59A 2007-07-20            
	 Fingerprint=E5F5 BEFC 781F 3637 774F  C1F8 EC35 84EF D449 E59A 

uid Andrew Beekhof <beekhof@mac.com>
sig  sig3  D449E59A 2007-07-20 __________ __________ [selfsig]

uid Andrew Beekhof <abeekhof@suse.de>
sig  sig3  D449E59A 2007-07-20 __________ __________ [selfsig]
sig revok  D449E59A 2013-06-24 __________ __________ [selfsig]
...

This is how other people’s copy of gpg knows not to use this key for that address anymore. And also why its important to refresh your keys periodically.

Revoking Old Keys

Realistically though, you probably don’t want people using old and potentially compromised (or compromise-able) keys to send you sensitive messages. The best thing to do is revoke the entire key.

Since keys can’t be removed once you’ve uploaded them, you’re actually updating the existing entry. To do this you need the original private key - so keep it safe!

Some people advise you to pre-generate the revocation key - personally that seems like just one more thing to keep track of.

Orphaned keys that can’t be revoked still appear valid to anyone wanting to send you a secure message - a good reason to set an expiry date as a failsafe!

This is what one of my old revoked key looks like:

pub  1024D/DABA170E 2004-10-11 *** KEY REVOKED *** [not verified]
                               Andrew Beekhof (SuSE VPN Access) <andrew@beekhof.net>
	 Fingerprint=9A53 9DBB CF73 AB8F B57B  730A 3279 4AE9 DABA 170E 

Final Result

My new key:

pub  4096R/4C644D83 2013-06-24 Andrew Beekhof <andrew@beekhof.net>
                               Andrew Beekhof <beekhof@mac.com>
                               Andrew Beekhof <abeekhof@redhat.com>
	 Fingerprint=C503 7BA2 D013 6342 44C0  122C 7267 2420 4C64 4D83 

Closing word

I am by no means an expert at this, I would be very grateful to hear about any mistakes I may have made above.

GPG Quickstart was originally published by Andrew Beekhof at That Cluster Guy on June 24, 2013.

Release candidate: 1.1.10-rc5

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at June 19, 2013 10:32 AM

Lets try this again… Announcing the fourth and a half release candidate for Pacemaker 1.1.10

I previously tagged rc4 but ended up making several changes shortly afterwards, so it was pointless to announce it.

This RC is a result of cleanup work in several ancient areas of the codebase:

  • A number of internal membership caches have been combined
  • The three separate CPG code paths have been combined

As well as:

  • Moving clones is now both possible and sane
  • Improved behavior on systemd based nodes
  • and other assorted bugfixes (see below)

Please keep the bug reports coming in!

Help is specifically requested for testing plugin-based clusters, ACLs, the new –ban and –clear commands, and admin actions (such as moving and stopping resources, calls to stonith_admin) which are hard to test in an automated manner.

Also any light that can be shed on possible memory leaks would be much appreciated.

If everything looks good in a week from now, I will re-tag rc5 as final.

To build rpm packages for testing:

  1. Clone the current sources:

    # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
    # cd pacemaker
    
  2. Install dependancies

    [Fedora] # sudo yum install -y yum-utils
    [ALL]	# make rpm-dep
    
  3. Build Pacemaker

    # make rc
    
  4. Copy and deploy as needed

Details - 1.1.10-rc5

Changesets  168
Diff 96 files changed, 4983 insertions(+), 3097 deletions(-)


Highlights

### Features added in Pacemaker-1.1.10-rc5

  • crm_error: Add the ability to list and print error symbols
  • crm_resource: Allow individual resources to be reprobed
  • crm_resource: Implement –ban for moving resources away from nodes and –clear (replaces –unmove)
  • crm_resource: Support OCF tracing when using –force-(check start stop)
  • PE: Allow active nodes in our current membership to be fenced without quorum
  • Turn off auto-respawning of systemd services when the cluster starts them

Changes since Pacemaker-1.1.10-rc3

  • Bug pengine: cl#5155 - Block the stop of resources if any depending resource is unmanaged
  • Convert all exit codes to positive errno values
  • Core: Ensure the blackbox is saved on abnormal program termination
  • corosync: Detect the loss of members for which we only know the nodeid
  • corosync: Do not pretend we know the state of nodes we’ve never seen
  • corosync: Nodes that can persist in sending CPG messages must be alive afterall
  • crmd: Do not get stuck in S_POLICY_ENGINE if a node we couldn’t fence returns
  • crmd: Ensure all membership operations can complete while trying to cancel a transition
  • crmd: Everyone who gets a fencing notification should mark the node as down
  • crmd: Initiate node shutdown if another node claims to have successfully fenced us
  • crmd: Update the status section with details of nodes for which we only know the nodeid
  • crm_report: Find logs in compressed files
  • logging: If SIGTRAP is sent before tracing is turned on, turn it on
  • pengine: If fencing is unavailable or disabled, block further recovery for resources that fail to stop
  • remote: Workaround for inconsistent tls handshake behavior between gnutls versions
  • systemd: Ensure we get shut down correctly by systemd

Release candidate: 1.1.10-rc5 was originally published by Andrew Beekhof at That Cluster Guy on June 19, 2013.

“umount is too slow”

Posted in LINBIT Blogs by flip at May 27, 2013 07:31 AM

A question we see over and over again is

Why is umount so slow? Why does it take so long?

Part of the answer was already given in an earlier blog post; here’s some more explanation.

The write() syscall typically writes into RAM only. In Linux we call that “page cache“, or “buffer cache“, depending on what exactly the actual target of the write() system call was.

From that RAM (cache inside the operating system, high in the IO stack) the operating system does periodically do writeouts, at its leisure, unless it is urged to write out particular pieces (or all of it) now.

A sync (or fsync(), or fdatasync(), or …) does exactly that: it urges the operating system to do the write out.
A umount also causes a write out of all not yet written data of the affected file system.

Note:

  • Of course the “performance” of writes that go into volatile RAM only will be much better than anything that goes to stable, persistent, storage. All things that have only been written to cache but not yet synced (written out to the block layer) will be lost if you have a power outage or server crash.
    The linux block layer has never seen these changes, DRBD has never seen these changes, they cannot possibly be replicated anywhere.
    Data will be lost.

There are also controller caches which may or may not be volatile, and disk caches, which typically are volatile. These are below and outside the operating system, and not part of this discussion. Just make sure you disable all volatile caches on that level.

Now, for a moment, assume

  • you don’t have DRBD in the stack, and
  • a moderately capable IO backend that writes, say, 300 MByte/s, and
  • around 3 GiByte of dirty data around at the time you trigger the umount, and
  • you are not seek-bound, so your backend can actually reach that 300 MB/s,

you get a umount time of around 10 seconds.


Still with me?

Ok. Now, introduce DRBD to your IO stack, and add a long distance replication link. Just for the sake of me trying to explain it here, assume that because it is long distance and you have a limited budget, you can only afford 100 MBit/s. And “long distance” implies larger round trip times, so lets assume we have a RTT of 100 ms.

Of course that would introduce a single IO request latency of > 100 ms for anything but DRBD protocol A, so you opt for protocol A. (In other words, using protocol A “masks” the RTT of the replication link from the application-visible latency.)

That was latency.

But, the limited bandwidth of that replication link also limits your average sustained write throughput, in the given example to about 11MiByte/s.
The same 3 GByte of dirty data would now drain much slower, in fact that same umount would now take not 10 seconds, but 5 minutes.

You can also take a look at a drbd-user mailing list post.


So, concluding: try to avoid having much unsaved data in RAM; it might bite you. For example, you want your cluster to do a switchover, but the umount takes too long and a timeout hits: the node (should) get fenced, and the data not written to stable storage will be lost.

Please follow the advice about setting some sysctls to start write-out earlier!

Rolling updates with Ansible and Apache reverse proxies

Posted in Arrfab's Blog » Cluster by fabian.arrotin at May 23, 2013 04:36 PM

It's not a secret anymore that I use Ansible to do a lot of things. That goes from simple "one shot" actions with ansible on multiple nodes to "configuration management and deployment tasks" with ansible-playbook. One of the thing I also really like with Ansible is the fact that it's also a great orchestration tool.

For example, in some WSOA flows you can have a bunch of servers behind load balancer nodes. When you want to put a backend node/web server node in maintenance mode (to change configuration/update package/update app/whatever), you just "remove" that node from the production flow, do what you need to do, verify it's up again and put that node back in production. The principle of "rolling updates" is then interesting as you still have 24/7 flows in production.

But what if you're not in charge of the whole infrastructure ? AKA for example you're in charge of some servers, but not the load balancers in front of your infrastructure. Let's consider the following situation, and how we'll use ansible to still disable/enable a backend server behind Apache reverse proxies.

So here is the (simplified) situation : two Apache reverse proxies (using the mod_proxy_balancer module) are used to load balance traffic to four backend nodes (Jboss in our simplified case). We can't directly touch those upstream Apache nodes, but we can still interact on them , thanks to the fact that "balancer manager support" is active (and protected !)

Let's have a look at a (simplified) ansible inventory file :

[jboss-cluster]

jboss-1

jboss-2

jboss-3

jboss-4

[apache-group-1]

apache-node-1

apache-node-2

Let's now create a generic (write once/use it many) task to disable a backend node from apache ! :

---
##############################################################################
#
# This task can be included in a playbook to pause a backend node
# being load balanced by Apache Reverse Proxies
# Several variables need to be defined :
#   - ${apache_rp_backend_url} : the URL of the backend server, as known by Apache server
#   - ${apache_rp_backend_cluster} : the name of the cluster as defined on the Apache RP (the group the node is member of)
#   - ${apache_rp_group} : the name of the group declared in hosts.cfg containing Apache Reverse Proxies
#   - ${apache_rp_user}: the username used to authenticate against the Apache balancer-manager
#   - ${apache_rp_password}: the password used to authenticate against the Apache balancer-manager
#   - ${apache_rp_balancer_manager_uri}: the URI where to find the balancer-manager Apache mod
#
##############################################################################
- name: Disabling the worker in Apache Reverse Proxies
local_action: shell /usr/bin/curl -k --user ${apache_rp_user}:${apache_rp_password} "https://${item}/${apache_rp_balancer_manager_uri}?b=${apache_rp_backend_cluster}&w=${apache_rp_backend_url}&nonce=$(curl -k --user ${apache_rp_user}:${apache_rp_password} https://${item}/${apache_rp_balancer_manager_uri} |grep nonce|tail -n 1|cut -f 3 -d '&'|cut -f 2 -d '='|cut -f 1 -d '"')&dw=Disable"
with_items: ${groups.${apache_rp_group}}

- name: Waiting 20 seconds to be sure no traffic is being sent anymore to that worker backend node
pause: seconds=20

The interesting bit is the with_items one : it will use the apache_rp_group variable to know which apache servers are used upstream (assuming you can have multiple nodes/clusters) and will play that command for every host in the list obtained from the inventory !

We can now, in the "rolling-updates" playbook, just call the previous tasks (assuming we saved it as ../tasks/apache-disable-worker.yml) :

---

- hosts: jboss-cluster

serial: 1

user: root

tasks:

- include: ../tasks/apache-disable-worker.yml

- etc/etc ...

- wait_for: port=8443 state=started

- include: ../tasks/apache-enable-worker.yml

But Wait ! As you've seen, we still need to declare some variables : let's do that in the inventory, under group_vars and host_vars !

group_vars/jboss-cluster :

# Apache reverse proxies settins
apache_rp_group: apache-group-1
apache_rp_user: my-admin-account
apache_rp_password: my-beautiful-pass
apache_rp_balancer_manager_uri: balancer-manager-hidden-and-redirected

host_vars/jboss-1 :

apache_rp_backend_url : 'https://jboss1.myinternal.domain.org:8443'
apache_rp_backend_cluster : nameofmyclusterdefinedinapache

Now when we'll use that playbook, we'll have a local action that will interact with the balancer manager to disable that backend node while we do maintainance.

I let you imagine (and create) a ../tasks/apache-enable-worker.yml file to enable it (which you'll call at the end of your playbook).

Release candidate: 1.1.10-rc3

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at May 23, 2013 10:32 AM

Announcing the third release candidate for Pacemaker 1.1.10

This RC is a result of work in several problem areas reported by users, some of which date back to 1.1.8:

  • manual fencing confirmations
  • potential problems reported by Coverity
  • the way anonymous clones are displayed
  • handling of resource output that includes non-printing characters
  • handling of on-fail=block

Please keep the bug reports coming in. There is a good chances that this will be the final release candidate and 1.1.10 will be tagged on May 30th.

Help is specifically requested for testing plugin-based clusters, ACLs and admin actions (such as moving and stopping resources, calls to stonith_admin) which are hard to test in an automated manner.

To build rpm packages for testing:

  1. Clone the current sources:

    # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
    # cd pacemaker
    
  2. Install dependancies

    [Fedora] # sudo yum install -y yum-utils
    [ALL]	# make rpm-dep
    
  3. Build Pacemaker

    # make rc
    
  4. Copy and deploy as needed

Details - 1.1.10-rc3

Changesets  116
Diff 59 files changed, 707 insertions(+), 408 deletions(-)


Highlights

### Features added in Pacemaker-1.1.10-rc3

  • PE: Display a list of nodes on which stopped anonymous clones are not active instead of meaningless clone IDs
  • PE: Suppress meaningless IDs when displaying anonymous clone status

Changes since Pacemaker-1.1.10-rc2

  • Bug cl#5133 - pengine: Correctly observe on-fail=block for failed demote operation
  • Bug cl#5151 - Ensure node names are consistently compared without case
  • Check for and replace non-printing characters with their octal equivalent while exporting xml text
  • cib: CID#1023858 - Explicit null dereferenced
  • cib: CID#1023862 - Improper use of negative value
  • cib: CID#739562 - Improper use of negative value
  • cman: Our daemons have no need to connect to pacemakerd in a cman based cluster
  • crmd: Do not record pending delete operations in the CIB
  • crmd: Ensure pending and lost actions have values for last-run and last-rc-change
  • crmd: Insert async failures so that they appear in the correct order
  • crmd: Store last-run and last-rc-change for fail operations
  • Detect child processes that terminate before our SIGCHLD handler is installed
  • fencing: CID#739461 - Double close
  • fencing: Correctly broadcast manual fencing ACKs
  • fencing: Correctly mark manual confirmations as complete
  • fencing: Do not send duplicate replies for manual confirmation operations
  • fencing: Restore the ability to manually confirm that fencing completed
  • lrmd: CID#1023851 - Truncated stdio return value
  • lrmd: Don’t complain when heartbeat invokes us with -r
  • pengine: Correctly handle resources that recover before we operate on them
  • pengine: Re-initiate active recurring monitors that previously failed but have timed out
  • xml: Restore the ability to embed comments in the cib

Release candidate: 1.1.10-rc3 was originally published by Andrew Beekhof at That Cluster Guy on May 23, 2013.

Pacemaker Logging

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at May 20, 2013 01:43 PM

Normal operation

Pacemaker inherits most of its logging setting from either CMAN or Corosync - depending on what its running on top of.

In order to avoid spamming syslog, Pacemaker only logs a summary of its actions (NOTICE and above) to syslog.

If the level of detail in syslog is insufficient, you should enable a cluster log file. Normally one is configured by default and it contains everything except debug and trace messages.

To find the location of this file, either examine your CMAN (cluster.conf) or Corosync (corosync.conf) configuration file or look for syslog entries such as:

pacemakerd[1823]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log

If you do not see a line like this, either update the cluster configuration or set PCMK_debugfile in /etc/sysconfig/pacemaker

crm_report also knows how to find all the Pacemaker related logs and blackbox files

If the level of detail in the cluster log file is still insufficient, or you simply wish to go blind, you can turn on debugging in Corosync/CMAN, or set PCMK_debug in /etc/sysconfig/pacemaker.

A minor advantage of setting PCMK_debug is that the value can be a comma-separated list of processes which should produce debug logging instead of a global yes/no.

When an ERROR occurs

Pacemaker includes support for a blackbox.

When enabled, the blackbox contains a rolling buffer of all logs (not just those sent to syslog or a file) and is written to disk after a crash or assertion failure.

The blackbox recorder can be enabled by setting PCMK_blackbox in /etc/sysconfig/pacemaker or at runtime by sending SIGUSR1. Eg.

killall -USR1 crmd

When enabled you’ll see a log such as:

crmd[1811]:   notice: crm_enable_blackbox: Initiated blackbox recorder: /var/lib/pacemaker/blackbox/crmd-1811

If a crash occurs, the blackbox will be available at that location. To extract the contents, pass it to qb-blackbox:

qb-blackbox /var/lib/pacemaker/blackbox/crmd-1811

Which produces output like:

Dumping the contents of /var/lib/pacemaker/blackbox/crmd-1811
[debug] shm size:5242880; real_size:5242880; rb->word_size:1310720
[debug] read total of: 5242892
Ringbuffer:
 ->NORMAL
 ->write_pt [5588]
 ->read_pt [0]
 ->size [1310720 words]
 =>free [5220524 bytes]
 =>used [22352 bytes]
...
trace   May 19 23:20:55 gio_read_socket(368):0: 0x11ab920.5 1 (ref=1)
trace   May 19 23:20:55 pcmk_ipc_accept(458):0: Connection 0x11aee00
info    May 19 23:20:55 crm_client_new(302):0: Connecting 0x11aee00 for uid=0 gid=0 pid=24425 id=0e943a2a-dd64-49bc-b9d5-10fa6c6cb1bd
debug   May 19 23:20:55 handle_new_connection(465):2147483648: IPC credentials authenticated (24414-24425-14)
...
[debug] Free'ing ringbuffer: /dev/shm/qb-create_from_file-header

When an ERROR occurs you’ll also see the function and line number that produced it such as:

crmd[1811]: Problem detected at child_death_dispatch:872 (mainloop.c), please see /var/lib/pacemaker/blackbox/crmd-1811.1 for additional details
crmd[1811]: Problem detected at main:94 (crmd.c), please see /var/lib/pacemaker/blackbox/crmd-1811.2 for additional details

Again, simply pass the files to qb-blackbox to extract and query the contents.

Note the a counter is added to the end so as to avoid name collisions.

Diving into files and functions

In case you have not already guessed, all logs include the name of the function that generated them. So:

crmd[1811]:   notice: crm_update_peer_state: cman_event_callback: Node corosync-host-1[1] - state is now lost (was member)

came from the function crm_update_peer_state().

To obtain more detail from that or any other function, you can set PCMK_trace_functions in /etc/sysconfig/pacemaker to a comma separated list of function names. Eg.

PCMK_trace_functions=crm_update_peer_state,run_graph

For a bigger stick, you may also activate trace logging for all the functions in a particular source file or files by setting PCMK_trace_files as well.

PCMK_trace_files=cluster.c,election.c

These additional logs are sent to the cluster log file. Note that enabling tracing options also alters the output format.

Instead of:

crmd:  notice: crm_cluster_connect: 	Connecting to cluster infrastructure: cman

the output includes file and line information:

crmd: (   cluster.c:215   )  notice: crm_cluster_connect: 	Connecting to cluster infrastructure: cman

But wait there’s still more

Still need more detail? You’re in luck! The blackbox can be dumped at any time, not just when an error occurs.

First, make sure the blackbox is active (we’ll assume its the crmd that needs to be debugged):

killall -USR1 crmd

Next, discard any previous contents by dumping them to disk

killall -TRAP crmd

now cause whatever condition you’re trying to debug, and send -TRAP when you’re ready to see the result.

killall -TRAP crmd

You can now look for the result in syslog:

grep -e crm_write_blackbox: /var/log/messages

This will include a filename containing the trace logging:

crmd[1811]:   notice: crm_write_blackbox: Blackbox dump requested, please see /var/lib/pacemaker/blackbox/crmd-1811.1 for contents
crmd[1811]:   notice: crm_write_blackbox: Blackbox dump requested, please see /var/lib/pacemaker/blackbox/crmd-1811.2 for contents

To extract the trace loging for our test, pass the most recent file to qb-blackbox:

qb-blackbox /var/lib/pacemaker/blackbox/crmd-1811.2

At this point you’ll probably want to use grep :)

Pacemaker Logging was originally published by Andrew Beekhof at That Cluster Guy on May 20, 2013.

Debugging the Policy Engine

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at May 20, 2013 11:49 AM

Finding the right node

The Policy Engine is the component that takes the cluster’s current state, decides on the optimal next state and produces an ordered list of actions to achieve that state.

You can get a summary of what the cluster did in response to resource failures and nodes joining/leaving the cluster by looking at the logs from pengine:

grep -e pengine\\[ -e pengine: /var/log/messages

Although the pengine process is active on all cluster nodes, it is only doing work on one of them. The “active” instance is chosen through the crmd’s DC election process and may move around as nodes leave/join the cluster.

If you do not see anything from pengine at the time the problem occurs, continue to the next machine.

If you do not see anything from pengine on any node, check your cluster if logging to syslog is enabled and the syslog configuration to see where it is being sent. If in doubt, refer to Pacemaker Logging.

Once you have located the correct node to investigate, the first thing to do is look for the terms ERROR and WARN, eg.

grep -e pengine\\[ -e pengine: /var/log/messages | grep -e ERROR -e WARN

This will highlight any problems the software encountered.

Next expand the query to all pengine logs:

grep -e pengine\\[ -e pengine: /var/log/messages

The output will look a little like:

pengine[6132]:   notice: LogActions: Move	 mysql	(Started corosync-host-1 -> corosync-host-4)
pengine[6132]:   notice: LogActions: Start   www	(corosync-host-6)
pengine[6132]:   notice: process_pe_message: Calculated Transition 7: /var/lib/pacemaker/pengine/pe-input-4424.bz2
pengine[6132]:   notice: process_pe_message: Calculated Transition 8: /var/lib/pacemaker/pengine/pe-input-4425.bz2

In the above logs, transition 7 resulted in mysql being moved and www being started. Later, transition 8 occurred but everything was where it should be and no action was required.

Other notable entries include:

pengine[6132]:  warning: cluster_status: We do not have quorum - fencing and resource management disabled
pengine[6132]:   notice: stage6: Scheduling Node corosync-host-1 for shutdown
pengine[6132]:  warning: stage6: Scheduling Node corosync-host-8 for STONITH

as well as

pengine[6132]:   notice: LogActions: Start   Fencing      (corosync-host-1 - blocked)

which indicates that the cluster would like to start the Fencing resource, but some dependancy is not satisfied.

pengine[6132]:  warning: determine_online_status: Node corosync-host-8 is unclean

which indicates that either corosync-host-8 has failed, or a resource on it has failed to stop when requested.

pengine[6132]:  warning: unpack_rsc_op: Processing failed op monitor for www on corosync-host-4: unknown error (1)

which indicates a health check for the www resource failed with a return code of 1 (aka. OCF_ERR_GENERIC). See Pacemaker Explained for more details on OCF return codes.

  • Is there anything from the Policy Engine at about the time of the problem?
    If not, go back to the crmd logs and see why no recovery was attempted.

  • Did pengine log why something happened? does that sound correct?
    Excellent, thanks for playing.

Getting more detail from the Policy Engine

The job performed by the Policy engine is a very complex and frequent task, so to avoid filling up the disk with logs, it only indicates what it is doing and rarely the reason why. Normally the why can be found in the crmd logs, but it also saves the current state (the cluster configuration and the state of all resources) to disk for situations when it can’t.

These files can later be replayed using crm_simulate with a higher level of verbosity to diagnose issues and, as part of our regression suite, to make sure they stay fixed afterwards.

Finding these state files is a matter of looking for logs such as

crmd[1811]:   notice: run_graph: Transition 2 (... Source=/var/lib/pacemaker/pengine/pe-input-473.bz2): Complete
pengine[1810]:   notice: process_pe_message: Calculated Transition 0: /var/lib/pacemaker/pengine/pe-input-473.bz2

The “correct” entry will depend on the context of your query.

Please note, sometimes events occur while the pengine is performing its calculation. In this situation, the calculation logged by process_pe_message() is discarded and a new one performed. As a result, not all transitions/files listed by the pengine process are executed by the crmd.

After obtaining the file named by run_graph() or process_pe_message(), either directly or from a crm_report archive, pass it to crm_simulate which will display its view of the cluster at that time:

crm_simulate --xml-file ./pe-input-473.bz2
  • Does the cluster state look correct?

    If not, file a bug. It is possible we have misparsed the state of the resources, any calculation we make based on this would therefor also be wrong.

Next, see what recovery actions the cluster thinks need to be performed:

crm_simulate --xml-file ./pe-input-473.bz2 --save-graph problem.graph --save-dotfile problem.dot --run

In addition to the normal output, this command creates:

  • problem.graph, the ordered graph of actions, their parameters and prerequisites
  • problem.dot, a more human readable version of the same graph focussed on the action ordering.

Open problem.dot in dotty or graphviz to obtain a graphical representation:

  • Arrows indicate ordering dependencies
  • Dashed-arrows indicate dependencies that are not present in the transition graph
  • Actions with a dashed border of any color do not form part of the transition graph
  • Actions with a green border form part of the transition graph
  • Actions with a red border are ones the cluster would like to execute but cannot run
  • Actions with a blue border are ones the cluster does not feel need to be executed
  • Actions with orange text are pseudo/pretend actions that the cluster uses to simplify the graph
  • Actions with black text are sent to the lrmd
  • Resource actions have text of the form ${rsc}_${action}_${interval} ${node}
  • Actions of the form ${rsc}_monitor_0 ${node} is the cluster’s way of finding out the resource’s status before we try and start it anywhere
  • Any action depending on an action with a red border will not be able to execute.
  • Loops are really bad. Please report them to the development team.

Check the relative ordering of actions:

  • Are there any extra ones?
    Do they need to be removed from the configuration?
    Are they implied by the group construct?
  • Are there any missing?
    Are they specified in the configuration?

You can obtain excruicating levels of detial by adding additional -V options to the crm_simulate command line.

Now see what the cluster thinks the “next state” will look like:

crm_simulate --xml-file ./pe-input-473.bz2 --save-graph problem.graph --save-dotfile problem.dot --simulate
  • Does the new cluster state look correct based on the input and actions performed?
    If not, file a bug.

Debugging the Policy Engine was originally published by Andrew Beekhof at That Cluster Guy on May 20, 2013.

Debugging Pacemaker

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at May 20, 2013 09:08 AM

Where to start

The first thing to do is look in syslog for the terms ERROR and WARN, eg.

grep -e ERROR -e WARN /var/log/messages

If nothing looks appropriate, find the logs from crmd

grep -e crmd\\[ -e crmd: /var/log/messages

If you do not see anything from crmd, check your cluster if logging to syslog is enabled and the syslog configuration to see where it is being sent. If in doubt, refer to Pacemaker Logging for how to obtain more detail.

Although the crmd process is active on all cluster nodes, decisions are only occuring on one of them. The “DC” is chosen through the crmd’s election process and may move around as nodes leave/join the cluster.

For node failures, you’ll always want the logs from the DC (or the node that becomes the DC).
For resource failures, you’ll want the logs from the DC and the node on which the resource failed.

Log entries like:

crmd[1811]:   notice: crm_update_peer_state: cman_event_callback: Node corosync-host-1[1] - state is now lost (was member)

indicate a node is no longer part of the cluster (either because it failed or was shut down)

crmd[1811]:   notice: crm_update_peer_state: cman_event_callback: Node corosync-host-1[1] - state is now member (was lost)

indicates a node has (re)joined the cluster

crmd[1811]:   notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE ...
crmd[1811]:   notice: run_graph: Transition 2 (... Source=/var/lib/pacemaker/pengine/pe-input-473.bz2): Complete
crmd[1811]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE ...

indicates recovery was attempted

crmd[1811]:   notice: te_rsc_command: Initiating action 36: monitor www_monitor_0 on corosync-host-5
crmd[1811]:   notice: te_rsc_command: Initiating action 54: monitor mysql_monitor_10000 on corosync-host-4

indicates we performed a resource action, in this case we are checking the status of the www resource on corosync-host-5 and starting a recurring health check for mysql on corosync-host-4.

crmd[1811]:   notice: te_fence_node: Executing reboot fencing operation (83) on corosync-host-8 (timeout=60000)

indicates that we are attempting to fence corosync-host-8.

crmd[1811]:   notice: tengine_stonith_notify: Peer corosync-host-8 was terminated (st_notify_fence) by corosync-host-1 for corosync-host-1: OK

indicates that corosync-host-1 successfully fenced corosync-host-8.

Node-level failures

  • Did the crmd fail to notice the failure?

    If you do not see any entries from crm_update_peer_state(), check the corosync logs to see if membership was correct/timely

  • Did the crmd fail to initiate recovery?

    If you do not see entries from do_state_transition() and run_graph(), then the cluster failed to react at all. Refer to Pacemaker Logging to for how to obtain more detail about why the crmd ignored the failure.

  • Did the crmd fail to perform recovery?

    If you DO see entries from do_state_transition() but the run_graph() entry(ies) include the text Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, then the cluster did not think it needed to do anything.

    Obtain the file named by run_graph() (eg. /var/lib/pacemaker/pengine/pe-input-473.bz2) either directly or from a crm_report archive and continue to debugging the Policy Engine.

  • Was fencing attempted?

    Check if the stonith-enabled property is set to true/1/yes, if so obtain file named by run_graph() (eg. /var/lib/pacemaker/pengine/pe-input-473.bz2) either directly or from a crm_report archive and continue to debugging the Policy Engine.

  • Did fencing complete?

    Check the configuration of fencing resources and if so proceed to Debugging Stonith.

Resource-level failures

  • Did the resource actually fail?

    If not, check for logs matching the resource name to see why the resource agent thought a failure occurred.

    Check the resource agent source to see what code paths could have produced those logs (or the lack of them)

  • Did crmd notice the resource failure?

    If not, check for logs matching the resource name to see if the resource agent noticed.

    Check a recurring monitor was configured.

  • Did the crmd fail to initiate recovery?

    If you do not see entries from do_state_transition() and run_graph(), then the cluster failed to react at all. Refer to Pacemaker Logging to for how to obtain more detail about why the crmd ignored the failure.

  • Did the crmd fail to perform recovery?

    If you DO see entries from do_state_transition() but the run_graph() entry(ies) include the text Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, then the cluster did not think it needed to do anything.

    Obtain the file named by run_graph() (eg. /var/lib/pacemaker/pengine/pe-input-473.bz2) either directly or from a crm_report archive and continue to debugging the Policy Engine.

  • Did resources stop/start/move unexpectedly or fail to stop/start/move when expected?

    Obtain the file named by run_graph() (eg. /var/lib/pacemaker/pengine/pe-input-473.bz2) either directly or from a crm_report archive and continue to debugging the Policy Engine.

Debugging Pacemaker was originally published by Andrew Beekhof at That Cluster Guy on May 20, 2013.

Pacemaker on RHEL6.4

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at May 04, 2013 08:09 AM

Over the last couple of years, we have been evolving the stack in two ways of particular relevance to RHEL customers:

  • minimizing the differences to the default RHEL-6 stack to reduce the implication of supporting Pacemaker there (making it more likely to happen)

  • adapting to changes to Corosync’s direction (ie. the removal of plugins and the addition of a quorum API) for the future

As a general rule, Red Hat does not ship packages it doesn’t at least plan to support. So part of readying Pacemaker for “supported” status is removing or deprecating the parts of Red Hat’s packages that they have no interest and/or capacity to support.

Removal of crmsh

For reasons that you may or may not agree with, Red Hat has decided to rely on pcs for command line and GUI cluster management in RHEL-7.

As a result there is no future, in RHEL, for the original cluster shell crmsh.

Normally it would have been deprecated. However since crmsh is now a stand-alone project, it’s removal from the Pacemaker codebase also resulted in it’s removal from RHEL-6 once the packages were refreshed.

To fill the void and help prepare people for RHEL-7, pcs is now also available on RHEL-6.

Status of the Plugin

Anyone taking the recommended approach of using Pacemaker with CMAN (ie. cluster.conf) on RHEL-6 or any of its derivatives can stop reading for now (we’ll need to talk again when RHEL 7 comes out with corosync-2, but thats another conversation).

Anyone using corosync.conf on RHEL 6 should keep reading…

One of the differences between the Pacemaker and rgmanager stacks is where membership and quorum come from.

Pacemaker has traditionally obtained it from a custom plugin, whereas rgmanager used CMAN. Neither source is “better” than the other, the only thing that matters is that everyone obtains it from the same place.

Since the rest of the components in a RHEL-6 cluster use CMAN, support for it was added to Pacemaker which also helps minimize the support load. Additionally, in RHEL-7, Corosync’s support for plugins such as Pacemaker’s (and CMAN’s) goes away.

Without any chance of being supported in the short or long-term, configuring plugin-based clusters (ie. via corosync.conf ) is now officially deprecated in RHEL. As some of you may have already noticed, starting corosync in 6.4 produces the following entries in the logs:

Apr 23 17:35:36 corosync [pcmk  ] ERROR: process_ais_conf: You have configured a cluster using the Pacemaker plugin for Corosync. The plugin is not supported in this environment and will be removed very soon.
Apr 23 17:35:36 corosync [pcmk  ] ERROR: process_ais_conf:  Please see Chapter 8 of 'Clusters from Scratch' (http://clusterlabs.org/doc) for details on using Pacemaker with CMAN

Everyone is highly encouraged to switch to CMAN-based Pacemaker clusters

While the plugin will still be around, running Pacemaker in configurations that are not well tested by Red Hat (or, for the most part, by upstream either) contains an element of risk.

For example, the messages above were originally added for 6.3, however since logging from the plugin was broken for over a year, no-one noticed. It only got fixed when I was trying to figure out why no-one had complained about them yet!

A lack of logging is annoying but not usually problematic, unfortunately there is also something far worse…

Fencing Failures when using the Pacemaker plugin

It has come to light that fencing for plugin-based clusters is critically broken.

The cause was a single stray ‘n’-character, probably from a copy+paste, that prevents the crmd from correctly reacting to a membership-level failures (ie. killall -9 corosync) of it’s peers.

The problem did not show up in any of Red Hat’s testing because of the way Pacemaker processes talk to their peers on other nodes when CMAN (or Corosync 2.0) is in use.

For CMAN and Corosync 2.0 we use Corosync’s CPG API which provides notifications when peer processes (the crmd in this case) join or leave the messaging group. These additional notifications from CPG follow a different code path and are unaffected by the bug… allowing the cluster to function as intended.

Unfortunately, despite the size and obviousness of the fix, a z-stream update for a bug affecting a deprecated use-case of an as-yet-unsupported package is a total non-starter.

People wanting to stick with plugin-based clusters should obtain 1.1.9 or later from the Clusterlabs repos that includes the fix

You can read more about the bug and the fix on the Red Hat bugzilla

For details on converting to a CMAN-based stack, please see Clusters from Scratch.

Switching to CMAN is really far less painful than it sounds

There is also a quickstart guide for easily generating cluster.conf, just substitute the name of your cluster nodes.

Pacemaker on RHEL6.4 was originally published by Andrew Beekhof at That Cluster Guy on May 04, 2013.

Release candidate: 1.1.10-rc2

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at May 03, 2013 12:12 PM

Announcing the second release candidate for Pacemaker 1.1.10

No major changes have been introduced, just some fixes for a few niggles that were discovered since RC1.

Unless blocker bugs are found, this will be the final release candidate and 1.1.10 will be tagged on May 10th.

Help is specifically requested for testing plugin-based clusters, ACLs and admin actions (such as moving and stopping resources) which are hard to test in an automated manner.

To build rpm packages for testing:

  1. Clone the current sources:

    # git clone git://github.com/ClusterLabs/pacemaker.git
    # cd pacemaker
    
  2. Install dependancies

    On Fedora/RHEL and its derivatives, you can do this by running:

    # yum install -y yum-utils
    # make yumdep
    

    Otherwise you will need to investigate the spec file and/or wait for rpmbuild to report missing packages.

  3. Build Pacemaker

    # make rpm
    
  4. Copy and deploy as needed

Details - 1.1.10-rc2

Changesets  31
Diff 30 files changed, 687 insertions(+), 138 deletions(-)


Highlights

### Features added in Pacemaker-1.1.10-rc2

N/A

Changes since Pacemaker-1.1.10-rc1

  • Bug cl#5152 - Correctly clean up fenced nodes during membership changes
  • Bug cl#5153 - Correctly display clone failcounts in crm_mon
  • Bug cl#5154 - Do not expire failures when on-fail=block is present
  • cman: Skip cman_pre_stop in the init script if fenced is not running
  • Core: Ensure the last field in transition keys is 36 characters
  • crm_mon: Check if a process can be daemonized before forking so the parent can report an error
  • crm_mon: Ensure stale pid files are updated when a new process is started
  • crm_report: Correctly collect logs when ‘uname -n’ reports fully qualified names
  • crm_resource: Allow –cleanup without a resource name
  • init: Unless specified otherwise, assume cman is in use if cluster.conf exists
  • mcp: inhibit error messages without cman
  • pengine: Ensure per-node resource parameters are used during probes
  • pengine: Implement the rest of get_timet_now() and rename to get_effective_time

Release candidate: 1.1.10-rc2 was originally published by Andrew Beekhof at That Cluster Guy on May 03, 2013.

Mixing Pacemaker versions

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at April 19, 2013 09:47 AM

When mixing Pacemaker versions, there are two factors that need to be considered. The first is obviously the package version - if that is the same, then there is no problem.

If not, then the Pacemaker feature set needs to be checked. This feature set increases far less regularly than the normal package version. Newer versions of Pacemaker expose this value in the output of pacemakerd --features:

$ pacemakerd –features
Pacemaker 1.1.9 (Build: 9048b7b)
Supporting v3.0.7: generated-manpages agent-manpages ascii-docs publican-docs ncurses gcov libqb-logging libqb-ipc lha-fencing upstart systemd nagios heartbeat corosync-native snmp

In this case, the feature set is 3.0.7 (major 3, minor 0, revision 7).

For older versions, you should refer to the definition of CRM_FEATURE_SET in crm.h, usually this will be located at /usr/include/pacemaker/crm/crm.h.

If two packages or versions share the same feature set, then the expectation is that they are fully compatible. Any other behavior is a bug which needs to be reported.

If the feature sets between two versions differ but have the same major value (ie. the 3 in 3.0.7 and 3.1.5), then they are said to be upgrade compatible.

What does upgrade compatible mean?

When two versions are upgrade compatible, it means that they will co-exist during a rolling upgrade but not on an extended or permanent basis as the newer version requires all its peers to support API feature(s) that the old one does not have.

The following two rules apply when mixing installations with different feature sets:

  • When electing a node to run the cluster (the Designated Co-ordinator or “DC”), the node with the lowest feature set always wins.
  • The DC records its feature set in the CIB
  • Nodes may not join the cluster if their feature set is less than the one recorded in the CIB

Example

Consider node1 with a feature set of 3.0.7 and node2 with feature set 3.0.8… when node2 first joins the cluster, node1 will naturally remain the DC.

However if node1 leaves the cluster, either by being shut down or due to a failure, node2 will become the DC (as it is by itself and by definition has the lowest feature set of any active node).

At this point, node1 will be rejected if it attempts to rejoin the cluster and will shut down, as its feature set is lower than that of the DC (node2).

Is this happening to me?

If you are affected by this, you will see an error in the logs along the lines of:

error: We can only support up to CRM feature set 3.0.7 (current=3.0.8)

In this case, the DC (node2) has feature set 3.0.8 but we (node1) only have 3.0.7.

To get these two nodes talking to each other again:

  1. stop the cluster on both nodes
  2. on both nodes, run: CIB_file=/path/to/cib.xml cibadmin -M -X ‘’
  3. start node1 and wait until it is elected as the DC
  4. start node2

Mixing Pacemaker versions was originally published by Andrew Beekhof at That Cluster Guy on April 19, 2013.

Release candidate: 1.1.10-rc1

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at April 18, 2013 09:32 AM

A funny thing happened on the way to 1.1.9…

Between tagging it on the Friday, and announcing it on the following Monday, people started actually testing it and found a couple of problems.

Specifically, there were some significant memory leaks and problems in a couple of areas that our unit and integration tests can’t sanely test.

So while 1.1.9 is out, it was never formally announced. Instead we’ve been fixing the reported bugs (as well as looking for a few more by running valgrind on a live cluster) and preparing for 1.1.10.

Also, in an attempt to learn from previous mistakes, the new release procedure involves release candidates. If no blocker bugs are reported in the week following a release candidate, it is re-tagged as the official release.

So without further ado, here is the 1.1.9 release notes as well as what changed in 1.1.10-rc1.

Details - 1.1.10-rc1

Changesets  143
Diff 104 files changed, 3327 insertions(+), 1186 deletions(-)


Highlights

### Features added in Pacemaker-1.1.10

  • crm_resource: Allow individual resources to be reprobed
  • mcp: Alternate Upstart job controlling both pacemaker and corosync
  • mcp: Prevent the cluster from trying to use cman even when it is installed

Changes since Pacemaker-1.1.9

  • Allow programs in the haclient group to use CRM_CORE_DIR
  • cman: Do not unconditionally start cman if it is already running
  • core: Ensure custom error codes are less than 256
  • crmd: Clean up memory at exit
  • crmd: Do not update fail-count and last-failure for old failures
  • crmd: Ensure we return to a stable state if there have been too many fencing failures
  • crmd: Indicate completion of refresh to callers
  • crmd: Indicate completion of re-probe to callers
  • crmd: Only perform a dry run for deletions if built with ACL support
  • crmd: Prevent use-after-free when the blackbox is enabled
  • crmd: Suppress secondary errors when no metadata is found
  • doc: Pacemaker Remote deployment and reference guide
  • fencing: Avoid memory leak in can_fence_host_with_device()
  • fencing: Clean up memory at exit
  • fencing: Correctly filter devices when no nodes are configured yet
  • fencing: Correctly unpack device parameters before using them
  • fencing: Fail the operation once all peers have been exhausted
  • fencing: Fix memory leaks during query phase
  • fencing: Prevent empty call-id during notification processing
  • fencing: Prevent invalid read in parse_host_list()
  • fencing: Prevent memory leak when registering devices
  • crmd: lrmd: stonithd: fixed memory leaks
  • ipc: Allow unpriviliged clients to clean up after server failures
  • ipc: Restore the ability for members of the haclient group to connect to the cluster
  • legacy: cl#5148 - Correctly remove a node that used to have a different nodeid
  • legacy: Support “crm_node –remove” with a node name for corosync plugin (bnc#805278)
  • logging: Better checks when determining if file based logging will work
  • Pass errors from lsb metadata generation back to the caller
  • pengine: Do not use functions from the cib library during unpack
  • Prevent use-of-NULL when reading CIB_shadow from the environment
  • Skip WNOHANG when waiting after sending SIGKILL to child processes
  • tools: crm_mon - Print a timing field only if its value is non-zero
  • Use custom OCF_ROOT_DIR if requested
  • xml: Prevent lockups by setting a more reliable buffer allocation strategy
  • xml: Prevent use-after-free in cib_process_xpath()
  • xml: Prevent use-after-free when not processing all xpath query results

Details - 1.1.9

Changesets  731
Diff 1301 files changed, 92909 insertions(+), 57455 deletions(-)


Highlights

### Features added in Pacemaker-1.1.9

  • corosync: Allow cman and corosync 2.0 nodes to use a name other than uname()
  • corosync: Use queues to avoid blocking when sending CPG messages
  • ipc: Compress messages that exceed the configured IPC message limit
  • ipc: Use queues to prevent slow clients from blocking the server
  • ipc: Use shared memory by default
  • lrmd: Support nagios remote monitoring
  • lrmd: Pacemaker Remote Daemon for extending pacemaker functionality outside corosync cluster.
  • pengine: Check for master/slave resources that are not OCF agents
  • pengine: Support a ‘requires’ resource meta-attribute for controlling whether it needs quorum, fencing or nothing
  • pengine: Support for resource container
  • pengine: Support resources that require unfencing before start

Changes since Pacemaker-1.1.8

  • attrd: Correctly handle deletion of non-existant attributes
  • Bug cl#5135 - Improved detection of the active cluster type
  • Bug rhbz#913093 - Use crm_node instead of uname
  • cib: Avoid use-after-free by correctly support cib_no_children for non-xpath queries
  • cib: Correctly process XML diff’s involving element removal
  • cib: Performance improvements for non-DC nodes
  • cib: Prevent error message by correctly handling peer replies
  • cib: Prevent ordering changes when applying xml diffs
  • cib: Remove text nodes from cib replace operations
  • cluster: Detect node name collisions in corosync
  • cluster: Preserve corosync membership state when matching node name/id entries
  • cman: Force fenced to terminate on shutdown
  • cman: Ignore qdisk ‘nodes’
  • core: Drop per-user core directories
  • corosync: Avoid errors when closing failed connections
  • corosync: Ensure peer state is preserved when matching names to nodeids
  • corosync: Clean up CMAP connections after querying node name
  • corosync: Correctly detect corosync 2.0 clusters even if we don’t have permission to access it
  • crmd: Bug cl#5144 - Do not updated the expected status of failed nodes
  • crmd: Correctly determin if cluster disconnection was abnormal
  • crmd: Correctly relay messages for remote clients (bnc#805626, bnc#804704)
  • crmd: Correctly stall the FSA when waiting for additional inputs
  • crmd: Detect and recover when we are evicted from CPG
  • crmd: Differentiate between a node that is up and coming up in peer_update_callback()
  • crmd: Have cib operation timeouts scale with node count
  • crmd: Improved continue/wait logic in do_dc_join_finalize()
  • crmd: Prevent election storms caused by getrusage() values being too close
  • crmd: Prevent timeouts when performing pacemaker level membership negotiation
  • crmd: Prevent use-after-free of fsa_message_queue during exit
  • crmd: Store all current actions when stalling the FSA
  • crm_mon: Do not try to render a blank cib and indicate the previous output is now stale
  • crm_mon: Fixes crm_mon crash when using snmp traps.
  • crm_mon: Look for the correct error codes when applying configuration updates
  • crm_report: Ensure policy engine logs are found
  • crm_report: Fix node list detection
  • crm_resource: Have crm_resource generate a valid transition key when sending resource commands to the crmd
  • date/time: Bug cl#5118 - Correctly convert seconds-since-epoch to the current time
  • fencing: Attempt to provide more information that just ‘generic error’ for failed actions
  • fencing: Correctly record completed but previously unknown fencing operations
  • fencing: Correctly terminate when all device options have been exhausted
  • fencing: cov#739453 - String not null terminated
  • fencing: Do not merge new fencing requests with stale ones from dead nodes
  • fencing: Do not start fencing until entire device topology is found or query results timeout.
  • fencing: Do not wait for the query timeout if all replies have arrived
  • fencing: Fix passing of parameters from CMAN containing ‘=’
  • fencing: Fix non-comparison when sorting devices by priority
  • fencing: On failure, only try a topology device once from the remote level.
  • fencing: Only try peers for non-topology based operations once
  • fencing: Retry stonith device for duration of action’s timeout period.
  • heartbeat: Remove incorrect assert during cluster connect
  • ipc: Bug cl#5110 - Prevent 100% CPU usage when looking for synchronous replies
  • ipc: Use 50k as the default compression threshold
  • legacy: Prevent assertion failure on routing ais messages (bnc#805626)
  • legacy: Re-enable logging from the pacemaker plugin
  • legacy: Relax the ‘active’ check for plugin based clusters to avoid false negatives
  • legacy: Skip peer process check if the process list is empty in crm_is_corosync_peer_active()
  • mcp: Only define HA_DEBUGLOG to avoid agent calls to ocf_log printing everything twice
  • mcp: Re-attach to existing pacemaker components when mcp fails
  • pengine: Any location constraint for the slave role applies to all roles
  • pengine: Avoid leaking memory when cleaning up failcounts and using containers
  • pengine: Bug cl#5101 - Ensure stop order is preserved for partially active groups
  • pengine: Bug cl#5140 - Allow set members to be stopped when the subseqent set has require-all=false
  • pengine: Bug cl#5143 - Prevent shuffling of anonymous master/slave instances
  • pengine: Bug rhbz#880249 - Ensure orphan masters are demoted before being stopped
  • pengine: Bug rhbz#880249 - Teach the PE how to recover masters into primitives
  • pengine: cl#5025 - Automatically clear failcount for start/monitor failures after resource parameters change
  • pengine: cl#5099 - Probe operation uses the timeout value from the minimum interval monitor by default (#bnc776386)
  • pengine: cl#5111 - When clone/master child rsc has on-fail=stop, insure all children stop on failure.
  • pengine: cl#5142 - Do not delete orphaned children of an anonymous clone
  • pengine: Correctly unpack active anonymous clones
  • pengine: Ensure previous migrations are closed out before attempting another one
  • pengine: Introducing the whitebox container resources feature
  • pengine: Prevent double-free for cloned primitive from template
  • pengine: Process rsc_ticket dependencies earlier for correctly allocating resources (bnc#802307)
  • pengine: Remove special cases for fencing resources
  • pengine: rhbz#902459 - Remove rsc node status for orphan resources
  • systemd: Gracefully handle unexpected DBus return types
  • Replace the use of the insecure mktemp(3) with mkstemp(3)

Release candidate: 1.1.10-rc1 was originally published by Andrew Beekhof at That Cluster Guy on April 18, 2013.