Quick links:

That Cluster Guy: A New Fencing Mechanism (TBD)

That Cluster Guy: A New Thing

That Cluster Guy: Two Nodes - The Devil is in the Details

That Cluster Guy: Containerizing Databases with Kubernetes and Stateful Sets

That Cluster Guy: HA for Composible Deployments of OpenStack

That Cluster Guy: Thoughts on HA for Multi-Subnet Deployments of OpenStack

That Cluster Guy: Working with OpenStack Images

That Cluster Guy: Evolving the OpenStack HA Architecture

That Cluster Guy: Minimum Viable Cluster

That Cluster Guy: Receiving Reliable Notification of Cluster Events

That Cluster Guy: Fencing for Fun and Profit with SBD

That Cluster Guy: Double Failure - Get out of Jail Free? Not so Fast

That Cluster Guy: Life at the Intersection of Pets and Cattle

That Cluster Guy: Life at the Intersection of Pets and Cattle

That Cluster Guy: Adding Managed Compute Nodes to a Highly Available Openstack Control Plane

That Cluster Guy: Adding Managed Compute Nodes to a Highly Available Openstack Control Plane

That Cluster Guy: Feature Spotlight - Smart Resource Restart from the Command Line

That Cluster Guy: Feature Spotlight - Controllable Resource Discovery

That Cluster Guy: Release Candidate: 1.1.12-rc1

That Cluster Guy: Potential for data corruption affecting Pacemaker 1.1.6 through 1.1.9

That Cluster Guy: Announcing 1.1.11 Beta Testing

That Cluster Guy: Pacemaker and RHEL 6.4 (redux)

That Cluster Guy: Changes to the Remote Wire Protocol in 1.1.11

That Cluster Guy: Pacemaker 1.1.10 - final

That Cluster Guy: Release candidate: 1.1.10-rc7

That Cluster Guy: Release candidate: 1.1.10-rc6

That Cluster Guy: GPG Quickstart

That Cluster Guy: Release candidate: 1.1.10-rc5

That Cluster Guy: Release candidate: 1.1.10-rc3

That Cluster Guy: Pacemaker Logging

A New Fencing Mechanism (TBD)

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at March 07, 2018 02:11 AM

Protecting Database Centric Applications

In the same way that some application require the ability to persist records to disk, for some applications the loss of access to the database means game over - more so than disconnection from the storage.

Cinder-volume is one such application and as it moves towards an active/active model, it is important that a failure in one peer does not represent a SPoF. In the Cinder architecture, the API server has no way to know if the cinder- volume process is fully functional - so they will still recieve new requests to execute.

A cinder-volume process that has lost access to the storage will naturally be unable to complete requests. Worse though is loosing access to the database, as this will means the result of an action cannot be recorded.

For some operations this is ok, if wasteful, because the operation will fail and be retried. Deletion of something that was already deleted is usually treated as a success and re-attempted operations for creating volume will return a new volume. However performing the same resize operation twice is highly problematic since the recorded old size no longer matches the actual size.

Even the safe operations may never complete because the bad cinder-volume process may end up being asked to perform the cleanup operations from its own failures, which would result in additional failures.

Additionally, despite not being recommended, some Cinder drivers make use of locking. For those drivers it is just as crucial that any locks held by a faulty or hung peer can be recovered within a finite period of time. Hence the need for fencing.

Since power-based fencing is so dependant on node hardware and there is always some kind of storage involved, the idea of leveraging the SBD[1] ( Storage Based Death ) project’s capabilities to do disk based heartbeating and poison-pills is attractive. When combined with a hardware watchdog, it is an extremely reliable way to ensure safe access to shared resources.

However in Cinder’s case, not all vendors can provide raw access to a small block device on the storage. Additionally, it is really access to the database that needs protecting not the storage. So while useful, it is still relatively easy to construct scenarios that would defeat SBD.

A New Type of Death

Where SBD uses storage APIs to protect applications persisting data to disk, we could also have one based on SQL calls that did the same for Cinder-volume and other database centric applications.

I therefor propose TBD - “Table Based Death” (or “To Be Decided” depending on how you’re wired).

Instead of heartbeating to a designated slot on a block device, the slots become rows in a small table in the database that this new daemon would interact with via SQL.

When a peer is connected to the database, a cluster manager like Pacemaker can use a poison pill to fence the peer in the event of a network, node, or resource level failure. Should the peer ever loose quorum or its connection to the database, surviving peers can assume with a degree of confidence that it will self terminate via the watchdog after a known interval.

The desired behaviour can be derived from the following properties:

  1. Quorum is required to write poison pills into a peer’s slot

  2. A peer that finds a poison pill in its slot triggers its watchdog and reboots

  3. A peer that looses connection to the database won’t be able to write status information to its slot which will trigger the watchdog

  4. A peer that looses connection to the database won’t be able to write a poison pill into another peer’s slot

  5. If the underlying database looses too many peers and reverts to read-only, we won’t be able to write to our slot which triggers the watchdog

  6. When a peer that looses connection to its peers, the survivors would maintain quorum(1) and write a poison pill to the lost node (1) ensuring the peer will terminate due to scenario (2) or (3)

If N seconds is the worst case time a peer would need to either notice a poison pill, or disconnection from the database, and trigger the watchdog. Then we can arrange for services to be recovered after some multiple of N has elasped in the same way that Pacemaker does for SBD.

While TBD would be a valuable addition to a traditional cluster architecture, it is also concievable that it could be useful in a stand-alone configuration. Consideration should therefor be given during the design phase as to how best consume membership, quorum, and fencing requests from multiple sources - not just a particular application or cluster manager.

Limitations

Just as in the SBD architecture, we need TBD to be configured to use the same persistent store (database) as is being consumed by the applications it is protecting. This is crucial as it means the same criteria that enables the application to function, also results in the node self-terminating if it cannot be satisfied.

However for security reasons, the table would ideally live in a different namespace and with different access permissions.

It is also important to note that significant design challenges would need to be faced in order to protect applications managed by the same cluster that was providing the highly available database being consumed by TBD. Consideration would particularly need to be given to the behaviour of TBD and the applications it was protecting during shudown and cold-start scenarios. Care would need to be taken in order to avoid unnecessary self-fencing operations and that failure responses are not impacted by correctly handling these scenarios.

Footnotes

[1] SBD lives under the ClusterLabs banner but can operate without a traditional corosync/pacemaker stack.

A New Thing

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at February 16, 2018 03:32 AM

I made a new thing.

If you’re interested in Kubernetes and/or managing replicated applications, such as Galera, then you might also be interested in an operator that allows this class of applications to be managed natively by Kubernetes.

There is plenty to read on why the operator exists, how replication is managed and the steps to install it if you’re interested in trying it out.

There is also a screencast that demonstrates the major concepts:

asciicast

Feedback welcome.

Two Nodes - The Devil is in the Details

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at February 15, 2018 11:52 PM

tl;dr - Many people love 2-node clusters because they seem conceptually simpler and 33% cheaper, but while it’s possible to construct good ones, most will have subtle failure modes

The first step towards creating any HA system is to look for and try to eliminate single points of failure, often abbreviated as SPoF.

It is impossible to eliminate all risk of downtime and especially when one considers the additional complexity that comes with introducing additional redunancy, concentrating on single (rather than chains of related and therefor decreasingly probable) points of failure is widely accepted as a suitable compromise.

The natural starting point then is to have more than one node. However before the system can move services to the surviving node after a failure, in general, it needs to be sure that they are not still active elsewhere.

So not only are we looking for SPoFs, but we are also looking to balance risks and consequences and the calculus will be different for every deployment [1]

There is no downside if a failure causes both members of a two node cluster to serve up the same static website. However its a very different story if it results in both sides independently managing a shared job queue or providing uncoordinated write access to a replicated database or shared filesystem.

So in order to prevent a single node failure from corrupting your data or blocking recovery, we rely on something called fencing.

Fencing

At it s heart, fencing turns a question Can our peer cause data corruption? into an answer no by isolating it both from incoming requests and persistent storage. The most common approach to fencing is to power off failed nodes.

There are two categories of fencing which I will call direct and indirect but could equally be called active and passive. Direct methods involve action on the part of surviving peers, such interacting with an IPMI or iLO device, whereas indirect relies on the failed node to somehow recognise it is in an unhealthy state (or is at least preventing remaining members from recovering) and signal a hardware watchdog to panic the machine.

Quorum helps in both these scenarios.

Direct Fencing

In the case of direct fencing, we can use it to prevent fencing races when the network fails. By including the concept of quorum, there is enough information in the system (even without connectivity to their peers) for nodes to automatically know whether they should initiate fencing and/or recovery.

Without quorum, both sides of a network split will rightly assume the other is dead and rush to fence the other. In the worst case, both sides succeed leaving the entire cluster offline. The next worse is a death match , a never ending cycle of nodes coming up, not seeing their peers, rebooting them and initiating recovery only to be rebooted when their peer goes through the same logic.

The problem with fencing is that the most commonly used devices become inaccessible due to the same failure events we want to use them to recover from. Most IPMI and iLO cards both loose power with the hosts they control and by default use the same network that is causing the peers to believe the others are offline.

Sadly the intricacies of IPMI and iLo devices is rarely a consideration at the point hardware is being purchased.

Indirect Fencing

Quorum is also crucial for driving indirect fencing and, when done right, can allow survivors to safely assume that missing nodes have entered a safe state after a defined period of time.

In such a setup, the watchdog’s timer is reset every N seconds unless quorum is lost. If the timer (usually some multiple of N) expires, then the machine performs an ungraceful power off (not shutdown).

This is very effective but without quorum to drive it, there is insufficient information from within the cluster to determine the difference between a network outage and the failure of your peer. The reason this matters is that without a way to differentiate between the two cases, you are forced to choose a single behaviour mode for both.

The problem with choosing a single response is that there is no course of action that both maximises availability and prevents corruption.

  • If you choose to assume the peer is alive but it actually failed, then the cluster has unnecessarily stopped services.

  • If you choose to assume the peer is dead but it was just a network outage, then the best case scenario is that you have signed up for some manual reconciliation of the resulting datasets.

No matter what heuristics you use, it is trivial to construct a single failure that either leaves both sides running or where the cluster unnecessarily shuts down the surviving peer(s). Taking quorum away really does deprive the cluster of one of the most powerful tools in its arsenal.

Given no other alternative, the best approach is normally to sacrificing availability. Making corrupted data highly available does no-one any good and manually reconciling diverant datasets is no fun either.

Quorum

Quorum sounds great right?

The only drawback is that in order to have it in a cluster with N members, you need to be able to see N/2 + 1 of your peers. Which is impossible in a two node cluster after one node has failed.

Which finally brings us to the fundamental issue with two-nodes:

quorum does not make sense in two node clusters, and

without it there is no way to reliably determine a course of action that both maximises availability and prevents corruption

Even in a system of two nodes connected by a crossover cable, there is no way to conclusively differentiate between a network outage and a failure of the other node. Unplugging one end (who’s likelihood is surely proportional to the distance between the nodes) would be enough to invalidate any assumption that link health equals peer node health.

Making Two Nodes Work

Sometimes the client can’t or wont make the additional purchase of a third node and we need to look for alternatives.

Option 1 - Add a Backup Fencing Method

A node’s iLO or IPMI device represents a SPoF because, by definition, if it fails the survivors cannot use it to put the node into a safe state. In a cluster of 3 nodes or more, we can mitigate this a quorum calculation and a hardware watchdog (an indirect fencing mechanism as previously discussed). In a two node case we must instead use network power switches (aka. power distribution units or PDUs).

After a failure, the survivor first attempts to contact the primary (the built-in iLO or IPMI) fencing device. If that succeeds, recovery proceeds as normal. Only if the iLO/IPMI device fails is the PDU invoked and assuming it succeeds, recovery can again continue.

Be sure to place the PDU on a different network to the cluster traffic, otherwise a single network failure will prevent access to both fencing devices and block service recovery.

You might be wondering at this point… doesn’t the PDU represent a single point of failure? To which the answer is “definitely“.

If that risk concerns you, and you would not be alone, connect both peers to two PDUs and tell your cluster software to use both when powering peers on and off. Now the cluster remains active if one PDU dies, and would require a second fencing failure of either the other PDU or an IPMI device in order to block recovery.

Option 2 - Add an Arbitrator

In some scenarios, although a backup fencing method would be technically possible, it is politically challenging. Many companies like to have a degree of separation between the admin and application folks, and security conscious network admins are not always enthusiastic about handing over the usernames and passwords to the PDUs.

In this case, the recommended alternative is to create a neutral third-party that can supplement the quorum calculation.

In the event of a failure, a node needs to be able to see ether its peer or the arbitrator in order to recover services. The arbitrator also includes to act as a tie-breaker if both nodes can see the arbitrator but not each other.

This option needs to be paired with an indirect fencing method, such as a watchdog that is configured to panic the machine if it looses connection to its peer and the arbitrator. In this way, the survivor is able to assume with reasonable confidence that its peer will be in a safe state after the watchdog expiry interval.

The practical difference between an arbitrator and a third node is that the arbitrator has a much lower footprint and can act as a tie-breaker for more than one cluster.

Option 3 - More Human Than Human

The final approach is for survivors to continue hosting whatever services they were already running, but not start any new ones until either the problem resolves itself (network heals, node reboots) or a human takes on the responsibility of manually confirming that the other side is dead.

Bonus Option

Did I already mention you could add a third node? We test those a lot :-)

Two Racks

For the sake of argument, lets imagine I’ve convinced you the reader on the merits of a third node, we must now consider the physical arrangement of the nodes. If they are placed in (and obtain power from), the same rack, that too represents a SPoF and one that cannot be resolved by adding a second rack.

If this is surprising, consider what happens when the rack with two nodes fails and how the surviving node would differentiate between this case and a network failure.

The short answer is that it can’t and we’re back to having all the problems of the two-node case. Either the survivor:

  • ignores quorum and incorrectly tries to initiate recovery during network outages (whether fencing is able to complete is a different story and depends on whether PDU is involved and if they share power with any of the racks), or

  • respects quorum and unnecessarily shuts itself down when its peer fails

Either way, two racks is no better than one and the nodes must either be given independant supplies of power or be distributed accross three (or more depending on how many nodes you have) racks.

Two Datacenters

By this point the more risk averse readers might be thinking about disaster recovery. What happens when an asteroid hits the one datacenter with our three nodes distributed across three different racks? Obviously Bad Things(tm) but depending on your needs, adding a second datacenter might not be enough.

Done properly, a second datacenter gives you a (reasonably) up-to-date and consistent copy of your services and their data. However just like the two- node and two-rack scenarios, there is not enough information in the system to both maximise availability and prevent corruption (or diverging datasets). Even with three nodes (or racks), distributing them across only two datacenters leaves the system unable to reliably make the correct decision in the (now far more likely) event that the two sides cannot communicate.

Which is not to say that a two datacenters solution is never appropriate. It is not uncommon for companies to want a human in the loop before taking the extraordinary step of failing over to a backup datacenter. Just be aware that if you want automated failure, you’re either going to need a third datacenter in order for quorum to make sense (either directly or via an arbitrator) or find a way to reliably power fence an entire datacenter.

Footnotes

[1] Not everyone needs redundant power companies with independent transmission lines. Although the paranoia paid off for at least one customer when their monitoring detected a failing transformer. The customer was on the phone trying to warn the power company when it finally blew.

Containerizing Databases with Kubernetes and Stateful Sets

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at February 15, 2017 01:45 AM

The canonical example for Stateful Sets with a replicated application in Kubernetes is a database.

As someone looking at how to move foundational OpenStack services to containers, and eventually Kubernetes in the future, this is great news as databases are very typical of applications with complex boostrap and recovery processes.

If we can successfully show Kubernetes managing a multi-master database natively safely, the patterns would be broadly applicable and there is one less reason to have a traditional cluster manager in such contexts.

TL;DR

Kubernetes today is arguably unsuitable for deploying databases UNLESS the pod owner has the ability to verify the physical status of the underlying hardware and is prepared to perform manual recovery in some scenarios.

General Comments

The example allows for N slaves but limits itself to a single master.

Which is absolutely a valid deployment, but does prevent us from exploring some of the more interesting corner multi-master cases and unfortunately from a HA perspective makes pod 0 a single point of failure because:

  • although MySQL slaves can be easily promoted to masters, the containers do not expose such a mechanism, and even if they did

  • writers are told to connect to pod 0 explicitly rather than use the mysql-read service

So if the worker on which pod 0 is running hangs or becomes unreachable, you’re out of luck.

The loss of this worker currently puts Kubernetes in a no-win situation. Either it does the safe thing (the current behaviour) and prevents the pod from being recovered or the attached volume from being accessed, leading to more downtime (because it requires an admin to intervene) than a traditional HA solution. Or it allows the pod to be recovered, risking data corruption if the worker (and by inference, the pod) is not completely dead.

Ordered Bootstrap and Recovery

One of the more important capabilities of StatefulSets is that:

Pod N cannot be recovered, created or destroyed until all pods 0 to N-1 are active and healthy.

This allows container authors to make many simplifying assumptions during bootstrapping and scaling events (such as who has the most recent copy of the data at a given point).

Unfortunately, until we get pod safety and termination guarantees, it means that if a worker node crashes or becomes unreachable, it’s pods are unrecoverable and any auto-scaling policies cannot be enacted.

Additionally, the enforcement of this policy only happens at scheduling time.

This means that if there is a delay enacting the scheduler’s results, an image must be downloaded, or an init container is part of the scale up process, there is a significant period of time in which an existing pod may die before new replicas can be constructed.

As I type this, the current status on my testbed demonstrates this fragility:

# kubectl get pods
NAME                         READY     STATUS        RESTARTS   AGE
[...]
hostnames-3799501552-wjd65   0/1       Pending       0          4m
mysql-0                      2/2       Running       4          4d
mysql-2                      0/2       Init:0/2      0          19h
web-0                        0/1       Unknown       0          19h

As described, the feature suggests this state (mysql-2 to be in the init state while mysql-1 is not active) can never happen.

While such behaviour remains possible, container authors must take care to include logic to detect and handle such scenarios. The easiest course of action is to call exit and cause the container to be re-scheduled.

The example partially addresses this race condition by bootstrapping pod N from N-1. This limits the impact of pod N’s failure to pod N+1’s startup/recovery period.

It is easy to conceive of an extended solution that closed the window completely by trying pods N-1 to 0 in order until it found an active peer to sync from.

Extending the Pattern to Galera

All Galera peers are writable, which makes some aspects easier and others more complicated.

Bootstrapping pod 0 would require some logic to determine if it is bootstrapping the cluster (--wsrep="") or in recovery mode (--wsrep=all:6868,the:6868,peers:6868) but special handling of pod 0 has precedent and is not onerous. The remaining pods would unconditionally use --wsrep=all:6868,the:6868,peers:6868.

pod 0 is no longer a single point of failure with respect to writes, however the loss of the worker it is hosted on will continue to inhibit scaling events until manually confirmed and cleaned up by an admin.

Violations of the linear start/stop ordering could be significant if they result from a failure of pod 0 and occur while bootstrapping pod 1. Further, if pod 1 was stopped significantly earlier than pod 0, then depending on the implementation details of Galera, it is conceivable that a failure of pod 0 while pod 1 is synchronising might result in either data loss or pod 1 becoming out of sync.

Removing Shared Storage

One of the main reasons to choose a replicated database is that it doesn’t require shared storage.

Having mutiple slaves certainly assists read scalability and if we modified the example to use multiple masters it would likely improve write performance and failover times. However having multiple copies of the database on the same shared storage does not provide additional redundancy over what the storage already provides - and that is important to some customers.

While there are ways to give containers access to local storage, attempting to make use of them for a replicated database is problematic:

  • It is currently not possible to enforce that pods in a Stateful Set always run on the same node.

    Kubernetes does have the ability to assign node affinity for pods, however since the Stateful Sets are a template, there is no opportunity to specify a different kubernetes.io/hostname selector for each copy.

    As the example is written, this is particularly important for pod 0 as it is the only writer and the only one guaranteed to have the most up-to-date version of the data.

    It is possible that to work-around this problem if the replica count exceeds the worker count and all peers were writable masters, however incorporating such logic into the pod would negate much of the benefit of using Stateful Sets.

  • A worker going offline prevents the pod from being started.

    In the shared storage case, it was possible to manually verify the host was down, delete the pod and have Kubernetes restart it.

    Without shared storage this is no longer possible for pod 0 as that worker is the only one with the data used to bootstrap the slaves.

    The only options are to bring the worker back, or manually alter the node affinities to have pod 0 replace the slave on the worker with the most up-to-date one.

Summing Up

While Stateful Sets may not satisfy those looking for data redundancy, they are a welcome addition to Kubernetes that will require pod safety and termination guarantees before they can really shine. The example gives us a glimpse of the future but arguably shouldn’t be used in production yet.

Those looking to manage a database with Kubernetes today would be advised to use individual pods and/or vanilla ReplicaSets, need the ability to verify the physical status of the underlying hardware and should be prepared to perform manual recovery in some scenarios.

HA for Composible Deployments of OpenStack

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at July 24, 2016 03:20 AM

One of the hot topics for OpenStack deployments is composable roles

  • the ability to mix-and-match which services live on which nodes.

This is mostly a solved problem for services not managed by the cluster, but what of the services still managed by the cluster?

Considerations

  1. Scale up

    Naturally we want to be able to add more capacity easily

  2. Scale down

    And have the option to take it away again if it is no longer necessary

  3. Role re-assignment post-deployment

    Ideally the task of taking capacity from one service and giving it to another would be a core capability and not require a node be nuked from orbit first.

  4. Flexible role assignments

    Ideally, the architecture would not impose limitations on how roles are assigned.

    By allowing roles to be assigned on an ad-hoc basis, we can allow arrangements that avoid single-points-of-failure (SPoF) and potentially take better advantage of the available hardware. For example:

    • node 1: galera and rabbit
    • node 2: galera and mongodb
    • node 3: rabbit and mongodb

    This also has implications when just one of the roles needs to be scaled up (or down). If roles become inextricably linked at install time, this requires every service in the group to scale identically - potentially resulting in higher hardware costs when there are services that cannot do so and must be separated.

    Instead, even if two services (lets say galera and rabbit) are originally assigned to the same set of nodes, this should imply nothing about how either of them can or should be scaled in the future.

    We want the ability to deploy a new rabbit server without requiring it host galera too.

Scope

This need only apply to non-OpenStack services, however it could be extended to those as well if you were unconvinced by my other recent proposal.

At Red Hat, the list of services affected would be:

  • HAProxy
  • Any VIPs
  • Galera
  • Redis
  • Mongo DB
  • Rabbit MQ
  • Memcached
  • openstack-cinder-volume

Additionally, if the deployment has been configured to provide Highly Available Instances:

  • nova-compute-wait
  • nova-compute
  • nova-evacuate
  • fence-nova

Proposed Solution

In essance, I propose that there be a single native cluster, consisting of between 3 (the minimum sane cluster size) and 16 (roughly Corosync’s out-of-the-box limit) nodes, augmented by a collection of zero-or-more remote nodes.

Both native and remote nodes will have roles assigned to them, allowing Pacemaker to automagically move resources to the right location based on the roles.

Note that all nodes, both native and remote, can have zero-or-more roles and it is also possible to have a mixture of native and remote nodes assigned to the same role.

This will allow us, by changing a few flags (and potentially adding extra remote nodes to the cluster) go from a fully collapsed deployment to a fully segregated one - and not only at install time.

If installers wish to support it1, this architecture can cope with roles being split out (or recombined) after deployment and of course the cluster wont need to be taken down and resources will move as appropriate.

Although there is no hard requirement that anything except the fencing devices run on the native nodes, best practice would arguably dictate that HAProxy and the VIPs be located there unless an external load balancer is in use.

The purpose of this would be to limit the impact of a hypothetical pacemaker-remote bug. Should such a bug exist, by virtue of being the gateway to all the other APIs, HAProxy and the VIPs are the elements one would least want to be affected.

Some installers may even choose to enforce this in the configuration, but “by convention” is probably sufficient.

Implementation Details

The key to this implementation is Pacemaker’s concept of node attributes and expressions that make use of them.

Instance attributes can be created with commands of the form:

pcs property set --node controller-0 proxy-role=true

Note that this differs from the osprole=compute/controller scheme used in the Highly Available Instances instructions. That arrangement wouldn’t work here as each node may have serveral roles assigned to it.

Under the covers, the result in Pacemaker’s configuration would look something like:

<cib ...>
  <configuration>
    <nodes>
      <node id="1" uname="controller-0">
        <instance_attributes id="controller-0-attributes">
          <nvpair id="controller-0-proxy-role" name="proxy-role" value="true"/>
...

These attributes can then be referenced in location constraints that restrict the resource to a subset of the available nodes based on certain criteria

For example, we would use the following for HAProxy:

pcs constraint location haproxy-clone rule score=0 proxy-role eq true

which would create the following under the covers:

<rsc_location id="location-haproxy" rsc="haproxy-clone">
  <rule id="location-haproxy-rule" score="0">
    <expression id="location-haproxy-rule-expr" attribute="proxy-role" operation="eq" value="true"/>
  </rule>
</rsc_location>

Any node, native or remote, not meeting the criteria is automatically eliminated as a possible host for the service.

Pacemaker also defines some node attributes automatically based on a node’s name and type. These are also available for use in constraints. This allows us, for example, to force a resource such as nova-evacuate to run on a “real” cluster node with the command:

pcs constraint location nova-evacuate rule score=0 "#kind" ne remote

For deployments based on Pacemaker 1.1.15 or later, we can also simplify the configuration by using pattern matching in our constraints.

  1. Restricting all the VIPs to nodes with the proxy role:

     <rsc_location id="location-haproxy-ips" resource-discovery="exclusive" rsc-pattern="^(ip-.*)"/>
    
  2. Restricting nova-compute to compute nodes (assuming a standardized naming convention is used):

     <rsc_location id="location-nova-compute-clone" resource-discovery="exclusive" rsc-pattern="nova-compute-(.*)"/>
    

Final Result

This is what a fully active cluster would look like:

9 nodes configured
87 resources configured

Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
RemoteOnline: [ overcloud-compute-0 overcloud-compute-1
overcloud-compute-2 rabbitmq-extra-0 storage-0 storage-1 ]

 ip-172.16.3.4 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
 ip-192.0.2.17 (ocf::heartbeat:IPaddr2): Started overcloud-controller-1
 ip-172.16.2.4 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2
 ip-172.16.2.5 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
 ip-172.16.1.4 (ocf::heartbeat:IPaddr2): Started overcloud-controller-1
 Clone Set: haproxy-clone [haproxy]
     Started: [ overcloud-controller-0 overcloud-controller-1
overcloud-controller-2 ]
 Master/Slave Set: galera-master [galera]
     Slaves: [ overcloud-controller-0 overcloud-controller-1
overcloud-controller-2 ]
 ip-192.0.3.30 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2
 Master/Slave Set: redis-master [redis]
     Slaves: [ overcloud-controller-0 overcloud-controller-1
overcloud-controller-2 ]
 Clone Set: mongod-clone [mongod]
     Started: [ overcloud-controller-0 overcloud-controller-1
overcloud-controller-2 ]
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ overcloud-controller-0 overcloud-controller-1
overcloud-controller-2 rabbitmq-extra-0 ]
 Clone Set: memcached-clone [memcached]
     Started: [ overcloud-controller-0 overcloud-controller-1
overcloud-controller-2 ]
 openstack-cinder-volume (systemd:openstack-cinder-volume): Started storage-0
 Clone Set: nova-compute-clone [nova-compute]
     Started: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 ]
 Clone Set: nova-compute-wait-clone [nova-compute-wait]
     Started: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 ]
 nova-evacuate (ocf::openstack:NovaEvacuate): Started overcloud-controller-0
 fence-nova (stonith:fence_compute): Started overcloud-controller-0
 storage-0 (ocf::pacemaker:remote): Started overcloud-controller-1
 storage-1 (ocf::pacemaker:remote): Started overcloud-controller-2
 overcloud-compute-0 (ocf::pacemaker:remote): Started overcloud-controller-0
 overcloud-compute-1 (ocf::pacemaker:remote): Started overcloud-controller-1
 overcloud-compute-2 (ocf::pacemaker:remote): Started overcloud-controller-2
 rabbitmq-extra-0 (ocf::pacemaker:remote): Started overcloud-controller-0

A small wish, but it would be nice if installers used meaningful names for the VIPs instead the underlying IP addresses they manage.

  1. One reason they may not do so on day one, is the careful co-ordination that some services can require when there is no overlap between the old and new sets of nodes assigned to a given role. Galera is one specific case that comes to mind. 

Thoughts on HA for Multi-Subnet Deployments of OpenStack

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at July 24, 2016 03:20 AM

In a normal deployment, in order to direct traffic to the same HAProxy instance, Pacemaker will ensure that each VIP is configured on at most one HAProxy machine.

However in a spine and leaf network architecture, the nodes are in multiple subnets and there may be a limitation that the machines can not be part of a common L3 network that the VIPs could be added to.

Once the traffic reaches HAProxy everything should JustWork(tm) - modulo creating the appropriate networking rules, the problem is getting it to the proxy.

The approach to dealing with this will need to differ based on the latencies that can be gaurenteed between every nodes in the cluster. At Red Hat, we define LAN-like latencies to be 2ms or better - consistently and between every node that would make up the cluster.

Low Latency Links

You have more flexibility in low latency scenarios as the cluster software can operate as designed.

At a high level, the possible ways forward are:

  1. Decide the benefit isn’t worth it and create an L3 network just for the VIPs to live on.

  2. Put all the controllers into a single subnet.

    Just be mindful of what will happen if the switch connecting them goes bad.

  3. Replace the HAProxy/VIP portion of the architecture with a load balancer appliance that is accessible from the child subnets.

  4. Move the HAProxy/VIP portion of the architecture into a dedicated 3-node cluster load balancer that is accessible from the child subnets.

    The new cluster would need the list of controllers and some health checks which could be as simple as “is the controller up/down” or as complex as “is service X up/down”.

Right now creating a load balancer near the network spine would have to be an extra manual step for users of TripleO. However once composable roles (the ability to mix and match which services go on which nodes) are supported, it should be possible to install such a configuration out of the box by placing three machines near the spine and giving only them the “haproxy” role.

Higher Latency

Corosync has very strict latency requirements of no more than 2ms for any of its links. Assuming your installer can deploy across subnets, the existance of such a link would be a barrier to the creation of a highly available dpeloyment.

To work around these requirements, we can use Pacemaker Remote to extend Pacemaker’s ability to manage services on nodes separated by higher latency links.

In TripleO, the work needed to make this style of deployment possible is already planned as part of our “HA for composable roles” design.

As per option 4 of the low latency case, such a deployment would consist of a three node cluster containing only HAProxy and some floating IPs.

The rest of the nodes that make up the traditional OpenStack control-plane are managed as remote cluster nodes. Meaning instead of a traditional Corosync and Pacemaker stack, they have only the pacemaker-remote daemon and do not participate in leader elections or quorum calculations.

External Load Balancers

If you you wish to use a dedicated load balancer, then the 3-node cluster would just co-ordinate the actions of the remote nodes and not host any services locally.

An installer may conceivably create them anyway but leave them disabled to simplify the testing matrix.

General Considerations

The creation of a separate subnet or set of subnets for fencing is highly encouraged.

In general we want to avoid the possibility for a single network(ing) failure that can take out communication to the both a set of nodes and the device that can turn them off.

Everything is HA a trade-off between the chance of a particular failure occurring and the consequences if it ever actually happens. Everyone will likely draw the line in a different place based on their risk aversion, all I can do is make recommendations based on my background in this field.

Working with OpenStack Images

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at June 20, 2016 02:45 AM

Creating Images

For creating images, I recommend the virt-builder tool that ships with RHEL based distributions and possibly others:

virt-builder centos-7.2 --format qcow2 --install "cloud-init" --selinux-relabel

Note the use of the --selinux-relabel option. If you specify --install but do not include this option, you may end up with instances that treats all attempts to log in as security violations and blocks them.

The cloud-init package is incredibly useful (discussed later) but isn’t available in CentOS images by default, so I recommend adding it to any image you create.

For the full list of supported targets, try virt-builder -l. Targets should include CirrOS as well as several versions of openSUSE, Fedora, CentOS, Debian, and Ubuntu.

Adding Packages to an existing Image

On RHEL based distributions, the virt-customize tool is available and makes adding a new package to an existing image simple.

virt-customize  -v -a myImage --install "wget,ntp" --selinux-relabel

Note once again the use of the --selinux-relabel option. This should only be used for the last step of your customization. As above, not doing so may result in an instance that treats all attempts to log in as security violations and blocks them.

Richard Jones also has a good post about updating RHEL images since they require subscriptions. Just be sure to use --sm-unregister and --selinux-relabel at the very end.

Logging in

If you haven’t already, tell OpenStack about your keypair:

nova keypair-add myKey --pub-key ~/.ssh/id_rsa.pub

Now you can tell your provisioning tool to add it to the instances it creates. For Heat, the template would look like this:

myInstance:
  type: OS::Nova::Server
  properties:
    image: { get_param: image }
    flavor: { get_param: flavor }
    key_name: myKey

However almost no image will let you log in, via ssh or on the console, as root. Instead they normally create a new user that has full sudo access. Red Hat images default to cloud-user while CentOS has a centos user.

If you don’t already know which user your instance has, you can use nova console-log myImage to see what happens at boot time.

Assuming you configured a key to add to the instance, you might see a line such as:

ci-info: ++++++Authorized keys from /home/cloud-user/.ssh/authorized_keys for user cloud-user+++++++

which tells you which user your image supports.

Customizing an Instance at Boot Time

This section relies heavily on the cloud-init package. If it is not present in your images, be sure to add it using the techniques above before trying anything below.

Running Scripts

Running scripts on the instances once they’re up can be a useful way to customize your images, start services and generally work-around bugs in officially provided images.

The list of commands to run is specified as part of the user_data section of a Heat template or can be passed to nova boot with the --user-data option:

myNode:
  type: OS::Nova::Server
  properties:
    image: { get_param: image }
    flavor: { get_param: flavor }
    user_data_format: RAW
    user_data:
      #!/bin/sh -ex

      # Fix broken qemu/strstr()
      # https://bugzilla.redhat.com/show_bug.cgi?id=1269529#c9
      touch /etc/sysconfig/64bit_strstr_via_64bit_strstr_sse2_unaligned

Note the extra options passed to /bin/sh The e tells the script to terminate if any command produces an error and the x tells the shell to log everything that is being executed. This is particularly useful as it causes the script’s execution to be available in the console’s log (nova console-log myServer).

When Scripts Take a Really Long Time

If we have scripts that take a really long time, we may want to delay the creation of subsequent resources until our instance is fully configured.

If we are using Heat, we can set this up by creating SwiftSignal and SwiftSignalHandle resources to coordinate resource creation with notifications/signals that could be coming from sources external or internal to the stack.

signal_handle:
  type: OS::Heat::SwiftSignalHandle

wait_on_server:
  type: OS::Heat::SwiftSignal
  properties:
    handle: {get_resource: signal_handle}
    count: 1
    timeout: 2000

We then add a layer of indirection to the user_data: portion of the instance definition using the str_replace: function to replace all occurences of “wc_notify” in the script with an appropriate curl PUT request using the “curl_cli” attribute of the SwiftSignalHandle resource.

myNode:
  type: OS::Nova::Server
  properties:
    image: { get_param: image }
    flavor: { get_param: flavor }
    user_data_format: RAW
    user_data:
      str_replace:
        params:
          wc_notify:   { get_attr: ['signal_handle', 'curl_cli'] }
        template: |
          #!/bin/sh -ex

          my_command_that --takes-a-really long-time

          wc_notify --data-binary '{"status": "SUCCESS", "data": "Script execution succeeded"}'

Now the creation of myNode will only be considered successful if and when the script completes.

Installing Packages

One should avoid the temptation to hardcode calls to a specific package manager as part of a script as it limits the usefulness of your template. Instead, this is done in a platform agnostic way using the packages directive.

Note that instance creation will not fail if packages fail to install or are already present. Check for any required binaries or files as part of the script.

user_data_format: RAW
user_data:
  #cloud-config
  # See http://cloudinit.readthedocs.io/en/latest/topics/examples.html
  packages:
    - ntp
    - wget

Note that this will NOT work for images that need a Red Hat subscription. There is supposed to be a way to have it register, however I’ve had no success with this method and instead I recommend you create a new image that has any packages listed here pre-installed.

Installing Packages and Running scripts

The first line of the user_data: section (#config or #!/bin/sh) is used to determine how it should be interpreted. So if we wish to take advantage of scripting and cloud-init, we must combine the two pieces into a multi-part MIME message.

The cloud-init docs include a MIME helper script to assist in the creation of complex user_data: blocks.

Simply create a file for each section and invoke with a command line similar to:

python ./mime.py cloud.config:text/cloud-config cloud.sh:text/x-shellscript

The resulting output can then be pasted in as a template and even edited in-place later. Here is an example that includes notification for a long running process:

user_data_format: RAW
user_data:
  str_replace:
    params:
      wc_notify:   { get_attr: ['signal_handle', 'curl_cli'] }
    template: |
      Content-Type: multipart/mixed; boundary="===============3343034662225461311=="
      MIME-Version: 1.0
      
      --===============3343034662225461311==
      MIME-Version: 1.0
      Content-Type: text/cloud-config; charset="us-ascii"
      Content-Transfer-Encoding: 7bit
      Content-Disposition: attachment; filename="cloud.config"

      #cloud-config
      packages:
        - ntp
        - wget

      --===============3343034662225461311==
      MIME-Version: 1.0
      Content-Type: text/x-shellscript; charset="us-ascii"
      Content-Transfer-Encoding: 7bit
      Content-Disposition: attachment; filename="cloud.sh"
      
      #!/bin/sh -ex

      my_command_that --takes-a-really long-time

      wc_notify --data-binary '{"status": "SUCCESS", "data": "Script execution succeeded"}'

Evolving the OpenStack HA Architecture

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at June 07, 2016 04:06 AM

In the current OpenStack HA architecture used by Red Hat, SuSE and others, Systemd is the entity in charge of starting and stopping most OpenStack services. Pacemaker exists as a layer on top, signalling when this should happen, but Systemd is the part making it happen.

This is a valuable contribution for active/passive (A/P) services and those that require all their dependancies be available during their startup and shutdown sequences. However as OpenStack matures, more and more components are able to operate in an unconstrained active/active capacity with little regard for the startup/shutdown order of their peers or dependancies - making them well suited to be managed by Systemd.

For this reason, a future revision of the HA architecture should limit Pacemaker’s involvement to core services like Galera and Rabbit as well as the few remaining OpenStack services that run A/P.

This would be particularly useful as we look towards a containerised future. It both allows OpenStack to play nicely with the current generation of container managers which lack Orchestration, as well as reduces recovery and downtime by allowing for the maximum parallelism.

Divesting most OpenStack services from the cluster also removes Pacemaker as a potential obstacle for moving them to WSGI. It is as-yet unclear if services will live under a single Apache instance or many and the former would conflict with Pacemaker’s model of starting, stopping and monitoring services as individual components.

Objection 1 - Pacemaker as an Alerting Mechanism

Using Pacemaker as an alerting mechanism for a large software stack is of limited value. Of course Pacemaker needs to know when a service dies but it necessarily takes action straight away, not wait around to see if there will be any others with which it can correlate a root cause.

In large complex software stacks, the recovery and alerting components should not be the same thing because they do and should operate on different timescales.

Pacemaker also has no way to include the context of a failure in an alert and thus no way to report the difference between Nova failing and Nova failing because Keystone is dead. Indeed Keystone being the root cause could be easily lost in a deluge of notifications about the failure of services that depend on it.

For this reason, as the number of services and dependancies grow, Pacemaker makes a poor substitute for a well configured monitoring and alerting system (such as Nagios or Sensu) that can also integrate hardware and network metrics.

Objection 2 - Pacemaker has better Monitoring

Pacemaker’s native ability to monitor services is more flexible than Systemd’s which relies on a “PID up == service healthy” mode of thinking 1.

However, just as Systemd is the entity performing the startup and shutdown of most OpenStack services, it is also the one performing the actual service health checks.

To actually take advantage of Pacemaker’s monitoring capabilities, you would need to write Open Cluster Framework (OCF) agents 2 for every OpenStack service. While this would not take a rocket scientist to achieve, it is an opportunity for the way services are started in a clustered and non-clustered environment to diverge.

So while it may feel good to look at a cluster and see that Pacemaker is configured to check the health of a service every N seconds, all that really achieves is to sync Pacemaker’s understanding of the service with what Systemd already knew. In practice, on average, this ends up delaying recovery by N/2 seconds instead of making it faster.

Bonus Round - Active/Passive FTW

Some people have the impression that A/P is a better or simpler mode of operation for services and in this was justify the continued use of Pacemaker to manage OpenStack services.

Support for A/P configurations is important, it allows us to make applications that are in no way cluster-aware more available by reducing the requirements on the application to almost zero.

However, the downside is slower recovery as the service must be bootstrapped on the passive node, which implies increased downtime. So at the point the service becomes smart enough to run in an unconstrained A/A configuration, you are better off to do so - with or without a cluster manager.

  1. Watchdog-like functionality is only a variation on this, it only tells you that the thread responsible for heartbeating to Systemd is alive and well - not if the APIs it exposes are functioning. 

  2. Think SYS-V init scripts with some extra capabilities and requirements particular to clustered/automated environment. It’s a standard historically supported by the Linux Foundation but hasn’t caught on much since it was created in the late 90’s. 

Minimum Viable Cluster

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at October 07, 2015 05:51 AM

In the past there was a clear distinction between high performance (HP) clustering and high availability (HA) clustering, however the lines have been bluring for some time. People have scaled HA clusters upwards and HP inspired clusters have been used to provide availability through redundancy.

The trend in providing availablity of late has been towards the HP model - pools of anonymous and stateless workers that can be replaced at will. A really attractive idea but in order to pull it off they have to make assumptions that may or may not be compatible with some peoples’ workloads.

Assumptions are neither wrong nor bad, you just need to make sure they are compatible with your environment.

So when looking for an availablity solution, keep Occam’s razor in mind but don’t be a slave to it. Look for the simplest architecture but then work upwards until you find one that meets the needs of your actual (not ideal) application or stack.

Starting Simple

Application HA is the simplest kind of cluster you can deploy because the cluster and the application are the same thing. It takes care of talking to its peers, checking to see if they’re still online, deciding if it should remain operational (because too many peers were lost) and synchronising any data between itself and peers.

This gives you basic fault tolerance, when a node fails there are other copies with sufficient state to take up the workload.

Galera and RabbitMQ (with replicated queues) are two popular examples in this category.

However when I said Application HA was the simplest, thats only from an admin’s point of view, because the application is doing everything.

Some issues the creators of these kinds of applications think about ahead of time:

  • Can I assume a node that I can’t see is offline?
  • What to do when some of the nodes cannot see other ones? (quorum)
  • What to do when half the nodes cannot see the other half? (split-brain)
  • Does it matter if the application is still active on nodes we cannot see? (data integrity)
  • Is there state that needs to synchronised? (replication)
  • If so, how to do so reliably and in the presence of past and future failures? (reconciliation)

So if you’re looking to create a custom application with similar properties, make sure you can fund the development team you will need to make it happen.

And remember that the reality of those simplifying assumptions will only be apparent after everything else has already hit the fan.

But lets assume the best-case here… if all you need is one of these existing applications, great! Install, configure, done. Right?

Maybe. It might depend on your hardware budget.

Unfortunately (or perhaps not) most companies aren’t Google, Twitter or Bookface. Most companies do not have thousands of nodes in their cluster, in fact getting many of them to have more than two can be a struggle.

In such environments the overhead of having 1?, 2?, 10?!? spare nodes (just in case of a failure - which will surely never happen) starts to represent a significant portion of their balance sheet.

As the number of spare nodes goes down, so does the number of failures that the system can absorb. It is irrelevant if a failure leaves two (or twenty) functional nodes if the load generated by the clients exceeds the application’s ability to keep up.

An overloaded system leads to operation timeouts which generates even more load and more timeouts. The surviving nodes aren’t really functional at that point either.

If the services lived in a widget of some kind (perhaps Docker containers or KVM images), we could have a higher level entity that would make new copies for us. Problem solved right?

Maybe. Is your application truely stateless?

Some are, Memcache is one that comes to mind because its a cache, neither creating nor consuming anything. However even web servers seem to want session state these days, so chances are your application isn’t stateless either.

Stateless is hard.

Where do new instances recover their state from? Who are its peers? A static list isn’t going to be possible if the widgets are anonymous cattle. Do you need a discovery protocol in your application?

There may also be a penalty for bringing up a brand new instance. For example, the sync time for a new Galera instance is a function of the dataset size and network bandwidth. That can easily run into the tens-of-minutes range.

So there is an incentive to either stop modelling everything as cattle or to keep the state somewhere else.

Ok, so lets put all the state in a database. Problem solved right?

Maybe. How smart is your widget manager?

You could create a single widget with both the application and the database. That would allow you to use systemd to achieve Node HA

  • the automated and deterministic management of a set of resources within a single node.

In some ways, systemd looks a lot like a cluster manager. It knows about sets of services, it knows relationships between them (so that the database is started before the application) and it knows how to recover the stack if something fails.

Unfortunately you’re out of luck if a failure on node A requires recovery (of the same or a different service) on nodeB because the caveat is that all these relationships must exist within a single node.

This of course is not the container model - which likes to have each service in its own widget. More importantly, you always need to pay the database synchronisation cost for every failure which is not ideal.

Alternatively, if your application isn’t active-active, you don’t even get the option of combining them into a single flavour of widget.

By splitting them up into two however, the synchronisation cost is only payable when a database widget dies. This improves your recovery time and makes the widget purists happy, but now you make need to make sure the application doesn’t start until the database is both running, synchronised and available to take requests.

About now you might be tempted to think about putting retry loops in the application instead.

Chances are however, there is another service that is a client of the application (and there is a client of the client of the …).

Every time you build in another level of retry loops, you increase your failure detection time and ultimately your downtime.

Hence the question: How smart is your widget manager?

  • It needs to ensure there are at least N copies of a widget active.
  • It might need to ensure there are less than M copies available.
  • It might need to ensure the application starts after the database.
  • It might need to be able to stop the application if not enough copies of the database are around and/or writable. Perhaps it got corrupted? Perhaps someone needs to do maintenance?

Lets assume the widget manager can do these things. Most can, that means we’re done right?

Maybe. What happens if the widget manager cannot see one of its peers?

Just because the widget manager cannot see one of its peers with a bunch of application widgets, does not mean they’re not happily swallowing client requests they can never process and/or writing to the data store via some other subnet.

If this does not apply to your application, consider yourself blessed.

For the rest of us, in order to preserve data integrity, we need the widget manager to take steps to ensure that the peer it can no longer see does not have any active widgets.

This is one reason why systemd is rarely sufficient on its own.

Hint: A great way to do this is to power off the host

Are you done yet?

One thing we skipped over is where the database itself is storing its state.

If you were using bare metal, you could store it there - but thats old-fashioned. Storing it in the KVM image or docker container isn’t a good idea, you’d loose everything if the last container ever died.

Projects like glusterfs are options, just be sure you understand what happens when partitions form.

If you’re thinking of something like NFS or iSCSI, consider where those would come from. Almost certainly you don’t want a single node serving them up - that would introduce a single point of failure and the whole point of this is to remove those.

You could add a SAN into the mix for some hardware redundancy, however you need to ensure either:

  • exactly one node accesses the SAN at any time (active/passive), or
  • your filesystem can handle concurrent reads and write from multiple hosts (active/active)

Both options will require quorum and fencing in order to reliably hand out locks. This is the sweet-spot of a full blown cluster manager, System HA, and why traditional, scary, cluster managers like Pacemaker and Veritas exist.

Unless you’d like to manually resolve block-level conflicts after a split-brain, some part of the system needs to rigorously enforce these things. Otherwise its widget managers all the way down.

One of Us

Once you have a traditional cluster manager, you might be surprised how useful it can be.

A lot of applications are resilient to failures once they’re up, but have non-trivial startup sequences. Consider RabbitMQ:

  • Pick one active node and start rabbitmq-server
  • Everywhere else, run
    • Start rabbitmq-server
    • rabbitmqctl stop_app
    • rabbitmqctl join_cluster rabbit@${firstnode}
    • rabbitmqctl start_app

Now the Rabbit’s built-in HA can take over but to get to that point:

  • How do you pick which is the first node?
  • How do you tell everyone else who it is?
  • Can rabbitmq accept updates before all peers have joined?
  • Can your app?

This is the sort of thing traditional cluster managers do before breakfast. They are afterall really just distributed finite state machines.

Recovery can be a troublesome time too:

http://previous.rabbitmq.com/v3_3_x/clustering.html

the last node to go down must be the first node to be brought online. If this doesn’t happen, the nodes will wait 30 seconds for the last discconected node to come back online, and fail afterwards.

Depending on how the nodes were started, you may see some nodes running and some stopped. What happens if the last node isn’t online yet?

Some cluster managers support concepts like dual-phased services (or “Master/slave” to use the politically incorrect term) that can allow automated recovery even with constraints such as these. We have Galera agents that also take advantage of these capabilities - finding the ‘right’ node to bootstrap before synchronising it to all the peers.

Final thoughts

HA is a spectrum, where you fit depends on what assumptions you can make about your application stack.

Just don’t make those assumptions before you really understand the problem at hand, because retrofitting an application to remove simplifying assumptions (such as only supporting pets) is even harder than designing it in in the first place.

Whats your minimal viable cluster?

Receiving Reliable Notification of Cluster Events

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at September 01, 2015 02:16 AM

When Pacemaker 1.1.14 arrives, there will be a more reliable way to receive notification of cluster events.

In the past, we relied on a ocf:pacemaker:ClusterMon resource to monitor the cluster status with the crm_mon daemon and trigger alerts on each cluster event.

One of the arguments to ClusterMon was the location of a custom script that would be called when the event happened. This script could then create an SNMP trap, SMS, email, etc to alert the admin based on dynamically filled environment variables describing precisely the cluster event that occurred.

The Problem

Relying on a cluster resource proved to be not such a great idea for a number of reasons:

  • Alerts ceased if the resource was not running
  • There was no indication that the alerts had ceased
  • The resource was likely to be stopped at exactly the point that something interesting was happening
  • Old alerts were likely to be resent whenever the status section of the cib was rebuilt when a new DC was elected

The Solution

Clearly support for notifications needed to be baked into the core of Pacemaker, so thats what we’ve now done. Finally (sorry, you wouldn’t believe the length of my TODO list).

To make it work, drop a script onto each of the nodes, /var/lib/pacemaker/notify.sh would be a good option, then tell the cluster to start using it:

    pcs property set notification-agent=/var/lib/pacemaker/notify.sh

Like resource agents, this one can be written in whatever language makes you happy - as long as it can read environment variables.

Pacemaker will also check your agent completed and report the return code. If the return code is not 0, Pacemaker will also log any output from your agent.

The agent is called asynchronously and should complete quickly. If it has not completed after 5 minutes it will be terminated by the cluster.

Where to Send Alerts

I think we can all agree that hard coding the intended recipient of the notification into the scripts would be a bad idea. It would make updating the recipient (vacation, change of role, change of employer) annoying and prevent the scripts from being reused between different clusters.

So there is also a notification-recipient cluster property which will be passed to the script. It can contain whatever you like, in whatever format you like, as long as the notification-agent knows what to do with it.

To get people started, the source includes a sample agent which assumes notification-recipient is a filename, eg.

    pcs property set notification-recipient=/var/lib/pacemaker/notify.log

Interface

We preserved the old list of environment variables, so any existing ClusterMon scripts will still work in this new mode. I have added a few extra ones though.

Environment variables common to all notification events:

  • CRM_notify_kind (New) Indicates the type of notification. One of resource, node, and fencing.
  • CRM_notify_version (New) Indicates the version of Pacemaker sending the notification.
  • CRM_notify_recipient The value specified by notification-recipient from the cluster configuration.

Additional environment variables available for notification of node up/down events (new):

  • CRM_notify_node The node name for which the status changed
  • CRM_notify_nodeid (New) The node id for which the status changed
  • CRM_notify_desc The current node state. One of member or lost.

Additional environment variables available for notification of fencing events (both successful and failed):

  • CRM_notify_node The node for which the status changed.
  • CRM_notify_task The operation that caused the status change.
  • CRM_notify_rc The numerical return code of the operation.
  • CRM_notify_desc The textual output relevant error code of the operation (if any) that caused the status change.

Additional environment variables available for notification of resource operations:

  • CRM_notify_node The node on which the status change happened.
  • CRM_notify_rsc The name of the resource that changed the status.
  • CRM_notify_task The operation that caused the status change.
  • CRM_notify_interval (New) The interval of a resource operation
  • CRM_notify_rc The numerical return code of the operation.
  • CRM_notify_target_rc The expected numerical return code of the operation.
  • CRM_notify_status The numerical representation of the status of the operation.
  • CRM_notify_desc The textual output relevant error code of the operation (if any) that caused the status change.

Fencing for Fun and Profit with SBD

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at August 31, 2015 06:51 AM

What is this Fencing Thing and do I Really Need it?

Fundamentally fencing is a mechanism for turning a question

Is node X capable of causing corruption or divergent datasets?

into an answer

No

so that the cluster can safely initiate recovery after a failure.

This question exists because we cannot assume that an unreachable node is in fact off.

Sometimes it will do this by powering the node off, clearly a dead node can do no harm. Other times we will use a combination of network (stop traffic from arriving) and disk (stop a rogue process from writing anything to shared storage) fencing.

Fencing is a requirement of almost any cluster, regardless of whether it is active/active, active/passive or involves shared storage (or not).

One of the best ways of implementing fencing is with a remotely accessible power switch, however some environments may not allow them, see the value in them, or have ones that are not suitable for clustering (such as IPMI devices that loose power with the host they control).

Enter SBD

SBD can be particularly useful in environments where traditional fencing mechanisms are not possible.

SBD integrates with Pacemaker, a watchdog device and, optionally, shared storage to arrange for nodes to reliably self-terminate when fencing is required (such as node failure or loss of quorum).

This is achieved through a watchdog device, which will reset the machine if SBD does not poke it on a regular basis or if SBD closes its connection “ungracefully”.

Without shared storage, SBD will arrange for the watchdog to expire if:

  • the local node looses quorum, or
  • the Pacemaker, Corosync or SBD daemons are lost on the local node and are not recovered, or
  • Pacemaker determines that the local node requires fencing, or
  • in the extreme case that Pacemaker kills the sbd daemon as part of recovery escalation

When shared storage is available, SBD can also be used to trigger fencing of its peers.

It does this through the exchange of messages via shared block storage such as a SAN, iSCSI, FCoE. SBD on the target peer sees the message and triggers the watchdog to reset the local node.

These properties of SBD also make it particularly useful for dealing with network outages, potentially between different datacenters, or when the cluster needs to forcefully recover a resource that refuses to stop.

Documentation is another area where diskless SBD shines, because it requires no special knowledge of the user’s environment.

Not a Silver Bullet

One of the ways in which SBD recognises that the node has become unhealthy is to look for quorum being lost. However traditional quorum makes no sense in a two-node cluster and is often disabled by setting no-qorum-policy=ignore.

SBD will honour this setting though, so in the event of a network failure in a two-node cluster, the node isn’t going to self-terminate.

Likewise if you enabled Corosync 2’s two_node option, both sides will always have quorum and neither party will self-terminate.

It is therefor suggested to have three or more nodes when using SBD without shared storage.

Additionally, using SBD for fencing relies on at least part of a system that has already showed itself to be malfunctioning (otherwise we wouldn’t be fencing it) to function correctly.

Everything has been done to keep SBD as small, simple and reliable as possible, however all software has bugs and you should choose an appropriate level of paranoia for your circumstances.

Installation

RHEL 7 and derivatives like CentOS include sbd, so all you need is yum install -y sbd.

For other distributions, you’ll need to build it from source.

# git clone git@github.com:ClusterLabs/sbd.git
# cd sbd
# autoreconf -i
# ./configure

then either

# make rpm

or

# sudo make all install
# sudo install -D -m 0644 src/sbd.service /usr/lib/systemd/system/sbd.service
# sudo install -m 644 src/sbd.sysconfig /etc/sysconfig/sbd

NOTE: The instructions here do not apply to the version of SBD that currently ships with openSUSE and SLES.

Configuration

SBD’s configuration lives in /etc/sysconfig/sbd by default and the we include a sample to get you started.

For our purposes here, we can ignore the shared disk functionality and concentrate on how SBD can help us recover from loss of quorum as well as daemon and resource-level failures.

Most of the defaults will be fine, and really all you need to do is specify the watchdog device present on your machine.

Simply set SBD_WATCHDOG_DEV to the path where we can find your device and thats it. Below is the config from my cluster:

# grep -v \# /etc/sysconfig/sbd | sort | uniq
SBD_DELAY_START=no
SBD_PACEMAKER=yes
SBD_STARTMODE=clean
SBD_WATCHDOG_DEV=/dev/watchdog
SBD_WATCHDOG_TIMEOUT=5

Beware: If uname -n does not match the name of the node in the cluster configuration, you will need to pass the advertised name to SBD with the -n option. Eg. SBD_OPTS="-n special-name-1"

Adding a Watchdog to a Virtual Machine

Anyone experimenting with virtual machines can add a watchdog device to an existing instance by editing the xml and restarting the instance:

virsh edit vmnode

Add <watchdog model='i6300esb'/> underneath the ‘' tag. Save and close, then reboot the instance to have the config change take effect:

virsh destroy vmnode
virsh start vmnode

You can then confirm the watchdog was added:

virsh dumpxml vmnode | grep -A 1 watchdog 

The output should look something like:

<watchdog model='i6300esb' action='reset'>
  <alias name='watchdog0'/>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
</watchdog>

Using a Software Watchdog

If you do not have a real watchdog device, you should go out and get one.

However you’re probably investigating SBD because it was not possible/permitted to get a real fencing device, so there is a strong chance you’re going to using a software based watchdog device.

Software based watchdog devices are not evil incarnate however you should be aware of their limitations, they are after-all software and require a degree of correctness from a system that has already showed itself to not be (functioning correctly, otherwise we wouldn’t be fencing it).

That being said, it still provides value when there is a network outage, potentially between different datacenters, or the cluster needs to forcefully recover a resource that refuses to stop.

To use a software watchdog, you’ll need to load the kernel’s softdog module:

/sbin/modprobe softdog

Once loaded you’ll see the device appear and you can set SBD_WATCHDOG_DEV accordingly:

# ls -al /dev/watchdog
crw-rw----. 1 root root 10, 130 Aug 31 14:19 /dev/watchdog

Don’t forget to arrange for the softdog module to be loaded at boot time too:

# echo softdog > /etc/modules-load.d/softdog.conf 

Using SBD

On a systemd based system, enabling SBD with systemctl enable sbd will ensure that SBD is automatically started and stopped whenever corosync is.

If you’re integrating SBD with a distro that doesn’t support systemd, you’ll likely want to edit the corosync or cman init script to both source the sysconfig file and start the sbd daemon.

Simulating a Failure

To see SBD in action, you could:

  • stop pacemaker without stopping corosync, and/or
  • kill the sbd daemon, and/or
  • use stonith_admin -F

Killing pacemakerd is usually not enough to trigger fencing because systemd will restart it “too” quickly. Likewise, killing one of the child daemons will only result in pacemakerd respawning them.

Uninstalling

On every host, run:

# systemctl disable sbd

Then on one node, run:

# pcs property set stonith-watchdog-timeout=0
# pcs cluster stop --all

At this point no part of the cluster, including Corosync, Pacemaker or SBD should be running on any node.

Now you can start the cluster again and completely remove the stonith-watchdog-timeout option:

# pcs cluster start --all
# pcs property unset stonith-watchdog-timeout

Troubleshooting

SBD will refuse to start if the configured watchdog device does not exist. You might see something like this:

# systemctl status sbd
sbd.service - Shared-storage based fencing daemon
   Loaded: loaded (/usr/lib/systemd/system/sbd.service; disabled)
   Active: inactive (dead)

To obtain more logging from SBD, pass additional -V options to the sbd daemon when launching it.

SBD will trigger the watchdog (and your node will reboot) if uname -n is different to the name of the node in the cluster configuration. If this is the case for you, pass the correct name to sbd with the -n option.

Pacemaker will refuse to start if it detects that SBD should be in use but cannot find the sbd process.

The have-watchdog property will indicate if Pacemaker considers SBD to be in use:

# pcs property 
Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: STSRHTS2609
 dc-version: 1.1.12-a14efad
 have-watchdog: false
 no-quorum-policy: freeze

Double Failure - Get out of Jail Free? Not so Fast

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at May 29, 2015 03:34 AM

Occasionally you might hear someone explain away a failure/recovery scenario with “that’s a double failure, we can’t/don’t protect against those”.

There are certainly situations where this is true. A node failure combined with a fencing device failure will and should prevent a cluster from recovering services on that node.

However!

It doesn’t mean we can ignore the failure. Nor does it it make it acceptable to forget that services on the failed node still need to be recovered one day.

Playing the “double failure” card also requires the failures to be in different layers. Notice that the example above was for a node failure and fencing failure.

The failure of a second node while recovering from the first doesn’t count (unless it was your last one).

Just something to keep in mind in case anyone was thinking about designing something to support highly available openstack instances…

Life at the Intersection of Pets and Cattle

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at May 28, 2015 02:50 AM

Scale UP: Servers are like pets - you name them and when they get sick you nurse them back to health

Scale OUT: Servers are like cattle - you number them and when they get sick you shoot them

Why Pets?

The theory goes that pets have no place in the server room. Everything should be stateless and replicated and if one copy dies, who cares because there are 100 more.

Except real life isn’t like that.

It’s hard to build replicated, stateless, shared-nothing systems, its even hard just bringing them online because the applications often need external context in order to distinguish between different recovery scenarios.

I cover some of these ideas in my Highly Available Openstack Deployments document.

Indeed that document shows that even the systems built to manage cattle contain pieces that need to be treated as pets. They demonstrate the very limitations of the model they advocate.

Life at the Intersection

Eventually someone realises they need a pet after-all.

This is when things get interesting, because baked in from the start is the assumption that we don’t need to care about cattle:

  • It doesn’t matter if some of the cattle die, there’s plenty more
  • It doesn’t matter if some of the cattle die when a paddock explodes, there’s plenty more
  • It doesn’t matter if some of the cattle are lost moving them from one paddock to another, there’s plenty more
  • It doesn’t matter if new cattle stock is lost unloading them into a new paddock, there’s plenty more
  • It doesn’t matter, just try again

The assumptions manifest themselves in a variety of ways:

  • Failed actions are not retried
  • Error reporting is confused for error handling
  • Incomplete records (since the cattle can be easily re-counted)

All of which makes adopting some cattle as pets really freaking hard.

Raising Pets in Openstack

Some things are easier said than done.

When the compute node hosting an instance dies, evacuate it elsewhere

Easy right?

All we need to do is notice that the compute node disappeared, make sure its really dead (otherwise it might be running twice, which would be bad), and pick someone to call evacuate.

Except:

  • You can’t call evacuate before nova notices it’s peer is gone
  • You can’t (yet) tell nova that it’s peer has gone

Ok, so we can define a fencing device that talks to nova, loops until it notices the peer is gone and calls evacuate.

Not so fast, the failure that took out the compute node may have also taken out part of the control plane, which needs fencing to complete before it can be recovered. However in order for fencing to complete, the control plane needs to have recovered (nova isn’t going to be able to tell you it noticed the peer died if your request can’t be authenticated).

Ok, so we can’t use a fencing device, but the cluster will let services know when their peers go away. The notifications are even independent of the recovering the VIPs, so as long as at least one of the control nodes survives, we can block waiting for nova and make it work. We just need to arrange for only one of the survivors to perform the evacuations.

Job done, retire…

Not so fast kimosabi. Although we can recover a single set of failed compute and/or control nodes, what if there is a subsequent failure? You’ve had one failure, that means more work, more work means more opportunities to create more failures.

Oh, and by the way, you can’t call evacuate more than once. Nor is there a definitive way to determine if an instance is being evacuated.

Here are some of the ways we could still fail to recover instances:

  • A compute node that is in the process of initiating evacuations dies
    It takes time for nova to accept the evacuation calls, there is a window for some to be lost if this node dies too.
  • A compute node which is receiving an evacuated node dies
    At what point does the new compute node “own” the instance such that if this node died to, the instance would be picked up by a subsequent evacuate call? Depending on what is recorded inside nova and when, you might have a problem.
  • The control node which is orchestrating an evacuation dies
    There a window between a request being removed from the queue and it being actioned to the point that it will complete? Depending on what is recorded inside nova and when, you might have a problem.
  • A control node hosting one of the VIPs dies while an evacuation is in progress (probably)
    Do any of the activities associated with evacuation require the use of inter-component APIs? If so, you might have a problem if one of those APIs is temporarily unavailable
  • Some other entity (human or otherwise) also initiates an evacuation If there is no way for nova to say if an instance is being evacuated, how can we expect an admin to know that initiating one would be unsafe?
  • It is the 3rd Tuesday of a month with 5 Saturdays and the moon is a waxing gibbous
    Ok, perhaps I’m a little jaded at this point.

Doing it Right

Hopefully I’ve demonstrated the difficulty of adding pets (highly available instances) as an after thought. All it took to derail the efforts here was the seemingly innocuous decision that the admin should be responsible for retrying failed evacuations (based on it not having appeared somewhere else after a while?). Who knows what similar assumptions are still lurking.

At this point, people are probably expecting that I put my Pacemaker hat on and advocate for it to be given responsibility for all the pets. Sure we could do it, we could use nova APIs to managed them just like we do when people use their hypervisors directly.

But that’s never going to happen, so lets look at the alternatives. I foresee three main options:

  1. First class support for pets in nova
    Seriously, the scheduler is the best place for all this, it has all the info to make decisions and the ability to make them happen.

  2. First class support for pets in something that replaces nova
    If the technical debt or political situation is such that nova cannot move in this direction, perhaps someone else might.

  3. Creation of a distributed finite state machine that:

    • watches or is somehow told of new instances to track
    • watches for successful fencing events
    • initiates and tracks evacuations
    • keeps track of its peer processes so that instances are still evacuated in the event of process or node failure

The cluster community has pretty much all the tech needed for the last option, but it is mostly written in C so I expect that someone will replicate it all in Python or Go :-)

If anyone is interested in pursuing capabilities in this area and would like to benefit from the knowledge that comes with 14 years experience writing cluster managers, drop me a line.

Life at the Intersection of Pets and Cattle

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at May 28, 2015 02:50 AM

Scale UP: Servers are like pets - you name them and when they get sick you nurse them back to health

Scale OUT: Servers are like cattle - you number them and when they get sick you shoot them

Why Pets?

The theory goes that pets have no place in the server room. Everything should be stateless and replicated and if one copy dies, who cares because there are 100 more.

Except real life isn’t like that.

It’s hard to build replicated, stateless, shared-nothing systems, its even hard just bringing them online because the applications often need external context in order to distinguish between different recovery scenarios.

I cover some of these ideas in my Highly Available Openstack Deployments document.

Indeed that document shows that even the systems built to manage cattle contain pieces that need to be treated as pets. They demonstrate the very limitations of the model they advocate.

Life at the Intersection

Eventually someone realises they need a pet after-all.

This is when things get interesting, because baked in from the start is the assumption that we don’t need to care about cattle:

  • It doesn’t matter if some of the cattle die, there’s plenty more
  • It doesn’t matter if some of the cattle die when a paddock explodes, there’s plenty more
  • It doesn’t matter if some of the cattle are lost moving them from one paddock to another, there’s plenty more
  • It doesn’t matter if new cattle stock is lost unloading them into a new paddock, there’s plenty more
  • It doesn’t matter, just try again

The assumptions manifest themselves in a variety of ways:

  • Failed actions are not retried
  • Error reporting is confused for error handling
  • Incomplete records (since the cattle can be easily re-counted)

All of which makes adopting some cattle as pets really freaking hard.

Raising Pets in Openstack

Some things are easier said than done.

When the compute node hosting an instance dies, evacuate it elsewhere

Easy right?

All we need to do is notice that the compute node disappeared, make sure its really dead (otherwise it might be running twice, which would be bad), and pick someone to call evacuate.

Except:

  • You can’t call evacuate before nova notices it’s peer is gone
  • You can’t (yet) tell nova that it’s peer has gone

Ok, so we can define a fencing device that talks to nova, loops until it notices the peer is gone and calls evacuate.

Not so fast, the failure that took out the compute node may have also taken out part of the control plane, which needs fencing to complete before it can be recovered. However in order for fencing to complete, the control plane needs to have recovered (nova isn’t going to be able to tell you it noticed the peer died if your request can’t be authenticated).

Ok, so we can’t use a fencing device, but the cluster will let services know when their peers go away. The notifications are even independent of the recovering the VIPs, so as long as at least one of the control nodes survives, we can block waiting for nova and make it work. We just need to arrange for only one of the survivors to perform the evacuations.

Job done, retire…

Not so fast kimosabi. Although we can recover a single set of failed compute and/or control nodes, what if there is a subsequent failure? You’ve had one failure, that means more work, more work means more opportunities to create more failures.

Oh, and by the way, you can’t call evacuate more than once. Nor is there a definitive way to determine if an instance is being evacuated.

Here are some of the ways we could still fail to recover instances:

  • A compute node that is in the process of initiating evacuations dies
    It takes time for nova to accept the evacuation calls, there is a window for some to be lost if this node dies too.
  • A compute node which is receiving an evacuated node dies
    At what point does the new compute node “own” the instance such that if this node died to, the instance would be picked up by a subsequent evacuate call? Depending on what is recorded inside nova and when, you might have a problem.
  • The control node which is orchestrating an evacuation dies
    There a window between a request being removed from the queue and it being actioned to the point that it will complete? Depending on what is recorded inside nova and when, you might have a problem.
  • A control node hosting one of the VIPs dies while an evacuation is in progress (probably)
    Do any of the activities associated with evacuation require the use of inter-component APIs? If so, you might have a problem if one of those APIs is temporarily unavailable
  • Some other entity (human or otherwise) also initiates an evacuation If there is no way for nova to say if an instance is being evacuated, how can we expect an admin to know that initiating one would be unsafe?
  • It is the 3rd Tuesday of a month with 5 Saturdays and the moon is a waxing gibbous
    Ok, perhaps I’m a little jaded at this point.

Doing it Right

Hopefully I’ve demonstrated the difficulty of adding pets (highly available instances) as an after thought. All it took to derail the efforts here was the seemingly innocuous decision that the admin should be responsible for retrying failed evacuations (based on it not having appeared somewhere else after a while?). Who knows what similar assumptions are still lurking.

At this point, people are probably expecting that I put my Pacemaker hat on and advocate for it to be given responsibility for all the pets. Sure we could do it, we could use nova APIs to managed them just like we do when people use their hypervisors directly.

But that’s never going to happen, so lets look at the alternatives. I foresee three main options:

  1. First class support for pets in nova
    Seriously, the scheduler is the best place for all this, it has all the info to make decisions and the ability to make them happen.

  2. First class support for pets in something that replaces nova
    If the technical debt or political situation is such that nova cannot move in this direction, perhaps someone else might.

  3. Creation of a distributed finite state machine that:

    • watches or is somehow told of new instances to track
    • watches for successful fencing events
    • initiates and tracks evacuations
    • keeps track of its peer processes so that instances are still evacuated in the event of process or node failure

The cluster community has pretty much all the tech needed for the last option, but it is mostly written in C so I expect that someone will replicate it all in Python or Go :-)

If anyone is interested in pursuing capabilities in this area and would like to benefit from the knowledge that comes with 14 years experience writing cluster managers, drop me a line.

Adding Managed Compute Nodes to a Highly Available Openstack Control Plane

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at May 12, 2015 11:21 PM

As previously announced on RDO list and GitHub, we now have a way to allow Pacemaker to manage compute nodes within a single cluster while still allowing us to scale beyond corosync’s limits.

Having this single administrative domain then allows us to do clever things like automated recovery of VMs running on a failed or failing compute node.

The main difference with the previous deployment mode is that services on the compute nodes are now managed and driven by the Pacemaker cluster on the control plane.

The compute nodes do not become full members of the cluster and they no longer require the full cluster stack, instead they run pacemaker_remoted which acts as a conduit.

Assumptions

We start by assuming you have a functional Juno or Kilo control plane configured for HA and access to the pcs cluster CLI.

If you don’t have this already, there is a decent guide on Github for how to achieve this.

Basics

We start by installing the required packages onto the compute nodes from your faviorite provider:

yum install -y openstack-nova-compute openstack-utils python-cinder openstack-neutron-openvswitch openstack-ceilometer-compute python-memcached wget openstack-neutron pacemaker-remote resource-agents pcs

While we’re here, we’ll also install some pieces that aren’t in any packages yet (do this on both the compute nodes and the control plane):

mkdir /usr/lib/ocf/resource.d/openstack/
wget -O /usr/lib/ocf/resource.d/openstack/NovaCompute https://github.com/beekhof/osp-ha-deploy/raw/master/pcmk/NovaCompute
chmod a+x /usr/lib/ocf/resource.d/openstack/NovaCompute 

wget -O /usr/sbin/fence_compute https://github.com/beekhof/osp-ha-deploy/raw/master/pcmk/fence_compute
chmod a+x /usr/sbin/fence_compute

Next, on one node generate a key that pacemaker on the control plane with use to authenticate with pacemaker-remoted on the compute nodes.

dd if=/dev/urandom of=/etc/pacemaker/authkey bs=4096 count=1

Now copy that to all the other machines (control plane and compute nodes).

At this point we can enable and start pacemaker-remoted on the compute nodes:

chkconfig pacemaker_remote on
service pacemaker_remote start

Finally, copy /etc/nova/nova.conf, /etc/nova/api-paste.ini, /etc/neutron/neutron.conf, and /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini from the control plane to each of your compute nodes.

If you’re using ceilometer, you’ll also want /etc/ceilometer/ceilometer.conf from there too.

Preparing the Control Plane

At this point, we need to take down the control plane in order to safely update the cluster configuration. We don’t want things to be bouncing around while we make large scale modifications.

pcs resource disable keystone

Next we must tell the cluster to look for and run the existing control plane services only on the control plane (and not the about to be defined compute nodes). We can automate this with the clever use of scripting tools:

for i in $(cibadmin -Q --xpath //primitive --node-path | tr ' ' '\n' | awk -F "id='" '{print $2}' | awk -F "'" '{print $1}' | uniq | grep -v "\-fence") ; do pcs constraint location $i rule resource-discovery=exclusive score=0 osprole eq controller ; done

Defining the Compute Node Services

Now we can create the services that can run on the compute node. We create them in a disabled state so that we have a chance to can limit where they can run before the cluster attempts to start them.

pcs resource create neutron-openvswitch-agent-compute  systemd:neutron-openvswitch-agent --clone interleave=true --disabled --force
pcs constraint location neutron-openvswitch-agent-compute-clone rule resource-discovery=exclusive score=0 osprole eq compute 

pcs resource create libvirtd-compute systemd:libvirtd  --clone interleave=true --disabled --force
pcs constraint location libvirtd-compute-clone rule resource-discovery=exclusive score=0 osprole eq compute

pcs resource create ceilometer-compute systemd:openstack-ceilometer-compute --clone interleave=true --disabled --force
pcs constraint location ceilometer-compute-clone rule resource-discovery=exclusive score=0 osprole eq compute

pcs resource create nova-compute ocf:openstack:NovaCompute user_name=admin tenant_name=admin password=keystonetest domain=${PHD_VAR_network_domain} --clone interleave=true notify=true --disabled --force
pcs constraint location nova-compute-clone rule resource-discovery=exclusive score=0 osprole eq compute

Please note, a previous version of this post used:

pcs resource create nova-compute ocf:openstack:NovaCompute –clone interleave=true –disabled –force

Make sure you use the new form

Now that the services and where they can be located is defined, we specify the order in which they must be started.

pcs constraint order start neutron-server-clone then neutron-openvswitch-agent-compute-clone require-all=false

pcs constraint order start neutron-openvswitch-agent-compute-clone then libvirtd-compute-clone
pcs constraint colocation add libvirtd-compute-clone with neutron-openvswitch-agent-compute-clone

pcs constraint order start libvirtd-compute-clone then ceilometer-compute-clone
pcs constraint colocation add ceilometer-compute-clone with libvirtd-compute-clone

pcs constraint order start ceilometer-notification-clone then ceilometer-compute-clone require-all=false

pcs constraint order start ceilometer-compute-clone then nova-compute-clone
pcs constraint colocation add nova-compute-clone with ceilometer-compute-clone

pcs constraint order start nova-conductor-clone then nova-compute-clone require-all=false

Configure Fencing for the Compute nodes

At this point we need to define how compute nodes can be powered off (‘fenced’ in HA terminology) in the event of a failure.

I have an switched APC PUD, the configuration for which looks like this:

pcs stonith create fence-compute fence_apc ipaddr=east-apc login=apc passwd=apc pcmk_host_map="east-01:2;east-02:3;east-03:4;"

But you might be using Drac or iLO, which would require you do define one for each node, eg.

pcs stonith create fence-compute-1 fence_ipmilan login="root" passwd="supersecret" ipaddr="192.168.1.1" pcmk_host_list="compute-1"
pcs stonith create fence-compute-2 fence_ipmilan login="root" passwd="supersecret" ipaddr="192.168.1.2" pcmk_host_list="compute-2"
pcs stonith create fence-compute-3 fence_ipmilan login="root" passwd="supersecret" ipaddr="192.168.1.3" pcmk_host_list="compute-3"

Be careful when using devices that loose power with the hosts they control. For such devices, a power failure and network failure look identical to the cluster which makes automated recovery unsafe.

Obsolete Instructions

Please note, a previous version of this post included the following instructions, however they are no longer required.

Next we configure the integration piece that notifies nova whenever the cluster fences one of the compute nodes. Adjust the following command to conform to your environment:

pcs –force stonith create fence-nova fence_compute domain=example.com login=admin tenant-name=admin passwd=keystonetest auth-url=http://vip-keystone:35357/v2.0/

Use pcs stonith describe fence_compute if you need more information about any of the options.

Finally we instruct the cluster that both methods are required to consider the host safely fenced. Assuming the fence_ipmilan case, you would then configure:

pcs stonith level add 1 compute-1 fence-compute-1,fence-nova pcs stonith level add 1 compute-2 fence-compute-2,fence-nova pcs stonith level add 1 compute-3 fence-compute-3,fence-nova

Re-enabling the Control Plane and Registering Compute Nodes

The location constraints we defined above reference node properties which we now define with the help of some scripting magic:

for node in $(cibadmin -Q -o nodes | grep uname | sed s/.*uname..// | awk -F\" '{print $1}' | awk -F. '{print $1}'); do pcs property set --node ${node} osprole=controller; done

Connections to remote hosts are modelled as resources in Pacemaker. So in order to add them to the cluster, we define a service for each one and set the node property that allows it to run compute services.

Once again assuming the three compute nodes from earlier, we would run:

pcs resource create compute-1 ocf:pacemaker:remote
pcs resource create compute-2 ocf:pacemaker:remote
pcs resource create compute-3 ocf:pacemaker:remote

pcs property set --node compute-1 osprole=compute
pcs property set --node compute-2 osprole=compute
pcs property set --node compute-3 osprole=compute

Thunderbirds are Go!

The only remaining step is to re-enable all the services and run crm_mon to watch the cluster bring them and the compute nodes up:

pcs resource enable keystone
pcs resource enable neutron-openvswitch-agent-compute
pcs resource enable libvirtd-compute
pcs resource enable ceilometer-compute
pcs resource enable nova-compute

Adding Managed Compute Nodes to a Highly Available Openstack Control Plane

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at May 12, 2015 11:21 PM

As previously announced on RDO list and GitHub, we now have a way to allow Pacemaker to manage compute nodes within a single cluster while still allowing us to scale beyond corosync’s limits.

Having this single administrative domain then allows us to do clever things like automated recovery of VMs running on a failed or failing compute node.

The main difference with the previous deployment mode is that services on the compute nodes are now managed and driven by the Pacemaker cluster on the control plane.

The compute nodes do not become full members of the cluster and they no longer require the full cluster stack, instead they run pacemaker_remoted which acts as a conduit.

Assumptions

We start by assuming you have a functional Juno or Kilo control plane configured for HA and access to the pcs cluster CLI.

If you don’t have this already, there is a decent guide on Github for how to achieve this.

Basics

We start by installing the required packages onto the compute nodes from your faviorite provider:

yum install -y openstack-nova-compute openstack-utils python-cinder openstack-neutron-openvswitch openstack-ceilometer-compute python-memcached wget openstack-neutron pacemaker-remote resource-agents pcs

While we’re here, we’ll also install some pieces that aren’t in any packages yet (do this on both the compute nodes and the control plane):

mkdir /usr/lib/ocf/resource.d/openstack/
wget -O /usr/lib/ocf/resource.d/openstack/NovaCompute https://github.com/beekhof/osp-ha-deploy/raw/master/pcmk/NovaCompute
chmod a+x /usr/lib/ocf/resource.d/openstack/NovaCompute 

wget -O /usr/sbin/fence_compute https://github.com/beekhof/osp-ha-deploy/raw/master/pcmk/fence_compute
chmod a+x /usr/sbin/fence_compute

Next, on one node generate a key that pacemaker on the control plane with use to authenticate with pacemaker-remoted on the compute nodes.

dd if=/dev/urandom of=/etc/pacemaker/authkey bs=4096 count=1

Now copy that to all the other machines (control plane and compute nodes).

At this point we can enable and start pacemaker-remoted on the compute nodes:

chkconfig pacemaker_remote on
service pacemaker_remote start

Finally, copy /etc/nova/nova.conf, /etc/nova/api-paste.ini, /etc/neutron/neutron.conf, and /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini from the control plane to each of your compute nodes.

If you’re using ceilometer, you’ll also want /etc/ceilometer/ceilometer.conf from there too.

Preparing the Control Plane

At this point, we need to take down the control plane in order to safely update the cluster configuration. We don’t want things to be bouncing around while we make large scale modifications.

pcs resource disable keystone

Next we must tell the cluster to look for and run the existing control plane services only on the control plane (and not the about to be defined compute nodes). We can automate this with the clever use of scripting tools:

for i in $(cibadmin -Q --xpath //primitive --node-path | tr ' ' '\n' | awk -F "id='" '{print $2}' | awk -F "'" '{print $1}' | uniq | grep -v "\-fence") ; do pcs constraint location $i rule resource-discovery=exclusive score=0 osprole eq controller ; done

Defining the Compute Node Services

Now we can create the services that can run on the compute node. We create them in a disabled state so that we have a chance to can limit where they can run before the cluster attempts to start them.

pcs resource create neutron-openvswitch-agent-compute  systemd:neutron-openvswitch-agent --clone interleave=true --disabled --force
pcs constraint location neutron-openvswitch-agent-compute-clone rule resource-discovery=exclusive score=0 osprole eq compute 

pcs resource create libvirtd-compute systemd:libvirtd  --clone interleave=true --disabled --force
pcs constraint location libvirtd-compute-clone rule resource-discovery=exclusive score=0 osprole eq compute

pcs resource create ceilometer-compute systemd:openstack-ceilometer-compute --clone interleave=true --disabled --force
pcs constraint location ceilometer-compute-clone rule resource-discovery=exclusive score=0 osprole eq compute

pcs resource create nova-compute ocf:openstack:NovaCompute user_name=admin tenant_name=admin password=keystonetest domain=${PHD_VAR_network_domain} --clone interleave=true notify=true --disabled --force
pcs constraint location nova-compute-clone rule resource-discovery=exclusive score=0 osprole eq compute

Please note, a previous version of this post used:

pcs resource create nova-compute ocf:openstack:NovaCompute –clone interleave=true –disabled –force

Make sure you use the new form

Now that the services and where they can be located is defined, we specify the order in which they must be started.

pcs constraint order start neutron-server-clone then neutron-openvswitch-agent-compute-clone require-all=false

pcs constraint order start neutron-openvswitch-agent-compute-clone then libvirtd-compute-clone
pcs constraint colocation add libvirtd-compute-clone with neutron-openvswitch-agent-compute-clone

pcs constraint order start libvirtd-compute-clone then ceilometer-compute-clone
pcs constraint colocation add ceilometer-compute-clone with libvirtd-compute-clone

pcs constraint order start ceilometer-notification-clone then ceilometer-compute-clone require-all=false

pcs constraint order start ceilometer-compute-clone then nova-compute-clone
pcs constraint colocation add nova-compute-clone with ceilometer-compute-clone

pcs constraint order start nova-conductor-clone then nova-compute-clone require-all=false

Configure Fencing for the Compute nodes

At this point we need to define how compute nodes can be powered off (‘fenced’ in HA terminology) in the event of a failure.

I have an switched APC PUD, the configuration for which looks like this:

pcs stonith create fence-compute fence_apc ipaddr=east-apc login=apc passwd=apc pcmk_host_map="east-01:2;east-02:3;east-03:4;"

But you might be using Drac or iLO, which would require you do define one for each node, eg.

pcs stonith create fence-compute-1 fence_ipmilan login="root" passwd="supersecret" ipaddr="192.168.1.1" pcmk_host_list="compute-1"
pcs stonith create fence-compute-2 fence_ipmilan login="root" passwd="supersecret" ipaddr="192.168.1.2" pcmk_host_list="compute-2"
pcs stonith create fence-compute-3 fence_ipmilan login="root" passwd="supersecret" ipaddr="192.168.1.3" pcmk_host_list="compute-3"

Be careful when using devices that loose power with the hosts they control. For such devices, a power failure and network failure look identical to the cluster which makes automated recovery unsafe.

Obsolete Instructions

Please note, a previous version of this post included the following instructions, however they are no longer required.

Next we configure the integration piece that notifies nova whenever the cluster fences one of the compute nodes. Adjust the following command to conform to your environment:

pcs –force stonith create fence-nova fence_compute domain=example.com login=admin tenant-name=admin passwd=keystonetest auth-url=http://vip-keystone:35357/v2.0/

Use pcs stonith describe fence_compute if you need more information about any of the options.

Finally we instruct the cluster that both methods are required to consider the host safely fenced. Assuming the fence_ipmilan case, you would then configure:

pcs stonith level add 1 compute-1 fence-compute-1,fence-nova pcs stonith level add 1 compute-2 fence-compute-2,fence-nova pcs stonith level add 1 compute-3 fence-compute-3,fence-nova

Re-enabling the Control Plane and Registering Compute Nodes

The location constraints we defined above reference node properties which we now define with the help of some scripting magic:

for node in $(cibadmin -Q -o nodes | grep uname | sed s/.*uname..// | awk -F\" '{print $1}' | awk -F. '{print $1}'); do pcs property set --node ${node} osprole=controller; done

Connections to remote hosts are modelled as resources in Pacemaker. So in order to add them to the cluster, we define a service for each one and set the node property that allows it to run compute services.

Once again assuming the three compute nodes from earlier, we would run:

pcs resource create compute-1 ocf:pacemaker:remote
pcs resource create compute-2 ocf:pacemaker:remote
pcs resource create compute-3 ocf:pacemaker:remote

pcs property set --node compute-1 osprole=compute
pcs property set --node compute-2 osprole=compute
pcs property set --node compute-3 osprole=compute

Thunderbirds are Go!

The only remaining step is to re-enable all the services and run crm_mon to watch the cluster bring them and the compute nodes up:

pcs resource enable keystone
pcs resource enable neutron-openvswitch-agent-compute
pcs resource enable libvirtd-compute
pcs resource enable ceilometer-compute
pcs resource enable nova-compute

Feature Spotlight - Smart Resource Restart from the Command Line

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at November 14, 2014 10:17 AM

Restarting a resource can be a complex affair if there are things that depend on that resource or if any of the operations might take a long time.

Stopping a resource is easy, but it can be hard for scripts to determine at what point the the target resource has stopped (in order to know when to re-enable it), at what point it is appropriate to give up, and even what resources might have prevented the stop or start phase from completing.

For this reason, I am pleased to report that we will be introducing a --restart option for crm_resource in Pacemaker 1.1.13.

How it works

Assuming the following invocation

crm_resource --restart --resource dummy

The tool will:

  1. Check the current state of the cluster
  2. Set the target-role for dummy to stopped
  3. Calculate the future state of the cluster
  4. Compare the current state to the future state
  5. Work out the list of resources that still need to stop
  6. If there are resources to be stopped
    1. Work out the longest timeout of all stopping resource
    2. Look for changes until the timeout
    3. If nothing changed, indicate which resources failed to stop and exit
    4. Go back to step 4.
  7. Now that everything has stopped, remove the target-role setting for dummy to allow it to start again
  8. Calculate the future state of the cluster
  9. Compare the current state to the future state
  10. Work out the list of resources that still need to stop
  11. If there are resources to be stopped
    1. Work out the longest timeout of all stopping resource
    2. Look for changes until the timeout
    3. If nothing changed, indicate which resources failed to start and exit
    4. Go back to step 9.
  12. Done

Considering Clones

crm_resource is also smart enough to restart clone instances running on specific nodes with the optional --node hostname argument. In this scenario instead of setting target-role (which would take down the entire clone), we use the same logic as crm_resource --ban and crm_resource --clear to enable/disable the clone from running on the named host.

Want to know more?

Drop by IRC or ask us a question on the Pacemaker mailing list

There is also plenty of documentation available.

Feature Spotlight - Controllable Resource Discovery

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at November 13, 2014 02:09 AM

Coming in 1.1.13 is a new option for location constraints: resource-discovery

This new option controls whether or not Pacemaker performs resource discovery for the specified resource on nodes covered by the constraint. The default always, preserves the pre-1.1.13 behaviour.

The options are:

  • always - (Default) Always perform resource discovery for the specified resource on this node.

  • never - Never perform resource discovery for the specified resource on this node. This option should generally be used with a -INFINITY score. Although that is not strictly required.

  • exclusive - Only perform resource discovery for the specified resource on this node. Multiple location constraints using exclusive discovery for the same resource across different nodes creates a subset of nodes resource-discovery is exclusive to. If a resource is marked for exclusive discovery on one or more nodes, that resource is only allowed to be placed within that subset of nodes.

Why would I want this?

Limiting resource discovery to a subset of nodes the resource is physically capable of running on can significantly boost performance when a large set of nodes are preset. When pacemaker_remote is in use to expand the node count into the 100s of nodes range, this option can have a dramatic affect on the speed of the cluster.

Is using this option ever a bad idea?

Absolutely!

Setting this option to never or exclusive allows the possibility for the resource to be active in those locations without the cluster’s knowledge. This can lead to the resource being active in more than one location!

There are primarily three ways for this to happen:

  1. If the service is started outside the cluster’s control (ie. at boot time by init, systemd, etc; or by an admin)
  2. If the resource-discovery property is changed while part of the cluster is down or suffering split-brain
  3. If the resource-discovery property is changed for a resource/node while the resource is active on that node

When is it safe?

For the most part, it is only appropriate when:

  1. you have more than 8 nodes (including bare metal nodes with pacemaker-remoted), and
  2. there is a way to guarentee that the resource can only run in a particular location (eg. the required software is not installed anywhere else)

Want to know more?

Drop by IRC or ask us a question on the Pacemaker mailing list

There is also plenty of documentation available.

Release Candidate: 1.1.12-rc1

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at May 07, 2014 08:16 PM

As promised, this announcement brings the first release candidate for Pacemaker 1.1.12

https://github.com/ClusterLabs/pacemaker/releases/Pacemaker-1.1.12-rc1

This release primarily focuses on important but mostly invisible changes under-the-hood:

  • The CIB is now O(2) faster. Thats 100x for those not familiar with Big-O notation :-)

    This has massively reduced the cluster’s use of system resources, allowing us to scale further on the same hardware, and dramatically reduced failover times for large clusters.

  • Support for ACLs are is enabled by default.

    The new implementation can restrict cluster access for containers where pacemaker-remoted is used and is also more efficient.

  • All CIB updates are now serialized and pre-synchronized via the corosync CPG interface. This makes it impossible for updates to be lost, even when the cluster is electing a new DC.

  • Schema versioning changes

    New features are no longer silently added to the schema. Instead the ${Y} in pacemaker-${X}-${Y} will be incremented for simple additions, and ${X} will be bumped for removals or other changes requiring an XSL transformation.

    To take advantage of new features, you will need to updates all the nodes and then run the equivalent of cibadmin --upgrade.

Thankyou to everyone that has tested out the new CIB and ACL code already. Please keep those bug reports coming in!

List of known bugs to be investigating during the RC phase:

  • 5206 Fileencoding broken
  • 5194 A resource starts with a standby node. (Latest attrd does not serve as the crmd-transition-delay parameter)
  • 5197 Fail-over is delayed. (State transition is not calculated.)
  • 5139 Each node fenced in its own transition during start-up fencing
  • 5200 target node is over-utilized with allow-migrate=true
  • 5184 Pending probe left in the cib
  • 5165 Add support for transient node utilization attributes

To build rpm packages for testing:

  1. Clone the current sources:

    # git clone –depth 0 git://github.com/ClusterLabs/pacemaker.git # cd pacemaker

  2. Install dependancies (if you haven’t already)

    [Fedora] # sudo yum install -y yum-utils [ALL] # make rpm-dep

  3. Build Pacemaker

    # make rc

  4. Copy and deploy as needed

Details

Changesets: 633 Diff: 184 files changed, 12690 insertions(+), 5843 deletions(-)

Highlights

Features added since Pacemaker-1.1.11

  • Changes to the ACL schema to support nodes and unix groups
  • cib: Check ACLs prior to making the update instead of parsing the diff afterwards
  • cib: Default ACL support to on
  • cib: Enable the more efficient xml patchset format
  • cib: Implement zero-copy status update (performance)
  • cib: Send all r/w operations via the cluster connection and have all nodes process them
  • crm_mon: Display brief output if “-b/–brief” is supplied or ‘b’ is toggled
  • crm_ticket: Support multiple modifications for a ticket in an atomic operation
  • Fencing: Add the ability to call stonith_api_time() from stonith_admin
  • logging: daemons always get a log file, unless explicitly set to configured ‘none’
  • PE: Automatically re-unfence a node if the fencing device definition changes
  • pengine: cl#5174 - Allow resource sets and templates for location constraints
  • pengine: Support cib object tags
  • pengine: Support cluster-specific instance attributes based on rules
  • pengine: Support id-ref in nvpair with optional “name”
  • pengine: Support per-resource maintenance mode
  • pengine: Support site-specific instance attributes based on rules
  • tools: Display pending state in crm_mon/crm_resource/crm_simulate if –pending/-j is supplied (cl#5178)
  • xml: Add the ability to have lightweight schema revisions
  • xml: Enable resource sets in location constraints for 1.2 schema
  • xml: Support resources that require unfencing

Changes since Pacemaker-1.1.11

  • acl: Authenticate pacemaker-remote requests with the node name as the client
  • cib: allow setting permanent remote-node attributes
  • cib: Do not disable cib disk writes if on-disk cib is corrupt
  • cib: Ensure ‘cibadmin -R/–replace’ commands get replies
  • cib: Fix remote cib based on TLS
  • cib: Ingore patch failures if we already have their contents
  • cib: Resolve memory leaks in query paths
  • cl#5055: Improved migration support.
  • cluster: Fix segfault on removing a node
  • controld: Do not consider the dlm up until the address list is present
  • controld: handling startup fencing within the controld agent, not the dlm
  • crmd: Ack pending operations that were cancelled due to rsc deletion
  • crmd: Actions can only be executed if their pre-requisits completed successfully
  • crmd: Do not erase the status section for unfenced nodes
  • crmd: Do not overwrite existing node state when fencing completes
  • crmd: Do not start timers for already completed operations
  • crmd: Fenced nodes that return prior to an election do not need to have their status section reset
  • crmd: make lrm_state hash table not case sensitive
  • crmd: make node_state erase correctly
  • crmd: Prevent manual fencing confirmations from attempting to create node entries for unknown nodes
  • crmd: Prevent memory leak in error paths
  • crmd: Prevent memory leak when accepting a new DC
  • crmd: Prevent message relay from attempting to create node entries for unknown nodes
  • crmd: Prevent SIGPIPE when notifying CMAN about fencing operations
  • crmd: Report unsuccessful unfencing operations
  • crm_diff: Allow the generation of xml patchsets without digests
  • crm_mon: Allow the file created by –as-html to be world readable
  • crm_mon: Ensure resource attributes have been unpacked before displaying connectivity data
  • crm_node: Only remove the named resource from the cib
  • crm_node: Prevent use-after-free in tools_remove_node_cache()
  • crm_resource: Gracefully handle -EACCESS when querying the cib
  • fencing: Advertise support for reboot/on/off in the metadata for legacy agents
  • fencing: Automatically switch from ‘list’ to ‘status’ to ‘static-list’ if those actions are not advertised in the metadata
  • fencing: Correctly record which peer performed the fencing operation
  • fencing: default to ‘off’ when agent does not advertise ‘reboot’ in metadata
  • fencing: Execute all required fencing devices regardless of what topology level they are at
  • fencing: Pass the correct options when looking up the history by node name
  • fencing: Update stonith device list only if stonith is enabled
  • get_cluster_type: failing concurrent tool invocations on heartbeat
  • iso8601: Different logic is needed when logging and calculating durations
  • lrmd: Cancel recurring operations before stop action is executed
  • lrmd: Expose logging variables expected by OCF agents
  • lrmd: Merge duplicate recurring monitor operations
  • lrmd: Provide stderr output from agents if available, otherwise fall back to stdout
  • mainloop: Fixes use after free in process monitor code
  • make resource ID case sensitive
  • mcp: Tell systemd not to respawn us if we exit with rc=100
  • pengine: Allow container nodes to migrate with connection resource
  • pengine: cl#5186 - Avoid running rsc on two nodes when node is fenced during migration
  • pengine: cl#5187 - Prevent resources in an anti-colocation from even temporarily running on a same node
  • pengine: Correctly handle origin offsets in the future
  • pengine: Correctly search failcount
  • pengine: Default sequential to TRUE for resource sets for consistency with colocation sets
  • pengine: Delay unfencing until after we know the state of all resources that require unfencing
  • pengine: Do not initiate fencing for unclean nodes when fencing is disabled
  • pengine: Do not unfence nodes that are offline, unclean or shutting down
  • pengine: Fencing devices default to only requiring quorum in order to start
  • pengine: fixes invalid transition caused by clones with more than 10 instances
  • pengine: Force record pending for migrate_to actions
  • pengine: handles edge case where container order constraints are not honored during migration
  • pengine: Ignore failure-timeout only if the failed operation has on-fail=”block”
  • pengine: Log when resources require fencing but fencing is disabled
  • pengine: Memory leaks
  • pengine: Unfencing is based on device probes, there is no need to unfence when normal resources are found active
  • Portability: Use basic types for DBus compatability struct
  • remote: Allow baremetal remote-node connection resources to migrate
  • remote: Enable migration support for baremetal connection resources by default
  • services: Correctly reset the nice value for lrmd’s children
  • services: Do not allow duplicate recurring op entries
  • services: Do not block synced service executions
  • services: Fixes segfault associated with cancelling in-flight recurring operations.
  • services: Reset the scheduling policy and priority for lrmd’s children without replying on SCHED_RESET_ON_FORK
  • services_action_cancel: Interpret return code from mainloop_child_kill() correctly
  • stonith_admin: Ensure pointers passed to sscanf() are properly initialized
  • stonith_api_time_helper now returns when the most recent fencing operation completed
  • systemd: Prevent use-of-NULL when determining if an agent exists
  • upstart: Allow comilation with glib versions older than 2.28
  • xml: Better move detection logic for xml nodes
  • xml: Check all available schemas when doing upgrades
  • xml: Convert misbehaving #define into a more easily understood inline function
  • xml: If validate-with is missing, we find the most recent schema that accepts it and go from there
  • xml: Update xml validation to allow ‘<node type=remote />’

Release Candidate: 1.1.12-rc1 was originally published by at That Cluster Guy on May 07, 2014.

Potential for data corruption affecting Pacemaker 1.1.6 through 1.1.9

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at March 19, 2014 01:24 PM

It has come to my attention that the potential for data corruption exists in Pacemaker versions 1.1.6 to 1.1.9

Everyone is strongly encouraged to upgrade to 1.1.10 or later.

Those using RHEL 6.4 or later (or a RHEL clone) should already have access to 1.1.10 via the normal update channels.

At issue is some faulty logic in a function called tengine_stonith_notify() which can incorrectly add successfully fenced nodes to a list, causing Pacemaker to subsequently erase that node’s status section when the next DC election occurs.

With the status section erased, the cluster thinks that node is safely down and begins starting any services it has on other nodes - despite those already being active.

In order to trigger the logic, the fenced node must:

  1. have been the previous DC
  2. been sufficently functional to request its own fencing, and
  3. the fencing notification must arrive after the new DC has been elected, but before it invokes the policy engine

That this is the first we have heard of the issue since the problem was introduced in August 2011, the above sequence of events is apparently hard to hit under normal conditions.

Logs symptomatic of the issue look as follows:

# grep -e do_state_transition -e reboot  -e do_dc_takeover -e tengine_stonith_notify -e S_IDLE /var/log/corosync.log

Mar 08 08:43:22 [9934] lorien       crmd:     info: do_dc_takeover: 	Taking over DC status for this partition
Mar 08 08:43:22 [9934] lorien       crmd:   notice: tengine_stonith_notify: 	Peer gandalf was terminated (st_notify_fence) by mordor for gandalf: OK (ref=10d27664-33ed-43e0-a5bd-7d0ef850eb05) by client crmd.31561
Mar 08 08:43:22 [9934] lorien       crmd:   notice: tengine_stonith_notify: 	Notified CMAN that 'gandalf' is now fenced
Mar 08 08:43:22 [9934] lorien       crmd:   notice: tengine_stonith_notify: 	Target may have been our leader gandalf (recorded: <unset>)
Mar 08 09:13:52 [9934] lorien       crmd:     info: do_dc_takeover: 	Taking over DC status for this partition
Mar 08 09:13:52 [9934] lorien       crmd:   notice: do_dc_takeover: 	Marking gandalf, target of a previous stonith action, as clean
Mar 08 08:43:22 [9934] lorien       crmd:     info: do_state_transition: 	State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ]
Mar 08 08:43:28 [9934] lorien       crmd:     info: do_state_transition: 	State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED cause=C_FSA_INTERNAL origin=check_join_state ]

Note in particular the final entry from tengine_stonith_notify():

Target may have been our leader gandalf (recorded: <unset>)

If you see this after Taking over DC status for this partition but prior to State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE, then you are likely to have resources running in more than one location after the next DC election.

The issue was fixed during a routine cleanup prior to Pacemaker-1.1.10 in @f30e1e43 However the implications of what the old code allowed were not fully appreciated at the time.

Potential for data corruption affecting Pacemaker 1.1.6 through 1.1.9 was originally published by at That Cluster Guy on March 19, 2014.

Announcing 1.1.11 Beta Testing

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at November 21, 2013 02:00 PM

With over 400 updates since the release of 1.1.10, its time to start thinking about a new release.

Today I have tagged release candidate 1. The most notable fixes include:

  • attrd: Implementation of a truely atomic attrd for use with corosync 2.x
  • cib: Allow values to be added/updated and removed in a single update
  • cib: Support XML comments in diffs
  • Core: Allow blackbox logging to be disabled with SIGUSR2
  • crmd: Do not block on proxied calls from pacemaker_remoted
  • crmd: Enable cluster-wide throttling when the cib heavily exceeds its target load
  • crmd: Use the load on our peers to know how many jobs to send them
  • crm_mon: add –hide-headers option to hide all headers
  • crm_report: Collect logs directly from journald if available
  • Fencing: On timeout, clean up the agent’s entire process group
  • Fencing: Support agents that need the host to be unfenced at startup
  • ipc: Raise the default buffer size to 128k
  • PE: Add a special attribute for distinguishing between real nodes and containers in constraint rules
  • PE: Allow location constraints to take a regex pattern to match against resource IDs
  • pengine: Distinguish between the agent being missing and something the agent needs being missing
  • remote: Properly version the remote connection protocol
  • services: Detect missing agents and permission errors before forking
  • Bug cl#5171 - pengine: Don’t prevent clones from running due to dependant resources
  • Bug cl#5179 - Corosync: Attempt to retrieve a peer’s node name if it is not already known
  • Bug cl#5181 - corosync: Ensure node IDs are written to the CIB as unsigned integers

If you are a user of pacemaker_remoted, you should take the time to read about changes to the online wire protocol that are present in this release.

To build rpm packages for testing:

  1. Clone the current sources:

    # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
    # cd pacemaker
    
  2. If you haven’t already, install Pacemaker’s dependancies

    [Fedora] # sudo yum install -y yum-utils
    [ALL]	# make rpm-dep
    
  3. Build Pacemaker

    # make rc
    
  4. Copy the rpms and deploy as needed

Announcing 1.1.11 Beta Testing was originally published by at That Cluster Guy on November 21, 2013.

Pacemaker and RHEL 6.4 (redux)

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at November 04, 2013 09:33 AM

The good news is that as of Novemeber 1st, Pacemaker is now supported on RHEL 6.4 - with two caveats.

  1. You must be using the updated pacemaker, resource-agents and pcs packages
  2. You must be using CMAN for membership and quorum (background)

Technically, support is currently limited to Pacemaker’s use in the context of OpenStack. In practice however, any bug that can be shown to affect OpenStack deployments has a good chance of being fixed.

Since a cluster with no services is rather pointless, the heartbeat OCF agents are now also officially supported. However, as Red Hat’s policy is to only ship supported agents, some agents are not present for this initial release.

The three primary reasons for not shipping agents were:

  1. The software the agent controls is not shipped in RHEL
  2. Insufficient experience to provide support
  3. Avoiding agent duplication

Filing bugs is definitly the best way to get agents in the second categories prioritized for inclusion.

Likewise, if there is no shipping agent that provides the functionality of agents in the third category (IPv6addr and IPaddr2 might be an example here), filing bugs is the best way to get that fixed.

In the meantime, since most of the agents are just shell scripts, downloading the latest upstream agents is a viable work-around in most cases. For example:

    agents="Raid1 Xen"
    for a in $agents; do wget -O /usr/lib/ocf/resource.d/heartbeat/$a https://github.com/ClusterLabs/resource-agents/raw/master/heartbeat/$a; done

Pacemaker and RHEL 6.4 (redux) was originally published by at That Cluster Guy on November 04, 2013.

Changes to the Remote Wire Protocol in 1.1.11

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at October 31, 2013 12:04 PM

Unfortunately the current wire protocol used by pacemaker_remoted for exchanging messages was found to be suboptimal and we have taken the decision to change it now before it becomes widely adopted.

We attempted to do this in a backwards compatibile manner, however the two methods we tried were either overly complicated and fragile, or not possible due to the way the released crm_remote_parse_buffer() function operated.

The changes include a versioned binary header that contains the size of the header, payload and total message, control flags and a big/little-endian detector.

These changes will appear in the upstream repo shortly and ship in 1.1.11. Anyone for this will be a problem is encouraged to get in contact to discuss possible options.

For RHEL users, any version on which pacemaker_remoted is supported will have the new versioned protocol. That means 7.0 and potentially a future 6.x release.

Changes to the Remote Wire Protocol in 1.1.11 was originally published by at That Cluster Guy on October 31, 2013.

Pacemaker 1.1.10 - final

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at July 26, 2013 10:11 AM

Announcing the release of Pacemaker 1.1.10

There were three changes of note since rc7:

  • Bug cl#5161 - crmd: Prevent memory leak in operation cache
  • cib: Correctly read back archived configurations if the primary is corrupted
  • cman: Do not pretend we know the state of nodes we’ve never seen

Along with assorted bug fixes, the major topics for this release were:

  • stonithd fixes
  • fixing memory leaks, often caused by incorrect use of glib reference counting
  • supportability improvements (code cleanup and deduplication, standardized error codes)

Release candidates for the next Pacemaker release (1.1.11) can be expected some time around Novemeber.

A big thankyou to everyone that spent time testing the release candidates and/or contributed patches. However now that Pacemaker is perfect, anyone reporting bugs will be shot :-)

To build rpm packages:

  1. Clone the current sources:

    # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
    # cd pacemaker
    
  2. Install dependancies (if you haven’t already)

    [Fedora] # sudo yum install -y yum-utils
    [ALL]	# make rpm-dep
    
  3. Build Pacemaker

    # make release
    
  4. Copy and deploy as needed

Details - 1.1.10 - final

Changesets  602
Diff 143 files changed, 8162 insertions(+), 5159 deletions(-)


Highlights

Features added since Pacemaker-1.1.9

  • Core: Convert all exit codes to positive errno values
  • crm_error: Add the ability to list and print error symbols
  • crm_resource: Allow individual resources to be reprobed
  • crm_resource: Allow options to be set recursively
  • crm_resource: Implement –ban for moving resources away from nodes and –clear (replaces –unmove)
  • crm_resource: Support OCF tracing when using –force-(check start stop)
  • PE: Allow active nodes in our current membership to be fenced without quorum
  • PE: Suppress meaningless IDs when displaying anonymous clone status
  • Turn off auto-respawning of systemd services when the cluster starts them
  • Bug cl#5128 - pengine: Support maintenance mode for a single node

Changes since Pacemaker-1.1.9

  • crmd: cib: stonithd: Memory leaks resolved and improved use of glib reference counting
  • attrd: Fixes deleted attributes during dc election
  • Bug cf#5153 - Correctly display clone failcounts in crm_mon
  • Bug cl#5133 - pengine: Correctly observe on-fail=block for failed demote operation
  • Bug cl#5148 - legacy: Correctly remove a node that used to have a different nodeid
  • Bug cl#5151 - Ensure node names are consistently compared without case
  • Bug cl#5152 - crmd: Correctly clean up fenced nodes during membership changes
  • Bug cl#5154 - Do not expire failures when on-fail=block is present
  • Bug cl#5155 - pengine: Block the stop of resources if any depending resource is unmanaged
  • Bug cl#5157 - Allow migration in the absence of some colocation constraints
  • Bug cl#5161 - crmd: Prevent memory leak in operation cache
  • Bug cl#5164 - crmd: Fixes crash when using pacemaker-remote
  • Bug cl#5164 - pengine: Fixes segfault when calculating transition with remote-nodes.
  • Bug cl#5167 - crm_mon: Only print “stopped” node list for incomplete clone sets
  • Bug cl#5168 - Prevent clones from being bounced around the cluster due to location constraints
  • Bug cl#5170 - Correctly support on-fail=block for clones
  • cib: Correctly read back archived configurations if the primary is corrupted
  • cib: The result is not valid when diffs fail to apply cleanly for CLI tools
  • cib: Restore the ability to embed comments in the configuration
  • cluster: Detect and warn about node names with capitals
  • cman: Do not pretend we know the state of nodes we’ve never seen
  • cman: Do not unconditionally start cman if it is already running
  • cman: Support non-blocking CPG calls
  • Core: Ensure the blackbox is saved on abnormal program termination
  • corosync: Detect the loss of members for which we only know the nodeid
  • corosync: Do not pretend we know the state of nodes we’ve never seen
  • corosync: Ensure removed peers are erased from all caches
  • corosync: Nodes that can persist in sending CPG messages must be alive afterall
  • crmd: Do not get stuck in S_POLICY_ENGINE if a node we couldn’t fence returns
  • crmd: Do not update fail-count and last-failure for old failures
  • crmd: Ensure all membership operations can complete while trying to cancel a transition
  • crmd: Ensure operations for cleaned up resources don’t block recovery
  • crmd: Ensure we return to a stable state if there have been too many fencing failures
  • crmd: Initiate node shutdown if another node claims to have successfully fenced us
  • crmd: Prevent messages for remote crmd clients from being relayed to wrong daemons
  • crmd: Properly handle recurring monitor operations for remote-node agent
  • crmd: Store last-run and last-rc-change for all operations
  • crm_mon: Ensure stale pid files are updated when a new process is started
  • crm_report: Correctly collect logs when ‘uname -n’ reports fully qualified names
  • fencing: Fail the operation once all peers have been exhausted
  • fencing: Restore the ability to manually confirm that fencing completed
  • ipc: Allow unpriviliged clients to clean up after server failures
  • ipc: Restore the ability for members of the haclient group to connect to the cluster
  • legacy: Support “crm_node –remove” with a node name for corosync plugin (bnc#805278)
  • lrmd: Default to the upstream location for resource agent scratch directory
  • lrmd: Pass errors from lsb metadata generation back to the caller
  • pengine: Correctly handle resources that recover before we operate on them
  • pengine: Delete the old resource state on every node whenever the resource type is changed
  • pengine: Detect constraints with inappropriate actions (ie. promote for a clone)
  • pengine: Ensure per-node resource parameters are used during probes
  • pengine: If fencing is unavailable or disabled, block further recovery for resources that fail to stop
  • pengine: Implement the rest of get_timet_now() and rename to get_effective_time
  • pengine: Re-initiate active recurring monitors that previously failed but have timed out
  • remote: Workaround for inconsistent tls handshake behavior between gnutls versions
  • systemd: Ensure we get shut down correctly by systemd
  • systemd: Reload systemd after adding/removing override files for cluster services
  • xml: Check for and replace non-printing characters with their octal equivalent while exporting xml text
  • xml: Prevent lockups by setting a more reliable buffer allocation strategy

Pacemaker 1.1.10 - final was originally published by at That Cluster Guy on July 26, 2013.

Release candidate: 1.1.10-rc7

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at July 22, 2013 10:50 AM

Announcing the seventh release candidate for Pacemaker 1.1.10

This RC is a result of bugfixes to the policy engine, fencing daemon and crmd. We’ve squashed a bug involving constructing compressed messages and stonith-ng can now recover when a configuration ordering change is detected.

Please keep the bug reports coming in!

To build rpm packages for testing:

  1. Clone the current sources:

    # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
    # cd pacemaker
    
  2. Install dependancies (if you haven’t already)

    [Fedora] # sudo yum install -y yum-utils
    [ALL]	# make rpm-dep
    
  3. Build Pacemaker

    # make rc
    
  4. Copy and deploy as needed

Details - 1.1.10-rc7

Changesets  57
Diff 37 files changed, 414 insertions(+), 331 deletions(-)


Highlights

### Features added in Pacemaker-1.1.10-rc7

  • N/A

Changes since Pacemaker-1.1.10-rc6

  • Bug cl#5168 - Prevent clones from being bounced around the cluster due to location constraints
  • Bug cl#5170 - Correctly support on-fail=block for clones
  • Bug cl#5164 - crmd: Fixes crmd crash when using pacemaker-remote
  • cib: The result is not valid when diffs fail to apply cleanly for CLI tools
  • cluster: Correctly construct the header for compressed messages
  • cluster: Detect and warn about node names with capitals
  • Core: remove the mainloop_trigger that are no longer needed.
  • corosync: Ensure removed peers are erased from all caches
  • cpg: Correctly free sent messages
  • crmd: Prevent messages for remote crmd clients from being relayed to wrong daemons
  • crmd: Properly handle recurring monitor operations for remote-node agent
  • crm_mon: Bug cl#5167 - Only print “stopped” node list for incomplete clone sets
  • crm_node: Return 0 if –remove passed
  • fencing: Correctly detect existing device entries when registering a new one
  • lrmd: Prevent use-of-NULL in client library
  • pengine: cl5164 - Fixes pengine segfault when calculating transition with remote-nodes.
  • pengine: Do the right thing when admins specify the internal resource instead of the clone
  • pengine: Re-allow ordering constraints with fencing devices now that it is safe to do so

Release candidate: 1.1.10-rc7 was originally published by at That Cluster Guy on July 22, 2013.

Release candidate: 1.1.10-rc6

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at July 04, 2013 04:46 PM

Announcing the sixth release candidate for Pacemaker 1.1.10

This RC is a result of bugfixes in the policy engine, fencing daemon and crmd. Previous fixes in rc5 have also now been confirmed.

Help is specifically requested for testing plugin-based clusters, ACLs, the –ban and –clear commands, and admin actions (such as moving and stopping resources, calls to stonith_admin) which are hard to test in an automated manner.

There is one bug open for David’s remote nodes feature (involving managing services on non-cluster nodes), but everything else seems good.

Please keep the bug reports coming in!

To build rpm packages for testing:

  1. Clone the current sources:

    # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
    # cd pacemaker
    
  2. Install dependancies (if you haven’t already)

    [Fedora] # sudo yum install -y yum-utils
    [ALL]	# make rpm-dep
    
  3. Build Pacemaker

    # make rc
    
  4. Copy and deploy as needed

Details - 1.1.10-rc6

Changesets  63
Diff 24 files changed, 356 insertions(+), 133 deletions(-)


Highlights

### Features added in Pacemaker-1.1.10-rc6

  • tools: crm_mon –neg-location drbd-fence-by-handler
  • pengine: cl#5128 - Support maintenance mode for a single node

Changes since Pacemaker-1.1.10-rc5

  • cluster: Correctly remove duplicate peer entries
  • crmd: Ensure operations for cleaned up resources don’t block recovery
  • pengine: Bug cl#5157 - Allow migration in the absence of some colocation constraints
  • pengine: Delete the old resource state on every node whenever the resource type is changed
  • pengine: Detect constraints with inappropriate actions (ie. promote for a clone)
  • pengine: Do the right thing when admins specify the internal resource instead of the clone

Release candidate: 1.1.10-rc6 was originally published by at That Cluster Guy on July 04, 2013.

GPG Quickstart

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at June 24, 2013 02:40 PM

It seemed timely that I should refresh both my GPG knowledge and my keys. I am summarizing my method (and sources) below in the event that they may prove useful to others:

Preparation

The following settings ensure that any keys you create in the future are strong ones by 2013’s standards. Paste the following into ~/.gnupg/gpg.conf:

# when multiple digests are supported by all recipients, choose the strongest one:
personal-digest-preferences SHA512 SHA384 SHA256 SHA224
# preferences chosen for new keys should prioritize stronger algorithms: 
default-preference-list SHA512 SHA384 SHA256 SHA224 AES256 AES192 AES CAST5 BZIP2 ZLIB ZIP Uncompressed
# when making an OpenPGP certification, use a stronger digest than the default SHA1:
cert-digest-algo SHA512

The next batch of settings are optional but aim to improve the output of gpg commands in various ways - particularly against spoofing. Again, paste them into ~/.gnupg/gpg.conf:

# when outputting certificates, view user IDs distinctly from keys:
fixed-list-mode
# long keyids are more collision-resistant than short keyids (it's trivial to make a key with any desired short keyid)
keyid-format 0xlong
# If you use a graphical environment (and even if you don't) you should be using an agent:
# (similar arguments as  https://www.debian-administration.org/users/dkg/weblog/64)
use-agent
# You should always know at a glance which User IDs gpg thinks are legitimately bound to the keys in your keyring:
verify-options show-uid-validity
list-options show-uid-validity
# include an unambiguous indicator of which key made a signature:
# (see http://thread.gmane.org/gmane.mail.notmuch.general/3721/focus=7234)
sig-notation issuer-fpr@notations.openpgp.fifthhorseman.net=%g

Create a New Key

There are several checks for deciding if your old key(s) are any good. However, if you created a key more than a couple of years ago, then realistically you probably need a new one.

I followed instructions from Ana Guerrero’s post, which were the basis of the current debian guide, but selected the 2013 default key type:

  1. run gpg --gen-key
  2. Select (1) RSA and RSA (default)
  3. Select a keysize greater than 2048
  4. Set a key expiration of 2-5 years. [rationale]
  5. Do NOT specify a comment for User ID. [rationale]

Add Additional UIDs and Setting a Default

At this point my keyring gpg --list-keys looked like this:

pub   4096R/0x726724204C644D83 2013-06-24
uid                 [ultimate] Andrew Beekhof <andrew@beekhof.net>
sub   4096R/0xC88100891A418A6B 2013-06-24 [expires: 2015-06-24]

Like most people, I have more than one email address and I will want to use GPG with them too. So now is the time to add them to the key. You’ll want the gpg --edit-key command for this. Ana has a good exmaple of adding UIDs and setting a preferred one. Just search her instructions for Add other UID.

Separate Subkeys for Encryption and Signing

The general consensus is that separate keys should be used for signing versus encryption.

tl;dr - you want to be able to encrypt things without signing them as “signing” may have unintended legal implications. There is also the possibility that signed messages can be used in an attack against encrypted data.

By default gpg will create a subkey for encryption, but I followed Debian’s subkey guide for creating one for signing too (instead of using the private master key).

Doing this allows you to make your private master key even safer by removing it from your day-to-day keychain.

The idea is to make a copy first and keep it in an even more secure location, so that if a subkey (or the machine its on) gets compromised, your master key remains safe and you are always in a position to revoke subkeys and create new ones.

Sign the New Key with the Old One

If you have an old key, you should sign the new one with it. This tells everyone who trusted the old key that the new one is legitimate and can therefor also be trusted.

Here I went back to Ana’s instructions. Basically:

gpg --default-key OLDKEY --sign-key NEWKEY

or, in my case:

gpg --default-key 0xEC3584EFD449E59A --sign-key 0x726724204C644D83

Send it to a Key Server

Tell the world so they can verfiy your signature and send you encrypted messages:

gpg --send-key 0x726724204C644D83

Revoking Old UIDs

If you’re like me, your old key might have some addresses which you have left behind. You can’t remove addresses from your keys, but you can tell the world to stop using them.

To do this for my old key, I followed instructions on the gnupg mailing list

Everything still looks the same when you search for my old key:

pub  1024D/D449E59A 2007-07-20 Andrew Beekhof <beekhof@mac.com>
                               Andrew Beekhof <abeekhof@suse.de>
                               Andrew Beekhof <beekhof@gmail.com>
                               Andrew Beekhof <andrew@beekhof.net>
                               Andrew Beekhof <abeekhof@novell.com>
	 Fingerprint=E5F5 BEFC 781F 3637 774F  C1F8 EC35 84EF D449 E59A 

But if you click through to the key details, you’ll see the addresses associated with my time at Novell/SuSE now show revok in red.

pub  1024D/D449E59A 2007-07-20            
	 Fingerprint=E5F5 BEFC 781F 3637 774F  C1F8 EC35 84EF D449 E59A 

uid Andrew Beekhof <beekhof@mac.com>
sig  sig3  D449E59A 2007-07-20 __________ __________ [selfsig]

uid Andrew Beekhof <abeekhof@suse.de>
sig  sig3  D449E59A 2007-07-20 __________ __________ [selfsig]
sig revok  D449E59A 2013-06-24 __________ __________ [selfsig]
...

This is how other people’s copy of gpg knows not to use this key for that address anymore. And also why its important to refresh your keys periodically.

Revoking Old Keys

Realistically though, you probably don’t want people using old and potentially compromised (or compromise-able) keys to send you sensitive messages. The best thing to do is revoke the entire key.

Since keys can’t be removed once you’ve uploaded them, you’re actually updating the existing entry. To do this you need the original private key - so keep it safe!

Some people advise you to pre-generate the revocation key - personally that seems like just one more thing to keep track of.

Orphaned keys that can’t be revoked still appear valid to anyone wanting to send you a secure message - a good reason to set an expiry date as a failsafe!

This is what one of my old revoked key looks like:

pub  1024D/DABA170E 2004-10-11 *** KEY REVOKED *** [not verified]
                               Andrew Beekhof (SuSE VPN Access) <andrew@beekhof.net>
	 Fingerprint=9A53 9DBB CF73 AB8F B57B  730A 3279 4AE9 DABA 170E 

Final Result

My new key:

pub  4096R/4C644D83 2013-06-24 Andrew Beekhof <andrew@beekhof.net>
                               Andrew Beekhof <beekhof@mac.com>
                               Andrew Beekhof <abeekhof@redhat.com>
	 Fingerprint=C503 7BA2 D013 6342 44C0  122C 7267 2420 4C64 4D83 

Closing word

I am by no means an expert at this, I would be very grateful to hear about any mistakes I may have made above.

GPG Quickstart was originally published by at That Cluster Guy on June 24, 2013.

Release candidate: 1.1.10-rc5

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at June 19, 2013 10:32 AM

Lets try this again… Announcing the fourth and a half release candidate for Pacemaker 1.1.10

I previously tagged rc4 but ended up making several changes shortly afterwards, so it was pointless to announce it.

This RC is a result of cleanup work in several ancient areas of the codebase:

  • A number of internal membership caches have been combined
  • The three separate CPG code paths have been combined

As well as:

  • Moving clones is now both possible and sane
  • Improved behavior on systemd based nodes
  • and other assorted bugfixes (see below)

Please keep the bug reports coming in!

Help is specifically requested for testing plugin-based clusters, ACLs, the new –ban and –clear commands, and admin actions (such as moving and stopping resources, calls to stonith_admin) which are hard to test in an automated manner.

Also any light that can be shed on possible memory leaks would be much appreciated.

If everything looks good in a week from now, I will re-tag rc5 as final.

To build rpm packages for testing:

  1. Clone the current sources:

    # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
    # cd pacemaker
    
  2. Install dependancies

    [Fedora] # sudo yum install -y yum-utils
    [ALL]	# make rpm-dep
    
  3. Build Pacemaker

    # make rc
    
  4. Copy and deploy as needed

Details - 1.1.10-rc5

Changesets  168
Diff 96 files changed, 4983 insertions(+), 3097 deletions(-)


Highlights

### Features added in Pacemaker-1.1.10-rc5

  • crm_error: Add the ability to list and print error symbols
  • crm_resource: Allow individual resources to be reprobed
  • crm_resource: Implement –ban for moving resources away from nodes and –clear (replaces –unmove)
  • crm_resource: Support OCF tracing when using –force-(check start stop)
  • PE: Allow active nodes in our current membership to be fenced without quorum
  • Turn off auto-respawning of systemd services when the cluster starts them

Changes since Pacemaker-1.1.10-rc3

  • Bug pengine: cl#5155 - Block the stop of resources if any depending resource is unmanaged
  • Convert all exit codes to positive errno values
  • Core: Ensure the blackbox is saved on abnormal program termination
  • corosync: Detect the loss of members for which we only know the nodeid
  • corosync: Do not pretend we know the state of nodes we’ve never seen
  • corosync: Nodes that can persist in sending CPG messages must be alive afterall
  • crmd: Do not get stuck in S_POLICY_ENGINE if a node we couldn’t fence returns
  • crmd: Ensure all membership operations can complete while trying to cancel a transition
  • crmd: Everyone who gets a fencing notification should mark the node as down
  • crmd: Initiate node shutdown if another node claims to have successfully fenced us
  • crmd: Update the status section with details of nodes for which we only know the nodeid
  • crm_report: Find logs in compressed files
  • logging: If SIGTRAP is sent before tracing is turned on, turn it on
  • pengine: If fencing is unavailable or disabled, block further recovery for resources that fail to stop
  • remote: Workaround for inconsistent tls handshake behavior between gnutls versions
  • systemd: Ensure we get shut down correctly by systemd

Release candidate: 1.1.10-rc5 was originally published by at That Cluster Guy on June 19, 2013.

Release candidate: 1.1.10-rc3

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at May 23, 2013 10:32 AM

Announcing the third release candidate for Pacemaker 1.1.10

This RC is a result of work in several problem areas reported by users, some of which date back to 1.1.8:

  • manual fencing confirmations
  • potential problems reported by Coverity
  • the way anonymous clones are displayed
  • handling of resource output that includes non-printing characters
  • handling of on-fail=block

Please keep the bug reports coming in. There is a good chances that this will be the final release candidate and 1.1.10 will be tagged on May 30th.

Help is specifically requested for testing plugin-based clusters, ACLs and admin actions (such as moving and stopping resources, calls to stonith_admin) which are hard to test in an automated manner.

To build rpm packages for testing:

  1. Clone the current sources:

    # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
    # cd pacemaker
    
  2. Install dependancies

    [Fedora] # sudo yum install -y yum-utils
    [ALL]	# make rpm-dep
    
  3. Build Pacemaker

    # make rc
    
  4. Copy and deploy as needed

Details - 1.1.10-rc3

Changesets  116
Diff 59 files changed, 707 insertions(+), 408 deletions(-)


Highlights

### Features added in Pacemaker-1.1.10-rc3

  • PE: Display a list of nodes on which stopped anonymous clones are not active instead of meaningless clone IDs
  • PE: Suppress meaningless IDs when displaying anonymous clone status

Changes since Pacemaker-1.1.10-rc2

  • Bug cl#5133 - pengine: Correctly observe on-fail=block for failed demote operation
  • Bug cl#5151 - Ensure node names are consistently compared without case
  • Check for and replace non-printing characters with their octal equivalent while exporting xml text
  • cib: CID#1023858 - Explicit null dereferenced
  • cib: CID#1023862 - Improper use of negative value
  • cib: CID#739562 - Improper use of negative value
  • cman: Our daemons have no need to connect to pacemakerd in a cman based cluster
  • crmd: Do not record pending delete operations in the CIB
  • crmd: Ensure pending and lost actions have values for last-run and last-rc-change
  • crmd: Insert async failures so that they appear in the correct order
  • crmd: Store last-run and last-rc-change for fail operations
  • Detect child processes that terminate before our SIGCHLD handler is installed
  • fencing: CID#739461 - Double close
  • fencing: Correctly broadcast manual fencing ACKs
  • fencing: Correctly mark manual confirmations as complete
  • fencing: Do not send duplicate replies for manual confirmation operations
  • fencing: Restore the ability to manually confirm that fencing completed
  • lrmd: CID#1023851 - Truncated stdio return value
  • lrmd: Don’t complain when heartbeat invokes us with -r
  • pengine: Correctly handle resources that recover before we operate on them
  • pengine: Re-initiate active recurring monitors that previously failed but have timed out
  • xml: Restore the ability to embed comments in the cib

Release candidate: 1.1.10-rc3 was originally published by at That Cluster Guy on May 23, 2013.

Pacemaker Logging

Posted in That Cluster Guy by Andrew Beekhof (andrew@beekhof.net) at May 20, 2013 01:43 PM

Normal operation

Pacemaker inherits most of its logging setting from either CMAN or Corosync - depending on what its running on top of.

In order to avoid spamming syslog, Pacemaker only logs a summary of its actions (NOTICE and above) to syslog.

If the level of detail in syslog is insufficient, you should enable a cluster log file. Normally one is configured by default and it contains everything except debug and trace messages.

To find the location of this file, either examine your CMAN (cluster.conf) or Corosync (corosync.conf) configuration file or look for syslog entries such as:

pacemakerd[1823]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log

If you do not see a line like this, either update the cluster configuration or set PCMK_debugfile in /etc/sysconfig/pacemaker

crm_report also knows how to find all the Pacemaker related logs and blackbox files

If the level of detail in the cluster log file is still insufficient, or you simply wish to go blind, you can turn on debugging in Corosync/CMAN, or set PCMK_debug in /etc/sysconfig/pacemaker.

A minor advantage of setting PCMK_debug is that the value can be a comma-separated list of processes which should produce debug logging instead of a global yes/no.

When an ERROR occurs

Pacemaker includes support for a blackbox.

When enabled, the blackbox contains a rolling buffer of all logs (not just those sent to syslog or a file) and is written to disk after a crash or assertion failure.

The blackbox recorder can be enabled by setting PCMK_blackbox in /etc/sysconfig/pacemaker or at runtime by sending SIGUSR1. Eg.

killall -USR1 crmd

When enabled you’ll see a log such as:

crmd[1811]:   notice: crm_enable_blackbox: Initiated blackbox recorder: /var/lib/pacemaker/blackbox/crmd-1811

If a crash occurs, the blackbox will be available at that location. To extract the contents, pass it to qb-blackbox:

qb-blackbox /var/lib/pacemaker/blackbox/crmd-1811

Which produces output like:

Dumping the contents of /var/lib/pacemaker/blackbox/crmd-1811
[debug] shm size:5242880; real_size:5242880; rb->word_size:1310720
[debug] read total of: 5242892
Ringbuffer:
 ->NORMAL
 ->write_pt [5588]
 ->read_pt [0]
 ->size [1310720 words]
 =>free [5220524 bytes]
 =>used [22352 bytes]
...
trace   May 19 23:20:55 gio_read_socket(368):0: 0x11ab920.5 1 (ref=1)
trace   May 19 23:20:55 pcmk_ipc_accept(458):0: Connection 0x11aee00
info    May 19 23:20:55 crm_client_new(302):0: Connecting 0x11aee00 for uid=0 gid=0 pid=24425 id=0e943a2a-dd64-49bc-b9d5-10fa6c6cb1bd
debug   May 19 23:20:55 handle_new_connection(465):2147483648: IPC credentials authenticated (24414-24425-14)
...
[debug] Free'ing ringbuffer: /dev/shm/qb-create_from_file-header

When an ERROR occurs you’ll also see the function and line number that produced it such as:

crmd[1811]: Problem detected at child_death_dispatch:872 (mainloop.c), please see /var/lib/pacemaker/blackbox/crmd-1811.1 for additional details
crmd[1811]: Problem detected at main:94 (crmd.c), please see /var/lib/pacemaker/blackbox/crmd-1811.2 for additional details

Again, simply pass the files to qb-blackbox to extract and query the contents.

Note the a counter is added to the end so as to avoid name collisions.

Diving into files and functions

In case you have not already guessed, all logs include the name of the function that generated them. So:

crmd[1811]:   notice: crm_update_peer_state: cman_event_callback: Node corosync-host-1[1] - state is now lost (was member)

came from the function crm_update_peer_state().

To obtain more detail from that or any other function, you can set PCMK_trace_functions in /etc/sysconfig/pacemaker to a comma separated list of function names. Eg.

PCMK_trace_functions=crm_update_peer_state,run_graph

For a bigger stick, you may also activate trace logging for all the functions in a particular source file or files by setting PCMK_trace_files as well.

PCMK_trace_files=cluster.c,election.c

These additional logs are sent to the cluster log file. Note that enabling tracing options also alters the output format.

Instead of:

crmd:  notice: crm_cluster_connect: 	Connecting to cluster infrastructure: cman

the output includes file and line information:

crmd: (   cluster.c:215   )  notice: crm_cluster_connect: 	Connecting to cluster infrastructure: cman

But wait there’s still more

Still need more detail? You’re in luck! The blackbox can be dumped at any time, not just when an error occurs.

First, make sure the blackbox is active (we’ll assume its the crmd that needs to be debugged):

killall -USR1 crmd

Next, discard any previous contents by dumping them to disk

killall -TRAP crmd

now cause whatever condition you’re trying to debug, and send -TRAP when you’re ready to see the result.

killall -TRAP crmd

You can now look for the result in syslog:

grep -e crm_write_blackbox: /var/log/messages

This will include a filename containing the trace logging:

crmd[1811]:   notice: crm_write_blackbox: Blackbox dump requested, please see /var/lib/pacemaker/blackbox/crmd-1811.1 for contents
crmd[1811]:   notice: crm_write_blackbox: Blackbox dump requested, please see /var/lib/pacemaker/blackbox/crmd-1811.2 for contents

To extract the trace loging for our test, pass the most recent file to qb-blackbox:

qb-blackbox /var/lib/pacemaker/blackbox/crmd-1811.2

At this point you’ll probably want to use grep :)

Pacemaker Logging was originally published by at That Cluster Guy on May 20, 2013.