One of the hot topics for OpenStack deployments is composable roles - the ability to mix-and-match which services live on which nodes.
This is mostly a solved problem for services not managed by the cluster, but what of the services still managed by the cluster?
Naturally we want to be able to add more capacity easily
And have the option to take it away again if it is no longer necessary
Role re-assignment post-deployment
Ideally the task of taking capacity from one service and giving it to another would be a core capability and not require a node be nuked from orbit first.
Flexible role assignments
Ideally, the architecture would not impose limitations on how roles are assigned.
By allowing roles to be assigned on an ad-hoc basis, we can allow arrangements that avoid single-points-of-failure (SPoF) and potentially take better advantage of the available hardware. For example:
- node 1: galera and rabbit
- node 2: galera and mongodb
- node 3: rabbit and mongodb
This also has implications when just one of the roles needs to be scaled up (or down). If roles become inextricably linked at install time, this requires every service in the group to scale identically - potentially resulting in higher hardware costs when there are services that cannot do so and must be separated.
Instead, even if two services (lets say
rabbit) are originally assigned to the same set of nodes, this should imply nothing about how either of them can or should be scaled in the future.
We want the ability to deploy a new
rabbitserver without requiring it host
This need only apply to non-OpenStack services, however it could be extended to those as well if you were unconvinced by my other recent proposal.
At Red Hat, the list of services affected would be:
- Any VIPs
- Mongo DB
- Rabbit MQ
Additionally, if the deployment has been configured to provide Highly Available Instances:
In essance, I propose that there be a single native cluster, consisting of between 3 (the minimum sane cluster size) and 16 (roughly Corosync’s out-of-the-box limit) nodes, augmented by a collection of zero-or-more remote nodes.
Both native and remote nodes will have roles assigned to them, allowing Pacemaker to automagically move resources to the right location based on the roles.
Note that all nodes, both native and remote, can have zero-or-more roles and it is also possible to have a mixture of native and remote nodes assigned to the same role.
This will allow us, by changing a few flags (and potentially adding extra remote nodes to the cluster) go from a fully collapsed deployment to a fully segregated one - and not only at install time.
If installers wish to support it1, this architecture can cope with roles being split out (or recombined) after deployment and of course the cluster wont need to be taken down and resources will move as appropriate.
Although there is no hard requirement that anything except the fencing devices run on the native nodes, best practice would arguably dictate that HAProxy and the VIPs be located there unless an external load balancer is in use.
The purpose of this would be to limit the impact of a hypothetical pacemaker-remote bug. Should such a bug exist, by virtue of being the gateway to all the other APIs, HAProxy and the VIPs are the elements one would least want to be affected.
Some installers may even choose to enforce this in the configuration, but “by convention” is probably sufficient.
Instance attributes can be created with commands of the form:
pcs property set --node controller-0 proxy-role=true
Note that this differs from the
osprole=compute/controllerscheme used in the Highly Available Instances instructions. That arrangement wouldn’t work here as each node may have serveral roles assigned to it.
Under the covers, the result in Pacemaker’s configuration would look something like:
<cib ...> <configuration> <nodes> <node id="1" uname="controller-0"> <instance_attributes id="controller-0-attributes"> <nvpair id="controller-0-proxy-role" name="proxy-role" value="true"/> ...
For example, we would use the following for HAProxy:
pcs constraint location haproxy-clone rule score=0 proxy-role eq true
which would create the following under the covers:
<rsc_location id="location-haproxy" rsc="haproxy-clone"> <rule id="location-haproxy-rule" score="0"> <expression id="location-haproxy-rule-expr" attribute="proxy-role" operation="eq" value="true"/> </rule> </rsc_location>
Any node, native or remote, not meeting the criteria is automatically eliminated as a possible host for the service.
Pacemaker also defines some node attributes automatically based on a
node’s name and type. These are also available for use in
constraints. This allows us, for example, to force a resource such as
nova-evacuate to run on a “real” cluster node with the command:
pcs constraint location nova-evacuate rule score=0 "#kind" ne remote
For deployments based on Pacemaker 1.1.15 or later, we can also simplify the configuration by using pattern matching in our constraints.
Restricting all the VIPs to nodes with the proxy role:
<rsc_location id="location-haproxy-ips" resource-discovery="exclusive" rsc-pattern="^(ip-.*)"/>
nova-computeto compute nodes (assuming a standardized naming convention is used):
<rsc_location id="location-nova-compute-clone" resource-discovery="exclusive" rsc-pattern="nova-compute-(.*)"/>
This is what a fully active cluster would look like:
9 nodes configured 87 resources configured Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] RemoteOnline: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 rabbitmq-extra-0 storage-0 storage-1 ] ip-172.16.3.4 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 ip-192.0.2.17 (ocf::heartbeat:IPaddr2): Started overcloud-controller-1 ip-172.16.2.4 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 ip-172.16.2.5 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 ip-172.16.1.4 (ocf::heartbeat:IPaddr2): Started overcloud-controller-1 Clone Set: haproxy-clone [haproxy] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Master/Slave Set: galera-master [galera] Slaves: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] ip-18.104.22.168 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 Master/Slave Set: redis-master [redis] Slaves: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: mongod-clone [mongod] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: rabbitmq-clone [rabbitmq] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 rabbitmq-extra-0 ] Clone Set: memcached-clone [memcached] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] openstack-cinder-volume (systemd:openstack-cinder-volume): Started storage-0 Clone Set: nova-compute-clone [nova-compute] Started: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 ] Clone Set: nova-compute-wait-clone [nova-compute-wait] Started: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 ] nova-evacuate (ocf::openstack:NovaEvacuate): Started overcloud-controller-0 fence-nova (stonith:fence_compute): Started overcloud-controller-0 storage-0 (ocf::pacemaker:remote): Started overcloud-controller-1 storage-1 (ocf::pacemaker:remote): Started overcloud-controller-2 overcloud-compute-0 (ocf::pacemaker:remote): Started overcloud-controller-0 overcloud-compute-1 (ocf::pacemaker:remote): Started overcloud-controller-1 overcloud-compute-2 (ocf::pacemaker:remote): Started overcloud-controller-2 rabbitmq-extra-0 (ocf::pacemaker:remote): Started overcloud-controller-0
A small wish, but it would be nice if installers used meaningful names for the VIPs instead the underlying IP addresses they manage.
One reason they may not do so on day one, is the careful co-ordination that some services can require when there is no overlap between the old and new sets of nodes assigned to a given role. Galera is one specific case that comes to mind. ↩