Updating a live OpenStack cloud

jistr, marios, gfidente

Talk Overview

DB schemas and AMQP messaging (RPC) can change
Upgrade controllers in parallel
- DB schema must match service expectations
- Cloud management downtime
Upgrade computes serially (in batches)
- No direct DB connection, RPC pinning
Service-by-service upgrade is a possibility too

DB schemas do not change, AMQP messaging does not change or is backwards compatible
Challenge is in uptime expectations
Rolling updates on the controllers (node-by-node)
- Some services rely on Pacemaker for proper leaving/rejoining the cluster
“Normal” package update on the computes

From the Manager node
- openstack overcloud update stack overcloud -i --templates -e overcloud-resource-registry-puppet.yaml -e …
Sets pre-update hook on each node – heat native feature.
Also sets the UpdateIdentifier. Need the update to proceed one node at a time.

On one controller at a time, (breakpoints mainly for this).
Matching pre/post update environments (example the neutron pacemaker constraints).
Stop the cluster on that controller, maintenance mode
Yum update
Rejoin the cluster

This is delivered as the config property for a SoftwareDeployment (Heat).
Checks the update_identifier first
- echo "Not running due to unset update_identifier"
Contains the update logic, pacemaker, yum update

Nothing, in four different ways!

We don't want a new network, we're just calling it (by) name

network_id <-> network (name) mapping

Heat developers helped, we had to backport the fix into the stable branch

How come Pacemaker is failing to failover the services?

Services fail to stop

HA people helped, we needed to fix the constraints to shutdown cleanly!

Why is that IP still on the node we supposedly killed?

Neutron agents were down but keepalived was still running

Neutron developers helped, newer version wasn't affected

UPDATE_COMPLETE, scaling fails though!

Luckily there is people testing this all