marios log - tripleo

My summary of the OpenStack Stein Infrastructure Summit and Train PTG aka Denver III

2019-05-06 00:00:00 +0300

My summary of the OpenStack Stein Infrastructure Summit and Train PTG aka Denver III

This was the first re-combined event with both summit and project teams gathering happening in the same week and the third consecutive year that OpenStack has descended on Denver. This is also the first Open Infrastructure summit - the foundation is expanding to allow other non openstack projects to use the Open Infrastructure foundation for housing their projects.

This is a brief summary with pointers of the sessions or rooms I attended in the order they happened. The full summit schedule is here and the PTG schedule is here.

There is a list of some of the etherpads used in various summit sessions in this wiki page thanks to T. Carrez who let me take a photo of his screen for the URL :).

Photos
Open Infra Summit Day 1
Open Infra Summit Day 2
Open Infra Summit Day 3
Project Teams Gathering Day 1
Project Teams Gathering Day 2

Photos

The Marketplace - [warn] panoramic large-ish file ~15MB
big-blue-bear
SnowpenStack
Downtown - [warn] panoramic large-ish ~ 16MB

Summit Day One

General impression is a slightly reduced attendance - though I should note the last summit I attended was Austin unless I’m mistaken, attending PTG but not summit. There were about ~2000 summit attendees according to one of the keynote speakers. Having said that however J. Bryce gave some interesting numbers in his keynote, hilighting that Stein is the 19th on time release for OpenStack, that OS is still the 3rd largest open source project in the world with 105,000 members across 180 countries and with 65000 merged changes in the last year.

It was interesting to hear from Deutche Telekom - especially that they are using and contributing to zuul upstream and that they rely on CI for their ever growing deployments. One of the numbers given is they are adding capacity at 400 servers per week.

Some other interesting points from the keynotes are

the increasing use of Ironic as a standalone service outside of OpenStack deployments, for managing the baremetal infrastructure (further hilighting the OpenInfra vs OpenStack only theme),
the increasing adoption of zuul for CI and that it is being adopted as a foundation project
ericsson brought a 5g network to summit, apparently the first 5G network (?) in the United States that was available at their booth and which uses OpenStack for their infrastructure. There was also a demonstration of the latency differences between 3/4/5G networks involving VR headsets.

Besides the keynotes I attended the OpenStack Ansible project update - there was a shout out for the TripleO team by Mohammed Nasser who higlighted the excellent cross team collaboration story by the TripleO tempest team and the Ansible project. Finally I attended a talk called “multicloud ci/cd with openstack and kubernetes” where the presented setup a simple ‘hello world’ application across a number of different geographic locations and showed how CI/CD meant he could make a simple change to the app and have it be tested then deployed across the different clouds that run that application.

Summit Day Two

I attended the Zuul project BOF (‘birds of a feather’) where it was interesting to hear about various folks that are running Zuul internally - some on older versions and wanting to upgrade.

I also caught the “Deployment Tools: defined common capabilities” where folks that work on or are knowledgable about the various OpenStack deployment tools including TripleO got together and used this etherpad to try and compile a list of ‘tags’ which the various tools can claim to implement. Examples include containerized (i.e. support for containerized deployments), version support, day 2 operations etc. The first step will be to socialize further distill and then socialize these ‘capabilities’ via the openstack-discuss mailing list.

The Airship project update was the next session I went to and was quite well attended. In general it was interesting to hear about the similarities in the concepts and approach taken in Airship compared to TripleO. Especially the concept of an ‘undercloud’ and that deployment is driven by yaml files which define the deployment and service configuration values. In Airship these yaml files are known as charts. The equivalence in TripleO is the tripleo heat templates repo which holds the deployment and service configuration for TripleO deployments.

Finally an interesting session on running zuul ontop of Kubernetes and using Helm Charts. The presenters said they would make the charts used in their deployment would be made available upstream “soon”. This then spawned a side conversation with weshay and sshnaidm about using kubernetes for the TripleO CI squad’s zuul based reproducer. Prompted by weshay we micro-hackfest explored the use of k3s - 5 less than k8s. Taking the docker-compose file we tried to convert it using the kompose tool. We got far enough running the k3s service but stumbled on the lack of support for dependencies in kompose. We could investigate writing some Helm charts to do this but it is still TBD if k3s is a direction we will adopt for the reproducer this cycle or if we will keep podman which replaced docker (sshnaidm++ was working on this).

Summit Day Three

On Wednesday the first session I attended was a comparison of TripleO, Kolla and Airship as a deployment tool. The common requirement was support for container based deployments. You can see event details here - apparently there should be a recording though this isn’t available at time of writing. Again it was interesting to hear about the similarities between The Airship and TripleO project approach to config management including the management node ‘undercloud’.

I then went to the very well attended and well lead (by slagle and emilienm) TripleO project update. Again there should be a recording available at some point via that link but it isn’t there at present time. Besides a general stein update, slagle introduced the concepts of scaling (thousand not hundred) and edge as one of the main use cases for these ‘thousand node deployments’. These concepts were then further discussed in subsequent TripleO sessions noted in following paragraphs.

The first of these TripleO sessions was the forum that was devoted to scale and lead by slagle - etherpad is here. There is a good list of the identified and discussed “bottleneck services” on the undercloud - including Heat, Ironic, Mistral&Zaqar, Neutron, Keyston and Ansible and the technical challenges around possibly removing these. This was further explored during the PTG.

Finally I was at the Open Infrastructure project update given by C. Boylan which hilighted the move to opendev.org and then the zuul project update by J. Blair.

Project Teams Gathering Day 1

I spent the PTG in the TripleO room Room etherpad and picture

The etherpad contains notes from the various discussions but I hilight some of the main themes here. As usual there was a brief retrospective on the stein cycle and some of that was captured in this etherpad. This was followed by an operator feedback session - one of the main issues raised was ‘needs more scale’.

Slagle lead the discussion on Edge which introduced and discussed the requirements for The Distributed Compute Node architecture, where we will have a central deployment for our controllers and compute nodes spread across a number of edge locations. There was participation here from both the Edge working group as well as the Ironic project.

Then fultonj and gfidente lead the storage squad update (notes on the main tripleo room etherpad. Among other things, there was discussion around ceph deployments ‘at the edge’ and the challenges, as well as the trigerring of tripleo jobs in ceph-ansible pull requests.

Finally emilien lead the Deployment squad topics (notes on tripleo room etherpad). In particular there was further discussion around making the undercloud ‘lighter’ by considering which services we might remove. For this cycle it is likely that we keep Mistral albeit changing the way we use it so that is only executes ansible, keeping Neutron and os-net-config as is, but making the network configuration be applied more directly by ansible. There was also discussion around the use of Nova and whether we can just use Ironic directly. There will be exploration around the use of metalsmith to provide the information about the nodes in our deployment that we lose by removing Nova.

Project Teams Gathering Day 2

Room etherpad and day two picture

Slagle lead the first session which revisited the “thousand node scale” topic introduced in the tripleo operator forum and captured in the tripleo-forum-scale etherpad.

The HA session was introduced by bandini and dciabrin (see main room etherpad for notes). Some of the topics raised here were the need for a new workflow for minor deployment configuration changes such as changing a service password, how we can improve the issue posed by a partial/temporary disconection of one of the cluster/controlplane nodes and whether pacemaker should be the default in upstream deployments (this is a topic revisited most summits…) and there was no strong push back on this however this is still to be proposed as a gerrit change so is still TBD.

The upgrades squad was represented by chem, jfrancoa and ccamacho. There are notes in this upgrades session etherpad. Amongst other topics there was discussion around ‘FFWD II’ which is Queens to Train (and which includes the upgrade from Centos7 to Centos8) as well as a discussion around a completely fresh approach to the upgrades workflow that uses a separate set of nodes for the controlplane. The idea is to replicate the existing controlplane onto 3 new nodes but deploying the target upgrade version. This could mean more than 3 nodes if you have distributed the controlplane services across a number of dedicated nodes like Networker for example. Once the ‘new’ controlplane is ready you would migrate the data from your old controloplane and at that point there would be controlplane outage. However since the target controlplane is ready to go, the hope is that the switch over from old to new controlplane will be a relatively painless process once the details are worked out in this cycle. For the rest of the nodes Compute etc the existing workflow would be used with the tripleoclient running the relevant ansible playbooks to deliver upgrades on per node basis.

The TripleO CI squad was represented by weshay, quiquell, sshnaidm and myself. The session was introduced by weshay and we had a good discussion lasting well over an hour about numerous topics (captured in the main triplo room etherpad) including the performance gains from moving to standalone jobs, plans around the standalone-upgrade in particular that for stable/stein this should be green and voting now taiga story in progress, the work around rhel7/8 on baremetal and the software factory jobs, using browbeat to monitor changes to the deployment time and possibly alert of even block if this is significant.

Finally weshay showed off the shiny new zuul-based reproducer (kudos quiquell and sshnaidm). In short you can find the reproducer-quickstart in any TripleO ci job and follow the related reproducer README to have your own zuul and gerrit running the given job using either libvirt or ovb (i.e. on rdocloud). This is the first time the new reproducer was introduced to the wider team and whilst we (TripleO squad) would probably still call this a beta, we think its ready enough for any early adopters that might find this interesting and useful enough to try it out and the CI squad would certainly appreciate any feedback.

My summary of the OpenStack Stein PTG in Denver

2018-09-18 00:00:00 +0300

My summary of the OpenStack Stein PTG in Denver

After only 3 take off/landings I was very happy to participate in the Stein PTG in Denver. This is a brief summary with pointers of the sessions or rooms I attended in the order they happened (Stein PTG Schedule)

Upgrades ci with the stand-alone deployment
Upgrades SIG
Edge room
TripleO room

Upgrades CI with the stand-alone deployment

We had a productive impromptu round table (weshay++) in one of the empty rooms with the tripleo ci folks present (weshay, panda, sshnaidm, arxcruz, marios) the tripleo upgrades folks present (chem and holser) as well emeritus PTL mwahaha around the stand-alone and how we can use it for upgrades ci. We introduced the proposed spec and one of the main topics discussed was, ultimately is it worth it, to solve all of these subproblems to only end up with some approximation of the upgrade?

The consensus was yes since we can have 2 types of upgrades job: use the stand-alone to ci the actual tasks, i.e. upgrade_tasks and deployment_tasks for each service in the tripleo-heat-templates, and another job (the current job which will be adapted) to ci the upgrades workflow tripleoclient/mistral workflows etc. There was general consensus in this approach between the upgrades and ci representatives so that we could try and sell it to the wider team in the tripleo room on wednesday together.

Upgrades Special Interest Group

Room etherpad.

Monday afternoon was spent in the upgrades SIG room. There was first discussion of the placement api extraction and how this would have to be dealt with during the upgrade, with a solution sketched out around the db migrations required.

This lead into discussion around pre-upgrade checks that could deal with things like db migrations (or just check if something is missing and fail accordingly before the upgrade). As I was reminded during the lunchtime presentations pre upgrade checks is one of the Stein community goals (together with python-3). The idea is that each service would own a set of checks that should be performed before an upgrade is run and that they would be invoked via the openstack client (sthing along the lines of ‘openstack pre-upgrade-check nova’ - I believe there is already some implementation (from the nova team) but I don’t readily have details.

There was then a productive discussion about the purpose and direction of the upgrades SIG. One of the points raised was that the SIG should not be just about the fast forward upgrade even though that has been a main focus until now. The pre-upgrade checks are a good example of that and the SIG will try and continue to promote these with adoption by all the OpenStack services. On that note I proposed that whilst the services themselves will own the service specific pre-upgrade checks, it’s the deployment projects which will own the pre-upgrade infrastructure checks, such as healthy cluster/database or responding service endpoints.

There was ofcourse discussion around the fast forward upgrade with status updates from the deployment projects present (kolla-ansible, TripleO, charms, OSA). TripleO is the only project with an implemented workflow at present. Finally there was a discussion about whether we’re doing better in terms of operator experience for upgrades in general and how we can continue to improve (e.g. rolling upgrades was one of the discussed points here).

Edge room

Room etherpad Room etherpad2 Use cases Edge primer

I was only in attendance for the first part of this session which was about understanding the requirements (and hopefully continuing to find the common ground). The room started with a review of the various proposed use cases from dublin and any review of work since then. One of the main points raised by shardy is that in TripleO whilst we have a number of exploratory efforts ongoing (like split controlplane for example) it would be good to have a specific architecture to aim for and that is missing currently. It was agreed that the existing use cases will be extended to include the proposed architecture and that these can serve as a starting point for anyone looking to deploy with edge locations.

There are pointers to the rest of the edge sessions in the etherpad above.

TripleO room

Room etherpad Team picture

The order of sessions was slightly revised from that listed in the etherpad above because the East coast storms forced folks to change travel plans. The following order is to the best of my recollection ;)

TripleO and Edge cloud deployments

Session etherpad

There was first a summary from the Edge room from shardy and then tripleo specific discussion around the current work (split controlplane). There was some discussion around possibly using/repurposing “the multinode job” for multiple stacks to simulate the Edge locations in ci. There was also discussion around the networking aspects (though this will depend on the architecture which we don’t yet have fully targetted) with respect to the tripleo deployment networks (controlplane/internalapi etc) in an edge deployment. Finally there was consideration of the work needed in tripleo-common and the mistral workflows needed for the split controlplane deployment.

OS / Platform

(tracked on main tripleo etherpad linked above)

The main items discussed here were Python 3 support, removing instack-undercloud and “that upgrade” to Centos8 on Stein.

For Python3 the discussion included the fact that in TripleO we are bound by whatever python the deployed services support (as well as what the upstream distribution will be i.e. Centos 7/8 and which python ships where).

For the Centos8/Stein upgrade the upgrades folks chem and holser lead the discussion outlining how we will need a completely new workflow, which may be dictated in large by how the Centos8 is delivered. One of the approaches discussed here was to use a completely external/distinct upgrade workflow for the OS, versus the TripleO driven OpenStack upgrade itself. We got into more details about this during the Baremetal session see below).

TripleO CI

Session etherpad

One of the first items raised was the stand-alone deployment and its use in ci. The general proposal is that we should use a lot more of it! In particular to replace existing jobs (like scenarios 1/2) with a standalone deployement.

There was also discussion around the stand-alone for the upgrades ci as we agreed with the upgrades folks on Monday (spec). The idea of service vs workflow upgrades was presented/solidified here and I have just updated v8 of the spec accordingly to emphasise this point.

Other points discussed in the CI session were testing ovb in infra and how we could make jobs voting. The first move will be towards removing te-broker.

There was also some consideration of the involvement of the ci team with other squads and vice versa. There is a new column in our trello board called “requests from other DFG”.

A further point raised was the reproducer scripts and future directions including running and not only generating this in ci. As related side note it sounds like folks are using the reproducer and having some successes.

Ansible / Framework

(tracked on main tripleo etherpad linked above)

In this session an overview of the work towards splitting out the ansible tasks from the tripleo-heat-templates into re-usable roles was given by jillr and slagle. More info and pointers in the the main tripleo etherpad above.

Security

Session etherpad

Discussion around the workflow to change overcloud/service passwords (this is currently borked!). In particular problems around trying to CI this since the deploy takes too long to have deploy + stack update for the passwords and validation within the timeout. Possibly could be a 3rd party (but then non voting) job for now. There was also an overview of work towards using Castellan with TripleO, as well as discussion around selinux and locking down ssh.

UX / UI

Session etherpad

CLI/UI feature parity is a main goal for this cycle (and further probably it seems there is a lot to do) and plan management operations around this. Also good discussion around validations with Tengu joining remotely via Bluejeans to champion the effort of providing a nice way to run these via the tripleoclient.

Baremetal

Session etherpad

This session started with discussion around metalsmith vs nova on the undercloud and the required upgrade path to make this so. Also considered were the overcloud image customization and discussions around network automation (ansible with python-networking-ansible ml2 driver ).

However unexpectedly and the most interesting part of this session personally was an impromptu design session started by ipilcher (prompted by a question from phuongh who I believe was new to the room). The session was about the upgrade to Centos8 and three main approaches were explored, the “big bang” (everything off upgrade everything back), “some kind of rolling upgrade” and finally supporting either Centos8/Rocky or Centos7/Stein. The first and third were deemed unworkable but there was a very lively and well engaged group design session trying to navigate to a workable process for the ‘rolling upgrade’ aka split personality. Thanks to ipilcher (via bandini) the whiteboards looked like this.

My summary of the OpenStack Rocky PTG in Dublin

2018-03-07 00:00:00 +0200

My summary of the OpenStack Rocky PTG in Dublin

I was fortunate to be part of the OpenStack PTG in Dublin this February. Here is a summary of the sessions I was able to be at. In the end the second day of the TripleO meetup thursday was disrupted as we had to leave the PTG venue. However we still managed to cover a wide range of topics some of which are summarized here.

In short and in the order attended: * FFU * Release cycles * TripleO

FFU

session etherpad
There are at least 5 different ways of doing FFU! Deployment projects update (tripleo, openstack-ansible, kolla, charms)
Some folks trying to do it manually (via operator feedback)
We will form a SIG (freenode #openstack-upgrades? ) –> first order of business is documenting something! Agreeing on best practices when FFU. –> meetings every 2 weeks?

Release Cycles

session etherpad
Release cadence to stay at 6 months for now. Wide discussion about the potential impacts of a longer release cycle including maintenance of stable branches, deployment project/integration testing and d/stream product release cycles, marketing, documentation and others. In the end the merits of a frequent upstream release cycle won, or at least, there was no consensus about getting a longer cycle.
On the other hand operators still think upgrades suck and don’t want to do it every six months. FFU is being relied on as the least painfull way to do upgrades at a longer cadence than the upstream 6 month development cycle which for now will stay as is.
There will be an extended maintenance tag or policy introduced for projects that will support the LTS long term support for stable branches

TripleO

main tracking etherpad
retro session (emilienm) session etherpad some main points here are ‘do more and better ci’, communicate more and review at least a bit outside your squad, improve bugs triage, bring back deepdives.
ci session (weshay) session etherpad some main points here are ‘we need more attention on promotion’, upcoming features like new jobs (containerized undercloud, upgrades jobs), more communication with squads (upgrades ongoing for ex and continue to integrate the tripleo-upgrade role), python3 testing.
config download (slagle) session etherpad some main points are Rocky will bring config download and ansible-playbook worfklow for deployment of the environment, not just upgrade.
all in one (dprince) session etherpad some main points: using containerized undercloud have an ‘all-in-one’ role with only those services you need for your development at the given time. Some discussion around the potential CLI and pointers to more info https://review.openstack.org/#/c/547038/
tripleo for generic provisioning (shadower) session etherpad some main points are re-using the config download with external_deploy_tasks (idea is kubernetes or openshift deployed in a tripleo overcloud), some work still needed on the interfaces and discussion around ironic nodes and ansible.
upgrades (marios o/, chem, jistr, lbezdick) at session etherpad , some main points are improvements in the ci - tech debt (moving to using the tripleo-upgrade role now), containerized undercloud upgrade is coming in Rocky (emilien investigating), Rocky will be a stabilization cycle with focus on improvements to the operator experience including validations, backup/restore, documentation and cli/ui. Integration with UI might be considered during Rocky to be revisitied with UI squad.
containerized undercloud (dprince, emilienm) session etherpad dprince gave a demonstration of a running containerized undercloud environment and reviewed the current work from the trello board. It is running well today and we can consider switching to default containerized undercloud in Rocky.
multiple ceph clusters (gfidente, johfulto), linked bug , discussion around possible approaches including having multiple heat stacks. gfidente or jfulton are better sources of info you are interested in this feature.
workflows api (thrash) session etherpad , some main points are fixing inconsistencies in workflows (should all have an output value, and not trying to get that from a zaqar message) and fixing usability, make a v2 tripleo mistral workflows api (tripleo-common) and re-organise the directories moving existing things under v1, look into optimizing the calls to swift to avoid a large number of individual object GET as currently happens.
UI (jtomasek) session etherpad some main points here are adding UI support for the new composable networks configuration, integration with coming config-download deployment, continue to increase UI/CLI feature parity, allow deployment of multiple plans, prototype workflows to derive parameters for the operator based on input for specific scenarios (like HCI), investigate root device hints support and setting physical_network on particular nodes. Florian led a side session in the Hotel on Thursday morning after we were kicked out of Croke Park stadium because nodublin where we discussed allowing operators to upload customvalidations and prototyping the use of swift for storing validations.
You might note that there are errors in the html validator for this post, but its late here and I’m in no mood to fight that right now. Yes, I know. cool story bro

Deploying a stable/mitaka OpenStack with tripleo-docs (and grep, git-blame and git-log).

2016-06-17 00:00:00 +0300

Deploying a stable/mitaka OpenStack with tripleo-docs (and grep, git-blame and git-log).

This post is about how I was able to mostly successfully follow the tripleo-docs, to deploy a stable/mitaka 3-control 1-compute development (virt) setup so I can ultimately test upgrading this to Newton.

I wasn’t sure there was something worth writing here, but then the same tools I used to address the two issues I hit deploying mitaka kept coming up during the week when trying to upgrade that environment. I’ve had to use a lot of grep and git blame/log to get to the bottom of issues I’m seeing trying to upgrade the undercloud from stable/mitaka to latest/newton.

The Newton upgrade work is ongoing and possibly worthy of a future post.

I guess this post is mostly about git blame, and using URI munging using the change-id to get to actual gerrit code reviews from an error/issue you are seeing.

For the record I deployed stable/mitaka following the instructions at tripleo-docs and setting stable/mitaka repos in appropriate places. For example, during the virt-setup and the undercloud installation I followed the ‘Stable Branch’ admonition and enabled mitaka repos like:

sudo curl -o /etc/yum.repos.d/delorean-mitaka.repo http://trunk.rdoproject.org/centos7-mitaka/current/delorean.repo
sudo curl -o /etc/yum.repos.d/delorean-deps-mitaka.repo http://trunk.rdoproject.org/centos7-mitaka/delorean-deps.repo

Then when building images I enabled the mitaka repo like:

export NODE_DIST=centos7
export USE_DELOREAN_TRUNK=1
export DELOREAN_TRUNK_REPO="http://trunk.rdoproject.org/centos7-mitaka/current/"
export DELOREAN_REPO_FILE="delorean.repo"

The two issues I hit:

The pebcak issue
The overcloud needs moar memory bug

The pebcak issue.

This issue is the pebcak issue because whilst there is indeed a bona-fide bug that I hit here, I only hit that because I had a nit in my deployment command.

My deployment command looked like this:

openstack overcloud deploy --templates --control-scale 3 --compute-scale 1
  --libvirt-type qemu
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml
-e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml
-e network_env.yaml --ntp-server "pool.ntp.org"

Deploying like that ^^^ got me this:

The files ('overcloud-without-mergepy.yaml', 'overcloud.yaml') not found
in the /usr/share/openstack-tripleo-heat-templates/ directory

Err.. no I’m pretty sure those files are there (!)

# [stack@instack ~]$ ls -l /usr/share/openstack-tripleo-heat-templates/overcloud-without-mergepy.yaml
  lrwxrwxrwx. 1 root root 14 Jun 17 08:55 /usr/share/openstack-tripleo-heat-templates/overcloud-without-mergepy.yaml -> overcloud.yaml

So I know that message is very likely from the tripleoclient so I traced it. The code has actually already been fixed on master so grep gave me nothing there. However when I also tried against stable/mitaka:

[m@m python-tripleoclient]$ git checkout stable/mitaka
Switched to branch 'stable/mitaka'
[m@m python-tripleoclient]$ grep -rni "not found in the" ./*
./tripleoclient/v1/overcloud_deploy.py:414:  message = "The files {0} not
found in the {1} directory".format(

So then we can now use git blame to get to the code review that fixed it. Since we now know the file that error message comes from, we can use git blame against master branch. Since it is fixed on master, something must have fixed it:

[m@m python-tripleoclient]$ git checkout master
Switched to branch 'master'
Your branch is up-to-date with 'origin/master'.
[m@m python-tripleoclient]$ git blame tripleoclient/v1/overcloud_deploy.py

1077cf13 tripleoclient/v1/overcloud_deploy.py        (Juan Antonio Osorio Robles 2015-12-04 09:29:16 +0200  382)     def _try_overcloud_deploy_with_compat_yaml(self, tht_root, stack,
1077cf13 tripleoclient/v1/overcloud_deploy.py        (Juan Antonio Osorio Robles 2015-12-04 09:29:16 +0200  383)                                                stack_name, parameters,
1077cf13 tripleoclient/v1/overcloud_deploy.py        (Juan Antonio Osorio Robles 2015-12-04 09:29:16 +0200  384)                                                environments, timeout):
7a05679e tripleoclient/v1/overcloud_deploy.py        (James Slagle               2016-04-01 08:57:41 -0400  385)         messages = ['The following errors occurred:']
1077cf13 tripleoclient/v1/overcloud_deploy.py        (Juan Antonio Osorio Robles 2015-12-04 09:29:16 +0200  386)         for overcloud_yaml_name in constants.OVERCLOUD_YAML_NAMES:
1077cf13 tripleoclient/v1/overcloud_deploy.py        (Juan Antonio Osorio Robles 2015-12-04 09:29:16 +0200  387)             overcloud_yaml = os.path.join(tht_root, overcloud_yaml_name)
1077cf13 tripleoclient/v1/overcloud_deploy.py        (Juan Antonio Osorio Robles 2015-12-04 09:29:16 +0200  388)             try:
1077cf13 tripleoclient/v1/overcloud_deploy.py        (Juan Antonio Osorio Robles 2015-12-04 09:29:16 +0200  389)                 self._heat_deploy(stack, stack_name, overcloud_yaml,
1077cf13 tripleoclient/v1/overcloud_deploy.py        (Juan Antonio Osorio Robles 2015-12-04 09:29:16 +0200  390)                                   parameters, environments, timeout)
7a05679e tripleoclient/v1/overcloud_deploy.py        (James Slagle               2016-04-01 08:57:41 -0400  391)             except six.moves.urllib.error.URLError as e:
7a05679e tripleoclient/v1/overcloud_deploy.py        (James Slagle               2016-04-01 08:57:41 -0400  392)                 messages.append(str(e.reason))
1077cf13 tripleoclient/v1/overcloud_deploy.py        (Juan Antonio Osorio Robles 2015-12-04 09:29:16 +0200  393)             else:
1077cf13 tripleoclient/v1/overcloud_deploy.py        (Juan Antonio Osorio Robles 2015-12-04 09:29:16 +0200  394)                 return
7a05679e tripleoclient/v1/overcloud_deploy.py        (James Slagle               2016-04-01 08:57:41 -0400  395)         raise ValueError('\n'.join(messages))

So the git blame may not display great above, but I see the following line as particularly interesting since it is different to stable/mitaka:

7a05679e tripleoclient/v1/overcloud_deploy.py        (James Slagle               2016-04-01 08:57:41 -0400  392)                 messages.append(str(e.reason))

So now we can use git log to see the actual commit and check it is the one we are looking for:

[m@m python-tripleoclient]$ git log 7a05679e
commit 7a05679ebc944e3bec6f20c194c40fae1cf39d8d
Author: James Slagle <jslagle@redhat.com>
Date:   Fri Apr 1 08:57:41 2016 -0400

Show correct missing files when an error occurs

This function was swallowing all missing file exceptions, and then
printing a message saying overcloud.yaml or
overcloud-without-mergepy.yaml were not found.

The problem is that the URLError could occur for any missing file, such
as a missing environment file, typo in a relative patch or filename,
etc. And in those cases, the error message is actually quite misleading,
especially if the overcloud.yaml does exist at the exact shown path.

This change makes it such that the actual missing file paths are shown
in the output.

Closes-Bug: 1584792
Change-Id: Id9a70cb50d7dfa3dde72eefe0a5eaea7985236ff

Now that sounds promising! So not only do we have the actual bug number, but we have the Change-Id. We can use that to get to the gerrit code review:

[m@m ~]$ gimmeGerrit Id9a70cb50d7dfa3dde72eefe0a5eaea7985236ff

Where gimmeGerrit is a bash alias in my .profile:

gimme_gerrit() {$
    gerrit_url="http://review.openstack.org/#q,$1,n,z"$
    firefox $gerrit_url$
}$
alias gimmeGerrit=gimme_gerrit$

So from the review to master I just made a cherry-pick to stable/mitaka.

Now the reason I was seeing this issue in the first place, was because my deploy command was indeed wrong (it’s just that the error message was eaten by this particular bug). I was using ‘network_env.yaml’ but I had actually created network-env.yaml. Yes, much palmface, but if I hadn’t I wouldn’t have backported the fix so meh.

The overcloud needs moar memory bug.

It is more or less well known in the tripleo community that 4GB overcloud nodes will no longer cut it even in a virt environment, which is why we default to 5GB on current master instack-undercloud.

I was seeing OOM issues on the overcloud nodes with current stable/mitaka like:

16021:Jun 14 10:53:07 overcloud-controller-0 os-collect-config[2330]: u001b[0m\n\u001b[1;31mWarning: Not collecting exported resources without storeconfigs\u001b[0m\n\u001b[1;31mWarning: Not collecting exported resources without storeconfigs\u001b[0m\n\u001b[1;31mWarning: Scope(Haproxy::Config[haproxy]): haproxy: The $merge_options parameter will default to true in the next major release. Please review the documentation regarding the implications.\u001b[0m\n\u001b[1;31mWarning: Not collecting exported resources without storeconfigs\u001b[0m\n\u001b[1;31mWarning: Not collecting exported resources without storeconfigs\u001b[0m\n\u001b[1;31mWarning: Not collecting exported resources without storeconfigs\u001b[0m\n\u001b[1;31mError: /Stage[main]/Main/Pacemaker::Constraint::Base[storage_mgmt_vip-then-haproxy]/Exec[Creating order constraint storage_mgmt_vip-then-haproxy]: Could not evaluate: Cannot allocate memory - fork(2)\u001b[0m\n\u001b[1;31mError: /Stage[main]/Main/Pacemaker::Resource::Service[openstack-nova-novncproxy]/Pacemaker::Resource::Systemd[openstack-nova-novncproxy]/Pcmk_resource[openstack-nova-novncproxy]: Could not evaluate: Cannot allocate memory - /usr/sbin/pcs resource show openstack-nova-novncproxy > /dev/null 2>&1 2>&1\u001b[0m\n\u001b[1;31mWarning: /Stage[main]/Main/Pacemaker::Constraint::Base[nova-vncproxy-then-nova-api-constraint]/Exec[Creating order constraint nova-vncproxy-then-nova-api-constraint]: Skipping because of failed dependencies\u001b[0m\n\u001b[1;31mWarning: /Stage[main]/Main/Pacemaker::Constraint::Colocation[nova-api-with-nova-vncproxy-colocation]/Pcmk_constraint[colo-openstack-nova-api-clone-openstack-nova-novncproxy-clone]: Skipping because of failed dependencies\u001b[0m\n\u001b[1;31mWarning: /Stage[main]/Main/Pacemaker::Constraint::Base[nova-consoleauth-then-nova-vncproxy-constraint]/Exec[Creating order constraint nova-consoleauth-then-nova-vncproxy-constraint]: Skipping because of failed dependencies\u001b[0m\n\u001b[1;31mWarning: /Stage[main]/Main/Pacemaker::Constraint::Colocation[nova-vncproxy-with-nova-consoleauth-colocation]/Pcmk_constraint[

16313:Jun 14 10:53:07 overcloud-controller-0 os-collect-config[2330]:
Error: /Stage[main]/Sahara::Service::Api/Service[sahara-api]: Could not
evaluate: Cannot allocate memory - fork(2)
16314:Jun 14 10:53:07 overcloud-controller-0 os-collect-config[2330]:
Error: /Stage[main]/Haproxy/Haproxy::Instance[haproxy]/Haproxy::Config[haproxy]/Concat[/etc/haproxy/haproxy.cfg]/Exec[concat_/etc/haproxy/haproxy.cfg]:
Could not evaluate: Cannot allocate memory - fork(2)

Suspecting from previous experience this would be defaulted in instack-undercloud:

[m@m instack-undercloud]$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
[m@m instack-undercloud]$ grep -rni 'NODE_MEM' ./*
./scripts/instack-virt-setup:89:export NODE_MEM=${NODE_MEM:-5120}

[m@m instack-undercloud]$ git blame scripts/instack-virt-setup | grep  NODE_MEM
2dec7d75 (Carlos Camacho  2016-03-30 09:17:44 +0000  89) export NODE_MEM=${NODE_MEM:-5120}

So using git log to see more about 2dec7d75:

[m@m instack-undercloud]$ git log 2dec7d75
commit 2dec7d7521799c0323d076cd66ba71ebb444c706
Author: Carlos Camacho <ccamacho@redhat.com>
Date:   Wed Mar 30 09:17:44 2016 +0000

    Overcloud is not able to deploy with the default 4GB of RAM using instack-undercloud

    When deploying the overcloud with the default value of 4GB of RAM the overcloud fails throwing "Cannot allocate memory" errors.
    By increasing the default memory to 5GB the error is solved in instack-undercloud

    Change-Id: I29036edeebefc1959643a04c5396e72863fdca5f
    Closes-Bug: #1563750

So as in the case of the pebcak issue, gimmeGerrit yields the review so I then just cherrypicked that to stable/mitaka too.

Monitoring a tripleo Overcloud upgrade

2016-06-03 00:00:00 +0300

Monitoring a tripleo Overcloud upgrade

The tripleo overcloud upgrades workflow (WIP Docs) has been well tested for upgrades to stable/liberty. There is ongoing work to adapt this workflow for upgrades to stable/mitaka/newton (current master), as well as to change the process altogether and make it more composable.

This post is a description of the kinds of things I look for when monitoring a stable/liberty upgrade - verification points after a given step and some explanation in various points that may/not be helpful. I recently had to share a lot of this information as as part of a customer POC upgrade and thought it would be useful to have written down somewhere.

Upgrade the undercloud
Upgrade init step
Upgrade controllers
Upgrade compute and ceph nodes
Upgrade converge - apply config deployment wide

For reference, the overcloud being upgraded in the examples below was deployed like:

openstack overcloud deploy --templates /home/stack/tripleo-heat-templates
  -e /home/stack/tripleo-heat-templates/overcloud-resource-registry-puppet.yaml
  -e /home/stack/tripleo-heat-templates/environments/puppet-pacemaker.yaml
  --control-scale 3 --compute-scale 1 --libvirt-type qemu
  -e /home/stack/tripleo-heat-templates/environments/network-isolation.yaml
  -e /home/stack/tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml
  -e network_env.yaml --ntp-server '0.fedora.pool.ntp.org'

Upgrade your undercloud.

The first thing to check and very likely have to re-instate is any post-create customizations you had to make to your undercloud, such as creation of a new ovs interface for talking to your overcloud nodes, or any custom IP routes. The undercloud upgrade will revert those and you’ll have to re-add/create them.

The upgrade to liberty delivers a new upgrade-non-controller.sh script for the undercloud, so you can check this:

[stack@instack ~]$ which upgrade-non-controller.sh
/bin/upgrade-non-controller.sh

Other than that I always just sanity check that services are running OK post upgrade:

[stack@instack ~]$ openstack-service status
MainPID=2107 Id=neutron-dhcp-agent.service ActiveState=active
MainPID=2106 Id=neutron-openvswitch-agent.service ActiveState=active
MainPID=1191 Id=neutron-server.service ActiveState=active
MainPID=1232 Id=openstack-glance-api.service ActiveState=active
MainPID=1172 Id=openstack-glance-registry.service ActiveState=active
MainPID=1201 Id=openstack-heat-api-cfn.service ActiveState=active

Execute the upgrade initialization step

This is called the initialization step since it sets up the repos on the overcloud nodes (for the upgrade we are going to) and delivers the upgrade script to the non-controller nodes. This step is instigated through the inclusion of the major-upgrade-pacemaker-init.yaml in the deployment command. For example:

openstack overcloud deploy --templates /home/stack/tripleo-heat-templates
  -e /home/stack/tripleo-heat-templates/overcloud-resource-registry-puppet.yaml
  -e /home/stack/tripleo-heat-templates/environments/puppet-pacemaker.yaml
  --control-scale 3 --compute-scale 1 --libvirt-type qemu
  -e /home/stack/tripleo-heat-templates/environments/network-isolation.yaml
  -e /home/stack/tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml
  -e /home/stack/tripleo-heat-templates/environments/major-upgrade-pacemaker-init.yaml
  -e network_env.yaml --ntp-server '0.fedora.pool.ntp.org'

Once the heat stack has gone to UPDATE_COMPLETE you can check all non controller nodes for the presence of the newly delivered upgrade script tripleo_upgrade_node.sh:

[root@overcloud-novacompute-0 ~]# ls -l /root
-rwxr-xr-x. 1 root root 348 Jun  3 11:26 tripleo_upgrade_node.sh

One point to note is that the rpc version which we will use for pinning nova rpc during the upgrade is set in the compute upgrade script:

[root@overcloud-novacompute-0 ~]# cat tripleo_upgrade_node.sh
### DO NOT MODIFY THIS FILE
### This file is automatically delivered to the compute nodes as part of the
### tripleo upgrades workflow

# pin nova to kilo (messaging +-1) for the nova-compute service

crudini  --set /etc/nova/nova.conf upgrade_levels compute mitaka

yum -y install python-zaqarclient  # needed for os-collect-config
yum -y update

The line with the upgrade_levels compute above is actually written using the parameter we passed in the major-upgrade-pacemaker-init.yaml

You should also see the updated /etc/yum.repos.d/* on all overcloud nodes after this step so you can confirm that is all in order for the upgrade to proceed.

Upgrade controller nodes (and your entire pacemaker cluster)

(I skipped upgrading swift nodes, as it isn’t very interesting/much to say, see the WIP Docs for more or ping me).

This step will upgrade your controller nodes and during this process the entire cluster will be taken offline - this is normal. This step is instigated by including the major-upgrade-pacemaker.yaml environment file. For example:

openstack overcloud deploy --templates /home/stack/tripleo-heat-templates
  -e /home/stack/tripleo-heat-templates/overcloud-resource-registry-puppet.yaml
  -e /home/stack/tripleo-heat-templates/environments/puppet-pacemaker.yaml
  --control-scale 3 --compute-scale 1 --libvirt-type qemu
  -e /home/stack/tripleo-heat-templates/environments/network-isolation.yaml
  -e /home/stack/tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml
  -e /home/stack/tripleo-heat-templates/environments/major-upgrade-pacemaker.yaml
  -e network_env.yaml --ntp-server '0.fedora.pool.ntp.org'

I typically observe the pacemaker cluster during the upgrade process. For example on controller 1 i have watch -d pcs status and on controller-2 I have watch -d pcs status | grep -ni stop -C 2. During the upgrade the pacemaker cluster goes down completely at some point, before yum packages are updated and then the cluster is brought back up.

Once you start to see pacemaker services go down it means that the code in major_upgrade_controller_pacemaker_1.sh is running and eventually the cluster is stopped completely.

Every 2.0s: pcs status | grep -ni stop -C2 -B1                                                               Fri Jun  3 11:52:07 2016

Error: cluster is not currently running on this node

At this point you can start to monitor /var/log/yum.log to see packages being upgraded.

[root@overcloud-controller-0 ~]# tail -f /var/log/yum.log
Jun 03 11:51:52 Updated: erlang-otp_mibs-18.3.3-1.el7.x86_64
Jun 03 11:51:52 Installed: python2-rjsmin-1.0.12-2.el7.x86_64
Jun 03 11:51:52 Updated: python-django-compressor-2.0-1.el7.noarch
Jun 03 11:51:53 Updated: ntp-4.2.6p5-22.el7.centos.2.x86_64
Jun 03 11:51:53 Updated: rabbitmq-server-3.6.2-3.el7.noarch

Once the cluster starts to come back online and services start then you know that major_upgrade_controller_pacemaker_2.sh is being executed.

After the stack is UPDATE_COMPLETE, you can check the rpc pin is set on nova.conf on all controllers:

[root@overcloud-controller-0 ~]# grep -rni upgrade -A 1 /etc/nova/*
/etc/nova/nova.conf:106:[upgrade_levels]
/etc/nova/nova.conf-107-compute = mitaka

Upgrade compute and ceph nodes

This uses the upgrade-non-controller.sh script, to execute the tripleo_upgrade_node.sh on each non controller node, for example:

[stack@instack ~]$ upgrade-non-controller.sh --upgrade overcloud-novacompute-0

On both node types you can check that the yum update has been executed successfully. Note that the tripleo_upgrade_node.sh script is customized for each node type, so they will be different between computes and ceph nodes for example. However in all cases there will at some point be a yum -y update. See the major_upgrade_compute.sh and major_update_ceph_storage.sh for more info on how else they might differ.

For compute nodes you can check that the upgrade_levels is set for the nova rpc pinning in /etc/nova/nova.conf (which in the case of computes is used by nova-compute itself, api/sched/conductor etc are on controller).

[root@overcloud-novacompute-0 ~]# grep -rni upgrade -A 1 /etc/nova/*
/etc/nova/nova.conf:106:[upgrade_levels]
/etc/nova/nova.conf-107-compute = mitaka

Upgrade converge - apply config deployment wide and restart things.

The last step in the upgrade workflow is where we re-apply the deployment-wide config as specified by the tripleo-heat-templates used in the deploy/upgrade commands. It is instigated by including the major-upgrade-pacemaker-converge.yaml environment file, for example:

openstack overcloud deploy --templates /home/stack/tripleo-heat-templates
  -e /home/stack/tripleo-heat-templates/overcloud-resource-registry-puppet.yaml
  -e /home/stack/tripleo-heat-templates/environments/puppet-pacemaker.yaml
  --control-scale 3 --compute-scale 1 --libvirt-type qemu
  -e /home/stack/tripleo-heat-templates/environments/network-isolation.yaml
  -e /home/stack/tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml
  -e /home/stack/tripleo-heat-templates/environments/major-upgrade-pacemaker-converge.yaml
  -e network_env.yaml --ntp-server '0.fedora.pool.ntp.org'

For both major-upgrade-pacemaker-init.yaml (upgrade initialisation) as well as major-upgrade-pacemaker.yaml (controller upgrade) we specify for the resource registry:

OS::TripleO::ControllerPostDeployment: OS::Heat::None
OS::TripleO::ComputePostDeployment: OS::Heat::None
OS::TripleO::ObjectStoragePostDeployment: OS::Heat::None
OS::TripleO::BlockStoragePostDeployment: OS::Heat::None
OS::TripleO::CephStoragePostDeployment: OS::Heat::None

which means that things like the controller-config-pacemaker.yaml do not happen for controllers during those steps. That is, application of the overcloud_**.pp manifests does not happen during upgrade initialisation or controller upgrade.

However for converge we simply do not override this in the major-upgrade-pacemaker-converge.yaml environement file so that the normal puppet manifests get applied for each node, delivering any config changes (e.g. updates to liberty had to deal with a rabbitmq password change causing issues such as this).

Since we are applying new config we need to make sure everything is restarted properly to pick this up so we use the pacemaker_resource_restart.sh after the normal puppet manifests are applied.

So during this step, the pacemaker cluster will first go into an “unmanaged” state and this is to be expected and not a cause for alarm. This is because as a matter of practice, before applying the controller puppet manifest, we set he cluster to maintenance mode (as we are going to write to the pacemaker resource definitions/constraints to the cib) like this which uses the script here.

After the manifest is applied we unset maintenance mode here.

You should then see services restarting as pacemaker_resource_restart.sh is being executed. Seeing all the services running again at this point is a good indication that the converge step is coming to an end successfully.