Uploaded image for project: 'Hippo CMS'
  1. Hippo CMS
  2. CMS-14073

Upgrading to v14.3.x in a running cluster may cause possibly intermittent failures reloading the site on not-yet-upgraded instances

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • 14.3.0, 14.3.1, 14.3.2
    • 14.2.3, 14.3.3
    • None
    • Pulsar
    • Pulsar 245 - Eng OKRs

    Description

      Release v14.3.0+ introduces a new hippo:identiable mixin and a hippo:identifier property with a dynamic default value, see: CMS-13587.
      To support this custom extension on JCR, an updated jackrabbit version, (hippo-) jackrabbit 2.18.5-h3 is needed, which is bundled with release v14.3.0

      However, during an upgrade in a running cluster, other cluster instances may receive a failure processing the related node type changes, while they are still on an older version.

      This overlooked problem is expected to not be permanent (should be resolved after all cluster instances have been upgraded), but the sites running on the not-yet-upgraded cluster instances may no longer be able to serve requests.

      We're looking into a possible remedy for this problem, but at this moment it is strongly advised to not yet upgrade to v14.3.2 (or earlier 14.3.x) until we have resolved this.

      2020-10-13: Root cause analysis and resolution

      The problem turned out to be a bug (CMS-14076) which caused a running old instance to persist its failed-to-update (CMS-13707) node type definitions back into the database.
       This effectively 'rolled back' (some of) the intended node type changes needed for v14.3.x, which then resulted in the site failing rendering, even after a completed deployment and restart of the cluster!

      This problem is only triggered during a rolling upgrade deployment, when still running old instances receive the cluster synchronization event with the node type changes, and thereafter processing them incorrectly as described above.

      When using a stop/start cluster deployment, all instances already will have the node type changes, and therefore also no problem processing them.

      With only the CMS-14076 bug fixed, a rolling upgrade would still cause an intermittent error on the running old instances, but as soon as those instances are also upgraded, these would get automatically resolved and be stable again.
      However, because we already need to back port the CMS-140706 fix, we also back ported the changed for CMS-13707, including using the newer jackrabbit v2.18.5-h3, and thereby even those intermittent errors are now resolved.

      To support a rolling upgrade to v14.3.x (or later) we therefore will provide an intermediate patch release v14.2.3 with these two fixes.

      Also, the CMS-14706 fix, as well as some other important fixes and improvements, will be bundled in a new v14.3.3 release.

      The recommended upgrade path from v14.x to v14.3.x or later therefore  is:

      • Upgrade to the latest v14.3.3 or later
      • Either:
        • use a cluster stop/start deploy, possible together with using blue/green deployment: this will not (and did not) have this issue at all
      • or:
        1. first do an intermediate rolling deploy to v14.2.3
        2. wait until all instances have been upgraded
        3. immediately thereafter do a rolling deploy to v14.3.3

      Note: since v14.2.x releases are no longer maintained/updated, the intermediate v14.2.3 patch release is only provided to enable a rolling upgrade to v14.3.x or later, and is not supported for actual production usage.

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              adouma Ate Douma
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: