Uploaded image for project: '[Read Only] - Hippo Repository'
  1. [Read Only] - Hippo Repository
  2. REPO-2305

Concurrent UpgradeToV13 execution in a clustered/cloud deployment may fail to complete or revert specific necessary changes

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Top
    • Resolution: Fixed
    • Affects Version/s: 13.4.1
    • Fix Version/s: 14.1.0, 13.4.2, 12.6.9, 14.0.1
    • Component/s: None
    • Labels:
      None
    • Similar issues:
    • Processed by team:
      Pulsar
    • Sprint:
      Platform 226 - Taxonomy 1

      Description

      A glitch has been detected during an upgrade to v13 in the execution of the necessary UpgradeToV13 changes when this is performed concurrently by multiple (clustered) instances at startup.
      The problem is caused by the (Jackrabbit) cluster-wise event processing of intermediate incompatible nodetype changes (as required for the upgrade to v13), which fail to be processed by the event receiving instance (because these are regarded incompatible from Jackrabbit's POV).
      Such failed external (broadcasted) intermediate nodetype changes are ignored, but may thereafter be followed by other (intermediate) nodetype changes which are compatible.
      As result the receiving instance(s) may end up with an incomplete/half-baked nodetype change, which it will also persist to the database.

      This problem can only occur during concurrent clustered/cloud based startup.
      Subsequent cluster instances only started after a first (single) MigrateToV13 execution will not have this problem because nodetype change events from older/earlier journal events always are ignored since the earlier fix from REPO-2196.
      So this problem can be prevented by upgrading to v13 (once) using only a single repository instance/node, and only thereafter spin up additional cluster instances.

      To make the upgrade process however more resilient and convenient, and allowing concurrent startup even for the upgrade to v13, the following changes are needed:

      1. Ensure sequential execution of migration logic which require specific nodetype changes (e.g. MigrateToV13), using a cluster-wide lock.
      2. Improve and optimize the Jackrabbit processing of external nodetype change events:
        • These event do not need to be checked again for potentially incompatible changes as they already are verified and applied.
          The purpose of/need for the broadcasted event is only to notify and update the receiving instance of these changes (in memory, e.g. update/flush its cached model).
        • The updated nodetype model on the receiving instance does not need to be persisted again as these already have been applied and persisted by the source instance of the change.

      The above changes also will be back-ported to v12.6 (v12.6.9) as it uses and requires similar MigrateToV12 logic for upgrading from v11.
      So far there have been no reports of similar problems with the upgrade to v12, which was simpler in this regard, just to be sure and to align the general and common logic.

      As the UpgradeToV13 logic is intended to be needed and executed only once, for which it uses a specific 'marker' nodetype definition (hipposys:ntd_v13), an additional 'marker' nodetype definition (hipposys:ntd_v13b) will be added to force re-execute (once) for already upgraded deployments to repair and complete potentially incomplete upgrades caused by this issue.

      For v14 there was no need for a MigrateToV14, but it still 'carries' the not needed MigrateToV13 logic, which therefore now will be removed instead of needing to update it.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                adouma Ate Douma
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: