[REPO-2196] Jackrabbit cluster sync of outdated NodeTypeRecord changes during startup may break (revert) non-trivial node type changes - Issues

XML

Word

Printable

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 13.0.0, 13.1.0
Fix Version/s: 13.2.0, 13.1.2
Component/s: None
Labels:
None

Description

During the startup of an upgrade from a brXM v12.* to v13.* specific (deprecated) nodetypes and nodetype properties or child definition types are removed, through a MigrateToV13 process.
These nodetype definition removals are normally not allowed (non-trivial changes) in Jackrabbit, but (only internally) possible through brXM.

After the release of brXM v13 a potential complication was detected under specific conditions (see below) where the standard Jackrabbit Cluster synchronization may 'replay' these nodetype changes (actually: the full nodetype definition at the time of the change) when starting up another (outdated) cluster node.
Normally this wouldn't be problematic with only forward-compatible, trivial changes. Single cluster-sync nodetype events 'replay' then might still fail (default behavior of Jackrabbit) but such errors would only be logged without consequences: After the fix in this issue which makes sure that during 'cluster-sync phase' nodetype events from the journal table are ignored, there also won't be logged harmless errors any more.

However, when there are non-trivial nodetype definition removals, an earlier (trivial or non-trivial) nodetype change actually may then revert the (later) non-trivial removals: at the time of that earlier (trivial or non-trivial) nodetype change the later removal hasn't done yet, and hence is not contained in the nodetype change event record in the repository journal table: The nodetype change event record contains the entire nodetype description at the time of the event.
After that, the (first) trivial or non-trivial change may actually be apply-able (and thus executed), while the later non-trivial nodetype definition removals will fail because these events are executed by Jackrabbit, not internally controlled and allowed through brXM.

As result, an upgrade from v12.* to the current v13.* versions may result in unintended half-baked or broken nodetype definitions. Which may cause a failure during startup, or (worse) may cause problems/failures at a later stage at runtime.

The specific conditions when this may happen are (all required):

when using a clustered set of brXM nodes, and
after upgrade a (one) brXM v12.* node to v13, then
start one/more 'existing' brXM cluster nodes, having an entry in the REPOSITORY_LOCAL_REVISIONS table, and
using an existing or 'manually' copied local lucene index, and
a LOCAL_REVISION_ID (in the REPOSITORY_LOCAL_REVISIONS table) which is older than the entry in REPOSITORY_GLOBAL_REVISIONS

Only under the above conditions, a cluster node synchronization will be done at startup from one of these additional brXM nodes and then possibly lead to the above reported problem.

Note that while an initial upgrade to v13, involving multiple cluster nodes, may have been successful (escaping the above conditions), it still may occur if at a later stage yet another existing brXM cluster node is started within the above conditions!

Also important to note is that the above conditions do not apply when using the Lucene Index Export Addon to provide an up-to-date index (created after the upgrade to v13) for these additional brXM cluster nodes.

To prevent this problem from occurring again, the logic of the Jackrabbit ClusterNode has been extended to ignore/skip nodetype change events during (only) the startup of a cluster node.
Those events were never needed anyway as the Jackrabbit NodeTypeRegistry managing the nodetype definitions is (by default) using the database to persist the current definitions, and therefore didn't need nodetype change events to be replayed (that is: during startup).

And to fix/repair possibly already half-baked or broken nodetype definitions after an upgrade to v13, the original MigrateToV13 process, which normally would only be executed once, has been changed to run one additional time, thereby re-applying the intended nodetype changes for upgrading to v13, if at all needed.

Attachments

Issue Links

relates to

REPO-2305 Concurrent UpgradeToV13 execution in a clustered/cloud deployment may fail to complete or revert specific necessary changes

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Ate Douma (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 13/May/19 9:04 PM

Updated:: 24/Jan/20 4:25 PM

Resolved:: 14/May/19 5:32 PM