[REPO-1529] Inefficient and error prone (wrt OOM) start up of cluster nodes that have a revision id but do not yet have a Lucene index - Issues

XML

Word

Printable

Details

Type: Bug
Status: Closed
Priority: Top
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Story Points:
2
Sprint:
Platform Sprint 135, Platform Sprint 136

Description

Pretty much the problem is described at

https://issues.apache.org/jira/browse/JCR-3905

That issue is again partly a result of https://issues.apache.org/jira/browse/JCR-3162

This issue is also related to ~~REPO-1353~~ : In ~~REPO-1353~~ there has been implemented a fix to avoid OOM in case a lot of revisions need to be processed, however, it is still very easy to get an OOM now when starting a cluster node that has no index (yet) but does have already a data revision id! Note that if the cluster node is completely new (aka the REPOSITORY_LOCAL_REVISIONS table does not have the JOURNAL_ID for the new cluster node, then the problem is not present because the value from REPOSITORY_GLOBAL_REVISION is then used)

What happens on a startup of a cluster node without index is the following:

1) org.apache.jackrabbit.core.RepositoryImpl#initStartupWorkspaces --> org.apache.jackrabbit.core.query.lucene.MultiIndex#createInitialIndex creates the initial index by traversing all child nodes below the 'root node'
2) After the createInitialIndex, the checkPendingJournalChanges(context); are processed in SearchIndex:

index.createInitialIndex(context.getItemStateManager(),
                    context.getRootId(), rootPath);
 checkPendingJournalChanges(context);

checkPendingJournalChanges results in fetching the List<ChangeLogRecord> for the revision of the cluster node : This revision for a cluster node can lag much behind. As a result, the List<ChangeLogRecord> can result in an OOM because it can be very large. If it doesn't result in an OOM, it continues to remove from the new created index every doc for the ChangeLogRecord
3) After this, the cluster node start is invoked (org.apache.jackrabbit.core.cluster.ClusterNode#start) in org.apache.jackrabbit.core.RepositoryImpl#RepositoryImpl : This cluster node start results in a org.apache.jackrabbit.core.journal.AbstractJournal#sync with startup = true, resulting in processing again the same 'checkPendingJournalChanges' that where removed from the index in (2), but now add them again

Sigh.....

Obviously, the fix should be fairly simple: When a new index is created, use the REPOSITORY_GLOBAL_REVISION to start with!

Attachments

Issue Links

causes

REPO-1536 Drop patch for REPO-1408 which is obsolete since REPO-1529

Closed

relates to

REPO-1353 Out of memory error when adding a new cluster node due to large number of revisions to process

Closed

REPO-1407 Don't read full journal_table when starting with clean index

Closed

REPO-1408 Register new cluster node as up to date with the current global revision

Closed

REPO-1541 Release and deploy hippo jackrabbit patched version 2.10.1-h13

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Ard Schrijvers

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 21/Jul/16 12:59 PM

Updated:: 06/Nov/17 9:12 AM

Resolved:: 10/Aug/16 12:26 PM