Uploaded image for project: 'Hippo CMS'
  1. Hippo CMS
  2. CMS-15077

[Backport] Support consistent HST Model Loading

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Normal
    • Resolution: Fixed
    • None
    • 14.7.7, 15.1.0, 15.0.1
    • None
    • None
    • Quasar
    • Team Quasar 286 Sprint

    Description

      Problem Description

      The HST loads its model in three steps:

      1. JCR Nodes are read into a lightweight HstNode in memory model
      2. From the HstNode model an HstModel containing Host/Mount configuration is build
      3. On request, a Channel Configuration (sitemap, components, menu, etc) is loaded and appended to the HST Model

      The HST Model is an append only model, and everything which has been created is immutable. There is a lot of caching going on

      1. Upon JCR Node changes, only those parts of the in memory HstNode model are replaced which can be affected
      2. On invalidation, Channel Configuration parts which are not impacted are not rebuild but taken from a configuration cache

      The above strategy is implemented to have optimal HST Model loading performance.

      Cache Invalidation Complexity

      When a change gets persisted below /hst:hst, the HstConfigurationEventListener#onEvent gets triggered which invalidates the HstModel:

      @Override
          public void onEvent(EventIterator events) {
              synchronized(hstModelImpl) {
                  counter.incrementAndGet();
                  hstEventsCollector.collect(events);
                  if (hstEventsCollector.hasEvents()) {
                      hstModelImpl.invalidate();
                  }
              }
          }
      

      The hstEventsCollector collects all the events the HST most process for finegrained HstNode model and HST configuration objects invalidation. The invalidation synchronizes on the 'hstModelImpl' object as during an HstModel build, the model is not allowed to be invalidated and neither are the HstNodes / Configuration objects allowed to be invalidated. Once the HstModel is build, the 'append only' model gets tied to the current http request making sure that http request keeps using that instance of the HstModel. After the model is build, the above invalidation is allowed to proceed and invalidate the model such that on a next request the HstModel gets rebuild.

      However, there is a problem here. The HstConfigurationEventListener is an asynchronous JCR EventListener meaning it gets notified by a separate thread, different from the thread which persisted the JCR changes. This means, that the events can (and will) arrive with a delay. The size of the delay is normally short, but unpredictable and can be larger in circumstances. For the sake of argument, assume that an hst:mount node gets deleted and replaced (which happens ALL the time during mvn integration tests which typically restore a backup of /hst:hst at the end of a test meaning the next test will get a restored /hst:hst config with changed JCR UUIDs). And assume that a test runs (note this can also happen in production though very unlikely) which take the 'jcr uuid' from the Mount object from the hst Model. The following actions are done in eg a test (in the following order)

      1. during a test tear down an hst:mount is replaced below the /hst:hst and persisted
      2. a jcr event arrives in HstConfigurationEventListener#onEvent and the event is collected and the current HstModel instance is invalidated
      3. an HST integration test method fetches the HstModel, and gets the Mount object backed by the hst:mount from step one
      4. The test does a 'resty' interaction on the 'canonical identifier' of the Mount object (typically the UUID of the JCR Node backed by the hst:mount Node). The 'resty' interactions results in a lookup of the JCR Node by the 'canonical identifier' of the Mount object and does some actions with it

      The above example can also be replaced with an hst:sitemapitem, hst:component, etc etc. The logic of the steps stays the same.

      It is the steps 1-4 which make the problem very straightforward if the events of step (2) arrive after step (3) has been done: In that case, step (3) has the old HstModel, with the old 'canonical identifier' for the mount. However that old identifier cannot be matched any more to a JCR Node in step (4) as this node does not exist any more. When adding a thread sleep in the asynchronous listener, many HST mvn integration tests start to fail.

      Poor man's workaround

      The HST integration tests have had two poor man's workaround:

      1. After making changes below /hst:hst add a pretty random heuristic thread sleep of 100ms to make sure it is very likely the async HstConfigurationEventListener#onEvent has been triggered
      2. Programmatically via EventPathsInvalidator trigger the HstModel to be invalidated and specify the events for the event collector

      However, the heuristic of (1) sometimes fails on Jenkins (flakiness). And apart from that, it is very obscure and hard to understand for others when or why to include a thread sleep.

      Solution

      Failed Approach

      Jackrabbit supports a marker interface SynchronousEventListener which if implemented, makes the listener synchronous : in other words, the thread persisting the changes also triggers HstConfigurationEventListener#onEvent. This very simple approach however results in an undesirable problem and a very serious problem: a live or deadlock (which was unclear).

      Problems:

      1. Making the HstConfigurationEventListener a synchronous listener, results in the completion of persisting synchronously trigger HstConfigurationEventListener#onEvent which in its body has
        synchronized(hstModelImpl) 
        

        This results in that a persist can be blocked on another thread which has a lock on hstModelImpl : this lock happens when a thread is loading the HstModel. If the HstModel loading takes some time, the persist (and all other persists) are blocked on the HstModel loading

      2. The HstModel loading is done via HstNodes which are loading from JCR Nodes with an HST configurationuser : it seems that for some reason when making the HstConfigurationEventListener a synchronous listener, there seemed to occur some live or deadlock, blocking both the HstModel loading as well as the HstConfigurationEventListener from proceeding

      Successful Approach

      Making the HstConfigurationEventListener synchronous results in hard to declutter live or deadlocks which I could not figure out in thread dumps. Apart from this severe issue, there is the undesirable potential latency in saving. And there is the rule of thumb: A listener should be lightweight and return a quickly as possible, certainly synchronous listeners. An approach which works extremely well and results in all HST mvn integration tests to be able to run without Thread sleep for model changes is as follows:

      1. Introduce an extra separate Synchronous EventListener on /hst:hst which looks as follows:
            private class SynchronousOnEventCounter implements EventListener, SynchronousEventListener {
        
                private final AtomicLong counter = new AtomicLong();
        
                @Override
                public void onEvent(final EventIterator events) {
                    counter.incrementAndGet();
                }
        
                public long getOnEventCount() {
                    return counter.get();
                }
            };
        

        All this listener does is counting how many times #onEvent is invoked.

      2. To the existing HstConfigurationEventListener also add a counter which gets incremented in onEvent
      3. To the in memory VirtualHostsService (HstModel) add a new field, the buildNumber which is equal to the HstConfigurationEventListener onEvent counter

      With the above ingredients, we can get assurance that the HstModel is invalidated on time, even when the async HstConfigurationEventListener did not get an event yet. This is done by the following logic:

      If there is a non-null HstModel, and the buildnumber of that model is equal to invalidationMonitor.getSynchronousOnEventsCounter(), then that model is valid. Before, we'd always return the HstModel if it was not null. Now we also check against the synchronous event listener if it really does not lag behind. If it lags behind, we know that the VirtualHostsService should had been invalidated, however, the async HstConfigurationEventListener was just not yet updated. When we are in this state we must make sure that we WAIT until the async HstConfigurationEventListener onEvent counter becomes equal to the synchronous listener its onEvent counter AND at the same time, do not synchronized on the HstModel as this would result in a deadlock.

      An heuristic is that we'll wait at most for 1 second to have the async listener to catch up with the sync listener. If that doesn't happen within one second, we'll log an error and continue with the normal HstModel building. The error logging is mostly for feedback but is unexpected to occur

      Attachments

        Activity

          People

            aschrijvers Ard Schrijvers
            aschrijvers Ard Schrijvers
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: