Wednesday, September 2, 2009

Most wanted features in Ehcache

In our company, we finally decided to make a move forward and implement ehcache as our caching provider. The integration has provided me with great challenges, it forced me to learn quite a bit about ehcache before we took the decision.

Ehcache, as an API is well tested and perfectly capable of holding up to larger loads. But what's troubling me now is to monitor the caching activity that goes under the hood in ehcache. In most of the normal cases we don't have to worry about what's happening inside Ehcache, but there're times where we want to know very specific things like...

  • How far has the replication has reached ?
  • Has all the servers ( copies of ehcaches) in the cluster acknowledged the replication event successfully ?
  • How do we identify the replication failures ?
Ehcache currently doesn't support event handling at this granular level. Thanks to its extensible architecture, we can still write our own implementation classes to introduce this level of granularity. But, as you know in this world of open source every piece of code you write is worthless if some super-smart guy has already done it before.

One other thing I got stuck is about maintaining the coherency of data between the database and the cache.

The caching solution what we're looking at requires that the database should be 'master' of all the caches, we must make sure the database has all the latest information and at the same time the data must be available up-to-date to all the runtime components which take the hits from users in real time.

  • At any given time, how do I know that the data cached in Ehcache is consistent with the database or not ?
Has anyone encountered this kind of situation before ? How were you able to handle it ?

3 comments:

Alex Miller said...

As you may or may not know, Terracotta recently acquired the IP/copyright for Ehcache and hired Ehcache guru Greg Luck.

Regarding coherency, Terracotta clustered Ehcache is coherent across nodes in the cluster (in contrast to replicated Ehcache with RMI, etc). There is integration with Terracotta already (see http://bit.ly/2nwGbQ for details) but we are currently hard at work to bring even simpler and better native integration to the ehcache project to be released this fall.

With respect to the database consistency issue, I'm not sure what more you're looking for exactly. If all of your nodes write through the cache AND the cache is coherent, I think that answers your question. If some external force is updating the db, then there is always the chance of stale data.

bloggingDude said...

Thanks Alex for your comment. Of course, I'm aware that Terracotta has acquired ehcache. I'm definitely looking out for the new release with better integration with ehcache.

Regarding your question, I was actually looking at one negative scenarios in which one of the servers goes offline from the cluster and joins again after sometime. So, this server might have missed some updates through ehcache. My objective is to identify whether this server is really out-of-sync or not.

For example, throughout this down time if there are no updates in the cluster, the server doesn't have to worry about syncing with database when it comes back online. If I can identify this, i'd be saving significant amount of time by avoiding the sync process.

Alex Miller said...

That's a good question and I'm afraid I am not enough of an expert yet in replicated Ehcache to know the answer.

For Terracotta, the cache is coherent which means that there are a lot of rules about what happens when a server "goes offline". I'm inferring from what you say that you really mean "becomes disconnected from the rest of the cluster" and not "server process dies and restarts".

In the disconnected case, from the point of view of that server, it is not allowed to continue performing operations if it cannot verify whether it's data is current. The server will at that point wait (according to a number of reconnection parameters) until connection has been reestablished. It's not uncommon to then use cluster events to detect this situation and have the node kill itself. The process is then restarted and attempts to rejoin as a new node.