Efficient caching in Spring batch applications

This articles references two blogs posted on TopTal blog:

Motivation

There are a lot of resources on Batch processing in Spring and also on cache support in Spring. This post explores, how these two areas combine.

All examples are derived from a real world use case scenario from an reinsurance company. The company maintains a huge proprietary data store. We developed batch jobs, which transformed source data into its derivations. These derivations were mostly used for more efficient reporting. Reporting departments depended on our work. Once a transformation was specified and ordered, our batch jobs were expected to provide results within hours, ideally on the same working day.

The most important strategy, which we used was caching. We performed various operations on input data. Some of them very cheap and simple calculations. Other ones however, were expensive lookups in various dictionaries, mostly in a database. The database was very well optimized and was able to perform a lookup in order of few milliseconds or tenths of milliseconds. However, we processed billions of records and this could lead to execution times of hours, days and weeks.

It was only natural to avoid repeated lookups by employing caching. Let’s explore problems and strategies which we encountered.

Difference to online services

Before we proceed, we need to realize, what we expect from caching.

When developing an online service, users are generally very satisfied with response times of e.g. 50ms. If the result depends on an expensive operation, which takes e.g. 500ms to load, than we cache it. We also do not expect the cache to grow too much. We use any kind of cache, that comes out of the box with the framework.

However, in batch applications, we often process billions of records. We generally needs to process a record in fraction of milliseconds. This means, that the lookup in the cache itself must be just as fast. Due to the volume of processed data, the cache grows fast and needs to be limited. Usually, the cache, that comes out of the box, does not have these properties.

Example batch job

Our example batch job will calculate checksum of rows of pseudorandom words. For every integer (called “word number”), we run a simple deterministic algorithms, which creates a seemingly random word:

  2277 -> popczbuiyyzhndkxynrqohumbzcyndtdtkievvtqruigkptopwuknzjeb
248053 -> jylqesjiwatntiaozbsifogqykayrsavbgkxjycdxuvlceizlgstmclyh
668659 -> cpkltqvxeubxczkkeluwmtcohzbjbgtiobghkbwiwfudkahqdmdvmlocl

These words are generated by a WordGenerator and the operation represents an expensive lookup in a dictionary:

Next, we will generate pseudorandom rows of words. These represent flat files, which contained source data in our batch processes. Our example rows have fixed cardinalities. First column will only contain word number ranging from 0 to 10. Second column is ranging from 0 to 100. N-th column is ranging from 0 to Nth power of 10. This setup is similar to flat files, which we encountered in real production. Some columns, contained only few values (e.g. years from 1960 till 2018). Other columns contained broad range of values (e.g. bank transaction IDs):

Here is an example of such flat file:

[1, 65, 961, 5313, 98361, 180915]
[9, 57, 406, 8290, 69353, 333818]
[4, 45, 225, 2156, 73740, 467670]
[8, 29, 769, 558, 44227, 582468]
[3, 69, 587, 4424, 48614, 199967]
[1, 60, 33, 7401, 19605, 352871]
[6, 48, 851, 1267, 40345, 486722]
[0, 32, 43, 9670, 94479, 601521]
[5, 72, 214, 3535, 15218, 219020]
[3, 64, 307, 6513, 69858, 371924]
[9, 52, 478, 378, 90597, 505775]
[2, 88, 670, 5133, 44732, 104221]
[8, 76, 488, 8999, 65471, 238073]

Our batch process will calculate sum of hashcodes of all words (not word numbers)

Implementation in Spring batch

The source code of the implementation is located on GitHub.

The configuration of Spring batch is very simple and consists only of a single type step:

RowItemReader simply generates the flat file with word numbers.

ChecksumProcessor converts every word number from the row into a word and calculates sum of all word hashcodes:

CheckSumWriter adds up checksums from one chunk, into what is going to be the checksum of the whole flat file.

Finally, StepExecutionListener collects and prints results and statistics of the step.

To compare various strategies, our job will process the same flat file 8 times. This is represented by 8 steps, each of them configured with a different variation of ChecksumProcessor:

Results

First, we will execute all steps in an Java 8 JVM, with 4GB of heap space.

Plain

This step uses no caching at all. It gives us a reference speed of how fast we can process records without any optimization.

Executing step: [Plain]                                                   
Finished step. Word source: Plain generator. Processor: Simple processor  
Speed: 67474.11 rows/sec                                                  

Spring cached

This step uses a dedicated bean managed by Spring. This beans delegates calls to a plain word generator, but with having this invocation cached using @Cacheable annotation:

Results:

Executing step: [Spring cached]                                          
Finished step. Word source: Cached by Spring. Processor: Simple processor
Word source statistics: No statistics                                    
Speed: 83101.34 rows/sec                                                 

We can see only a slight increase of performance. This is due to the fact, that we did not configure this cache at all. First of all, you really need to set up a suitable cache size. All caches in following steps will use size of max 200000 entries. Default Spring cache might be too small. It is necessary to gain access to CacheManager and configure the cache there. CacheManager also provides statistics on how the cache performed.

There are other things to look out for. For example, our real world setup was using a version of Spring, which created JDK proxies to enable caching annotations. Time measurements and profiling sessions confirmed, that a considerable amount of time was being used in JDK proxies. The Spring version used in our example uses cglib. This should enable JVM’s JIT compiler to inline the caching advice more effectively.

Another problem with default caches was generation of keys. A key consisted of all parameters of cached method and was constructed by reflection. This was another source of delay.

We can only guess, why the default cache did not enable better performance. In a future article, we will explore, how Spring’s caching can be configured for higher performance.

HashMap

This step uses JRE’s HashMap. The size of the map is unlimited.

Executing step: [HashMap]                                                         
Finished step. Word source: Cached manually - HashMap. Processor: Simple processor
Word source statistics: Hit count: 11154589, miss count: 845411                   
Speed: 290824.49 rows/sec                                                         

We see a huge improvement in speed. Processor is accessing the map directly, without any proxies. This enables JVM’s JIT compiler to fully inline access to the cache.

However, if we run this step with JVM’s option -Xmx128m, we would soon run into OutOfMemoryError. And this is simply not an option. We observed, that OOME in production would bring down the whole infrastructure. Many components of the batch execution framework simply could not recover. We had to restart our JVM’s and middleware services. This required a lot of bureaucracy and took a lot of waiting time.

For this reason, all caches had to be limited in size.

LinkedHashMap, Guava, EhCache

We measured three caches with a maximum size fixed to 200000 elements.

Executing step: [LinkedHashMap]                                                                
Finished step. Word source: Cached manually - LimitedLinkedHashMap. Processor: Simple processor
Word source statistics: Hit count: 10189804, miss count: 1810196                               
Speed: 184842.88 rows/sec                                                                      
Executing step: [Guava]
Finished step. Word source: Cached manually - Guava. Processor: Simple processor
Word source statistics: CacheStats{hitCount=10190217, missCount=1809783}
Speed: 135722.04 rows/sec
Executing step: [EhCache]                                                        
Finished step. Word source: Cached manually - EhCache. Processor: Simple processor
Word source statistics: Hits: 10297676 Misses: 1702324                           
Speed: 117240.17 rows/sec                                                        

All of them obviously performed worse, than an unlimited map. It would be interested to see, why Guava and EhCache performed worse than LinkedHashMap.

LinkedHashMap(smart)

You might have already observed, that the function (Word number) -> Word is not the ideal candidate for caching. After all, we are not interested in the Word itself. We only need its hashcode. In light of this, this step uses LinkedHashMap, but with capacity 20 times larger than other caches. We can afford this size, because instead of long strings, we cache only simple integers.

Executing step: [LinkedHashMap(smart)]                                          
Finished step. Word source: Plain generator. Processor: Cached processor
Word source statistics: No statistics                                   
Processor statistics: Hit count: 11154589, miss count: 845411           
Speed: 448833.03 rows/sec                                               

As expected, this strategy enabled highest speed.

HashMap(soft)

This step is using a very unusual cache. It is again a simple unlimited HashMap, but referenced using Java’s SoftReference. Let’s execute it using -Xmx128m:

Executing step: [HashMap(soft)]                                                                                 
Cache has been garbage collected. Creating empty one.                                                           
Cache has been garbage collected. Creating empty one.                                                           
Cache has been garbage collected. Creating empty one.                                                           
Cache has been garbage collected. Creating empty one.                                                           
Cache has been garbage collected. Creating empty one.                                                           
Cache has been garbage collected. Creating empty one.                                                           
Cache has been garbage collected. Creating empty one.                                                           
-----------------------------------------------------------------------                                         
Finished step. Word source: Cached manually - HashMap - soft reference GC count: 7 . Processor: Simple processor
Word source statistics: Garbage collected: 7 times. Last cache statistics: Hit count: 360050, miss count: 122464
Speed: 172324.66 rows/sec                                                                                       

Usage of SoftReference allowed JVM to garbage collected the whole HashMap. When we found out, we simply created a new empty one. It happened 7 times and we still got a slightly better performance than with Guava or EhCache.

We used this strategy frequently, mostly for defensive reasons. As already mentioned, an OOME crashed whole infrastructure and this was the way how to prevent it. We made educated guesses on how large caches we could afford and we adjusted these estimates further during runtime.

You might ask yourself, why not wrap every value in a separate soft reference. Java comes with a WeakHashMap but not with a SoftHashMap. Guava’s CacheBuilder comes with option for soft values. This approach is not recommended for our use case. A huge number of soft references comes with performance hits. It is therefor preferable to soft reference whole segments of cache or the whole cache itself.

Conclusion

We explored a type of batch application, which processes a huge number of very small input records. We showed, that the most efficient caching can be achieved by avoiding any kinds of AOP proxies and by avoiding reflection based key generation.

In scenarios, where input data is transformed several times, it is advisable to put cache around function compositions, which have smallest possible key sizes and smallest possible value sizes.

If size of caches cannot be determined exactly to prevent OOME errros, than caches referenced via soft references represent a useful defensive mechanism.