Cache expiration puzzle

Jan 27, 2008

I’m in doubt and I need your help on this important subject. It’s about caching and about whether or not to use absolute or sliding expiration. I’ve been thinking really hard about this and can find both pros and cons to each solution. This scenario is this:

When the BlogEngine.NET application starts, it loads all posts into memory. Because BlogEngine.NET is a single-blog-per-installation it doesn’t really matter much since each post only use about 50kb of memory. So my 350 posts only use about 17.5MB of RAM. However, some people have thousands of posts even on a single installation. The goal is to reduce the memory footprint by using intelligent caching.

The simplest solution is to take the post body – the text of a post – and make it lazy loaded, so it only loads when it is requested by a visitor. By lazy loading I mean that the body is read from IO whether it is from a database or XML file only when it is requested instead of put into memory when the application starts.

The post body is by far the biggest entity on a post and therefore is the single most influence on memory consumption.

Sliding expiration

What I’ve done is to cache the post body with a sliding expiration of 3 minutes. That means that when a visitor requests a particular post, the body is cached for three minutes from that point on. One minute later, another visitor requests the same post and then the cache expiration is reset to 3 minutes again and so on.

The problem with this approach is if you have a rather popular blog where people requests a lot of both new and old posts. Then you might end up with a solution where almost all post bodies are cached all the time and never expires. On the positive side the performance remains high since no IO loading occurs because all post bodies are loaded into memory.

For lesser popular blogs, another positive is that all the rarely visited posts are removed from cache but the front page still loads directly from memory most of the times.

Absolute expiration

The other possibility is to use an absolute expiration of 3 minutes as well. That means that no matter how many requests a post gets, the cache always expires after 3 minutes after the first request. The problem here is if you have a popular blog, then the performance takes a hit since it has to read from IO every 3 minutes on every post. The positive thing is that you clear the cache very frequently to garbage collection.

Another issue with this solution is that for a new blog with few requests, the performance hit will be the biggest. It has to do IO reads at almost every request since there are more than 3 minutes between each request. As I see it, absolute expiration will reduce memory consumption most, but take the biggest performance hit.

Your suggestion

There are pros and cons with each solution and I can’t figure out what makes the most sense. I lean toward sliding expiration since it will make sure the front page always loads from memory, but then again, it also keeps certain posts in memory all the time.

What is your take on this issue and are three minutes the right expiration time span?

* $4.95/month BlogEngine.net Hosting – Click Here!

Comments (20) -

Phil Garcia
Phil Garcia United States
1/27/2008 8:42:23 PM #

Per the MSDN documentation, “when system memory becomes scarce, the cache automatically removes seldom-used or low-priority items to free memory.”

Since the framework will simply unload cached items when memory is needed elsewhere, you do not necessarily have to tightly balance memory usage and IO performance with a cache expiration policy or duration.

With this in mind, I would recommend a sliding expiration between 10 to 60 minutes (or more). And if possible, I would randomize the expiration a little bit, so large parts or sections of the cache are not unloaded to the garbage collector all at once.

Keep up the good work!

Paul
Paul United Kingdom
1/27/2008 11:06:12 PM #

Phil is correct in that the framework automatically clears stuff down when it's constrained by memory, so there is no need to worry about keeping the expiration short in that context.

However, you are absolutely correct to worry about the danger of permanently holding data, so I would suggest an absolute expiry, up to 60 minutes in the future.

Another concern, possibly not for BlogEngine but for others reading this post, is that on a web farm your servers each have their own Cache and you cannot always guarantee the same server will be hit by subsequent requests (even if you're not on a web farm today, it may be a business requirement tomorrow, always be aware of these concerns).  This may not immediately seem like a problem if they're each holding data for one hour - it's still only one database request per server per hour.

But what if server A gets cached at 10:00 and server B gets cached at 10:10, then the data is changed by server C at 11:05?  Server B will notice the change 5 minutes later, server A will not know for another 55 minutes.  So you create a situation where refreshing the page shows different data each time you refresh.

So actually, it's probably good practice to set an absolute expiry, rounding the expiry time to the next full hour, so all servers expire cached data at 11:00, regardless of whether they were loaded at 10:00 or 10:10.  Of course then you're relying on your servers having the same time, but that *should* be fair enough.  Better than not trying to synchronise, anyway.

Better still, use the nifty new functionality in SQL Server 2005 that expires the cache automatically when the data is changed.

Paul

Paul
Paul United Kingdom
1/27/2008 11:08:41 PM #

Dammit, I meant "changed by server C at 10:05" Smile

Michael
Michael Australia
1/28/2008 2:11:51 AM #

I would side with Paul, using the absolute expiry will allow the content to flow to the front end quickly, if you have a busy site, you would never get new content on the homepage, if you have a new site or one with not a lot of traffic, then the IO may not be much of a concern because the traffic is so low.

The SQL caching works great when you only have one or a small numbers of servers, but this caching method is very different when you are in a bigger environment.  When the data is updated and the SQL cache is expired immediately, ALL of the servers hit the database at the same time, which can create a very bad rush condition if you have alot of servers with alot of traffic.

My recommendation is to have the use absolute expiry and have a lower cache time, 3 minutes should be fine.  I think the whole point is to stop the rush of traffic to the database or IO source and that this should achieve that.

As a side note, it would be quite easy to enable absolute/sliding/SQL caching by a config change.  Have a default to absolute and 3 minutes, but have it overrideable.

mike

huobazi
huobazi People's Republic of China
1/28/2008 3:23:20 AM #

I think build a static html file in a httpmodule/httphander (user a stream/filter) fist then transfer to it can also improve the performance
when the entity/post was changed/modified  we can rebuild the html file again

this stuff show how to build the static files in httphander
www.cnblogs.com/.../...AndMakeStaticHtmlFiles.html

The demo web project can be download at www.cnblogs.com/Files/huobazi/BuildHtmlDemo.rar

In blogengine we can use a filter in CompressionModule for static files

a static file in asp.net online demo
http://www.devedu.com/default.aspx
http://www.devedu.com/default.html

www.devedu.com/Doc/DotNet/AspNet/default.3.aspx
www.devedu.com/Doc/DotNet/AspNet/default.3.html

hope helpful~

TweeZz
TweeZz
1/28/2008 4:02:41 AM #

Hi,

I would also go for a sliding expiration. In my opinion 3 minutes is really not a lot. But maybe for this blog it is a good choice to have posts cached for (only) 3 minutes. Since there is a big difference between the proposed 3 - 60 minutes, why not have everyone choose their selves what type of caching they want to use (if any) and what time span should be used?
If solution A is good for blog 1, but solution B is better for blog 2, then I would make it setup able. Then everybody can be made happy and you can stop breaking your head on this issue.

Ingmar
Ingmar Netherlands
1/28/2008 5:41:55 AM #

I would go for the sliding expiration. If you run a popular blog, you need more hardware (memory in this case). Probably an popular blog will have more income, has other values to the uptime so it should use better hosting.
If you have a low profile blog, you can have cheap hosting. I think it's the natural way of choice.

I wouldn't recommend making the settings editable via the admin interface. Most blogs run on shared hosting accounts, so you don't have any idea what's the impact on the machine. If you decide to make them editable, please use the web.config so not everyone will edit this setting and keeping this setting to the 'host' admin and not the 'site' admin.

Dan Atkinson
Dan Atkinson United Kingdom
1/28/2008 10:14:19 AM #

It seems that you've already made up your mind and you're just looking for validation, but you're right though.

As Phil pointed out first, having a sliding expiration with a randomised timer between a few minutes and an hour allows the cache to expire smoothly.

Ryan
Ryan Australia
1/28/2008 10:35:24 AM #

If you have a sliding expiration and a popular site, wouldn't that mean that you would NEVER have any new content on the site until no one was on your site?  Wouldn't having stale content be a really bad thing?  Even with a randomized timeout, as long as you had 1 visitor every X seconds, the homepage would never update.  Seems like a bad idea.  

I guess its the argument where blog A is different then blog B, so what works for one doesn't work well for another.  A setting seems like the best way to go.

Denny Ferrassoli
Denny Ferrassoli United States
1/28/2008 3:37:12 PM #

I agree with Ingmar. Allow the settings to be managed. However, I would prefer the ability to edit the settings in my admin interface. BlogEngine.NET is very easy to setup and deploy so keep it that way. I don't see a reason to make changes in web.config for advanced users. A description of what each setting does and its effects should suffice.

Bruno 'Shine' Figueiredo
Bruno 'Shine' Figueiredo Portugal
1/28/2008 3:42:46 PM #

You could also slide the expiration based on the how popular a post is.
This would allow the GC to could smoothly release resources.

Paul
Paul United Kingdom
1/28/2008 7:52:11 PM #

> You could also slide the expiration based on the how popular a post is.
> This would allow the GC to could smoothly release resources.

So you want to slide the cache expiration so that more popular threads are expired more quickly?  How does that seem like a good idea?  They're the ones that you want to cache, it's the ones that change regularly that you want to expire.

That said, my comment last night (in my defense, when I was half asleep), also had a flaw that I'm astonished no one picked up.  If the example I gave had a set expiry date of "the next full hour" then there would still be a 50 minute period where the two servers had different information cached.

Caching is a dangerous game and you should have to consider the consequences.  Unless you have triggers set to expire when the data changes, there is always going to be a risk involved.  I don't agree with Michael's comment that SQL Server 2005's triggers lead to heavy access when data in the cache is expired.  Expiring the cached object does not require the server to read it again immediately, it just says to the server "next time you need this data, refresh it from the database cause what you have is currently out of date".

I would suggest a much better approach to data retrieval is to always get exactly as much data as a page requires and hold it for the complete lifecycle of that page then throw it away.  In the unlikely event that the next page requires the same data, get it again.

Most, if not all, database servers can handle some lively activity as long as your indexing is well done.  Caching for a period of time should only be done if you know the data will not change in that period or in a situation where you really don't care whether the user sees immediate updates.

Say, for example, you have an ecommerce site with a list of countries in a drop down list - you may want to add some obscure country that was missed, once in a blue moon, but you may be happy to cache it once per hour and when the user from that obscure country complains you tell them to come back in two hours and try again.

And even in this kind of circumstance I would never recommend sliding expiry.  The risk of never updating your site is way too great.  If your site isn't that popular then don't bother caching the data in the first place.

Mads Kristensen
Mads Kristensen Denmark
1/28/2008 8:16:16 PM #

Guys, thank you very much for your insight on this subject. What I think is to use a combination of invalidating the cache when data changes and a sliding expiration of maybe 30 minutes. That way the post body will be cached for minimum 30 minutes or when the data changes. The data almost never changes, but in case it does, the cache will be reloaded for that post.

Also, I think you are right that this has to be a setting so people can easily adjust this based on their needs and data provider. The XML provider is of course slower than the database provider so it might need to be cached for longer periods.

Bruno 'Shine' Figueiredo
Bruno 'Shine' Figueiredo Portugal
1/29/2008 8:57:59 AM #

@Paul: That is the main idea...the most popular have a bigger sliding expiration. Like Mads says, a combination of a cache invalidation when data changes,  with a sliding expiration based on how popular the post is (bigger the popularity, bigger the time to expire), like I said, could be one of many solutions.

This said, I agree with you when you say that caching must be used with caution, but I think with a correct usage, it can  be most useful. That's why "God" invented the cache invalidation Tong

Olmo
Olmo
1/29/2008 10:07:21 AM #

I've an implementation of an in-memory data structure: RecentsDictionary. It's like a dictionary (get by keys) and a heap together, where key value pairs are priorized when accessed. And the less used is removed if the maximum number of items is reached.

I have used part of the code from someone else but I don't remember from who :S

I´ve the source code if necessary.

Charles Nurse
Charles Nurse United States
1/29/2008 12:57:00 PM #

In DotNetNuke we use sliding expiration for the most part - although module developers are free to do there own thing.  We have a setting available to the "Host" user account, that determines the Time span to use (from 0-120 minutes), and whenever something is changed we invalidate the cache.

andrei
andrei United States
2/4/2008 8:22:30 AM #

comment test

Andre Velloso
Andre Velloso Canada
3/11/2008 5:22:00 AM #

I would go for sliding expiration with a greater expiration time. For the Blog I would cache each individual post with a unique key. Then the most popular post would be always cached. Another advantage to do this is that when the post is updated you can update the cache by just using the same key. For the BlogEngine it is not dificult to handle the cache because both the admin and the public interface run under the same application. When you have separate applications to do these tasks it is a little more tricky.

Roman Clarkson
Roman Clarkson United States
3/12/2008 7:37:02 PM #

This is one of those examples that require a human to pose the question, sliding or absolute?  The question that comes to my mind is, "how can automation answer it for me?"  Couldn't BE.N track relevant statists that would answer the question, sliding or absolute?  Who has time to think about this stuff, let the application decide based on a good set of rules.

Secondly, perhaps limit the caching of posts to something like the last 30 days or last 100 posts.  That would give sliding the advantage and limit IO trips.

Mr .Phucked
Mr .Phucked United States
6/24/2008 8:03:23 PM #

I'm extremely interested in people's findings with these different caching strategies.  I currently have a pretty high traffic site with many pages receiving 25k page views a day and many with over 400+ comments.
The site is experiencing Out of memory exceptions approx every 3-4 hrs.
Caching may help to alleviate these problems, I hope!
Has anyone tried the caching on a high traffic site?

The site in question btw is www.thatsphucked.com (not safe for work)

Pingbacks and trackbacks (1)+

Comments are closed

About the author

Mads Kristensen

Mads Kristensen
Program Manager at the Microsoft Web Platform team and founder of BlogEngine.NET.

More...

Month List

Disclaimer

The opinions expressed herein are my own personal opinions and do not represent my employer’s view in any way.