Welcome to MSDN Blogs Sign in | Join | Help

An Async Html cache – Part I - Writing the cache

Other posts:

In the process of converting a financial VBA Excel Addin to .NET (more on that in later posts), I found myself in dire need of a HTML cache that can be called from multiple threads without blocking them. Visualize it as a glorified dictionary where each entry is (url, cachedHtml). The only difference is that when you get the page, you pass a callback to be invoked when the html has been loaded (which could be immediately if the html had already been retrieved by someone else).

In essence, I want this:

    Public Sub GetHtmlAsync(ByVal url As String, ByVal callback As Action(Of String))

I’m not a big expert in the .Net Parallel Extensions, but I’ve got help. Stephen Toub helped so much with this that he could have blogged about it himself. And, by the way, this code runs on Visual Studio 2010, which we haven’t shipped yet. I believe with some modifications, it can be run in 2008 + .Net Parallel Extensions CTP, but you’ll have to change a bunch of names.

In any case, here it comes. First, let’s add some imports.

Imports System.Collections.Concurrent
Imports System.Threading.Tasks
Imports System.Threading
Imports System.Net

Then, let’s define an asynchronous cache.

Public Class AsyncCache(Of TKey, TValue)

This thing needs to store the (url, html) pairs somewhere and, luckily enough, there is an handy ConcurrentDictionary that I can use. Also the cache needs to know how to load a TValue given a TKey. In ‘programmingese’, that means.

    Private _loader As Func(Of TKey, TValue)
    Private _map As New ConcurrentDictionary(Of TKey, Task(Of TValue))

I’ll need a way to create it.

    Public Sub New(ByVal l As Func(Of TKey, TValue))
        _loader = l
    End Sub

Notice in the above code the use of the Task class for my dictionary instead of TValue. Task is a very good abstraction for “do some work asynchronously and call me when you are done”. It’s easy to initialize and it’s easy to attach callbacks to it. Indeed, this is what we’ll do next:

    Public Sub GetValueAsync(ByVal key As TKey, ByVal callback As Action(Of TValue))

        Dim task As Task(Of TValue) = Nothing
        If Not _map.TryGetValue(key, task) Then
            task = New Task(Of TValue)(Function() _loader(key), TaskCreationOptions.DetachedFromParent)
            If _map.TryAdd(key, task) Then
                task.Start()
            Else
                task.Cancel()
                _map.TryGetValue(key, task)
            End If
        End If

        task.ContinueWith(Sub(t) callback(t.Result))
    End Sub

Wow. Ok, let me explain. This method is divided in two parts. The first part is just a thread safe way to say “give me the task corresponding to this key or, if the task hasn’t been inserted in the cache yet, create it and insert it”. The second part just says “add callback to the list of functions to be called when the task has finished running”.

The first part needs some more explanation. What is TaskCreationOptions.DetachedFromParent? It essentially says that the created task is not going to prevent the parent task from terminating. In essence, the task that created the child task won’t wait for its conclusion. The rest is better explained in comments.

        If Not _map.TryGetValue(key, task) Then ' Is the task in the cache? (Loc. X)
            task = New Task(Of TValue)(Function() _loader(key), TaskCreationOptions.DetachedFromParent) ' No, create it
            If _map.TryAdd(key, task) Then ' Try to add it
                task.Start() ' I succeeded. I’m the one who added this task. I can safely start it.
            Else
                task.Cancel() ' I failed, someone inserted the task after I checked in (Loc. X). Cancel it.
                _map.TryGetValue(key, task) ' And get the one that someone inserted
            End If
        End If

Got it? Well, I admit I trust Stephen that this is what I should do …

I can then create my little HTML Cache by using the above class as in:

Public Class HtmlCache

    Public Sub GetHtmlAsync(ByVal url As String, ByVal callback As Action(Of String))
        _asyncCache.GetValueAsync(url, callback)
    End Sub

    Private Function LoadWebPage(ByVal url As String) As String
        Using client As New WebClient()
            'Test.PrintThread("Downloading on thread {0} ...")
            Return client.DownloadString(url)
        End Using
    End Function

    Private _asyncCache As New AsyncCache(Of String, String)(AddressOf LoadWebPage)

End Class

I have no idea why coloring got disabled when I copy/paste. It doesn’t matter, this is trivial. I just create an AsyncCache and initialize it with a method that knows how to load a web page. I then simply implement GetHtmlAsync by delegating to the underlying GetValueAsync on AsyncCache.

It is somehow bizarre to call Webclient.DownloadString, when the design could be revised to take advantage of its asynchronous version. Maybe I’ll do it in another post. Next time, I’ll write code to use this thing.

Published Monday, April 27, 2009 4:57 PM by lucabol
Filed under: ,

Comments

# An Async Html cache ??? part I | ASP NET Hosting

Monday, April 27, 2009 8:27 PM by An Async Html cache ??? part I | ASP NET Hosting

# Simpler method

It would be much easier to use a normal thread safe collection class.

Each element would have:

 key

 url (string)

 status (loaded, failed, waiting to load, partially loaded)

 last status change (date/time)

 html_loaded (string)

 last_referenced (date/time)

 can_timeout_and_be_deleted(boolean)

Class methods

  Get HTML from URL(boolean lookup_only = false, int max_block_seconds = 0 /* -1 block forever, 0 - don't block, otherwise block for X seconds*/)

  Get HTML from KEY(boolean lookup_only = false)

  Delete_entry(URL)

  Delete_entry(KEY)

A thread or threads internal to the class would load the html asychronously and be invoked via a clock timer with ticks a few seconds apart.

Attaching a callback for each request is much harder to implement.  It is upto the method requesting the URL to decide whether or not it blocks, needs an asychronous callback/interrupt or polls for data.  

The idea is that for nearly all cases, no new threads should be created and no new callbacks should be hooked up.  This keeps your code easier to understand and debug.  Common faults and scenarios are handled easily:

 - requesting thread terminates

 - asynchronous load times out

 - error loading html

 - html hasn't been used for 5 minutes and can be removed (a tunable cache parameter)

 - memory limit of cache reached and unreferenced html strings can be removed (a tunable cache parameter)

 - duplicate request for a URL/KEY from more than one thread

 - html can be loaded from multiple sources (web, file, network share, ftp, database, etc.).

 - html load failed as html string exceeds the size limit on loaded string (e.g., a tunable cache parameter)

 - The common problem with attempting a callback for a method that is terminated is avoided.  That's a problem when the callback requires the cache to build a complex packet of data to pass in the callback.

This is quite similar to basic page handling algorithm in a virtual memory system (circa 1980).  It's how one handled this in systems lacking real threading or with non-reentrant GUI message handling (VB6 GUI/MFC GUI posting a message to the current winform indiciating asynchronous request completed).

Wednesday, April 29, 2009 1:20 PM by Greg

# re: An Async Html cache – part I

Thanks Greg, these are good comments.

We have a different design goal though. Both solutions are valid. I want the method requesting the URL to have the flexibility of deciding what to do (aka have a callback). I do want the exposed API to be async.

The rest of your comments talk to the difference between writing production code and a conceptual example. I'm doing the latter here.

Wednesday, April 29, 2009 1:39 PM by lucabol

# async

The idea of wrapping the asychronous cache handler in a class is to reduce or eliminate the need for callers to bbe asychronous.  This makes coding the caller's class much easier.

The other aspect is that the amount of work done in an asychronous call back should be minimal since you don't know when it will be executed.  For example, you get a callback call with the HTML you need whilst you are destroying the caller's object.  This is more important when dealing with large amounts of data in each cach entry (e.g., large xml strings) since processing each cache entry may take considerable time.

Wednesday, April 29, 2009 3:11 PM by Greg

# Luca Bolognese on HTML caching in VB.NET (Lisa Feigenbaum)

You may know Luca Bolognese from his well-known work on C# LINQ. Luca is now the Group Program Manager

Wednesday, April 29, 2009 3:15 PM by The Visual Basic Team

# An Async Html cache – part II – Testing the cache

Other posts: Part I – Writing the cache Let’s try out our little cache. First I want to write a synchronous

Friday, May 08, 2009 11:53 AM by Luca Bolognese's WebLog
New Comments to this post are disabled
 
Page view tracker