A group blog from members of the VB team
(Note: there was a grievous error in this post based on a bad assumption on my art regarding VB. Not feeling the need to hide my ignorance :-), I have instead made a number of edits in this post to bring it back to some semblance of reality.)
One thing that gets me annoyed with myself is realizing that the product or service I’ve just bought has some hidden costs that I didn’t anticipate. It might be as complicated as realizing that my plane ticket has all sorts of byzantine surcharges and luggage costs, or as simple as making budgeting for a new killer gaming PC and forgetting about the sales tax – it seems that there are always extra costs out there that will come back to bite you. Being (as I have said in past postings to this blog) an inherently lazy person, and being fully caught up in our “I must have the sparkly thing right now” culture, I’m apt to overlook the fine print in any purchase agreement unless I really force myself to slow down and pay attention to the details.
One of the times I really have to pay attention is when I’m reviewing code here at Microsoft. As a member of the release team that makes the final decisions as to what last-minute fixes are safe enough to be applied to an imminent release, it’s my job to take a look at the proposed fix and make sure that it doesn’t introduce more problems than it purports to solve – i.e., that it doesn’t add hidden costs.
On the rare occasion that I find a problem in someone’s code, it’s usually something that’s obviously dysfunctional – a case where a variable isn’t initialized properly, or where a different team has simultaneously made a change that will render this change dangerous. But some of the problems are more subtle, and might not be obvious if only minimal testing and review has been done. Consider the following code (very similar to a case I reviewed here at work the other day):
Public MyList As New List(Of String)
Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
Dim s As New System.Text.StringBuilder
For i As Integer = 0 To MyList.Count - 1
Edit: This is an error in my blog post – I hadn’t realized that VB, unlike many other languages, only gets the value of the stop condition once (thanks to Phil Wells for pointing this out). To make my point, I would have to be using code like this:
Dim i As Integer = 0
Do While i < MyList.Count() - 1
i += 1
The method Count() might still have a hidden cost, but in the For Loop case, I’m at least only hitting it once. The fact that VB only analyzes the stop condition once leads to a different issue to be aware of that I’ll cover in my next blog.
Seems pretty simple, yes? When a button is clicked, the code iterates through the strings in a list and appends them to a StringBuilder, creating (for whatever reason) one long string containing all of the contents. And, to cursory appearances, it will function correctly.
However, the code has hidden costs. You see, written this way, the value MyList.Count is going to be reevaluated on every iteration of the “For” loop, even though the value itself isn’t changing. Furthermore, my indexing into MyList, which is always with reference to the beginning of the list, will also have a non-zero impact. This impacts the performance of the code and will make my customers unhappy over the sloth-like operation of my application.
In this case, the cost isn’t really high, as Count() is a property wrapping something that’s stored as a private integer member, and index referencing also has tricks to speed it up, but there’s still a cost and it can add up. It’s even worse if the Count() call has to be calculated! Consider this really awful code where I’ve created my own “List” type:
Public Class MyLameObject
' For some unknown reason, I'm creating my own linked list instead of using a List(Of) generic.
' Please don't do this for real!
Public data As String
Public NextMLO As MyLameObject
Public Sub Insert(ByVal mlo As MyLameObject)
If NextMLO IsNot Nothing Then
NextMLO = mlo
mlo.NextMLO = NextMLO.NextMLO
NextMLO.NextMLO = mlo
Public Function Count() As Integer
Dim ct As Integer = 0
Dim current As MyLameObject = Me
Do While current IsNot Nothing
ct += 1
current = current.NextMLO
Public Function Item(ByVal index As Integer) As MyLameObject
If index >= Count() OrElse index < 0 Then Return Nothing
For i As Integer = 0 To index - 1
Public MyLameObjectHeader As New MyLameObject
Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) _
For i As Integer = 0 To MyLameObjectHeader.Count - 1
Perhaps as a school assignment, I’ve created a singly-linked list which contains objects of type “MyLameObject” (instead of using optimized collections that we provide). I’ve also created a member function called “Count” which returns the number of objects in my list (starting from the current node) and another member function called “Item” which returns me a reference to a specific object in the list, as indexed from the current node. The code will compile and it will “work.” But, oh! It has so many hidden costs, and if the definition of MyLameObject exists in an assembly for which I don’t have the source code, I might not even realize it!
First, let’s look at button click handler. It’s calling Count() on every single iteration, and there will be a number of iterations equal to the value returned from Count(). Count(), in turn, walks the entire list every time it’s called. So, that means that if my list contains 1000 items, that means that I’ll be referencing a MyLameObject one million times! (This is what they refer to as an “O(n2) problem” in computer science classes.)
But wait, it gets worse: the code for Item() calls Count() also on each iteration to verify that the index is not out of bounds – that brings us up to a count of 1001000 MyLameObject references.
Edit: As noted above, that isn’t true for VB, though it would be for many other languages
And, of course, Item() itself needs to walk the list to actual get to the desired item, on every iteration – that adds 500500 more object hits (using the n(n+1)/2 formula) , for a grand total of 1501500 1500500 object hits. And all that just to concatenate 1000 strings! That, I fear, is going to slow down my code’s performance somewhat.
Now, of course I could add code to get track of values for Count(), and that would reduce my cost to a third of what it was. I could use some clever hashing algorithm to more easily get to the items while still being able to use method calls. I could enumerate the list myself instead of relying on “Count” or on “Item,” and that would really cut my costs down dramatically. Or, best of all, I could avoid reinventing the wheel and using the classes that VB.NET already supplies for me, since we do our best to optimize them.
But that’s not the point that I’m trying to make here. The point is, whenever you call a function, you are at its mercy performance-wise – you are stuck with whatever hidden costs it contains. You might not even be able to determine what these hidden costs are if you don’t own the code you’re calling, except by experimentation. And if you call that function iteratively, you’re just compounding the problem.
Now, you can’t avoid making calls into code you don’t own – at some level, we all use libraries or assemblies that are handed to us. But you can be smart about how you use them. Consider *this* code:
Dim count As Integer = MyList.Count
For i As Integer = 0 To count - 1
Now, I’m only calling Count() once, so I minimize the impact of its cost! Or,
Edit: As above, this doesn’t actually make a difference in VB. Of course, you still have the cost of “Count(),” whatever it may be. In this case, it’s just a property read; if it’s some home-grown collection or data repository, Count() (or it’s analog) could be a lot more costly.
Consider this even better example that you can use on collections that support IEnumerable (as all System.Collections types do):
For Each data As String In MyList
Yeah! Now I’m firing on all cylinders! Now I don’t have to worry about indexing costs either – I’m just walking the list, collecting my strings in the fastest way possible. Really, the only cost I have to deal with is the cost of actually appending the data (which is in fact why I’m using the highly-optimized StringBuilder class instead of just concatenating strings using a much more costly s &= data line).
So, rule of thumb: whenever calling any method, expect hidden costs (magnified if you’re in a loop) and work to minimize them. Call the methods outside of the loop if they are providing data that will remain the same throughout the loop, such as a count, and avoid them altogether if there’s an alternative way to iterate through the data that doesn’t implicitly involve resetting your position. Your customers will appreciate the time savings!
‘Til next time,
Great artical, Informative and to the point, an eye opener for me.
Keep them coming,
One other hidden cost of your original example is that it will throw an unhandled exception each time it is executed. It should have been
For i As Integer = 0 To MyList.Count - 1
Otherwise, a very good article!
Another issue - unless you have really simple code in there, thee is a risk that the number of elements in MyList will change while you are iterating the list (don'tcha just love multi-threads, particularly when they can be triggered when you don't expect them - read com interop). If you're going to cache the count of items, you'd better lock that collection
You state that MyList.Count will be evaluated on every iteration of the loop. It's my understanding that this is true in C#, but not VB.NET. http://msdn.microsoft.com/en-us/library/5z06z1kb.aspx states clearly that the start, end and step expressions are only evaluated once, before the loop begins.
I'm not convinced that your argument about the last block of code having the least hidden costs is true. You've simply moved from one set of hidden costs to another and made the code more complex in the process. I'm curious to know whether I'm thinking this through properly (I didn't do any perf testing).
Firstly every good compiler book has a chp on optimization. One of the first thing every good optimization chp covers is redundant expression removal. For the case of Count being called inside a for loop this is a clasic case. Any good compiler will convert the code to the local var case that you gave. Doing it yourself gains you nothing. In the case of .NET I'd expect the JIT to do it if the compiler itself didn't.
Secondly you've replaced N compares and method calls (once for each iteration) with an object creation (the enumerator), N method calls (MoveNext), N comparisons (while MoveNext) and N property invocations (Current). I fail to see how this is an improvement on perf. Even if you assume the list indexer is slow surely it is faster than all of this. Unless the enumerator is a private nested class it is unlikely to have access to any better methods than your code. Complicating this even more is the check for IDisposable which the compiler might or might not optimize out.
Wow, lots of comments -- good! In order, then:
@Jim: Thanks for the compliment. I've actually been remiss in posting recently as we've been a little busy, but I hope to ramp up a bit on the blog now.
@Karl: D'oh! After all these years, my 0-based iterations are still colliding with my 1-based iterations. Thanks for pointing this out; I've fixed it in the article. (Obviously, this is code that I would never actually run for obvious reasons...)
@KG2V: Absolutely correct, and in fact, I had originally planned to have another twist in this post where I was iterating through the list and deletign entries, since that shows off that particular problem very well. However, the article was already getting overlong for the point I was really trying to make, which is simply "try to control your performance as much as possible" instead of "avoid assuming that your list is constant," so I removed the "delete" code and just left the inset code in so readers would understand how the objects in the list were connected. The "changing list" topic is what I'll be dong for my next post later this month.
@Phil Wells: That is a big goof on my part -- I actually hadn't realized that difference in VB. I'll add the example to Do...While to make the point. (This actually helps my next blog, though, which involves deletion in a loop.)
@Michael Taylor: The costs are less because iteration/enumeration just does an O(n) walk -- it doesn't care about how many items are in the collection, it just walks until the end. You are absolutely right that there are still costs (and indeed you must always assume that). Count, however, will *not* be optimized away (except, as noted by Phil, for VB) because the value of Count, being a method, must be assumed to vary by the compiler. Often a compiler will inline a method, which definitely helps, but the code is still being run.
OK. That's fixed up now (how embarassing!) -- a number of "-1"'s added in, and the conversion to a Do Loop to make the initial point. I'm sorry for the goof-up.
More comments: I phrased my response to Michael ambiguously -- I should have emphasized the "just", meaning that I don't have to call Count() at all before the O(n) walk. The For..Next loop therefore isn't necessarily a bad way to go, even if the method is in the stop condition, given that VB only evaluates it once. The hidden cost is "how much does Count() cost me," because it's still called at least once -- so it's something to be aware of. I prefer to use enumeration whevever I can for this and several other reasons. First, I don't care about the count nor how it's generated, so it isn't costing me anything --it isn't even assessed once. Second, the reference object is being reused just as it would be in my loop, so no extra cost there. Third, the iterator checking for Nothing in the list as the stop condition is analogous to the For loop checking the stop condition. And fourth, I won't ever have to worry about the end index and whether or not I added -1 to the stop condition :-). Ultimately, my savings in this case is the cost of "Count" plus some piece of mind.
Bleah -- sorry, too much rewrite, I'll try to do better next time.
Doesn't the for each version convert the direct "Item" call into a virtual "Current"? The latter comes with a hidden cost of no JIT inlining.
Yes, that's correct, and as you're implying, makes the choice even better.
I've always thought that the compiler optimized the code when using List.Count() or Array.Length()
I ran a test and looked at the code using Reflector.
I compiled the test code below...
Dim list as new List(Of Integer)(20)
Dim x as Integer = 0
For i = 0 to list.Count - 1
' Do something useless
x = me.list(i) + 1
Reflector produces code like this...
Me.list = New List(Of Integer)(20)
Dim num3 As Integer = (Me.list.Count - 1)
Dim i As Integer = 0
Do While (i <= num3)
Dim num As Integer = (Me.list.Item(i) + 1)
i += 1
Is Reflector optimizing the code or is it showing the results of the compiler optimizing the code?
Sorry. I missed Phill Wells comment. Asked and answered.
No problem... this blog post caused plenty of confusion, to my discredit. :-)
Overly expensive maintenance and support cost is the main one I find in reading someone else's code.
Dumbed down, simple, 1st year out of college code is often times the easiest and lowest cost in the long run. It's easiest to read and easiest to not make a careless mistake when modifying.
Is there a tool to quantify execution time between points in code?
In classic asp I used a timer that was very helpful and in the case of .net a lot of macros are running from a call, under the hood, and not at all easy to intuit without using them for a while.
Yes, the Team System version of Visual Studio has a performance profiler -- see http://msdn.microsoft.com/en-us/library/ms182371.aspx.