March, 2010

  • Kirill Osenkov

    Interview question

    • 47 Comments

    Here’s a nice simple interview question:

    In a given .NET string, assume there are line breaks in standard \r\n form (basically Environment.NewLine).

    Write a method that inserts a space between two consecutive line breaks to separate any two line breaks from each other.

    Also, anyone venture a guess what is a practical application for such a method?

  • Kirill Osenkov

    Interview answers

    • 8 Comments

    In the previous post, I’ve come up with this interview question:

    In a given .NET string, assume there are line breaks in standard \r\n form (basically Environment.NewLine).

    Write a method that inserts a space between two consecutive line breaks to separate any two line breaks from each other.

    Update: Rik Hemsley has posted an absolutely fantastic summary in form of a Visual Studio solution with all the answers from the comments, unit-tests and benchmarks:

    http://code.google.com/p/kirill-question/

    Jordan Terrell’s RegExp-based solution seems to be the fastest one.

    Thanks Rik! This is very helpful!

    Source of the problem

    First I’ll explain how I came up with the problem. I was writing some code that generated HTML, specifically, some text inside the <PRE> tag, interspersed with <BR /> tags. I’m not an expert in HTML, that’s why it came as a surprise to me that if you have multiple consecutive <BR /> elements, the browser will only render one line break (well, at least the browser I checked it on - IE8). My code generated the <BR /> tags from Environment.NewLine substrings in the source text. So, to preserve empty lines in the resulting HTML, I had to intersperse the line breaks with at least one space – then IE would keep the empty lines when rendering <PRE> contents. I didn’t keep direct line breaks because some HTML viewers that I care about don’t render them correctly.

    My original solution

    Anyway, I’ve gotten a lot of great responses that contain a lot of food for thought. Mostly I was amazed how the responses actually reflected my own thinking as I was coming up with a solution. My ultimate solution (which was also correctly suggested by ram) was this:

        public static string IntersperseLineBreaks(string source)
        {
            string result = source.Replace("\n\r", "\n \r");
            return result;
        }

    A temporary local variable here is for your debugging convenience until my team comes up with a way to inspect the return value of a method in the debugger.

    Ambiguous problem statement

    Of course, banko correctly noted that the problem is stated ambiguosly, since Environment.NewLine is “\r\n” on Windows, and “\n” on Unix. Besides, as Jordan Terrell mentions, the behavior is unclear on source texts that contain wild mixes like “\r\n\r” and “\r\n\n”. Jordan also suggests that I include unit-tests to specify correct solutions.

    Well, I apologize for a problem statement that is not specified well enough (I simply didn’t have time to be that thorough!). I didn’t do it on purpose, but now when I think about it, most real-world requirements are ambiguous and not well specified. It is always interesting to see how a candidate deals with ambiguity and unclear requirements. So be prepared to expect unclear problem statements in your interviews and deal with them gracefully.

    There are several good strategies to impress the interviewer in this case, the simplest being to ask clarifying questions in the places where you think the requirements are unclear. Let them know that you are trying to explicitly understand and model the requirements in your head and are filling in the gaps.

    Another approach is to explicitly identify and parameterize the variability in the requirements and give a generic solution that works in all cases. Finally, if the problem statement is unclear, you can totally assert, assume or demand requirements that make your life easier. For instance, I remember I said “let’s assume that C# has tuples” during my interview at Microsoft. The interviewer laughed and said OK. I produced a nice solution that used tuples, and then he asked how would I change it to not use tuples. But that’s a totally different question ;)

    Assessment criteria

    If I were an interviewer (and I’ve never been one yet) and I would happen to ask this question, I guess I would look at several things in answers:

    1. Code reuse (e.g. my solution above relies on the Replace implementation in the framework vs. reimplementing a custom version of it for this particular problem, which might be error prone). However the point of interview questions is usually to see how people implement algorithms, so I guess for a sorting question an answer like List<T>.Sort() would rate very high on code reuse, but pretty much useless otherwise :)
    2. Correctness – some people gave similar solutions to mine, which call replace twice, but are more resilient to “weird input streams” with non-normalized line breaks. Such solutions will be twice as slow (since they will do full string replacement twice), but if a candidate understands the trade-offs between performance and correctness, I’m totally fine with it.
    3. Performance – writing your own StringBuilder-based implementation of Replace specifically suited for this particular problem is totally OK. In fact, some people gave very efficient solutions using this approach (e.g. Jared Parsons (a developer who I work with, who sits in the office down the hall) had built a custom ‘replacer’ state machine in F#) and Thomas Levesque correctly used a StringBuilder to avoid unnecessary string allocations. Some people fell into the trap of using string concatenation instead of a StringBuilder, which is by far less efficient, since it allocates ~ O(n^2) memory vs. O(n log n) memory allocated by StringBuilder. I would want a .NET candidate to understand how strings work in .NET, what are allocations, how GC works, and why string concatenation is bad when used in a loop.
    4. Style – having a concise declarative version is great when performance is not critical. Having readable code that is expressive and easy to maintain can sometimes trump performance. Moreover, usually people who are able to express their thoughts in code concisely will likely also master the skill of performance tuning when necessary.

    So I guess it’s impossible to declare a single winner because different solutions win when different assessment criteria are used. In terms of readability, getting things done and reusing as much as possible, I like ram's solution (and mine :P), even if it won’t work in certain scenarios (I’m OK with making assumptions about the source text). In terms of algorithm implementation, other solutions are certainly much better, those that use a StringBuilder or build the resulting array char-by-char. Finally, in terms of expressiveness I really liked some declarative solutions (using LINQ), even if they’re less effective (e.g. use string concatenation).

    string.Replace(“\r\n\r\n”, “\r\n \r\n”)

    One solution deserves some special explanation since it nicely demonstrates the core of the problem. The Replace method continues matching substrings after the end of the replacement, so if your replacement patterns overlap, it won’t replace the second one:

    image

    To overcome this problem, some people suggested the following:

            while (str.Contains("\r\n\r\n"))
                str = str.Replace("\r\n\r\n", "\r\n \r\n");

    Note that this loop will never run more than twice (this is where I goofed myself). GSEJ corrected me, pointing out that only two passes are necessary – first pass leaves a maximum of two consecutive line breaks, second pass eliminates them all.

    However we notice that in the replacement pattern (“\r\n\r\n”, “\r\n \r\n”), the start and the end are the same (hence the first and last chars overlap with previous and next occurrences of the pattern). The pattern is redundant, so we can make it shorter and only replace the crux of the pattern: Replace(“\n\r”, “\n \r”):

    image

    This pattern is non-redundant and doesn’t overlap with the neighbors, hence no second pass required.

    Well, anyway, thanks everyone for your answers! It was truly interesting for me and hopefully educational and entertaining for some of you.

  • Kirill Osenkov

    &apos; is in XML, in HTML use &#39;

    • 4 Comments

    I just got hit by a very confusing "by design" behavior and it took me a while to figure out what's going on.

    Here is the line of code:

        text = System.Security.SecurityElement.Escape(text);

    This method replaces invalid XML characters in a string with their valid XML equivalent.

    The problem that I had is that when escaping some VB code using this method and then pasting it into Windows Live Writer, VB comments ' became &amp;apos;.

    Well, it turns out, XML supports &apos; to denote the apostrophe symbol '. However HTML doesn't officially support &apos; and hence Live Writer "HTML-escaped" my already "XML-escaped" string.

    Solution:

        text = System.Security.SecurityElement.Escape(text);
        // HTML doesn't support XML's &apos;
        // need to use &#39; instead
        // http://www.w3.org/TR/html4/sgml/entities.html
        // http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2005-October/004973.html
        // http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
        // http://fishbowl.pastiche.org/2003/07/01/the_curse_of_apos/
        // http://nedbatchelder.com/blog/200703/random_html_factoid_no_apos.html
        text = text.Replace("&apos;", "&#39;");
  • Kirill Osenkov

    Stress testing Visual Studio 2010 – Part 2

    • 1 Comments

    Read part 1

    In the first part of this series I've started talking about our general approach to stress testing Visual Studio. In this post I'll talk about what parameters we're measuring. In the next post I’ll explain how we're measuring them and what tools we use for it.

    VM

    The main metric that we are tracking is the devenv.exe process virtual memory. Virtual memory describes the address space of the process, and it can be backed by RAM or page file on disk. Until a region in the address space is committed, it doesn’t consume any actual memory.

    Our primary goal is to not let VM grow over 1.5 GB, because otherwise the address space on 32-bit OS becomes so fragmented that we are unable to allocate new chunks of memory and crash with OOM (Out-Of-Memory). Fortunately we’re now fulfilling this goal even on very large projects. Typical Visual Studio 2010 VM size on startup is anywhere from 300-500 MB. Note however that this doesn’t mean that it consumes that much RAM, no. It merely means that the process reserved this much address space which it might or might not actually fill in the future with “real memory” (committed memory that actually consumes RAM or page file).

    Our secondary goal is that VM doesn’t keep growing if VS is not working with new data (e.g. there are no memory leaks).

    If VM grows in our tests, it typically grows in chunks of approximately 15-20 MB. VM growth is usually caused by the private/committed memory growing.

    Working set

    Working set is a subset of virtual memory currently being stored in the RAM. If the memory is swapped out to the page file, working set may drop to almost nothing. This is the memory that gets freed when you minimize your Windows application or call SetProcessWorkingSetSize. It’s not a very useful counter since depending on how much is swapped out to the page file working set size can differ dramatically.

    Private bytes

    A more useful measure is the actual amount of memory consumed by this particular instance of the application not shared with other processes. It can be backed by physical memory (RAM) or page file and serves as a more exact measurement of what does your process consume.

    http://www.itwriting.com/dotnetmem.php

    Process\Privates Bytes shows all private bytes used in the process (including native memory).

    Committed bytes

    The .NET CLR Memory\# Total Committed Bytes is the private bytes used for managed heap.

    Bytes in all 4 managed heaps

    Another useful measurement is the consumed managed memory. The CLR currently has 3 generations in the garbage collector: 0-gen heap, 1-gen heap and 2-gen heap (GC.MaxGeneration currently returns 2). Moreover, for objects larger than 85,000 bytes there is a special fourth heap called the “large object heap”. The LOH is not compacted.

    Objects in gen-0 are short lived, young, recently allocated instances. As they survive garbage collections, they get promoted to the first and then the second generation. Gen-2 heap contains long lived objects that aren’t collected very often. If you have a memory leak, your leaked objects will most likely end up being promoted to the Gen-2 heap and will stay there.

    GC Handles

    Since VS is a mixed managed-native application (increasingly managed with every release), COM interop still plays an important role. System.Runtime.InteropServices.GCHandle is a managed structure that provides a means to access a managed object from unmanaged memory. You use GC Handles to prevent the GC from collecting a managed object if it’s only being used from native code. GCHandles are also used to pin an object in memory and prevent the GC from moving it. Typically a VS process has around 30,000-40,000 GC Handles, which is normal. However if we see GC Handles grow up to 70,000 and beyond, we likely have a managed/native memory leak somewhere.

    CCW

    CCW, or COM-Callable-Wrapper, is another COM interop counter that we’re tracking. If it grows above 2000, I’ll likely start investigating a leak bug. CCW and GC Handles sometimes leak together (if one is leaking, then the other is leaking).

    GDI handles

    GDI handles are objects used by the operating system to describe brushes, pens, and other parts of GDI drawing system.

    A Windows process can allocate a max of 10,000 GDI handles.

    If there is a GDI leak, we usually notice it very soon since if there is a leak, 10,000 get exhausted relatively fast. If you notice drawing weirdness or some controls don’t refresh correctly, check the GDI objects column in the Task Manager for your process. Tracking down GDI leaks is relatively easy – find what operation increases the number of GDI handles, and go through the code – search for places where you forget to call ReleaseDC, DeleteObject and such.

    Many people are surprised that Visual Studio 2010, a WPF application, can still leak GDI handles. The explanation is that some remaining parts of the UI are still written in native code/GDI, and it’s them who is typically leaking, if any. WPF doesn’t consume GDI objects and hence can’t leak them.

    USER objects

    User objects are very similar to GDI in nature, they’re also used by the operating system. To be frank, I’ve never seen them leaking.

    Handles

    OS handles are typically created when you open a file, a registry key, a semaphore or a mutex, and the like, or create a thread for instance. Typically the code should close the handle after using it (ever forgot to close a file?). Handles are measured by the Task Manager or also check out the SysInternals Handle tool.

    Threads

    The number of threads is worth watching because a typical .NET thread allocates 1 MB for its stack. Threads and VM are typically related – if you’re seeing a sudden 1 or 2 MB jump in VM, check the threads - it’s likely the runtime has started 1 or 2 new threads.

    VS has typically around 30-50 threads. We’re working to bring this number down wherever we can, because it’s generally good to not start your own threads, but instead preferably use the thread pool (ThreadPool.QueueUserWorkItem) and abstractions such as the TPL (Task Parallel Library). Speaking of which, if you have a chance to take Jeffrey Richter’s training on threading, go take it. It’s an awesome experience.

    In the next post I will talk about how exactly we measure these values and what tools we’re using for this.

  • Kirill Osenkov

    LiveGeometry @ Coding4Fun

    • 0 Comments

    Check out this article I wrote for Coding4Fun:

    http://blogs.msdn.com/coding4fun/archive/2010/03/01/9971021.aspx

    Hope you like it!

Page 1 of 1 (5 items)