In the first step of profile-guided optimization, we reduced tatal CPU sample of ReadNewLog.ReadFile from 6,223 samples to 5,803 samples; the second step reduced it further to 3,982. Would it be nice if we can reduce it to below 3,111 samples, essentially doubling the speed of CLRProfiler's profile loading?

The biggest target now are ReadChar and ReadInt, which has 1,573 and 1,160 exclusive samples respectively.

Here is the current implementation of ReadChar and ReadInt:

 

internal int ReadChar()

{

    pos++;

    if (bufPos < bufLevel)

        return buffer[bufPos++];

    else

        return FillBuffer();

}

 

int ReadInt()

{

    while (c == ' ' || c == '\t')

        c = ReadChar();

    bool negative = false;

    if (c == '-')

    {

        negative = true;

        c = ReadChar();

    }

    if (c >= '0' && c <= '9')

    {

        int value = 0;

        if (c == '0')

        {

            c = ReadChar();

            if (c == 'x' || c == 'X')

                value = ReadHex();

        }

        while (c >= '0' && c <= '9')

        {

            value = value * 10 + c - '0';

            c = ReadChar();

        }

 

        if (negative)

            value = -value;

        return value;

    }

    else

    {

        return -1;

    }

}

 

Here is the modified version:

int FastReadInt()

{

    int lc = c;

    int len = 0;

 

    while (lc == ' ' || lc == '\t')

    {

        if (bufPos < bufLevel)

            lc = buffer[bufPos ++];

        else

            lc = FillBuffer();

 

        len++;

    }

 

    uint diff = (uint) (lc - '0');

 

    int value = 0;

 

    if (diff <= 9)

    {

        do

        {

            value = value * 10 + (int) diff;

 

            if (bufPos < bufLevel)

                lc = buffer[bufPos ++];

            else

                lc = FillBuffer();

                   

            len ++;

            diff = (uint) (lc - '0');

        }

        while (diff <= 9);

    }

    else

    {

        value = -1;

    }

 

    c = lc;

    pos += len;

 

    return value;

}

 

We just change the two most frequenet calls to ReadInt in call ('c') command processing to try the new method FastReadInt 

    case    'C':

    case    'c':

    {

        c = ReadChar();

        if (pos <  startFileOffset || pos >= endFileOffset)

        {

            while (c >= ' ')

                c = ReadChar();

            break;

        }

        int threadIndex = FastReadInt();

        int stackTraceIndex = FastReadInt();

 

Result for top-10 most expensive (exclusive) functions:

ReadFile now only has 2,977 inclusive samples. We more than doubled the loading performance of CLRProfiler in 3 simple steps.

Here are the changes:

  • pos and c are class member variables; pos is long to support huge files. Updating them for every character is not cheap. In FastReadInt, local variables are introduced to replace their updates. The real member variables are updated before exiting the function,
  • ReadInt has two many features inside: supporting negative integer, hexidecimal numbers. We can remove those feature support in FastReadInt when calling from Call command processing.
  • With simplified code, character reading is only called twice, we can easily inline those two calls.
  • The (c >= '0'  && c <= '9') double comparison can be replaced by a subtraction and unsinged integer comparison (just learned it this week while reading BCL source code).
  • We do not need to check the first digit twice as in the old code.

There are still things we can do to improve FastReadInt performance. But now that we've domonstrated the power of profile-guide CPU time optimization to improve CLRProfiler profile loading time, we should switch to other optimizations.