Why we need a new lifetime management model

 

Managed code relies on Garbage Collection (GC) to manage the lifetime of object. For Add-In, object reference can be cross-appdomain or cross-process. We cannot rely on GC solely to give us automatic lifetime management. We need a new model that works cross any isolation boundary and can keep Add-In alive when it is in use and shut it down when it ends its lifetime. We also want to make sure that the container for the Add-In (either appdomain or process) can be recycled when there is no active Add-In.

 

We created a lifetime model based on remoting infrastructure. In this model, every object cross the isolation boundary will be the ISponsor of itself. Remoting infrastructure will ask the sponsor whether it should renew the lease and keep it alive. CLR Add-In model has implemented ISponsor logic in a class called ContractBase, which is supposed to be used by all contract based objects that go across the isolation boundary. Add-In and pipeline developers can easily use our ContractBase as the base class for lifetime management.  The purpose of this blog is to help you understand our lifetime management model and avoid a few bad coding patterns in your Add-In development.

 

 

Lifetime of Add-In and Reference Counting

 

If you read our blogs or wrote a simple Add-In pipeline before, you would know that an Add-In is kept alive by Add-In Adapter which inherits from ContractBase.  The Add-In lifetime management is essentially ContractBase lifetime management.

 

Below is how the ContractBase life cycle looks like. Assuming the left side is the LA appdomain and the right side is the RA appdomain (don’t assume LA is the Host and RA is the Add-In. It could be the other way around for data adapters)

 

2 contract   ---------------------------------------------------------- 1 new ContractBase( );  TotalToken=0

3 contract.AcquireLifetimeToken ------------------------------- 4 ContractBase; TotalToken =1

5 Use contract functions  ----------------------------------------- 5 ContractBase renew lease since TotalToken!=0

6 contract.RevokeLifetimeToken  ------------------------------- 7 ContractBase; TotalToken=0

9 contract cannot be used anymore -------------------------------8 If TotalToken==0 disconnect or AD.Unload         

 

First a new ContractBase object will be created in RA and passed to LA as a MarhalByRefObject. Inside LA this object will be treated as a contract (typically it implements both IContract and other functions that are supposed to go across the isolation boundary).

 

The contract can call AcquireLifetimeToken to notify RA to increase the token count. Once the token has been acquired, the contract interface can be considered as a live object.  

 

I used 5 twice. It means that operations at both sides of the isolation boundary can execute simultaneously.

 

When contract user finished all the work it intended to do, it could call contract.RevokeLifetimeToken to notify RA side again. This API will decrease the ref count.

 

Once the ref count reaches 0 it won’t renew the lease; therefore it is disconnected. If the ContractBase is the only adapter of the appdomain and the appdomain is created by our activation API, we may even unload the appdomain.

 

If we try to use the contract in LA after it is disconnected, we will see exceptions in LA (either RemotingException or AppdomainUnloadedException).

 

 

Understand ContractHandle, Remoting and Garbage Collection

 

From above explanation, we may notice that the lifetime management work is really done by two APIs in IContract, AcquireLifetimeToken and RevokeLifetimeToken. If you compare IContract and IUnknown, you will agree that both interfaces carry some similarities. One advantage of IContract is that it uses a list of tokens to keep track of all the references. Each reference will get a unique token. Therefore it is not likely to release someone else’s token, which is common in COM and it isvery hard to debug.

 

To improve the user experience dealing with the IContract, we created the ContractHandle class. ContractHandle calls AcquireLifetimeToken in the constructor and RevokeLifetimeToken in the finalizer or its dispose method. This class actually brings us close to the GC world. Since GC can manage the lifetime of ContractHandle, it will call RevokeLifetimeToken if it sees no root for ContractHandle. Acquire-Revoke lifetime token will automatically pair up, and we won’t confuse ourselves with complex if-else blocks.

 

ContractHandle brings us closer to a GC based world. There are still some significant difference between GC and our lifetime management model. We do have some guidelines in terms of using the ContractBase.

 

1.        Always hold a contract handle before using the contract functionality.

2.        Avoid decreasing token count down to zero if contract is still in use.

3.        Avoid circular reference of IContract.

4.        Be aware of JIT optimization while using ContractHandle.

 

Violating above rules will cause serious bugs in your product. Even worse, the bugs are totally not deterministic. If GC does not happen at the right moment, you may not even be able to find it in your dev test environment. But sooner or later, customers will see them.

 

Fortunately, we have run into problems violating above rules. We learned how to debug them, fix them. We want to blog about it so that you don’t need to make the same mistakes.

 

 

Demo Scenario

 

To illustrate the issues that user may see in lifetime management, I created below sample. This sample contains several bugs that I will discuss later. In our pipeline model, Add-In and Add-In base are not really interesting for lifetime management. To simplify the scenario, Add-In and Addin base are intentionally omitted. We will focus on contract, host side adapter and Add-In adapter.

 

Here is the contract.  IDemoContract contains two methods, Reverse and StoreContract. IDemoDataContract contains getter setter to an integer value.

 

    [AddInContract]

    public interface IDemoContract : IContract

    {

        IDemoDataContract Reverse(IDemoDataContract data);

        void StoreContract(IDemoDataContract data);

    }

    public interface IDemoDataContract : IContract

    {

        int GetValue();       

        void SetValue( int x);       

    }

    public delegate void Func();

 

Here is the AddInAdapter. It implments the IDemoContract. It also contains DataAdapter which implements IDemoDataContract.   To make our debugging experience better, we changed the LeaseTime in the Add-In appdomain to 5 seconds. The default lease time is 5 minutes that is tool long for debugging.

 

    [AddInAdapter]

    public class TestAddInAdapter : ContractBase, IDemoContract

    {

        ITestAddInBase TestAddInBase;

        ContractHandle storedContractHandle;

 

        internal TestAddInAdapter(ITestAddInBase inTestAddInBase)

        {

            TestAddInBase = inTestAddInBase;

            System.Runtime.Remoting.Lifetime.LifetimeServices.LeaseTime = new TimeSpan(0, 0, 5);

            System.Runtime.Remoting.Lifetime.LifetimeServices.RenewOnCallTime = new TimeSpan(0, 0, 5);

        }

        public IDemoDataContract Reverse(IDemoDataContract data)

        {

            DataAdapter retData = new DataAdapter();

            retData.SetValue(0 - data.GetValue());

            return retData;

        }

        public void StoreContract(IDemoDataContract data)

        {

            storedContractHandle = new ContractHandle(data);

        }       

    }  

 

    public class DataAdapter : ContractBase, IDemoDataContract

    {

        int m_Value = 0;

        public int GetValue() { return m_Value; }

        public void SetValue(int num) { m_Value = num; }      

    }

 

 

Here is the HostAdapter. It implements the host side view. DataAdapter2 also implements IDemoDataContract. It supports the SetTrigger function. It will write to console if x is set to zero. RunTest() is a simple method defined in HSVBlogSample

 

    [HostAdapter]

    public class TestHostAdapter : HSVBlogSample

    {

        IDemoContract contract;       

        ContractHandle handle;

 

        internal TestHostAdapter(IDemoContract inContract)

        {

            contract = inContract;

            handle = new ContractHandle(inContract);            

        }

 

        public override void RunTest()

        {

            DataAdapter2 data = new DataAdapter2();

            for (int i = 1; i <= 10; i++)

            {

                data.SetValue(i);

                data.SetTrigger(delegate { Trigger(); });

                IDemoDataContract retData = contract.Reverse(data);

                Console.WriteLine("The value is {0}", retData.GetValue());

            }

            contract.StoreContract(data);

        }

        public void Trigger() { Console.WriteLine("Value Set to Zero.");}

    }   

 

    public class DataAdapter2 : ContractBase, IDemoDataContract

    {

        int m_Value = 0;

        Func trigger;

       

        public int GetValue() { return m_Value; }

        public void SetValue(int num) { m_Value = num; if (m_Value == 0) trigger.Invoke();}

        public void SetTrigger(Func func) { trigger = func; }

    }       

 

 

Incorrect Coding Pattern -1  (unprotected IContract issue)

 

The first issue is in the implementation of Reverse method.

 

        public IDemoDataContract Reverse(IDemoDataContract data)

        {

            DataAdapter retData = new DataAdapter();

            retData.SetValue( 0 - data.GetValue() );

            return retData;

        }

 

The first question is where is data from. It is called from HostAdapter into AddInAdapter, therefore crossing the isolation boundary. It seems that nobody is managing the lifetime of data. It could be disconnected at the other side of the isolation boundary.

 

Since data is a contract, it is not safe to use it without taking a ContractHandle. There is a easy way to illustrate the issue here if you add Thread.Sleep() for 10 seconds before data.GetValue(). It is very likely to see a RemotingException.

 

One way to fix the issue is writing code like below

 

            using (ContractHandle handle = new ContractHandle(data))

            {

                DataAdapter retData = new DataAdapter();

                retData.SetValue(0 - data.GetValue());

                return retData;

            }

 

A little exercise: there us a similar issue in HostAdapter. Can you find it?

 

 

Incorrect Coding Pattern -2 (multiple reference issue)

 

Assuming you fixed above issue, the next type of issue will surface. It is a little more complex than the first one. You do need to know some background to understand this. Let’s look at the HostAdapter

     

            DataAdapter2 data = new DataAdapter2();

            for (int i = 1; i <= 10; i++)

            {

                data.SetValue(i);                                                             

                data.SetTrigger(delegate { Trigger(); });                       

                IDemoDataContract retData = contract.Reverse(data);  /// May throw Here

                Console.WriteLine("The value is {0}", retData.GetValue());

            }

            contract.StoreContract(data);                                              /// May throw Here

 

 

The problem here is that we are calling contract.Reverse(data) multiple times on the same data. Reverse(data) is executed at the other side of the isolation boundary. During the first cycle (i=1), the other side of the isolation boundary will have one ContractHandle attached with data.  After Reverse(data), that ContractHandle will go out of scope and could be disposed by GC. Therefore the TotalToken becomes zero for data. During the second cycle (i=2), the other side of isolation boundary will call ContractHandle constructor again on the same data, this will try to increase the token count from zero to one. This violates our rule No. 2 above.

 

Typically you won’t see any exception if GC does not happen. As your code logic becomes complex, this is very likely to give you surprise. If you put GC.Collect() before contract.Reverse, you will see an exception with this message “Once all outstanding references to a Contract have been revoked, new ones may not be acquired.  during the second cycle.

 

People may ask why don’t you disable this rule and allow token to go down to zero and go back up again. We cannot. We need a way to know when we could do clean up to remove dead appdomains and dispose unmanaged objectes etc. Removing this rule will potentially cause huge memory leak.

 

There are two solutions for this kind of problem. We can either create unique DataAdpater for each call. Or we can use a ContractHandle to increase the token count before it drops to zero like below.

 

            using (ContractHandle handle = new ContractHandle(data))

            {

                 for (int i = 1; i <= 10; i++)

                {

                    data.SetValue(i);

                    data.SetTrigger(delegate { Trigger(); });

                    IDemoDataContract retData = contract.Reverse(data);

                   Console.WriteLine("The value is {0}", retData.GetValue());

               }

               contract.StoreContract(data);

            }

 

 

Incorrect Coding Pattern -3 (circular reference issue)

 

I wrote a host with code like below. It simply activates a pipeline and call RunTest(). Token is a valid AddInToken we get from our discovery API. Remember that our HostAdapter inherits from HSVBlogSample.

 

            while (true)

            {

                HSVBlogSample hsv = token.Activate<HSVBlogSample>(AddInSecurityLevel.FullTrust);

                hsv.RunTest();

                hsv = null;

                Thread.Sleep(10000); GC.Collect(); GC.WaitForPendingFinalizers();

                Thread.Sleep(10000); GC.Collect(); GC.WaitForPendingFinalizers();

            }

 

Most people would not expect a memory leak in managed application like above. The host view is set to null. GC should take care of the rest.

 

We indeed has a memory leak here. Before we show the details, I want to explain a little bit about why we have multiple sleep and multiple GC.Collect here. Remember we set the lease time to 5 secondes. Sleep for 10 seconds help remoting to disconnect objects. In our lifetime management model, objects are not cleaned in one shot by GC. Depending on how you connect up your objects, we might need multiple GC and remoting-disconnection to clean up the memory.

 

In above example, no matter how many times we call GC or let the thread to sleep. The memory leak won’t go away. We can let the program run a few cycles and put a breakpoint at the beginning of the loop, then verify with SOS.

 

First, you can load SOS and dump appdomains.

 

.load sos

extension D:\WINDOWS\Microsoft.NET\Framework\v2.0.orcasx86ret\sos.dll loaded

 

!dumpdomain

 

--------------------------------------

Domain 9: 0028eaa0

LowFrequencyHeap: 0028eac4

HighFrequencyHeap: 0028eb1c

StubHeap: 0028eb74

Stage: OPEN

SecurityDescriptor: 002f6008

Name: Test

 

With above code, we can easily find OPEN appdomains like above besides the default appdomain for the host. Now we are sure that appdomains are not unloaded. Therefore someone must be keeping it alive from the host side. Someone must be holding a ContractHandle to keep the addin alive. The dumpheap command can verify that.

 

 

!dumpheap -type System.AddIn.Pipeline

 Address       MT     Size

0280a86c 05491914       20    

0280cec0 05259f6c       20    

0281906c 05491914       20    

0281c3dc 04f59f6c       20    

total 4 objects

Statistics:

      MT    Count    TotalSize Class Name

05259f6c        1           20 System.AddIn.Pipeline.ContractHandle

04f59f6c        1           20 System.AddIn.Pipeline.ContractHandle

05491914        2           40 System.AddIn.Pipeline.ContractHandle

Total 4 objects

 

From dumpheap output, we do see live ContracHandle objects in the memory. That is really strange. Let’s find out who is keeping them alive.

 

!gcroot 0280a86c

Scan Thread 2692 OSTHread a84

Scan Thread 1904 OSTHread 770

Scan Thread 2472 OSTHread 9a8

Scan Thread 2428 OSTHread 97c

Scan Thread 3524 OSTHread dc4

Scan Thread 3520 OSTHread dc0

DOMAIN(0021D738):HANDLE(WeakLn):1da10fc:Root:027bea50(System.Runtime.Remoting.Contexts.Context)->

027911b4(System.AppDomain)->

027be914(System.Runtime.Remoting.DomainSpecificRemotingData)->

0280ace4(System.Runtime.Remoting.Lifetime.LeaseManager)->

0280ad0c(System.Collections.Hashtable)->

0280ad44(System.Collections.Hashtable+bucket[])->

0280b6cc(System.Runtime.Remoting.Lifetime.Lease)->

0280ab2c(DataAdapter2)->

0280b0c0(Func)->

0280a85c(TestHostAdapter)->

0280a86c(System.AddIn.Pipeline.ContractHandle)

DOMAIN(0021D738):HANDLE(WeakSh):1da1298:Root:0280a85c(TestHostAdapter)->

02806818()

DOMAIN(0021D738):HANDLE(WeakSh):1da129c:Root:0280a85c(TestHostAdapter)->

02806818()

 

Examine the gcroot output closely will reveal why all those objects are alive.

ContractHandle is kept alive by TestHostAdapter

TestHostAdapter is kept alive by DataAdapter2’s trigger method.

DataAdapter2 is kept alive because remoting did not disconnect it.

It seems remoting LeaseManager still consider DataAdapter2 a live objects.

 

Why remoting did not disconnect DataAdapter2? Let’s dump the object and find it out.

 

!do 0280ab2c

Name: DataAdapter2

MethodTable: 05491c5c

EEClass: 051744a0

Size: 44(0x2c) bytes

 (D:\vbl\orcas\clrtest\testbin\AddIn\AddInImpl\RunBlogSampleCode\HostSideAdapters\Test1HostSideAdapters.dll)

Fields:

      MT    Field   Offset                 Type VT     Attr    Value Name

790f6f50  400018a        4        System.Object  0 instance 0280b200 __identity

00000000  4000143        8                       0 instance 0280ab58 m_lifetimeTokens

00000000  4000144        c                       0 instance 00000000 m_contractIdentifiers

790f6f50  4000145       10        System.Object  0 instance 0280ab70 m_contractIdentifiersLock

7910016c  4000146       14        System.Random  0 instance 0280ab7c m_random

79104974  4000147       1c       System.Boolean  1 instance        0 m_zeroReferencesLeft

790fb9c0  4000148       18         System.Int32  1 instance        0 m_tokenOfAppdomainOwner

790fb9c0  4000003       24         System.Int32  1 instance       10 m_Value

05491d64  4000004       20                 Func  0 instance 0280b0c0 trigger

 

Let’s examing m_lifetimeTokens which is a List<int> defined in ContractBase. The list is to hold all the acquired tokens.

 

!do 0280ab58

Name: System.Collections.Generic.List`1[[System.Int32, mscorlib]]

MethodTable: 7917bf7c

EEClass: 791a6c60

Size: 24(0x18) bytes

 (D:\WINDOWS\Microsoft.NET\Framework\v2.0.orcasx86ret\assembly\GAC_32\mscorlib\2.0.0.0__b77a5c561934e089\mscorlib.dll)

Fields:

      MT    Field   Offset                 Type VT     Attr    Value Name

7910f3b0  40009c7        4       System.Int32[]  0 instance 0280ac88 _items

790fb9c0  40009c8        c         System.Int32  1 instance        1 _size

790fb9c0  40009c9       10         System.Int32  1 instance       23 _version

790f6f50  40009ca        8        System.Object  0 instance 00000000 _syncRoot

7910f3b0  40009cb        0       System.Int32[]  0   shared   static _emptyArray

 

It turns out the size of the list is 1, which means that someone is still referencing DataAdapter2. That is why remoting did not disconnect it.

 

Examing AddInAdapter code closely, we will see that TestHostAdapter keeps AddInAdapter alive.

AddInAdapter keeps the storedContractHandle alive, which holds the contract of DataAdapter2.

 

 

We got a big circular reference here. In the usual managed application, circular reference will be deemed as dead object. Unfortunaly, with the help of remoting infrastructure, our circular reference becomes real reference and kept alive by remoting LeaseManager.

 

To fix the above sample, we need to break the circular reference. One simple way is to define the Trigger() function in the HostAdapter to be static. Another way is to avoid storing ContractHandle in the AddInAdapter.

 

As you can imagine, even more complex circular reference pattern can be created. Stress testing is definitely helpful to discover such kind of issues.

 

 

Incorrect Coding Pattern -4 (JIT optimization issue)

 

When we discuss the first incorrect coding pattern, I left a question. There is one IContract with no protection in the HostAdapter. I don’t know whether you find out the answer or not. Here is the problematic code.

 

IDemoDataContract retData = contract.Reverse(data);

Console.WriteLine("The value is {0}", retData.GetValue());

 

Above code may throw exception while doing Console.WriteLine if retData gets disconnected. Since retData is from the other side of the isolation boundary, we need a ContractHandle to protect it.

 

Here is the fix.

 

IDemoDataContract retData = contract.Reverse(data);

ContractHandle handle2 = new ContractHandle(data);

Console.WriteLine("The value is {0}", retData.GetValue());

 

Now handle2 will call AcquireLifetimeToken and we have some protection here to make sure retData won’t be disconnected.

 

This is the same issue we discussed in coding pattern -1. So what is the incorrect coding pattern -4?

 

Take a second look at above fix. This actually is not the right way to fix the problem. The fix itself is the incorrect coding pattern that we want to discuss.

 

Assuming one GC happens right before Console.WriteLine, will the handle2 be collected? I have seen failures with this coding pattern. It seems that JIT optimization is smart enough to know that ContractHandle will not be referenced later, so it could be assumed dead. Then GC calls its finalizer. Therefore Console.WriteLine is not protected and may still throw object disconnected exception.

 

The recommended coding pattern looks like below.

using (ContractHandle handle2 = new ContractHandle(retData))

{

IDemoDataContract retData = contract.Reverse(data);

Console.WriteLine("The value is {0}", retData.GetValue());

}

 

You can also use a field in the adapter class to hold the ContractHandle if you are sure that you are not going to have circular reference issue.

 

 

Summary

 

In this blog, we looked at our Add-In lifetime management model. Our model sits between pure reference counting model and pure GC based model. It does have pros and cons. We also listed a few coding patterns that user may encounter in their development. Keeping the issues in mind will help you write good quality pipeline code.

All the issues we listed above may not cause problems immediately on a dev machine due to the nature of how GC works.  Stress testing and running tests on machines with different configuration are very helpful to discover issues like these before you deploy your app to customers’ machines.