SimpleScript Part Seven: Binder Skeleton

SimpleScript Part Seven: Binder Skeleton

  • Comments 17

In Part Five I was discussing modules: there is a "global module" and any number of additional modules.  Each module is associated with a named item, and the only module which is associated with more than one named item is the global module.  This means that each module is going to need its own name table to keep track of the functions, the variable names and values, etc, in it.  We'll call such devices "binders" because they bind names to values.

For reasons which will become more clear in future episodes, it is convenient to have the binder implement the IDispatch interface.  Normally if you had to implement IDispatch you'd build a type library, and then have your implementation of IDispatch call through to the ITypeInfo methods which look up dispatch ids, invoke methods, and so on.  There's a good reason for that: rolling your own IDispatch code can be extremely tricky.  But, we're pretty much stuck with it here -- we haven't got the convenience of a known vtable interface to wrap a type library around at compile time.  The script engine name tables are going to be extremely dynamic, so we're going to have to roll our own IDispatch.

The code isn't nearly done yet, obviously, but I've got a good skeleton in place.  Take a look at binder.cpp for the details.   I've pretty much got the high-level semantics of the dispatch interface worked out, but the actual low-level implementation details all return E_NOTIMPL.  What I need to build here is a special hash table which can efficiently look up values by both numeric id and name.  That's not too challenging; what is going to be more difficult is implementing the SCRIPTITEM_GLOBALMEMBERS feature.  We need to be able to add another arbitrary dispatch object to our binder and dynamically aggregate all of its methods, without causing any collision in dispatch ids! 

Default property semantics are a little weird.  In the wacky world of late bound objects, an object may have a "default" method.  This feature was created so that you could call the Item method of a collection the same way you dereference an array in Visual Basic -- you make it look like a function call.  That is, since this code works:

Dim arr(10)
arr(1) = 123

then so should this code

dict = CreateObject("Scripting.Dictionary")
dict(1) = 123

the last line is the same as

dict.item(1) = 123

Such is the wackiness of VB -- because the array dereference and function call syntax are conflated, they ended up making function calls on invisible methods be legal on the left hand side of an assignment statement.  The mind fairly boggles.   We'll be good citizens in SimpleScript.  If we have an object in our binder and someone attempts to call it, we'll just pass on the arguments to its default method and let it sort out what the right thing to do is. 

As you can see, there is a whole lot of parameter validation, and altogether about eleven separate cases that I cover in the implementation of Invoke.  The actual implementation in VBScript and JScript is much, much more complex than this because they support IDispatchEx, garbage collection, property accessor functions, and numerous other features that complicate the code.  But still, this is the longest function you're likely to see me write in this project; I'll be very surprised if I write any code more complicated than the dispatch logic.

Speaking of garbage collection, I'm not quite sure what kind of garbage collector I'm going to build into SimpleScript.  Building a mark-and-sweep collector like JScript has would be instructive, but it would also take a lot of time and effort.  What do you guys think?  (If you look at invoke.cpp you'll see that I'm thinking ahead about issues we're going to run into in garbage collection.  We're going to run into related issues in the script execution code; it is possible for ill-behaved hosts to shut down the script engine in a callback while it is executing!  We need to be robust in the face of such behaviour.)

The next step is to get the actual guts of the binder working, and then build a list of code blocks to be executed when the script engine goes to started state.  The clone semantics for the code blocks are a little tricky, so we'll get that straightened away, and then maybe, just maybe, actually write a language parser.

UPDATE:  For your convenience, I've put a zip file containing all the sources at http://www.eric.lippert.com/simplescript.zip.  I'll keep this zip file updated as I change the sources on the blog.

  • re: What collector to build?

    I think it would be easier to pick a GC algorithm after we know what the language is going to be like.

    Here would be my hit list of design decisions to make before thinking about how to GC:

    Does the language support references?
    Does the language support global variables?
    Does the language support closures?
    Are the scope and lifetime of a variable closely related?
    What are the major object lifetime patterns?
    Does the average object contain other objects?
    What's the size of the average object?

    Up until this point there's been very little code that cares about GC and I think that's going to continue to be true for a while. There's no need to commit to a method early on.
  • Indeed! As I've mentioned before in my blog, normally the design process consists of a series of questions which dig deeply into the customer requirements. You figure out what problem they need to solve, and that drives the design.

    But with a pedagogic example like SimpleScript, the cart gets before the horse. The "problem" that I'm looking to solve is to demonstrate practical issues in implementing language processors. What I'm looking for are what sort of problems people would like to see solved. If people want to see how a M&S GC is implemented, then I'll design a language that has circularly referenced objects that live beyond their scopes.

    I'll therefore broaden the question -- what sort of language features do you think would be instructive? I started this project with the intention of making the language simple because I figured that the interesting, poorly understood part was the semantics of the engine interfaces. But if it turns out that what people are really interested in are GCs, closures, expando objects, and the like, then I'll start laying the groundwork for those things earlier rather than trying to bolt them on at the end.
  • more re: What collector to build?

    What do you want to get out of this project? I'd love to see a copy collector because designing one is pretty interesting. And once you've got a working Cheney scan, it's easy to experiment with an incremental collector, generational collector, train collector, type segregating collector, whatever. But you seem to have some priorities in mind already about what you want to focus on.
  • And while that may look like a carefully crafted reply to your question, it was written before you post. I'd have to think a while about proper pedagogy and such.
  • Just to dejargonify that a bit, so that anyone following along at home who hasn't written a GC algorithm lately can keep up:

    * a mark-n-sweep garbage collector works by first marking every object in the heap as "collect me", then clears the mark on all "known alive" objects. Then it clears the mark on everything that the "known alive" objects refer to, and so on, until all the dead objects are marked and none of the alive ones are marked. JScript uses a mark-n-sweep collector.

    The trick is to avoid deep recursion, as you can blow your stack pretty easily. JScript uses some very gross code to avoid recursion.

    Large long-lived working sets are expensive -- the .NET framework uses a generational M&S GC to deal with this problem, ie, it segregates objects by age and collects old objects less frequently, on the assumption that most objects die young.

    * A copy GC keeps two heaps. When a collection happens, every live object is copied from the old heap to the new heap, and then everything in the old heap is destroyed.

    A copy GC has a bunch of problems. You double the total amount of memory consumed. Objects cannot be "pinned" to any particular memory location. You must be able to fix up references to objects in the new heap. And the memory sloshes back and forth all the time; large working sets get quite expensive.

    The Cheney algorithm is a way to avoid recursion in a copy GC.

    * an incremental GC does a little bit of GC at a time rather than everything at once, and therefore distributes the time spent collecting more evenly.

    * The Train algorithm is an incremental copying GC which I don't understand, but I gather is pretty cool.

    * VBScript uses a simple stack-based garbage collector, because in VBScript an object's lifetime is tightly tied to its scope. As objects come into scope, they get stuck onto a stack. When a scope ends, we pop the stack back and free up everything.
  • > the .NET framework uses a generational M&S GC to deal with this problem

    I'm going to guess that it essentially uses a copy collector for the new generation and only uses mark-and-compact on the tenured generations (where the old objects live). If all of the objects are going to move anyways, there will be less of a pause for copying than M&C which is the goal for Eden. But then I've never seen the code...

    Train algorithm- a train collector divides memory into blocks called cars. An object has references either within its car (good) or to other cars (bad). The collector over time hooks together cars that refer to each other into a train. Collection within a train is fast since we've partitioned the heap into smaller sets that we don't need to trace between. This allows a train collector to run GCs with very short pauses even on older generations. Train algorithms are also used in many ways besides garbage collection... Downside: the train collector has less throughput compared to M&C.

    Type segregating collector- a type segregating collector arranges objects in memory according to their type. It can group together types by size (allowing for an arena-like layout to be used) or group together types by whether they can contain references. If all of the objects that can't contain reference are together on the heap, you can just test the pointer address of the object to see if the collector needs to bother looking at it. Downside: requires the collector to be very knowledgeable about your type system.

    All of the techniques that are being mentioned here are primarily for uniprocessor designs. Once you've got multiple processors, it's generally better to sacrifice throughput on a particular processor for parallel collection. And as the number of processors gets even larger, you can sacrifice latency on a particular processor for concurrent collection.
  • Speaking of comments...

    >>>
    // We must addref the pointer before the invocation. Why? Consider
    // this scenario: the invocation calls a method which calls back
    // into the script engine, which triggers a garbage collection, which
    // does the final release on the dispatch object. Then control returns
    // back to the call to Invoke, which dereferences its "this" pointer
    // and promptly crashes.
    <<<

    Part of the reason this big comment is necessary is because doing bare AddRefs and Releases is opaque and mysterious. Can the GCable thing be protected instead of the dispatch object? It would be natural to root a GCable thing here since the local stack is a root. Then I wouldn't have to think about what might happen to pdisp during the call.
  • Binder::GetIDsOfNames >>>
    if (IID_NULL != riid)
    {
    return hr= DISP_E_UNKNOWNINTERFACE;
    goto LError;
    }
    <<<

    This probably isn't what you wanted to do.


    Binder::Invoke >>>
    Bug("We don't support named arguments to except to property puts.")
    <<<

    Too many to's here.


    The goto LError; kind of bugs me. It makes a lot of sense when there actually is an error. But it's also used to skip to the end of a function even when everything is ok. I'd be quicker to understand what you were doing if it was goto out; on success or something.


    >>>
    HRESULT GetIdOfName(const WCHAR * pszName , DISPID * pdispid);
    HRESULT GetNameById(DISPID dispid, Name * * ppName);
    <<<

    Interesting that these aren't symmetric.
  • // Failing to
    // provide room for the return value is dumb, but not illegal.

    Arrgh.


    Binder::Invoke sure was a lot of fun to read.
  • Re: failing to provide room -- it's possible that the property get has a side effect and you care about that, not the return value. That's a very rare situation, but it is legal.
  • Re: IID_NULL -- Hmm, I'm surprised the compiler didn't catch that. Looks like I need to bump the warning level up to max.

    Re: Two to's is too many. Yep.

    Re: LError on non-error. Yeah, you're right. I might fix that up.
  • My main interest here is in the script engine interfaces; I'm coming from a perspective of wanting to implement an engine for an existing language. In this case I want to know what the engine needs to be aware of for a GC'd language, but the actual GC may as well be a black box. I just need to know how my language behaves.

    On the other hand, I'm interested in GCs for separate reasons. One of the most intriguing things to me is the ability to move memory, for better locality, larger allocation, or whatever. This is also more interesting because it requires awareness of the language -- what a memory reference is, etc. So if you do decide to explore a complex GC, I echo Nicholas' comments in that I'd like to see a good stepping stone for experimenting with more exotic things.
  • Since I promised earlier to write more on this I thought I'd catch up.

    There seems to be three big tradeoff areas to consider:

    Scriptyness- how well the language supports quick, little programs
    Parser complexity- how hard the language is to process
    Runtime complexity- how hard the language is to execute

    In general, scripty languages are harder to parse and run (automatic semicolonization, expando objects). We can trade parser and runtime complexity by moving things like type checking to be early or late (which also impacts the scripty feel). These choices affect other "cool stuff to do" items. Syntax coloring needs to match the parser and debugging needs to understand the runtime.

    There's a minimum level of scriptyness required. Fortran77Script#.NET really just misses the point. I think though that complex scripty features are better described than implemented. Slotted objects are neat but you wouldn't want to read the code that handles them after the preprocessor has gotten done with it. My goals are:

    * Eliminate boilerplate and uninteresting code. Make a language that lets you do a two line job in two lines.
    * Support programs more interesting than "Hello world", less interesting than being able to compile itself

    I consider the parser the least interesting part. There are a lot of good books on how to parse just about anything. As long as the parser works and it's not too hard to add new language features that's fine by me. There will probably be a lot of questions about code generation which is not as commonly talked about. And that leads to...

    The runtime which I consider the most interesting part. There are two ways to go here. The first is to develop a new runtime which would explain a lot about memory management, call resolution, and how existing languages are implemented. There have been only a few good discussions like this. The second is to leverage an existing runtime. I've never seen a really detailed look going through the entire process. The original roadmap had Managed Code - ? so maybe.
  • Thanks Nicholas. I think you're right on the money.

    I've never written a copy-based GC before, so that might be an interesting experiment. And I agree that a goal of the language should be brevity.

    I once considered writing up a .NET language that I was going to call "VERBOSE" -- basically a simple C#-like demonstration language but with the property that there were NO optional keywords. A typical class declaration would look something like

    class {
    name := "MyClass",
    sealed := true,
    extends := class {
    name := "MyBaseClass",
    namespace := "MyNamespace"
    },

    etc, etc. It would have been wonderfully horrid. I never did get around to it though.

    I agree that the runtime is the interesting part, but I will probably concentrate upon the infrastructure -- the GC, the error reporting mechanism, etc -- and not implement things like a big string manipulation library, a date object, etc.

  • Here’s a question I recently got about VBScript, where by &quot;recently&quot; I mean August 28th, 2003.&amp;nbsp;This...
Page 1 of 2 (17 items) 12