I guess this is turning out as a series of ObjectSpaces posts.  I did the first one a few weeks back covering the origin of OPath in “The Power of the Dot,” and followed that with “What’s the Big Idea.”  Now I have a few more in the cooker that you might find interesting.

 

At the time when ObjectSpaces was still in swaddling clothes, barely a prototype still teething on my hard disk, before it grew up, graduated, matriculated and finally started to dress for success, I was still experimenting with the concepts underlying object query and manipulation.  We had not set out to build an object database, but Luca and I were finding that many of the same problems that plagued object database designs were infesting our plans as well.

 

One of the hardest things to get right with an object database, a database that allows you to persist your objects as your objects, complete with dynamic reference lookup (swizzling et al), is the ability to gauge how much data to pull down from the server at one time.  This doesn’t sound like such a big deal if you are familiar with a relational database.  With a relational database you specify exactly the amount of data you want with every query using a projection; that is unless you like running slipshod with your asterisk hanging out.  With an object query, you generally want to retrieve the whole object.  Well, at least that’s the paradigm you want to write your program against.  What you really want is for the underlying system to figure out just what data you are using and only go fetch that at runtime. 

 

Of course, this turns out to be an unbelievably hard thing to do.  Some of the bits of data on the object are cheap to fetch and access, other data is more costly.  The typical example is the large binary photograph stored in your account database.  You don’t want to keep pulling that one down every time you go after the current balances.  Yet some times you need that data and some times you don’t.  If you do need it and you are processing large batches of accounts on the client, you don’t want to fire off an additional query per account just to get that part.  Unfortunately, this is only the tip of the iceberg.

 

Remember, with objects your data is defined in a network, or graph.  When you had a relational database and wanted to correlate information across multiple flat tables you wrote a query that introduced an explicit join operation.  You told it what to access and what to bring back, even if that was an explosive Cartesian product.  It was what you wanted so you knew what to expect.  But with a graph, all that’s changed.  Now you have these convenient little properties on your objects that let your navigate around your object space, pulling in bits here and there without so much as an if-you-please because it’s just too darn easy.  Heck, it’s a party.  Have drink, here’s a swizzle stick.  Sure, all the data is here in memory.  It’s been here all along.  Have you met it, the whole graph I mean.  Don’t believe me?  Let me introduce you.  Just start following those dots.  You’ve got customers?  Give’m a dot.  Now you’ve got Orders, and if you peek just a little further you’ll see that you’ve also got Order-Lines and Products and Shippers and Warehouses and dang, is this the whole database or what?

 

So you see, the problem extends well beyond what you might have first guessed was the data in question.  When you use the object paradigm you conveniently start to forget there’s actually a database underlying all this stuff, and that you still are sucking down bits through a straw from somewhere in the far beyond.  When you start off writing queries against Customers, you could easily end up manipulating Orders and Addresses and what not.  Or you could not.  Maybe you’re only interesting in pulling up a phone number.  How do you go about making sure the system is not pulling down thirty gigs of data when all you wanted to do was order a pizza? 

 

It’s easy, you might say.  There are many solutions to this problem.  It is true.  We’ve looked at most.  Some are practical, and some are highly theoretical and dicey.   They basically focus on solving the problem in one of three ways.

 

1)      All related items are fetched on demand.  The system is optimized for common scenarios of only accessing closely related primitive values.

 

2)      An administrator hand optimizes the definition, declaration or mapping of the object structures, identifying which bits of data naturally group with other bits of data.  They take into account the likely access patterns and optimize for those.  Grouped items pre-fetch together.  Related items are pulled in on demand via additional queries.

 

3)      The object database system is designed to optimize itself over time, adjusting the pre-fetch/demand-load relationships between objects.  If most of your apps access Customers without Orders, Orders will always be loaded on demand via additional queries.  If they are generally accessed together, Orders will always be queried for and brought back at the same time Customer data is retrieved.

 

The first option is basically admittance that the technology is a novelty and not intended for rigorous use.  The third option is highly suspect and specious.  While it is certainly possible to build a system that does in fact monitor its usage and tunes itself, it is utter foolishness to think that this would suffice for any well used database.  Real databases host myriad applications with significantly different patterns of use.  An autonomic system will optimize for the middle ground, and will never behave optimally for any case.

 

That’s leaves you with option #2, which is the most popular method in use today.  Yet, it generally has the same problem as option #3, except that a human brain has chosen to optimize it in favor of one of the two extremes.  It’s not that human brains are fallible.  It’s just an impossible problem to get right.  You see any system that relies on static assessment of optimization parameters is bound to be wrong half the time, and any system that relies on dynamic assessment of optimization parameters across the gamut of real-world applications is bound to be wrong all of the time.

 

So there we were, Luca and I stuck with the same problem that every O.P. system to date has had to face.  We knew we would have to pick one of the options and just go with it.  Luca hoped we wouldn’t end up with the first option, and was willing to go with the second.  I hated all three.  I loathed them, because I too am a programmer, and I too know what the customers will face when trying to build their applications.  Handing over the power to do your job well to an administrator that may know nothing about your needs seemed just a bad idea.  It’s the application programmers that know what the app is about to do.  They know what data they need, just like they did when they wrote apps against relational databases.  What we needed to do was put the power back in the hands of the people.  If SQL can do it with projections of rows, then by-gum, we certainly were going to figure a way to do projections of objects.  And that meant projecting everything, even the network, the graph, the entire matrix.

 

That’s when I hit upon the solution.  Like the dot, it was the graph stupid.  Object queries needed that additional bit of information that would allow the user to specify exactly what reachable parts of the network should be pre-fetched together.  So I took the OPath parser and added an additional context that would allow the specification of a list of properties, and sub-properties and so on that would form a tree of access paths.  Anything touched along these paths would trigger a pre-fetch of that data.  With a simple list of dot expressions you could easily specify what part of the matrix you wanted to span. 

 

The best thing about it was that it was an opt-in solution.  It was a way of saying, “Hey, I’m going to be dotting through this stuff, just wanted you to know.”  If you did not specify the new span parameter, then you could still navigate through all the properties, reaching all your data.  If you did, then your server could put the order in all at the same time, so all the data would come back hot and ready to eat.

 

Luca was a bit dubious of the idea at first, but I wore him down.  Ever since then, ObjectSpaces has had the span parameter, and yes the name derives from my own ultra-nerdiness.  It refers to the span of a space as used in linear algebra.  Crack a book if you don’t believe me.  I’m sure you have one in a box in your basement like I do.  Wife got rid of it years ago?  Mine tried too.  I tricked her by re-labeling the box, “trinkets, odds and ends.”  That stuff she keeps.

 

But I digress.

 

Matt