Todd Bishop mentioned the Higgins project this afternoon. For some reason this one just hit me wrong. I mean, we need a federated identity management system. Until we do everyone will have to roll their own solution and interop will just be a dream. But, just because Microsoft proposes something, does that mean that the open source community needs to come up with something different?
The quote that hit me was:
"Being an open source effort, Higgins will support any computer running Linux, Windows or any operating system, and will support any identity management system."
It's that repeated use of "any" that bothered me so much. It's as if it's going to solve all the world's problems. But, can it? Not really. Not out of the box. Instead, it's an "open" solution which I read as "download the bits, tweak them according to your needs, and hope for the best." Don't get me wrong, I think the OS community is doing great things. Sometimes though we just need a single-bladed pocket knife.
I'm getting close to finishing the RSS connector now. I've finished the cache code and I've switched the query engine over to use the CRM 3.0 web services. I'm not sure how we'll release this, but I suspect it'll be part of the CRM GotDotnet sandbox somewhere. Stay tuned.
I was talking to some folks yesterday about this whole Software as a Service thing. While we were talking I started trying to define to myself what SaaS is. Now, I know there are many smarter more experienced people doing the same and that there is no right answer here. Instead I decided to break it down into three different business models that are supported by internet-based software.
Software supporting a business -- the Software as a Storefront model
Think of this as the Amazon model. Amazon is the product resale business. Their business model is about selling things and their storefront is an application on the internet. Amazon doesn't make money by providing their application any more than a dry cleaner with a web presence makes money through that presence. It's an enabling technology. (Let's ignore some of Amazon's new service offerings for the moment.)
Business selling interactive software- the Software as an Application model
This is the SalesForce model. SalesForce sells their online application and their business model is about making money selling that application. Another model comes to mind here that's like the SF model, but slightly different. SF sells their software directly to their customers. One (unnamed) company that I'm thinking of sells their software as a service to other companies who turn around and resell it. Maybe this is the service side of the Software as an Application model. In this case the company is an application provider.
Business selling service software - Software as a Service
In my opinion this is the model that makes Web 2.0 spin. It's about exposing interesting functionality so that other consumers can combine and compose new applications on top. A handful of "name brand" (without naming the brands) services jump to mind: map services, auction services, product inventory, and an ISV community around existing Software as an Application vendors.
This model doesn't imply anything about the user or developer experience. The service is the interesting bit to the provider. I can see service vendors also selling application software or other services (see one of the above models). This is the area that I'm most interested in right now.
What's interesting here is the business model. That is, if the service is effectively headless, and the end user doesn't ever see the 'service' then how does a company make money with this model? That's a question we'll all need to think about in the long run.
It's stuck in LegalLand right now. It's not really Legal's fault either. When I put the original prototype together for the BizSummit and PDC I was able to take liberties with the set of APIs that I used. Yup, I ended up using unsupported functionality. Shouldn't be much of a surprise, I have access to all the internals, I'm on the team, I needed to get a job done, and this blog has always been about pushing past the envelope.
Turns out that once the connector was public the demand went way up and the internal pressure to release it went up. Problem is there's no dedicated resource for "fixing" the bits that I cheated on. That means that we (read that Microsoft) can't release the connector without doing one of two expensive things: document the undocumented or fix the code.
Ideally I'd like to fix the code. I really didn't push the envelope all that much. In fact, all I did was cheat a bit and use the COMProxy instead of the shiny new SWS (of which I am a long-time champion) for tweaking the requested queries, and I directly access the application-level metadata cache. Fixing the COMProxy issue is an afternoon's worth of work. It really just means pulling the WSDL, ripping out all the bits that have no bearing on RSS (so it loads faster), and tweaking a few query functions. The metadata cache is another issue altogether.
Here's why. The application-level metadata cache has two nice properties: it's already loaded and can be shared with the application thus cutting down on memory requirements, and it has a reasonable object model (note that I didn't say it has a good object model… if it did we would publish it). That means I need to define an object model that makes sense, and I'd want to make it "big enough" to be useful. Plus, I would need to write a bunch of code to read the WS-based metadata data and transform that into the nice object model. I've been assuming that anything I do in that space, once released, will probably end up in general usage (I would actually hope so because I wouldn't want everyone to have to go through this same pain). If I'm right about that then I would want to make sure that the cache is really usable. But then that means I'd need to spend more time "getting it right". There's also the added problem that if I, as an aside, release a metadata cache programming model that people will come to expect something much like it in a future product release (which is why I never made the 1.2 web service code generally available - when I was finally ready to release it the product team took up the banner and built one themselves).
If there really is demand for the RSS stuff and / or the an object-based metadata cache, let me know and I'll try to get it on the radar. If not, let me know and I'll keep working on other things.
I've been spending the last few weeks developing a graduate-level database modeling course as part of my MSE capstone project. Turns out that creating a 10 week long course covering database modeling as part of a Software Engineering program is a lot of work. It's not like the problem space isn't well understood. Quite the opposite: there's a ton of material available. I've done course development in the past, but it's usually limited to a single session, or at most a week-long "boot camp" style. This is completely different.
First, there's a fuzzy line between software engineering and computer science. This course is supposed to be firmly in the SE camp which means that deep dives into the theory of database systems is not part of the material. That's not to say that the information isn't interesting. It is. But, if you're a student trying to understand the SE aspects of databases so you can move up in the world (say from being a developer to being a project manager or senior designer) then you only need to know enough information about the implementation details to know when something is a seriously bad idea.
The course breaks down into two main sections: the first half, up to the mid-term, is the SE side of the theory. It covers topics such as locking, 2PL / 2PC, relationship types, normalization, and basic modeling techniques. The second half is all about well-understood patterns in the business application space. The second half is easy to justify - this information is immediately applicable to many student's current job. It's that first half that gets nasty.
Anyway, I used to sit in classes and think "I can do a better job teaching this class". I might still think that in some cases, but now I've had to put my money where my mouth is and prove that I can at least structure the class in the first place. I've always known that there's a huge skill and experience gap between the most and least experienced students. I had to keep that in mind the whole time and it changed the material that I could throw in. I compromised a bit though. Most of the material is aimed at the high-end of the average, but each module includes a set of additional reading to keep advanced students interested. I also threw an advanced topics module in during the last class week. It might be too deep for the less experienced students, but it should provide some head-scratching for the more experienced.
This class goes "production" for Spring quarter this year. I'll be attending as an observer to see how things pan out and I'll apply what I learn to tweaking that class and as input to two others that I'm working on. This should be interesting.
It was interesting to see Mitch Milam's post about the metadata browser. This was a little tool I put together pre-V1.0 ship, but which didn't make the schedule until V3. Mitch points out the "published" component which displays the entity metadata in a nice format. If you edit list.aspx you'll see two sections commented out that provide links to individual entity schemas. This is unsupported and undocumented functionality that we considered calling "sample code". Turns out that it made it into the box but not enabled by default. These schemas work nicely with the code generators that I blogged about a while back.
In a past life (I'm fairly certain it was a past life because there are days when I'm sure I'm paying for it), I worked on a pair of very large-scale data-cleansing systems. They will go unnamed, except that I'll refer to them as System DBS and System ES. Both systems had a specific set of goals and in the case of DBS a very specific target problem domain. The goals were quite simple:
Given that DBS was specifically designed to run over data from a given problem domain (let's call it telecom data) one might assume that the problem was well-constrained. If one did assume that one would be very wrong. So, a small team of developers set out to generalize the problem space to cover different domains and designed ES as a result. ES followed the same path as above only in a very general way and without the overly complex rules engine (12 years later I think we might have been able to use that rules engine, but at the time our distaste for it was clear in the ES design). As an aside outside of parentheses this was called the U model mainly for the shape that the model took on while we drew it on the whiteboard.
So, what's the point of that history lesson and what does it have to do with duplicate detection. Well, remember that I mentioned that the primary domain was telecom. That means we covered concepts such as customers, addresses, telephone numbers, physical plant data (there are a lot of little pieces that go into getting telephone service over a land line), bills & invoices, and payments. In all there were some 30 different systems involved in sourcing this data to our engine. One the first things that needs to happen in the DBS problem space is that a set of candidate keys need to be identified that work across all systems, or at least across enough systems such that in the end all systems can be logically linked. In the telecom world that was the phone number.
In the U.S. (actually in the North American dialing area) telephone numbers are 10 digits long and always follow a very specific format. I won't go into the formal names for those various groups of digits or even why there are groups, but let's just say that each of the groups can become part of a key. Nearly every installation of DBS was in the U.S. so building a telephone number parser wasn't too terribly difficult. You're either looking for 7 or 10 digits (unless you run across a PBX in the data and then you have to start messing about with extensions). Well, the big DBS installation I worked on, and what drove much of the ES design, was not based in the U.S., but was instead in another country with very different telephone formats. Some places used 4 digits, some 5, some 7… you get the point. The plan was to configure and run DBS to first dedup that data so we could convert the whole country to 10 digit dialing. A secondary goal, once the customer realized what we could do, was to dedup the whole lot of data to see what we could find (did you know that many times the phone company doesn't know that your home already has a connected telephone line so they sent a technician out with new gear to hook it up?).
I can hear you now: "Get on with the discussion and tell us why duplicate detection is hard."
Remember that I mentioned the step about identifying a candidate key? Well, in the case of a phone number it's fairly simple (let's make some additional simplifying assumptions that phone numbers are never reused, each person has only one phone number, and numbers never move from person to person). Once you see a phone number in a normalized format you can then query over the set of existing data looking for other instances of that phone number. In a live RDBMS that can be done using a unique key constraint over the normalized phone number column that will throw back an error when an insert or update attempts to violate that key. In our simple world this works every time because when you get an error on an insert you know that you have either a duplicate record or an error in the key. For updates you know you have an error in one of the keys of the updated record or in at most one existing record in the database.
Now, let's extend this from our idealized phone number world to something that's more CRM-ish. A phone number is not a reasonable candidate key because it does change over time, two people can share it, and many people have many numbers. A solution to this problem is to identify a new candidate key. One approach is to construct a key from various bits of useful information. For example one might use some normalized elements of a contact's name, a phone number, possibly an email address, and their home address. Once we've extended the key to cover enough elements to guarantee uniqueness (which is not possible and is left as a proof to the reader) in our problem space we will invariably run into the case where that key isn't wide enough.
Let's see what happens when we insert a new record. First, we construct / synthesize a candidate key and put it into one of the columns in the INSERT statement. Then we fire the statement at the database and wait for an error. Let's assume we get a key violation back so we have a few options: we can change the key, we can ask the data supplier what to do, or we can punt. If we change the key automatically then we've simply ignored the duplicate detection problem and we might as well not have done any of this work. Same thing with punting except that it's overly harsh on the other side: the data doesn't go into the database.
We might ask the user what to do. Well, simply telling them that their just-entered data would create a duplicate record in the system and therefore must be in error wouldn't be particularly useful. How would they know what part of it is in error? How would they even know what to do with a duplicate. One thing that DBS and ES did was return the new record and the existing duplicate(s) in a nice bit of dynamic UI so that the user could see both records essentially side-by-side and make a judgment call. This worked for our solution because we specifically engineered it so that there was a headless service running but a staff of "Error Resolution and Correction Clerks". That is, people were sitting in the dark waiting for bad data to pop up on their screen, they'd make a call based on the original data, the duplicate data, and occasionally the data from the source system.
Let's say we do something like the first approach where we simply return the "offending" records and the new record and let the user decide. Then, the user decides that these two records are actually different from one another but that the data is 100% correct. In this case the user or the system could mar the record in such a way that it's no longer considered a duplicate and complete the write operation. What just happened here? Well, the candidate key for the new item will no longer raise an error when it's the cause of a duplication because that key is different. We could mar the data in a predictable way so that the key stays intact but so that there's a "larger" unique key over the data, but again that wouldn’t cause the insert to fail.
The next option is to use a unique key over the candidate key plus some invented data (invented in a predictable way that is) and query the data on the candidate key before attempting an insert. Now we're getting somewhere. We allow duplicate candidate keys but invent a wider key that guarantees uniqueness (see above for details on the widening proof). But we still haven't solved the problem because we don't have a reasonable way to verify that a duplicate is really a duplicate.
This all gets horribly complex when you're dealing with multiple record types or subclasses of types (think of the customer case in MS-CRM where "customer" might mean Account or Contact. This means you need a candidate key that crosses type boundaries and that you have a way to reconcile duplicates across types.
Anyway, that's why duplicate detection is hard. Note that I didn't say it was impossible, just hard.
[Note: This is personal opinion, it doesn't reflect the viewpoint of Microsoft or the Microsoft CRM team. This is my take on the DMF and both its shortcomings and ultimate potential. Don't assume that anything that seems like a prediction here is apt to happen. I'm only peripherally involved with the DMF team and I don't set direction for them.]
It's a Framework
The "F" in DMF is all about frameworks. Why? Because creating a general-purpose data migration tool or product is extremely difficult, expensive, error-prone, and unlikely to meet our customers' needs. That's right. The DMF is a framework because that's the best approach we could take and the most we could provide without setting unrealistic expectations. Simply put, there isn't a way for us to create a tool that can detect all possible data formats from all possible CRM "applications" and correctly get that data into your shiny new (or slightly used) MS-CRM without the potential for serious data disaster.
Let's look at a few scenarios to see why the framework approach was recommended and pursued by the R&D team. First, we can assume that existing CRM systems have been customized (I don't have the exact numbers, but my gut tells me that it's a high percentage). Next, we can assume that a CRM system has been in use long enough to collect a reasonable amount of data (otherwise why would we worry about migrating data from an existing system to a shiny new MS-CRM).
Given that an existing system has been customized and has been running for some time there's likely to be a few "dirty" bits of data floating around. That doesn't mean that there's a bug with the in-place system. By "dirty" I simply mean that the data in any given database column will have both syntactic and semantic problems. For example, in the U.S. states are typically abbreviated to two uppercase letters. But that hasn't always been the case. For example, Minnesota is conventionally abbreviated MN (at least that's what the post office would like to see), but it's conceivable that collected data includes other abbreviations like "Minn.", misspellings, fully-specified values, and even missing data.
That's just one simple case. Phone number formats and addresses are notoriously hard to agree upon. More about that particular problem in a few days when I get around to talking about why duplicate detection is actually damned difficult to do well.
What we wanted to provide and what we did provide
Ideally we would like to have shipped something with a lot less user-facing emphasis on the "F" part of DMF. One of our goals, which we simply didn't meet, was to provide a Big Green Button that when pressed would discover your other CRM data, clean it up, normalize it, automatically match it to your new MS-CRM system (including all the customizations you put in place and any others that we might discover while migrating your data), and last, but not least, migrate that data. Really, that's what we wanted to do. [Bobert, if you're reading this you'll remember working on another system just like this about 10 years ago and about 9000 miles away.] Well, we didn't ship one of those, so what did we ship?
The general idea behind the DMF is that you're not migrating a single system just once. You might be in which case the DMF still provides a ton of value. One of the assumptions that we had was that MS-CRM customers would be migrating from any number of essentially unknown systems. So, without some really great AI we would need a bit of manual intervention. That is, we'd need to ask a number of questions about your data: what format is it, what source systems hold it, what are the syntactic modifications, and what semantic rules are applied. In many cases we assumed that at least the latter two questions couldn't be answered directly: you would need to discover those rules as you went.
Why is this a multi-step process
It was precisely this problem that drove the idea of the intermediary staging database - the CDF. The idea here was to incentivize partners to either create adapters from source systems for resale (i.e. connect the CDF to Act or Goldmine) or to build a consulting business model around migrating custom data (Access databases, Excel files). We would provide the back-end services such as constructing the CDF from your customizations and moving the data into your production system.
There were three huge problems with this model: we didn't get the partners we wanted; we didn't provide a key piece of technology; and we didn't get the CDF construction logic completed. In retrospect I think the partner model would have been easier to sell if we (and the partners) were up-front about including data acquisition, cleansing, and migration costs directly in the CRM purchase price. Not doing so left the customers with an unexpected bill for these services. We missed the key data cleansing middleware that would have taken all the source data, applied a set of cleansing rules, and produced useful production data. The problem is simply that the technology is extremely hard to get right and even when it is right still requires a set of domain-aware eyeballs to verify the production rules. Finally, we could have and should have done a better job reading your customizations (pick lists and pick list value mappings in particular) and applying them to the CDF and the cleansing / mapping rules.
What's next for the DMF?
That's a good question. I know where I'd like to see the DMF go in future releases, but I can't promise that the team has the same point of view. In particular I think we can do a lot better job in the back-end CDF construction; we can do a much better job with value maps; we should be able to better manage keys; and we should do a better job and basic data cleansing. This latter bit is the most important in my mind: without clean data the value of your CRM system rapidly deteriorates. This isn't just a DMF problem, but if we could verify that source data, once scrubbed, met certain criteria, we would be a lot closer to helping with the problem.
Another area that the DMF could stand some improvement in is around managing multiple phase migrations. The idea of the DMF works great for one-time migrations where all the source data from all the source systems is moved into the CDF at the same time. It doesn't necessarily help if the data is moved in piecemeal unless the DMF includes basic rules around duplicate detection, prevention, and clean-up. If the CDF holds source data over time we can get closer to solving the problem because we can identify these issues during clean-up and "do the right thing." However, if the CDF takes on more of a bulk-load / bulk-import role as a staging area then the actual import step from CDF to CRM needs to include reasonable rules covering data clean-up rule application at the platform level. That's another topic for another day though.