Hi folks,

My name is Atul Adya and I am a software architect in the ADO.NET team. After working on persistent object stores, object caching, distributed file systems, and diagnostics in wireless networking, I have been spending time on a very important problem faced by programmers - reshaping/transformation/mapping of data from one data model to another.

We have been designing and implementing an innovative and powerful solution to address the mapping problem in ADO.NET vNext. In this blog post, I will describe the overall design of this entity mapping architecture. You can play with our mapping technology by downloading the CTP bits.

1. Model to Model Mapping in ADO.NET vNext

Developers typically use object-oriented technology for writing applications and they use relational databases to store their data. However, because of the "impedance mismatch" between the data models of these two technologies, there is a considerable amount of effort that is spent in many projects to address this mismatch. To bridge this gap, a number of Object-Relational (OR) Mapping solutions have been invented that lessen the burden on the programmer.

In addition to applications, a number of data services can also take advantage of a higher level of abstraction offered by a conceptual model of the data rather than the store model (See MSDN article). To cater to the need of applications and data services, ADO.NET vNext provides a more general mapping solution - OR mapping being just one, albeit a very important, instance of the general mapping solution. ADO.NET vNext provides a conceptual model - the Entity Data Model (EDM) - that allows applications and data services to model, communicate and manipulate data at a higher level of abstraction. Thus, the bridging technology that we need is one that maps store side concepts (i.e., relations) to a higher level of abstraction, i.e., the EDM.

Before we dive into the mapping strategy in ADO.NET vNext, it is important to recall the different layers in the stack - the Object Services layer, the Map Provider layer and the Store layer. The ADO.NET mapping approach is different than previous OR mappings in several ways:

(1) It separates the Map Provider layer (described by the EDM) from the Object Services layer where CLR objects are built. The mappings between these layers are relatively trivial - they are essentially one-to-one mappings although we do take care of relationship and navigation property fix-ups in this mapping component.

(2) The EDM to the store mapping component does the powerful structural transformation between the EDM layer and the store layer. It allows developers to map using different inheritance strategy mappings, split entities across tables, mapping with equality filter conditions, etc.

(3) As discussed below, the mapping transformations are internally expressed as eSQL views which provide significant benefits for querying and update.

Actually, the mapping engine that transforms between the EDM and store space performs mappings from one EDM model to another EDM model - note the store can be modeled in EDM terms. Our current implementation assumes that the store has a flat relational model but that restriction can be easily removed in later versions.

In this blog, I will focus on the EDM-to-EDM mapping engine (a.k.a. Entity Mapping engine). This engine takes the CSDL, SSDL and MSL files and produces mapping information for consumption by the update and query pipelines. Recall that these XML files have the following information:

  • CSDL file: The Entity Data Model for the application
  • SSDL file: The database as expressed in EDM terms
  • MSL file: The mapping information about how various entity/association sets and tables connect with each other.

2. Where we came from?

The main goal of the mapping infrastructure in ADO.NET vNext is to translate various operations from the application space to the relational space:

  • Queries: Given an eSQL or LINQ query in EDM terms, the mapping engine must provide information to allow the query to be translated into store terms and shipped to the database server.
  • Updates: Given a set of objects that have been inserted, modified or deleted in the object cache, the mapping engine needs to have sufficient information to help transform these changes into INSERT, DELETE and UPDATE database statements.

Along with these requirements, the mapping engine has to help in propagating server-generated keys and other server-generated values to the EDM space. Furthermore, it also has to provide sufficient information to allow concurrency control checks to be performed.

When we started looking at this problem, we realized that there were a number of important (and sometimes conflicting) requirements that made the problem quite difficult:

a) Flexibility of the mappings: Looking at various scenarios, we observed that people really wanted to map application data to tables in very powerful ways. With a large amount of data present in existing databases, we cannot tell users to change the data or add/remove columns according to the needs of the mapping engine!

For non-legacy database scenarios, existing data is not issue. However, developers may want to have really flexible mappings to the store, i.e., decoupling the thinking of how the store is laid out and how the entities are modeled is important. For example, developers may want to model complex inheritance hierarchies and then map them according to a variety of mapping strategies to database tables, i.e., the three well-known strategies and combinations of them. An enterprise-quality mapping engine needs to be powerful enough to support a large number of scenarios.

b) Robustness of design: Given the plethora of desired scenarios, one way to handle them is to write code that explicitly handles all the different cases. To our knowledge, most OR mapping solutions use this strategy. Even though this approach can be made to work for queries, it is much more difficult to make it work for updates to cached entities and set-based/DML updates. Since every scenario has to be carefully programmed for, the system can become somewhat fragile as more and more scenarios are supported by the mapping solution - clearly, this is undesirable. Thus, robustness of the design as the scenarios scale is critical.

Furthermore, a "case-by-case" programmed design has huge development and maintenance costs - every scenario has to be explicitly programmed for and when a new scenario is added, it needs to be accounted for in the code base. For example, if there is support for the Table-Per-Hierarchy mapping scenario and we want to allow combination of this strategy with a different inheritance approach, we have to explicitly code it. Thus, it is undesirable to have a solution that supports different scenarios via case-by-case reasoning.

c) Correct round-tripping of data: We need to maintain fidelity of the EDM data as changes are made to it, i.e., when an application's changes in the EDM space are propagated to the store and back to the EDM space, the data should be re-created in full fidelity (assuming no concurrent modifications). We need correctness guarantees in the design to ensure that such data round-trips.

3. To View or not to View - that is the question

To address these concerns, we took a radically different approach, thanks to our colleagues in the Database Group at Microsoft Research - Sergey Melnik and Phil Bernstein have been working on these issues for the past few years (see http://research.microsoft.com/~melnik/  for the latest papers).

One of the key insights that we had was that there are two aspects to mapping - the transformation/shaping aspects (e.g., mapping am entity to multiple tables) and the runtime aspects (e.g., translating queries, updates, concurrency control). Wouldn't it be a good idea if we could separate the two problems? Well, we did precisely that. The shaping issues are handled by the View Generator (a.k.a. Mapping Compiler) and the rest is handled by the runtime. The View Generator (as we see later why it is called so) takes the input Mapping Specification Language or MSL files and converts them into Mapping Views that are easily consumable by the update and query pipelines.

Let us consider queries first. To handle data shaping from tables to the rich EDM model, we can express the mapping as a Query Mapping View or QM View that expresses entity sets and association sets in terms of tables using eSQL.

For example, suppose entity type ECustomer is a subtype of entity type EPerson and we have an entity set EPersons of type EPerson, i.e., it can contain EPerson or ECustomer entities. Suppose this entity set is mapped to two tables SPersons and SCustomers such that the former contains all the persons and the latter contains all the customers - this inheritance strategy is also known as Table-Per-Concrete-Class or Horizontal Partitioning. Thus, we have the following (key fields are underlined; ECustomer has an extra field - "credit").

EDM Types: EPerson(pid, name) ECustomer(credit) <inherits from> EPerson
Tables:          SPersons(pid, name) SCustomers(pid, name, credit)

For this scenario, the QM View for EPersons is:

    CASE WHEN _from0 THEN EPerson(p.pid, p.name)
                ELSE ECustomer(p.pid, p.name, p.credit)
    (SELECT pid, name, null, true as _from0 FROM SPersons
     SELECT pid, name, credit, null as _from0 FROM SCustomers) as p

Essentially, this view is saying that EPersons can be populated from SPersons and SCustomers by performing a union of the tables and then constructing EPerson entities for tuples that came from the SPersons table and ECustomer entities for tuples from the SCustomers table (using the _from0 "artificial" field to determine where the tuple came from).

Thus, as explained by Murali in an earlier posting, when the query pipeline receives a query

SELECT VALUE p FROM EPersons AS p WHERE p.pid = 17

it substitutes EPersons with the view above (i.e. "view unfolding") so that all the "leaf" expressions in the query tree are in database terms - at this point, the query pipeline can refactor the query tree such that the relational operators are at the bottom of the tree (which can be shipped off to the server).

Thus, query mapping views give us a nice way of thinking about the problem in a r, i.e., the whole shaping/transformation exercise has been reduced to coming up with the appropriate eSQL view! Furthermore, shaping has nothing to do with the runtime artifacts like delay loading, concurrency control, etc.

This declarative approach gives us a lot of power, composability, flexibility, elegance, and, most importantly, the ability to reason about the correctness of the approach in a very clean manner. As stated above, from the query pipeline's perspective, mapping has just been reduced to view unfolding - there are no complicated mapping data structures or state management to keep track of mapping information! Isn't that cool? It gets even better. Read on …

4. Fine! That was a nice Jedi mind trick. But what about updates?

If we are expressing entity sets as views on tables, then propagating updates to the store essentially boils down to solving the well-known View Update problem in databases. Yay! We can leverage all the knowledge from that field and we are done. Well, not quite. It turns out that there are a number of limitations on view updates, e.g., most systems allow updates only to single-table views, or at best to a single table in a multi-table view - if you modify a view involving a join of two tables, databases typically do not allow updates to both tables. Of course, in mapping scenarios, a user may modify multiple fields of an entity that has been mapped to multiple tables - it is unacceptable to inform the user that such modifications are disallowed. Thus, the view update solution does not work for our environment.

So the question is - how do we handle translation of updates from the EDM space to the Store space in a declarative manner? We solve this problem by introducing a novel concept of Update Mapping Views or UM Views. These are views in which tables are expressed in terms of entity sets and association sets. Essentially, they are the "reverse" of the QM Views. For example, for the EPersons, SPersons, SCustomers example discussed above, the UM View for SPersons is (for SCustomers, the view is similar):

SELECT VALUE SPerson(p.pid, p.name)
FROM EPersons as p

The UM View simply says that the SPersons table contains all entities of type EPerson - similarly, for SCustomers. So what do we do with modifications of entities? Well, the modification problem for the above example is - given changes to EPersons, propagate them to SPersons and SCustomers. That is, given changes to the "base set" EPersons we need to propagate the updates to the "view" SPersons and the "view" SCustomers. Wait a minute - that is nothing but the Materialized View Maintenance problem in databases!! Unlike the View Update problem, Materialized View Maintenance is not as restrictive, e.g., you can have a single change to the base relation be propagated to multiple materialized views (recall that this would have corresponded to a multi-table update if we used the View Update approach on the QM Views).

Thus, by having "reverse" views we have converted our View Update problem with limited solutions into the much more flexible View Maintenance problem. Not only do we gain flexibility, it turns out that we can use the knowledge of Incremental Materialized View Maintenance algorithms that have been developed in the literature for propagating small changes in the EDM space to the table space. Of course, these algorithms were developed for the relational model for very general scenarios. Instead, we had to develop optimized algorithms for a non-relational model (i.e., the EDM) for our specific settings (i.e., we have keys on both sides, etc).

Thus, in the above example, given a change to the EPersons entity set, we can propagate the changes to the SPersons or SCustomers table depending on the type of the entity (how update propagation is done is something we will blog later).

5. Data Roundtripping

Our bidirectional views are generated in such a manner that data can roundtrip from the EDM to the store and back. To achieve this goal, the QM Views and UM Views need to be inverses of each other - let's see what that means. Suppose that an entity set E is spread across tables S1, S2, … Sn. That is,

E = QMView(S1, S2, …Sn)

We know that S1 = UMView1(E), S2 = UMView2(E), Sn = UMViewn(E).
(Note that he above views may involve other entity or association sets, e.g., S2 = UMView2(E2). For simplicity, we are ignoring such a scenario in this discussion)
So we get the round-tripping condition that must be satisfied for the views related to E:

E = QMView(UMView1(E), UMView2(E), … UMViewn(E))

That is, applying the update mapping views on E (sending the data to the store) and then applying the query mapping views on that result (querying data in E from the store) must yield E itself. The views used in ADO.NET vNext must not violate this condition. For the example above, it must be the case that:

EPersons = QMView(SPersons, SCustomers)
EPersons = QMView(UMView1(EPersons), UMView2(EPersons))

The reader can verify that this is indeed true.

6. Summary

To summarize our approach:

1.  We separate out shaping/transformation issues from runtime aspects

2.  We use a declarative style for expressing these transformations - for queries, we use Query Mapping Views that express entity/association sets in terms of tables. For updates, tables are expressed in terms of entity/association sets using Update Mapping Views.

3.  During query execution, translation of an EDM query to the store space is done by view unfolding the Query Mapping views.

4.  During update propagation, translation of entity changes to tables is performed by using incremental view maintenance algorithms over the Update Mapping Views.

5.  The bidirectional views declaratively say that data will roundtrip from the EDM to the store and back (assume a correct implementation :)

The component that converts the MSL files into eSQL views for each table, entity set, and association set is called the View Generator or Mapping Compiler.

7. Future posts

Hopefully, this post has piqued your interest about the mapping technology. There are a lot of things that I did not talk about today. They are topics for future postings:

1. Generation of views - You must be wondering how we get these views. Does the end-user write them? That seems painful! The answer is - we generate them from the XML mapping specification given to us. How we do it is actually quite interesting.

2. How does one express mappings? We have an XML format for specifying mappings (in an MSL file). It turns out that we have formal semantics for each of those mapping concepts in the MSL file. Understanding these semantics really helps in better understanding the mapping input and various mapping issues.

3. Validation of input mappings - We do extensive validation on the MSL file to make sure that the mapping is valid, i.e., to ensure data can roundtrip from the EDM to the Store and back. There is a ton of interesting constraint propagation work that we have done there.

4. Capabilities of mappings - The power of ADO.NET vNext's mapping capabilities can be appreciated by looking at some examples and the corresponding views.

5. Details on the update pipeline - How does propagation of updates happen via view maintenance? There are a number of interesting and subtle issues in the update pipeline that one has to take care of.

Please let us know if you have questions, criticisms, problems w.r.t. our Entity Mapping approach.