Welcome to MSDN Blogs Sign in | Join | Help

Entity SQL

ADO.NET’s Entity Client comes with a client-side query engine that provides a consistent entity-level platform over different store providers. The input to that query engine is an entity-level query written in a language naturally called Entity SQL. The purpose of Entity SQL is to provide a common, entity-level, query language independent of any data store, so that queries written in it are the same across different store implementations. People might expect Entity SQL to be a small subset of SQL constructs that is common for most data stores. Those will be pleasantly surprised - Entity SQL is a very rich functional language whose expressiveness supersedes conventional SQL dialects.

 

Expressions

The building block of Entity SQL is expression. The language is about composing complex expressions. Thus the simplest Entity SQL query looks like this:

 

-- Returns one scalar item

1;

 

The domain of expression types is multidimensional. With regard to multiplicity, expressions fall into two categories: items and collections. The difference between those two is easy to guess – collections are subject to relational operations while items are not. Here are some valid collection expressions:

 

-- Returns a collection of one element

{1};

 

 

-- Returns a collection of three elements

{1, 2, 3};

 

With regard to source, expressions may be literals, parameters, or nested expressions. That means anywhere a value is expected, it could be parameterized or composed of some other expression. Parameterization has both security- and performance benefits. By passing user-supplied input as parameters, one avoids the risk of SQL injection. And by using a constant query text, a frequently used query is likely to stay in the query plan cache and won’t waste time to compile for each execution.

 

Composition means that queries are no longer constrained to the tabular shape of conventional SQL. A nested expression may be placed anywhere a value of that type would be accepted. For instance:

 

-- Returns a hierarchical collection of three elements at top-level.

-- x must be passed in the parameter collection.

ROW(@x, {@x}, {@x, 4, 5}, {@x, 7, 8, 9})

 

-- Returns a hierarchical collection of one element at top-level

-- x must be passed in the parameter collection.

{{{@x}}};

 

If a collection represents the vertical axis of a conventional SQL row set, then what represents the horizontal axis, i.e. how can we make a composite item? In addition to the scalar items that we’ve seen already, there are rows and entities. An Entity SQL row is identical to the conventional SQL row – it’s a set of items eventually of different types. A row is constructed through the ROW() constructor.

 

An entity is an instance of an entity type from the entity model. An entity could be flattened to a row through the VALUE keyword. The opposite is also true – an entity may be constructed from a row using the entity type constructor as long as all the properties are supplied in the exact order as they are defined in the entity model.

 

Now we are ready for some samples with relational operations. All the samples bellow use the Northwind sample model:

 

-- Returns one row with three elements

ROW(1, ‘abc’);

 

 

-- Returns a collection of five identical rows with three elements each

SELECT ROW(1, ‘abc’)

FROM {11, 12, 13, 14, 15};

 

 

-- Returns a collection of two rows with two elements each

{ROW(1, ‘abc’)}

UNION ALL

{ROW(2, ‘xyz’)};

 

 

-- Returns a collection of all the entities from the Categories set

SELECT c

FROM Northwind.Categories AS c;

 

 

-- Returns a collection of rows where each row represents a flattened entity from the Categories set

SELECT VALUE c

FROM Northwind.Categories AS c;

 

Navigate vs. Join

An entity model has no notion of normalization as that is only applicable to store implementations. Therefore entity-level queries against a well-designed model should rarely need to use JOIN. Instead, those queries should use navigation properties. Those are properties that represent relationships in an entity model. For instance:

 

-- Returns a collection of rows where each row contains a Product entity and a reference to its corresponding Category entity

SELECT p, p.Categories

FROM Northwind.Products AS p;

 

Navigating the same relationship in the opposite direction is even more interesting:

 

-- Returns a collection of rows where each row contains a Category entity and a collection of references to its corresponding Product entities

SELECT c, c.Products

FROM Northwind.Categories AS c;

 

References and Entities

References and entities in Entity SQL differ with regard to entity properties - properties are directly applicable only on entities, and not on references. A reference  must be “dereferenced” to an entity first. For instance, to query all products and the name of each product’s category, we need the following query:

 

-- Returns a collection of rows where each row contains a Product entity and the name of its corresponding Category

SELECT p, DEREF(NAVIGATE(p, Northwind.FK_Products_Categories)).CategoryName

FROM Northwind.Products AS p;

 

Nested Queries

As it was mentioned above, a nested query may be used anywhere an expression of the type it returns is accepted. Furthermore, since Entity SQL is not limited to a tabular result set as conventional SQL dialects are, one can place a nested query even in a projection clause:

 

-- Returns a collection of rows where each row contains a Category entity and a collection of references to its corresponding Product entities

SELECT c, (SELECT DEREF(p)

                FROM NAVIGATE(c, Northwind.FK_Products_Categories) AS p) AS Products

FROM Northwind.Categories AS c;

 

Paging

Paging is a hot feature that web applications demand from data providers. Roughly speaking paging is extracting a range/page of a base collection starting at a given offset and proceeding for a given length/size. The benefit of paging is reducing the cost of building and transporting the result on the backend and processing it on the frontend.

 

Entity SQL supports physical paging. That’s what the major data store providers support today. Physical paging means both the offset and the size of the requested page are given in number of items. For instance, if an application is rendering products from the Northwind model in groups of 10, on page 1 it will render items 1 through 10, on page 2 – items 11 through 20, and so forth. Physical paging has some deficiencies with regard to changes in the base collection:

·         For each item that is removed from a lower-index page, the rest of the collection will shift towards the beginning, and one item will not be shown at all.

·         For each item that is added to a lower-index page, the rest of the collection will shift towards the end, and one item will be shown more than once.

Despite of its semantic deficiencies physical paging could be implemented efficiently by data providers. It is also easy to comprehend and to adopt in applications.

 

Paging implies some sort order. Although in standard SQL one can query a range without specifying any ordering, Entity SQL doesn’t allow that exception – paging constructs are invalid without an ORDER BY clause. The clauses to specify page offset and page size are SKIP and LIMIT respectively. The following example would retrieve products from position 21 through position 30 ordered by name:

 

-- Returns a collection of rows where each row contains a Category entity and a collection of references to its corresponding Product entities

SELECT p

FROM Northwind.Products AS p

ORDER BY p.ProductName

SKIP 20

LIMIT 10;

 

Each of SKIP and LIMIT is optional. When LIMIT is given alone, it is equivalent to TOP in the projection clause. However, mixing SKIP and TOP is not allowed.

 

Differences from Standard SQL

Although Entity SQL shares a set of operations and syntax constructs with standard SQL, it should not be approached as another “SQL dialect”. Instead, it should be regarded as a new, collection-oriented, query language with its own merits.

 

The syntax of standard SQL is driven by the tabular shape of the result sets that data stores return. There are some shortcuts that are not available in Entity SQL since Entity SQL does not imply any particular shape of the collections it operates on. Following are examples of how some standard SQL statements map into Entity SQL:

 

-- Pseudo-T-SQL

SELECT

FROM

 

UNION ALL

 

SELECT

FROM …;

 

 

Writing such a statement in Entity SQL is ambiguous since the operands of the UNION ALL operator are unclear. It would work over named collections like this:

 

-- Pseudo-Entity SQL

c1

UNION ALL

c2;

 

 

The rule in Entity SQL is that nested queries must always be enclosed in parentheses: 

 

-- Pseudo-Entity SQL

( SELECT

FROM … )

 

UNION ALL

 

( SELECT

FROM … );

 

Another convenient shortcut in T-SQL is that an ORDER BY clause may be specified at the end of the last query of a UNION ALL operator:

 

-- Pseudo-T-SQL

SELECT

FROM

 

UNION ALL

 

SELECT

FROM

 

ORDER BY …;

 

Entity SQL’s UNION ALL operator doesn’t have an ORDER BY clause at all. Again, nesting should be used:

 

-- Pseudo-Entity SQL

SELECT

FROM

    ( 

        ( SELECT

           FROM … )

 

          UNION ALL

 

        ( SELECT

          FROM … )

    )

ORDER BY …;

 

Overall our belief is that having stricter rules and less exceptions is better for the adoption of a programming language. Once a developer abstracts herself from the tabular constraint of conventional result sets, Entity SQL will become much easier to comprehend than standard SQL.

 

 

A beta version of the Entity SQL language reference is available as part of the June CTP distribution. Take a look and try out some queries. Don’t hesitate to post questions and ask for help at the Entity Framework forum. Feedback is welcome.

 

 

Zlatko Michailov

Program Manager, Data Programmability

Microsoft Corp.

Posted by Zlatko Michailov | 3 Comments
Filed under:

Entity Framework June CTP Released

The June CTP of Entity Framework was released earlier this week. It includes work from the milestone when we switched from adding features to shaping the final product. More details on the content could be found in Danny’s blog post.

 

The bits could be downloaded from the Microsoft Download Center. There are some installation issues due to the timing when the CTP branch was cut off the main line. Workarounds are discussed at the ADO.NET blog.

 

Feedback is welcome at the ADO.NET Technology Preview forum.

 

 

Zlatko Michailov

Program Manager, Data Programmability

Microsoft Corp.

 

Posted by Zlatko Michailov | 3 Comments
Filed under:

RE: Hello World!

It’s been over a year since my last post on this blog… I must have been too shy about Entity Framework and my role in it. I guess I owe an explanation.

 

Entity Framework is steadily driving towards release. The main planned features are in. We are aiming at giving the product a finished shape rather than adding as many features as possible. For feature overviews and latest news, check out the ADO.NET Team blog.

 

I am the program manager for the following Entity Framework components:

·         EntityClient (EntityConnection, EntityCommand, EntityParameter, etc.)

·         Query pipeline

·         Update pipeline

·         Entity SQL

 

Although I’m not the architect of those components, I’ve been scoping them for the last almost 2 years, and I’m driving them towards completion. I’ve been answering questions related to them on the Entity Framework forum. I’ll be happy to discuss usage patterns and, of course, answer more questions.

Posted by Zlatko Michailov | 0 Comments
Filed under:

ADO.NET v.Next Links

Check out these links on ADO.NET v.Next:

 

Blogs:

          http://blogs.msdn.com/data

            http://blogs.msdn.com/dataaccess

 

Channel 9:

http://channel9.msdn.com/Showpost.aspx?postid=191667 

 

Screencast:

http://blogs.msdn.com/dataaccess/archive/2006/05/10/594781.aspx

Posted by Zlatko Michailov | 1 Comments
Filed under:

MSDK Synchronization Primitives - SemiMutex

SemiMutex is a sophisticated type of mutex that mixes exclusive- and shared locks following this widely accepted concept:

  • An attempt to perform an exclusive lock blocks until all active locks are released.
  • An attempt to perform a shared lock blocks until an eventual active exclusive lock is released.

 

The tricky part is how to make an attempt for an exclusive lock block subsequent lock attempts of either kind while waiting for all shared locks to be released with using minimum system resources.

 

MSDK provides a very elegant implementation where each method consists of literally a couple of programming lines. That is made possible by the Blocking Counter that is used internally. Each shared/read lock increments the Blocking Counter while its corresponding release decrements it. At entry-level access is serialized through a regular mutex which holds subsequent lock attempts. Now for an exclusive lock attempt to patiently wait for all shared locks to be released without consuming CPU cycles, all it needs is to call WaitUntilClear() on the embedded Blocking Counter. That is hooked on the internal Clear Event member of the Blocking Counter that is signaled in a permanent manner when the counter value becomes 0.

 

 

SemiMutex makes the implementation of the Single-Writer–Multiple-Readers multithreading pattern absolutely trivial.

 

Enjoy!

 

Source code:

http://cvs.sourceforge.net/viewcvs.py/msdk/include/semimutex.h?rev=1.2&view=auto

 

Download MSDK:

          http://prdownloads.sourceforge.net/msdk/msdk-2.10.053.tar.gz?download (37K)

 

Posted by Zlatko Michailov | 0 Comments
Filed under:

HOWTO: Group Outlook mail messages by conversation without wasting 2 lines per thread

Problem

I like Outlook Express as an email client. I like how nicely it groups message threads without wasting space. It’s excellent for… home use only though. When it comes to work, integration with Calendar becomes the most important factor - I manage 50+% of my time through Calendar. Thus, I have no choice but to use the big brother, Outlook, for email. The problem I have with Outlook is that it adds 2 extra lines per message thread that I don’t want to see. Imagine you have a folder full of collapsed message threads. The useful information in the view is only 33% because for each thread (which is collapsed to a single line), Outlook adds two more (unnecessary) lines – a blank one and a copy of the subject.

 

Half-Solution

Here’s what I’ve discovered to keep my messages organized by thread and by receive time:

  • Disable grouping
  • Sort by: Conversation Index (ascending), Receive Time (ascending)

Then Outlook is smart enough to simulate something like threads. The problem is I cannot collapse threads but I can it live with that for now.

 

Feedback Welcome

If anyone has an idea how to group messages in collapsible threads, or knows a mail client that integrates with Calendar, please let me know. I’ll appreciate that.

Posted by Zlatko Michailov | 0 Comments
Filed under:

MSDK Synchronization Primitives - Blocking Counter

The MSDK Blocking Counter is a simple synchronization primitive that turns out very powerful in modeling threading patterns. It wasn’t originally planned – it emerged during the design of the thread message queue, and proved to be very helpful later on.

 

Basic API

Blocking Counter is a counter with the following basic methods:

  • Increment         (operator ++)        – never blocks; increments the value of the counter.
  • BlockingDecrement (operator --)        – if the value of the counter is strictly positive, it decrements it without blocking; if the value of the counter 0, it blocks until another thread increments the value.

 

Business Meaning

The Blocking Counter has a very common business meaning – it represents a pool of resources. A resource may be added at any time (Increment/++). A resource may be allocated only when there are 1 or more resources in the pool (BlockingDecrement/--). That is the opposite of Semaphore where we start with a fixed number of resources in the pool, and threads “borrow” resources temporarily. Interestingly enough a Semaphore may be modeled using a Blocking Counter, while the other way around is not possible.

 

Additional API

Additionally the Blocking Counter primitive exposes two blocking methods:

  • WaitUntilClear      - blocks until the counter value becomes 0 without affecting it.
  • WaitUntilDirty      - blocks until the counter value becomes positive without affecting it.

 

Application

A Blocking Counter could be used in queues, stacks, or other data structures, where a pop attempt should block until there is at least one available item. In MSDK Blocking Counters are used in SemiMutex and BlockingSequence, which are further used in ReadWrite and Thread respectively.

 

Source code:

http://cvs.sourceforge.net/viewcvs.py/msdk/include/bcounter.h?rev=1.4&view=auto

 

Download MSDK:

          http://prdownloads.sourceforge.net/msdk/msdk-2.10.053.tar.gz?download (37K)

 

 

SemiMutex is another interesting primitive and it will be the topic of my next post.

 

Posted by Zlatko Michailov | 0 Comments
Filed under:

MSDK Synchronization Primitives - Event

Normally event should not deserve any special attention as a synchronization primitive. However, the MSDK event compensates the inconsistent/incomplete behavior of both Win32- and POSIX event primitives.

 

 

Win32 Event

A Win32 event has one of two types:

  • Manual           - once signaled, the event remains signaled until explicitly reset. While the event is in signaled state, threads consume it without blocking.
  • Automatic     - a signal is consumed exactly once. If one or more threads are blocked when an automatic event is signaled, exactly one thread is unblocked, and the event is reset immediately after that. If no thread is waiting on the event when it is signaled, the event remains signaled until a thread wants to consume it. That thread consumes the event without blocking, and the event is reset immediately after that.

Additionally Win32 has an interesting function, PulseEvent(), that has different effect over different types of events, and which, unfortunately, is documented as unreliable. Pulsing an automatic event releases one blocked thread, or the signal is lost. Pulsing a manual event releases all blocked threads, or the signal is lost. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/base/pulseevent.asp

 

 

POSIX Condition

The corresponding primitive in POSIX is Condition. Conditions are only pulsing, i.e. if no thread is waiting on the condition, the signal is lost. POSIX has a method, pthread_cond_broadcast(), that explicitly releases all blocked threads. Additionally a POSIX condition is always combined with a mutex, which turns out very convenient for modeling complex behaviors like MSDK Event.

http://www.die.net/doc/linux/man/man3/pthread_cond_signal.3.html

http://www.die.net/doc/linux/man/man3/pthread_cond_broadcast.3.html

 

 

MSDK Event

As I stated in my initial post on multithreading, I bet on Win32 as the main threading platform (http://blogs.msdn.com/zlatkom/archive/2006/03/08/Multithreading_WinPosix.aspx). Therefore I keep the terminology as close to Win32 as possible. However, I disagree with the concept that events should have a fixed a type and all the signaling they get is based on that type. So I changed that in the following manner: events have no specific type. Instead, the signaler specifies how it wants to signal the event. There are three possible ways to signal an event:

  • Pulse           – if one or more threads are blocked waiting on this event, exactly one thread gets unblocked and the event is reset. If no thread is waiting in this event, the signal is lost. This is the equivalent of a POSIX condition and pulsing a Win32 automatic event.
  • Signal           – if one or more threads are blocked waiting on this event, exactly one thread gets unblocked and the event is reset. If no thread is waiting in this event, the event remains in signaled state until a thread wants to consume it. That thread continues execution without blocking and the event is reset immediately after that. This is the default. This is the equivalent of a Win32 automatic event.
  • Permanent   – the event remains signaled until explicitly reset or signaled with Pulse or Signal. During that time all threads that want to consume the event continue their execution without blocking. This is the equivalent of a Win32 manual event and broadcasting a POSIX signal plus remaining signaled.

 

Source code:

http://cvs.sourceforge.net/viewcvs.py/msdk/include/event.h?rev=1.3&view=auto

 

Download MSDK:

          http://prdownloads.sourceforge.net/msdk/msdk-2.10.053.tar.gz?download (37K)

 

 

As I was writing this post, I discovered a bug and an enhancement:

  • Bug: possible racing condition while setting the event state in line 186. To resolve it line 186 should be moved between lines 152 and 153.
  • Enhancement: Replace the call to PulseEvent() with SetEvent() ResetEvent() on line 166 since the MSDN help states against using PulseEvent(). Generally, that replacement is not equivalent. However, in MSDK that is fine because access to an Event instance is serialized.

 

 

Next are the extended synchronization primitives – Blocking Counter and Semi-Mutex.

 

Posted by Zlatko Michailov | 0 Comments
Filed under:

MSDK (Multithreading SDK)

MSDK is a C++ template library released under the GNU Lesser General Public License (LGPL) http://www.gnu.org/licenses/lgpl.html, which means it is truly free for any purpose. Nothing in MSDK is patented. Since MSDK is a C++ template library, it is distributed only as source code (plus samples and documentation). The archive could be downloaded from:

          http://prdownloads.sourceforge.net/msdk/msdk-2.10.053.tar.gz?download (37K)

The source code may be browsed online at:

          http://cvs.sourceforge.net/viewcvs.py/msdk

 

MSDK builds and runs under Windows and Linux. Its purpose is to create a convenient OS-independent framework for C++ developers. MSDK uses directly Win32/MSVCRT under Windows, and Pthreads otherwise.

 

MSDK 2.10, released on 12/23/2004, includes:

Basic Primitives

·         Mutex      - exclusive lock

·         Event       - tri-state event: pulse, single, and permanent

 

Extended Primitives

·         BlockingCounter   - blocks when trying to decrement from 0. A cornerstone in MSDK!

·         SemiMutex          - shared- and exclusive locks mixed

 

Thread Support

·         Message<T>                 - optimized (typed) buffer carrier

·         BlockingSequence<T>     - queue/stack that blocks when trying to pop from an empty sequence

·         Thread                        - the most important feature

 

Multithreading Patterns

·         ReadWrite           - single writer multiple readers

·         ThreadPool<T>    - thread pool

 

Process Support

·         Config                - config file read/write

·         Log<T>              - asynchronous log writer

·         Service              - Windows service-, Linux daemon stub

 

I will elaborate on each class in my upcoming posts. Then I’ll reveal my vision about the direction in which MSDK should develop.

Posted by Zlatko Michailov | 0 Comments
Filed under:

Windows- and POSIX Threads

Writing a simple console program in C++ that does some computation, file input/output, and memory structure processing, and that also builds and runs under Windows and multiple UNIX flavors, is quite feasible – not too many #ifdef‘s are required. Unfortunately when things start scaling up and concurrent processing, i.e. threading, is required, one hits a brick wall.

 

Windows

Threads became popular with the 32-bit (and hybrid) versions of Windows – Windows NT 3.1 and Windows 95. It may be arguable whether threads were implemented in Windows or VMS first but I don’t find that very important. Here is an interesting article that compares Windows NT and VMS:

http://www.windowsitpro.com/Windows/Articles/ArticleID/4494/pg/1/1.html

What I do find important is that threading is a key point in Windows’s architecture. Threads have been carefully designed and implemented for performance. And I bet the majority of the threading code worldwide is written against Windows threads.

 

Programming against Windows threads has one caveat – initializing the CRT (C Runtime) library in Visual C++.  The CRT library has been around ever since the Microsoft C compiler was released for MS-DOS. It uses internal buffers for some of its functions. During the port to 32-bit Windows, those global buffers have been converted to thread local storage as an easy way to achieve thread safety. Now the problem is that every thread that uses CRT must initialize its own thread local storage. Therefore CRT provides its own thread creation functions that supersede the native Win32 calls and do the initialization internally. For a similar reason, MFC exposes its own thread creation API but I won’t bother with it since MFC is not portable.

 

POSIX

The UNIX community’s response to Windows threads has been a standard specification known as POSIX Threads or Pthreads. Here is a link to the standard itself:

http://standards.ieee.org/reading/ieee/interp/1003-1c-95_int/index.html

and another link that provides a more readable interpretation of the spec:

http://www.llnl.gov/computing/tutorials/pthreads/

 

RedHat and SuSE (and perhaps most Linux distributors as well) ship an implementation called LinuxThreads with their 2.4 kernel distributions:

http://pauillac.inria.fr/~xleroy/linuxthreads/

which is replaced by NPTL in the 2.6 kernel distributions:

http://en.wikipedia.org/wiki/NPTL

It even turns out RedHat has an implementation of Pthreads for Windows:

http://sources.redhat.com/pthreads-win32/

 

Portable Threading Code

Conceptually the two sets of APIs – Win32 and Pthreads are pretty close. However the function signatures are by far different. One option is to always program against the Pthread API but there is a good chance to lose the native Win32 threading performance. Furthermore, both APIs are plain C, and as a C++ programmer I would like to program against a class library.

 

So I implemented such a class library that resolves to Win32/CRT threads under Windows and to Pthreads otherwise. I called it MSDK. The code that wraps the platform APIs is minimal. MSDK implements well-known synchronization primitives plus some new ones. It also implements some well-known threading patterns. Plus classes for building portable services. And that will be the topic of my next post.  

Posted by Zlatko Michailov | 1 Comments
Filed under:

Introduction

Starting with this post I’m launching a series about multithreading where I’ll share my experience and my ideas about the future. I will not provide basic educational material. Instead, I’ll describe my work on primitives and patterns. If you are looking for a place to start learning, consider the following links:

http://en.wikipedia.org/wiki/Multithreading

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dndllpro/html/msdn_threads.asp

 

What makes threads interesting is not spawning and terminating them but rather the constraints implied by their concurrent execution. Everything boils down to safe resource sharing (without losing too much performance). That is known as thread safety. Interestingly enough, thread safety implies a very small number of synchronization primitives and design patterns, which makes it a very attractive target for improvements. And that’s what I did – I designed and implemented a set of primitives and patterns that make multithreading programming rather commodity than craft.

 

I’ll try to keep the posts short as opposed to writing long articles. I’ll explain how I got motivated to do that and, of course, I’ll discuss details of the implementation.

Posted by Zlatko Michailov | 1 Comments
Filed under:

Active Data

1.   Passive Data

Data, as we know it today, resides quietly in designated data stores, and allows external services to hammer on it without losing its quality. Active logic is coded in services, not in data. It is services who decide how to transport, reproduce, and/or reshape data. Being passive, data is susceptible to unlimited number of attacks of uncontrolled brutality. Here is an example:

 

“Secure” File Systems

Most file systems today support some sort of an access control to their files and folders. Although access control information is stored as part of the file system, enforcing it is an activity performed by the operating system, which is a service separate from the file system. Any unprivileged user can access every single file and folder on a Windows/Linux dual boot machine. Install this NTFS driver in Linux http://data.linux-ntfs.org and see what the SYSTEM account can do. Alternatively, install this Ext2/Ext3 driver in Windows http://www.fs-driver.org and feel like a Linux root.

 

It’s been long since the software industry has recognized the need to activate data – structs and functions have been combined into classes with public and private members. Great! If that need has been obvious for source code, why is it not obvious for compiled products? Data units must implement their own processing functionality! Here is a good example:

 

Magnetic Stripe vs. Smart Card

If an attacker gets hold of a magnetic-stripe card, all they need is a reading device and a software driver, and they can dump everything from the card (even if that takes a large number of attempts). A smart card may not be read – it may only be asked to respond to good requests. And, if the smart card detects a threat, it may stop responding, or even destroy itself. Today smart cards are standalone hardware devices but I believe one day all the data we produce and exchange will be protected like that.

 

2.   What Does It Mean for Data to Become “Smart”?

  • First, data will be produced and exchanged in Self-Contained Units of Data (SCUD). Each scud will carry sufficient code to perform all its public operations. That code must be executable on all major hardware/OS configurations. That doesn’t mean each Word document will embed WinWord.exe. Instead, a document will call a certified implementation of a standard WinWord API by a trusted vendor. Scuds are sandboxed based on their type, i.e. different sets of services are enabled for them based on their type.

 

  • Second, the content of every scud must be encrypted to protect its confidentiality, the scud as a whole must be hashed to prevent tempering with its content and/or code, and the scud may optionally be signed to guarantee its origin. Specifying a complete mechanism for protecting scuds is not a trivial problem, and is a good topic for another post. I’ll appreciate any help with that.

 

  • Third, services will become servants of scuds. Instead of performing actions on data, services will create favorable environment that attracts scuds to perform their own operations.

 

3.   Consequences

3.1.          Privacy

People will have to get used to the idea of having (and using and guarding) a truly private key (not the one from their employer) in order to prove the origin of their scuds and to be able to receive confidential scuds without malicious actors along the path impersonating them and thus intercepting scuds. Scuds could be further programmed to destroy themselves in an unfriendly environment.

3.2.          Viruses

Viruses will have their renaissance. Since all scuds will be executable, distinguishing between a real message and a virus will be a real challenge. It will be up to the platforms to properly sandbox scuds.  

 

4.   Data Workflows

Having said all that, let’s see what some popular data workflows will look like:

4.1.          Data Transportation

Services no longer transport data packets from node A to node B. Instead, a service on node A creates a socket to its corresponding service on node B. The socket attracts some scuds to travel from node A to node B. There may be multiple open sockets from node A at any given moment. Each socket attracts a different set of scuds (at a different degree).

4.2.          Web Browsing

A user creates a socket to a web server and requests a copy of a scud within the server’s scope. The web server asks the scud to replicate itself and attracts the copy in the socket to the browser. The copy itself may not be likely to be replicable/persistable. (The server may be configured to send the original scud – “drop box” scenario.) If the request is for the result of a servlet, that servlet creates a scud dynamically and the server attracts that dynamic scud in the socket to the browser.

4.3.          Email

Alice produces a mail scud that is automatically encrypted and hashed. She further signs it with her private key. Alice instructs the scud to take off towards Bob. Then the scud tries to find a local mail service to use as a vehicle for the first leg. The local mail service on her machine creates a socket to the nearest mail server that attracts all outgoing mail scuds. Once at the Alice’s mail server, the scud requests a socket towards Bob. A socket is created to Bob’s mail server that attracts the scud. After arrival at Bob’s mail server, the scud sits and waits to be attracted by a socket. When Bob logs onto his machine, his local mail service automatically creates a socket to the mail server. His user token in the socket attracts all mail scuds addressed to him. Thus the one from Alice makes it through the final leg.

4.4.          Permanent Email Address

In order to maintain a permanent email address where we could be reached regardless of employer and Internet service provider (ISP), today we have two options:

·         A free email server, e.g. Hotmail, Yahoo, Gmail, etc.

·         An email proxy, e.g. Source Forge.

Either option we choose, it’s not because of any merits but because it’s the lesser evil. Here is how it should be: Bob has a global identity. When Bob logs onto a computer, his session creates the most attractive environment for scuds targeted at his identity. His immediate ISP has the second most attractive environment, and so forth. User identity should propagate throughout the global network in a way similar to how DNS records propagate today. Thus, when Alice sends Bob an email, the scud follows the strongest attraction that should ultimately lead to Bob’s session. Sending a message to Bob does not require knowing and specifying his mail service provider. The only thing needed is the public view of his global identity.

 

5.   Timeline

The trend is obvious: object oriented programming is unquestionable, smart cards are very popular in Europe and slowly ramping up in US. Today the world is still at a stage where information and influence are insufficient, and the problem for people is how to gather more up-to-date information on time. In order for the above vision to turn into reality, the world must enter another stage where people want to keep their own information away not from individual hackers but from commercial telemarketers whose business is to intercept traffic and to sneak at private information; where people and network nodes choke on volumes of junk and start losing important information. We are not there yet but we are headed that way.

 

Posted by Zlatko Michailov | 0 Comments
Filed under:

Hello World!

My name is Zlatko Michailov. I am a program manager at Microsoft. I used to write code for a living for 13 years. Since I no longer do that, I release my steam in words.

 

I am passionate about OS-independent programming, multithreading, security, networks, and data. I’ll try to explain in my posts why I think those are tightly connected.

 

Bear with me.

 

 
Page view tracker