Project Austin Part 6 of 6: Storage

George Mileka [MSFT]

October 29th, 20120 0

Hi, my name is George Mileka. I’m a developer on the C++ Libraries team. I have been working on the Project Code Name Austin for many months with Jorge, Eric, and Alan. To learn more about what the Project Code Name Austin is, you can read this great post by Jorge Pereira.

For Project Austin, we have used ESE (Extensible Storage Engine) as the storage engine. In this blog post, I will explain why we chose ESE, how ESE works, and finally what abstractions we have created around it for our own use.

Why use ESE?

When we started thinking about how to store the pages, the strokes, and the photos in the notebook, we came to realize we need to support the following scenarios:

The application may be shutdown unexpectedly due to Windows 8 life time management. Even though the application will get a chance to save its state in the Suspending event handler, that time cannot be guaranteed to be long enough to save all the necessary state because it is not controlled by the application. As a result:

We need a way to guarantee data consistency even if the application was shutdown unexpectedly.
We also need a fast way to save the data to minimize data loss in case of an unexpected shutdown.

We expect the data to grow significantly over time; to the degree that we cannot save/load the entire data set within a reasonable amount of time. As a result:

We need the ability to be selective about what we load or save.
We need the ability to cache some of the reads or writes.

Taking these requirements into consideration, we realized that implementing our own layer would be a significant amount of work in terms of time and complexity.

At the same time, Jorge pointed out that we can use a database engine since it really does everything we are looking for.

So – it is going to be a database engine. But which database engine?

We needed a database engine that is accessible from Windows Store apps and has a relatively easy deployment story. For the initial release, we have decided to not support remote storage or sync’ing across devices to scope the work.

A few database engines came to mind:

ESE is a database technology that has been shipping in Windows for a few releases now. A process can access the functionality as it would access many other Windows Win32 functionalities: include the header (esent.h), make the Win32 calls, and link with the import library (esent.lib). ESE is already being used by various major Microsoft products.

SQL ServerCompact Edition

SQL Server Compact Edition is similar to ESE, but additionally, it has a query processor. Unfortunately, it does not currently support Windows Store apps.

SQL Lite

SQL Lite is an open-source project that provides similar functionality to SQL Server Compact.

Talking to the ESE team, it turns out that they are committed to having their engine work for Windows Store applications. Also, the fact that ESE is already in box (Windows maintains and services it), made it more attractive to us over SQL Lite (we had a pre-determined set of queries to run, so that lack of a query processor was not a huge minus). So, we decided to use ESE.

How ESE Works

In this section I will explain the high-level concepts in ESE. The most complete source for documentation is, of course, MSDN. ESE interface is a flat C API. Whether you want to create tables, define indexes, insert data, update data, or run queries, you do that by filling a data structure and passing it to the appropriate API.

Creating Tables

We create a table by passing an instance of JET_TABLECREATE4 to JetCreateTableColumnIndex4 JET_TABLECREATE4 describes the table and its columns and indexes. Columns are described by an array of JET_COLUMNCREATE instances and indexes are described by an array of JET_INDEXCREATE3 instances.

Writing Data

Writing data (inserting or updating) is done by building an array of type JET_SETCOLUMN where each element points to the value to be written as well as the column the value should be written to. Then, we pass the array to the appropriate ESE APIs (see JetPrepareUpdate/JetSetColumns/JetUpdate2).

In case of updates, ESE uses a cursor to decide where to apply the new set of values. So, we need to position the cursor at the right record (or records), and then invoke the writing API.

Note that we can build an array of only the fields we want to target.

In case of inserts, positioning the cursor is not necessary since it is going to be a new record.

Selection and Iteration

Reading, updating, and deletion are common operations that we sometimes apply to a specific row or set of rows that share some criteria.

ESE APIs allows us to achieve this by first defining a range, and then iterating through it.

Setting up A Cursor for a Range

Keys

To define a range, we first need to understand the concept of a key. We can think of a key as a fixed pointer to a specific record in a table. The pointer is set by specifying values for one or more columns -where those columns are part of an index on that table.

For example, if we have:

A table with columns: id, name, address, phone, age
An index X1 on (id).
An index X2 on (id, age).

Then, a key can be defined as

“pointer to the first row of row set with id = 5 in X1”. Or
“pointer to the last row of row set with id = 7 in X1”

Partial Keys

As mentioned above, the pointer is set by specifying values for one or more columns – where those columns are part of an index on that table.

A key is considered partial if it does not specify values for all the columns of an index. If the key partial, matching happens against the specified columns only.

In the above example, we can define a partial key as follows:

“pointer to the first row of row set with ‘id = 5’ in X2” (note that we have omitted age). This key will locate entries with ‘id = 5’, and ignore the values of ‘age’. This capability is very useful when you do not want to match the whole key.

One important aspect here is that for an index (c1, c2, c3), we can have partial keys by omitting values for columns on the right only.

In other words, the following are valid (partial) keys:

Key on c1 (partial key)
Key on c1,c2 (partial key)
Key on c1, c2, c3 (not a partial key, but still a valid key)

The following are not valid keys:

Key on c2, c3 (cannot omit c1)
Key on c3 (cannot omit c1 or c2 if c3 is to be used)
Key on c1, c3 (cannot omit c2 if c3 is to be used)

Ranges

Now that we can define keys, we just need to define two of keys to define a range. In some cases, the criteria might be the same for both keys except that we want one key to point to the first row of the set, while we want the other key to point to the last row of the same set (for example, “get me all students with age=15”).

ESE allows us to make this distinction when we are creating the key. We basically tell it whether we want the key to be a start key (JET_bitFullColumnStartLimit) or an end key (JET_bitFullColumnEndLimit).

Cursors

A cursor is a pointer to a row targeted by the next operation ESE row-specific operation. By setting the keys (above), the cursor is also set to the same row as the start key. The cursor can then be moved by calling JetMove.

ESE Range APIs

To define the ranges as described above, the user needs to do the following:

Select the index the key will be built against by calling JetSetCurrentIndex4.
Build the start key by repeatedly calling JetMakeKey for each column.

The first call to a new key must specify the JET_bitNewKey flag. This how ESE knows we are building a new key against the same index in case we building a start and end keys for example.
Note that this API does not take the column id as input because it assumes the user is passing the values in the same order the columns are defined in the selected index. This guarantees that we are building keys that may omit only ‘right’ column values.
Set the key to the first row by calling JetSeek.
Build the end key by following the same steps as the start key. However, at the very end, set the range by calling JetSetIndexRange.

Iterating

Iterating involves the moving of the cursor (JetMove) from one record to the next in the returned result set. The cursor knows the start from the start key, and knows the end from the end key. Note that both the cursor and the result set are not exposed to the user.

Row Specific Operations

Typically, we start by defining the range as described above, and iterate on the result using JetMove. The following are examples of row-specific operations.

Reading Records

We can retrieve the values of the current row by calling JetRetrieveColumn for each column we want to read.

Updating Records

We update values in the current row by calling JetPrepareUpdate / JetSetColumns / JetUpdate2.

Deleting Records

We delete the current record by calling JetDelete.

Threading and Transactions

ESE supports access to the same database from multiple threads. The recommendation on MSDN is to create a session for each thread. This becomes handy when submitting transactions form multiple threads as the session id is what is used to associate the commit transaction to the matching begin transaction. Also, all calls in between are associated through the same session id.

In the Austin project, since all our begin/commit transaction pairs happen in the same function, it is sufficient to make those functions non-re-entrant and avoid race conditions. This means that Austin calls into ESE from different threads will be serialized. This simplifies our thread model significantly – however, should we ever need to avoid serializing ESE access, we’ll need to change this part of the implementation and create a session for each thread.

Abstracting ESE

Why abstract ESE?

As mentioned earlier in this blog post, ESE interface is flat C API. Below I’m listing some of the reasons that made us create a higher level abstraction on top of ESE flat C API:

Handle the memory allocation and de-allocation safely and avoid dealing with memory buffers at the application level.
Provide type/type size safety through templates and specialized templates.
Use some of the C++ standard structures and paradigms (vectors, and iterators) instead of inflexible C data structures.
Reduce the surface of the storage layer to the application for better productivity.
Provide thread safety given that we are not using multiple sessions.

So, we ended up creating an engine class which wraps the functionality of the ESE APIs. We also created a set of classes that represent tables, columns, indexes, and cells. Along with satisfying the requirements outlined above, the engine is capable of interacting with those data structures to perform various operations (like creating tables, inserting/updating data, finding data, data deletion, etc).

Tables

Creating tables follows the same logical sequence ESE defines.

Create a table description object.
Create column description objects – add them to an array – add the array to the table object.
Create index description objects – add them to an array – add the array to the table object.
Tell ESE to create a table using the table description object.

Table Description Object

std::wstring name = L"Photo";
std::shared_ptr<s::itable> table = s::createTable(name);

Column Description Object

std::shared_ptr<s::icolumn_typed<b::uint32>> photoIdColumn;
photoIdColumn = s::createColumn<b::uint32>(L"ID", s::column_type::type_int32);
table->addColumn(photoIdColumn);

Note: There are definitely lots of improvements to be made. For example, we should not be passing column_type::type_int32 to the createColumn function since the template parameter already has this information.

Index Description Object

std::shared_ptr<s::iindex_column> activeIndexColumn;
 activeIndexColumn = s::createIndexColumnDefinition(_activeColumn, true);
 
 std::shared_ptr<s::iindex> activeIndex;
 activeIndex = s::createIndex(L"ActivePhoto_Index", false, false);
 
 activeIndex->addIndexColumn(_activeIndexColumn);
 table->addIndex(activeIndex);

The Cell Object

While table, column, and index objects hold definitions – the cell object is an instance of a column value in the database.

The only way to create a cell is through a column. A cell holds the same type as that of the column which created it.

std::shared_ptr<s::icell_typed<b::uint32>> photoIdCell;
 photoIdCell = photoIdColumn->createCell(0); // initialize with 0

Whenever there is an operation that requires passing values to the engine, we use a set of pre-created cells and populate them with the values. Then, this set is passed to the engine which knows how to traverse them, identify the parent columns, and translate the data structure to ESE C-structures before making the API calls.

void photo_table::insert(b::uint32 photoId, b::uint32 pageId, bool active)
 {
 _photoIdCell->setValue(photoId);
 _pageIdCell->setValue(pageId);
 _activeCell->setValue(active);
 std::shared_ptr<std::vector<std::shared_ptr<s::icell>>> row;
 row->push_back(_photoIdCell);
 row->push_back(_pageIdCell);
 row->push_back(_activeCell);
 table->insert(row);
 }

Running Queries

Building a Query

To build a query, we need to specify which fields we are interested in (SELECT c1, c2), and then specify which criteria (set of values) need to be satisfied in the returned rows.

 // select columns we are interested in...
 std::shared_ptr<s::iresult> result = s::createResult();
 
 result->addColumn(_photoIdColumn);
 result->addColumn(_pageIdColumn);
 result->addColumn(_activeColumn);
 
 // set staging row entries as search criteria and execute…
 
 _photoIdCell->setValue(photoId);
 _pageIdCell->setValue(pageId);
 
 std::shared_ptr<std::vector<std::shared_ptr<s::icell>>> row;
 
 row->push_back(_pageIdCell);
 row->push_back(_photoIdCell); // note : omitting field would have
 
 // resulted in a partial key.
 
 // execute
 table->read(row, result); // where values = those in row

Reading the Result

The result is just a two dimensional array. One dimension is a vector of rows (iresult_row), and the second dimension is a vector of cells (icell). To read the content, we iterate through each row, and through each cell. For each cell, we identify which column it belongs to and cast it to extract the value.

 // iterator through the result...
 std::shared_ptr<std::vector<std::shared_ptr<s::iresult_row>>> rows = result->getRows();
 
 for (auto &row : *rows)
 {
 std::shared_ptr<std::vector<std::shared_ptr<s::icell>>> cells = row->getCells();
 
 for (auto &cell : cells)
 {
 if( cell->getColumnId() == _photoIdColumn->getId() )
 {
 std::shared_ptr<s::icell_typed<b::uint32>> typedCell = std::dynamic_pointer_cast<s::icell_typed<b::uint32>>(cell);
 b::uint32 readPhotoId == typedCell->getValue()
 }
 .
 .

Source Code

You can see how we implemented this functionality by looking at the source code files. The source code is available for download on our CodePlex page.

The following folders have the files relevant for this post:

baja\storage\db
journal\models\db

We have an improved version of the storage wrapper along with a minimal client application to demonstrate the functionality. We have uploaded the sources to prototypes/storage. It is much easier to understand how the storage works by looking at those files.