PatHelland's WebLog

  • Trip Report -- HPTS (High Performance Transaction Systems) Workshop -- Part 1 of the Trip Report

    Trip Report – HPTS (High Performance Transaction Systems) Workshop – Oct 25th-28th, Asilomar, Pacific Grove, CA.   (www.hpts.ws).    This is Part 1 of the trip report…  I am chiseling my way through my notes and expect this to take a couple more blog entries before I can describe everything I plan to cover.  Believe it or not, this 11 page “Part 1” is only taking us through Monday morning – there’s Monday afternoon, all Tuesday, and Wednesday morning yet to cover!

    Overview:

    HPTS was founded by Jim Gray, Dieter Gawlick, Andreas Reuter, and some other industry luminaries in 1985.  It is a workshop designed to bring together practitioners and academics working in super large, scalable, high throughput systems.  There is a lot of overlap between this community and the folks doing DBMS systems.  In the early days, the biggest challenges included getting 1000 transactions per second through a mainframe.  At the time, that felt like getting a man to land on the moon and come back safely.  Now, this includes huge web based sites like Amazon, eBay, Google, and Bing.  Our area of interest also includes integrating massively complex heterogeneous Enterprise and Internet (B2B) systems.  Of course, we are fascinated with hardware trends, networking, and all sorts of geeky stuff.

    I first attended HPTS in 1985.  It was a magic opportunity to be with legends in the industry.  Gene Amdahl drove up in his Rolls Royce and parked next to my beat up old van.  Mike Blasgen(System R), Mike Stonebraker, Bruce Lindsay, Mohan, Andy Heller, Dieter Gawlick, Hector Garcia-Molina, Irv Traiger, Phil Bernstein, Jim Gray, and many more folks who had become well known to me through their papers were there in the flesh and treating me (a 29 year old punk) with kindness and respect.  It was a magical 3 day gathering.

    Let me explain the Asilomar Conference Center (www.visitasilomar.com).  Asilomar sits right by the Pacific Ocean between Pebble Beach and Monterey in the Pine Trees and includes dunes leading right to the beach.   It was built as a YWCA around 1910 or so and contains a number of beautiful historic buildings in the arts and crafts style built by architect Julia Morgan.  In the 1950s, it was sold to the state of California which has operated it as a conference center (and built more buildings in the 1960s).  The rooms are rustic (no TVs or phones – crappy mattresses), the food is served in a large dining hall when the bell rings on the roof and it has a magic feel.   It is normal to walk out of your room, hear the surf, and see deer walking around.  I absolutely love the evenings by the real wood fireplaces.

    In 1985, Gray decided we could all go cheap and lined us up to sleep four in a room.  I shared a room with Dieter Gawlick (inventor of IMS/FastPath among other things), Hector Garcia-Molina, and Theo Haerderer (a wonderful professor from Germany).   After the first night, Hector informed us we ALL snored too much and moved to stay in a hotel down the street.  For the next 8 or 10 gatherings, we were two to a room but now most of us pay a bit more to have a private room.

    I’ve attended HPTS every two years (for three days) since 1985.  I was program and general chair in 1989 and general chair in 2007.  Since Jim Gray’s disappearance, I’ve assumed the role of local arrangements chair and “godfather” – kinda’ watching out for the whole workshop.  This year, James Hamilton (now of Amazon) was general chair and Margo Seltzer (of Harvard) was the program chair.  We had a wonderful (but exhausting) three days and I will tell you about them soon.

    One of the traditions is that the folks leading HPTS wander around during the first two days trying to figure out who gets suckered into doing the work as the next workshop happens two years later.  I am thrilled to report that Mohan (from IBM) will be the general chair in 2011 and Mike Carey (now of UC Irvine) will be the program chair.  I will remain as local arrangements chair and “godfather”.

    So, every HPTS starts with Sunday dinner and a casual beer/wine afterwards.  Monday is all day presentations and after dinner we have poster sessions (volunteer 10 minute talks – we had 15 this year).  Tuesday is all day presentations and this time we had a 90+ minute panel after dinner.  Finally, we have sessions Wednesday morning and adjourn after lunch.  It is a long, exhausting, and exhilarating 3 to 4 days.

    Of course, the most important part of the workshop is the breaks and the opportunity to talk with people working in the same area as you are working.  In addition to the grizzled old luminaries, this year we went out of our way to get more graduate students and invited 16 of the brightest and best from Berkeley, Stanford, MIT, Harvard, Brown, and more.   I spent each of the three evenings staying up late talking by the fire with these wonderful young folks (and some other diehard old farts).   As I type this the following weekend, I am still tired from the sleep shortage!

    So, let’s get to the content:

    Session 1) The New Memory Hierarchy –
                                Session Chair: David Cheriton (Stanford)

    Steve Kleiman – Chief Scientist at NetApp:   “Flash on Compute Servers”

    So, Steve Kleiman is the Chief Scientist at NetApp so you would expect him to be focused on network attached storage and we were not disappointed.

    Steve started with a discussion of hardware trends (not all of which I was able to capture in my notes).  He observed that with multi-core and VMs, more bandwidth (BW) and IOs per Second (IOPS) are needed for each node.  Also, as a trend, Steve pointed out that flash memory has more than 5X the IOPS and less than 100 microseconds access time.

    The original case for network storage lies around the benefits of aggregating disk which include more random IOPS and bandwidth as well as the management simplicity of concentrating data into a centralized place.  It has been the case that a single application per host has somewhat limited ability to consume bandwidth.

    Considering flash in a networked storage environment, it does not have the same benefits of sharing.  Since there are enough IOPS in a flash drive to placate most apps, the equation changes.  Is network storage fast when the network latency is noticeable compared to flash?  We can do 5 SSDs to get 40,000 IOPS but will the latency be a problem?

    We will see flash on compute servers.   Servers can already accommodate flash and it is cheap enough with 250GB < 10% of the server cost.

    A new tier in the storage hierarchy:  Steve argues we will see storage split into two tiers: the IOPS tier and the capacity tier.  The IOPS-Tier will offer the best $/IOPS with low latency and will have sufficient capacity for the active data in most typical enterprise systems.  It will be the new “primary storage”.   The Capacity-Tier will offer the best $/GB – Deduplication, compression, and good serial performance.

    Steve then outlined three broad categories of Host-Based Flash usage.  DAS (which I think means “Direct Access Storage but my notes failed me), Primary Storage Service, and Cache.

    ·         Host-Base Flash: DAS:  Single point of failure and so you would use normal backup or application based replication like Exchange or Oracle).  You can have boot files, VM, temp files. 

    ·         Host-Based Flash: Primary Storage Service:  File or Block.  Requires mirroring or RAID to peers.  Allows capacity sharing.  Steve believes it is likely that Microsoft and VMWare are likely to compete here – the integration of this is complex.  Making reliable primary storage is hard and this may impact the roles of folks in IT.

    ·         Host-Based Flash: Cache –File or block.  This will be a writethrough cache.  If you do a write-back you will occasionally lose stuff.  Using a cache can allow for a centralized data management (in a remote network storage server).  This can help with keeping the current IT storage management roles by providing an automatic “tiering”.

    So, data management spans the IOPS tier and the Capacity tier.  You need to have automated data movement and global access.  It is impractical to have manual placement at huge scale.  It is going to be important for us to have a technology independent language to specify the desired properties for the data.  Say what properties you want and not how to get them.   New SLO (Service Level Objectives) for Max Latency, Bandwidth, Availability, etc.

    In summary, the new host-based flash is the new IOPS-Tier.  It will replace high-performance primary storage in the cloud.  This IOPS-Tier s is not going to be economical as a secondary or as archival since the $/GB is higher than spinning media.   The Capacity Tier will be implemented as network storage.  It is optimized for the best $/GB.  Data management between these two tiers will be important.

    There was a lot of discussion around the SLO (Service Level Objectives).   Steve indicated that their vision was not super-flexible set of knobs but more a small set of options for defining the data characteristics.  Margo offered that one aspect of what was important was the “value” of different parts of data which can, in turn, tell the system how hard to work to make the data remain available.

    Someone asked if there was any performance analysis for 3-tier storage (memory, IOPS-Tier, Capacity-Tier) over more traditional 2-tier storage (Memory and Disk).  Steve answered that there had not been an explicit performance analysis.

    Stonebraker observed that enterprises seem to like NAS/SAN but that the Cloud guys are using direct attached storage.  Are the forward looking guys walking from NAS/SAN?  Steve replied there are people very interested in NAS/SAN because it gives them a way to control the data and make servers stateless.  Adding in the IOPS-Tier is a simple extension.

    This was a fascinating and engaging presentation!

    Memory Technologies for Data Intensive Computing –Andy Bechtolshein (Sun Computers)

    What an honor to get Andy to come and present to us.  Unfortunately, I didn’t get a chance to chat with him personally as he was only at the workshop for a short while.

    Andy is a well respected and talented hardware engineer who was a founder of both Sun Microsystems and of Arista Network.  His slides are not going to be posted and I typed as aggressively as I could during his lightning fast presentation.  I would recommend you look both at my friend James Hamilton’s blog post of Andy’s talk at: http://perspectives.mvdirona.com/2009/10/26/AndyBechtolsheimAtHPTS2009.aspx and then read my notes attached below.  I have not attempted to coalesce the information between James’ notes and mine and there may be overlap.  James has done a nice job cleaning his up and I will focus on some of the other sessions in my editing.

    Here’s my rough notes:

    è Andy speaks VERY fast and I lost a lot of what was covered – The slides were very dense – Also, I apologize for some inaccuracies… These are RAW notes that I made during the fast paced presentation and some of them don’t completely make sense to me. 

     

    Semiconductor Technology Roadmap

    Still increasing

    Number of devices on the chip still doubling

    Flash is increasing faster than logic and DRAM – Flash faster than Moore’s law

    Between 2010 and 2022, Flash should increase by 100X – After that (10nm) optical problems may become severe – don’t know after 2022

    <more content lost…>

    Emerging devices

    ·         Carbon-nano tubes

    ·         Graphene Nanoribbons

    ·         … <more that I lost…>

    Silicon Roadmap Predictions

    ·         128X increase in transistors by 2022

    ·         1000 cores per chip by 2022

    CPU Module Roadmap

    ·         IO bandwidth will not allow the same bandwidth per core.  IO going a little faster but number of pins hasn’t increased in 10 years.  

    ·         1000 cores would need several terabytes of bandwidth to memory.  Of course, this would need power and cooling

    ·         <more lost…>

    CPU Challenge

    ·         Power per core -- 

    ·         Clock rates not increasing – need lower power level per core

    ·         <more lost…>

    Power efficiency – Increasing frequency – a Square effect on the power

    ·         CPU power strategy –

    ·         Clock rates look flat

    ·         High clock rates are less power efficient

    ·         To reduce power, simplify architecture

    ·         <more lost…>

    Multi-Chip 3D packaging

    ·         Thru-Si via Stacking

    ·         Need to combine CPU and Memory on one module

    High Density 3D Multi-Chip Module

    ·          

    Benefits of MCM Packaging

    ·         Enables much higher memory bandwidth

    ·         More channels, wider interfaces, faster I/o

    ·         Greatly reduces power

    Challenges with MCM packaging

    ·         Total Memory – 64 device <…??...>

    ·         With 64 GNB/device – 4TB

    ·         Assuming 1K cores – 4BG/core

    MCM enables  Fabric I/O

    ·         2010 –1 32 Gb/direction

    ·         2020 1.73Tbps/direction

    ·         Mesh or higher radix fabric topologies

    ·          

    Expected Link Data Rates

    ·         10GBps shipping today

    ·         2012 – 25 GBps

    ·         2016 – 50 GBPS

    ·         2020 – 100 GBps

    ExaScale Storage

    ·         Forget Hard Disks – Disk are not getting faster

    Flash Experience so far

    ·         Not easy to get to 1000000 IOPS

    ·         Writes are a problem

    ·         SAS/SATA interface not optimal

    ·         Direct PCI-Express interface looks more promising.  – Move FLASH closer to the CPU  -- Lower latency and more throughput

    Flash in the Memory Hierarchy

    ·         Flash is not random access  more like a block device  -- random access about 1000 slower than DRAM

    ·         Flash can be used as a stable storage device  -- writes are committed – Supercap magic behind the scenes.

    ·         Tremendous throughput and size – terabytes of capacity

    Flash Latency

    ·         Latency expected to drop by a factor of 10

    ·         Flash transfer rates will double each year for the next several years. – 250MBS/sec 2012

    Flash prices

    ·         2004 -- $100 GB

    ·         2009 $1 – GB

    ·         2012 -- $.25 / GB 

    ·         Prices are dropping dramatically –

    Flash Summary

    ·         FLASH will be very expensive, very dense,

    ·         Costs falling by 50% per year

    ·         Access times falling by 50% per year

    Tech Summary

    ·         Moore’s Law for at least 10 more years

    ·         Frequency Gains  not happening

    ·         Flash will be interesting

    Software Challenges

    ·         Limits of App Parallelism

    ·         Need to exploit Parallelism

     

    Session 2) Using the New Memory Hierarch
                                              Session Chair: David Cheriton (Stanford)

    Scaling-Out without Partitioning – Phil Bernstein and Colin Reid (Microsoft)

    This session was, in many ways, one of the most interesting at HPTS.  It was not new to me since I had been privileged to spend a number of hours talking to Colin (and a bit to Phil) about Hyder back at Microsoft.  Phil did a great job explaining a radical new approach to scale-out and serializable database!

    Hyder is a software stack for transactional record management.  It can offer full database functionality and is designed to take advantage of flash in a novel way.

    Most approaches to scale-out use partitioning and spread the data across multiple machines leaving the application responsible for consistency.

    In Hyder, the database is the log, no partitioning is required, and the database is multi-versioned.  Hyder runs in the App process with a simple high-performance programming model and no need for client server.  This avoids the expense of RPC.

    Hyder leverages some new hardware assumptions.  I/Os are now cheap and abundant.  Raw flash (not SSDs – raw flash) offers at least 10^4 more IOPS/GB than HDD.  This allows for dramatic changes in usage patterns.  We have cheap and high performance data center networks.  Large and cheap 64-bit addressable memories are available.   Also, with many-core servers, computation can be squandered and Hyder leverages that abundant computation to keep a consistent view of the data as it changes.

    The Hyder system has individual nodes and a shared flash storage which holds a log.  Appending a record to the log involves a send to the log controller and a response with the location in the log into which the record was appended.  In this fashion, many servers can be pushing records into the log and they are allocated a location by the log controller.  It turns out that this simple centralized function of assigning a log location on append will adjudicate any conflicts (as we shall see later).

    The Hyder stack comprises a persistent programming language like LING or SQL, an optimistic transaction protocol, and a multi-versioned binary search tree to represent the database state.  The Hyder database is stored in a log but it IS a binary tree.  So you can think of the database as a binary tree that is kept in the log and you find data by climbing the tree through the log.

    The Binary Tree is multi-versioned.  You do a copy-on-write creating new nodes and replace nodes up to the root.  The transaction commits when the copy-on-write makes it up to the root of the tree.

    For transaction execution, each server has a cache of the last committed state.  That cache is going to be close to the latest and greatest state since each server is constantly replaying the log to keep the local state accurate [recall the assumption that there are lots of cores per server and it’s OK to spend cycles from the extra cores].  So, each transaction running in a single server reads a snapshot and generates an intention log record.   The transaction gets a pointer to the snapshot and generates an intention log record.  The server generates updates locally appending them to the log (recall that an append is sent to the log controller which returns the log-id with its placement in the log).  Updates are copy-on-write climbing up the binary tree to the root.

    Log updates get broadcast to all the servers – everyone sees the log.  Changes to the log are only done by appending to the log.  New records and their addresses are broadcast to all servers.  In this fashion, each server can reconstruct the tail of the log.

    Now we get to transaction commit.  All machines are marching ahead rolling the log forward.  When an intention record is generated (indicating a proposed new root for the log), each machine in the cluster checks for a conflict.  If any of the records updated conflicts with previously committed work, the transaction is aborted.  Conflict in this case is traditional read/write conflict using all the classic set of rules for various degrees of consistency.  If there are no conflicts, the transaction is committed.  If there are conflicts, the transaction is aborted. 

    Because there is an append in the log controller which assigns order, each intention record will race to the log and the controller will pick a winner.  Because log ordering is deterministic, each of the server nodes can reliably come to the same conclusion about which transaction won the race and, hence commits rather than aborts in the face of conflict.  Conflict detection and transaction commit is done redundantly and predictably at each node.

    So, each node generates an intention log for a transaction it wants to do.  Every server checks for conflicts and then either aborts or commits.  The log order provides determinism.

    Performance of Hyder:  The system scales without partitioning.  The system-wide throughput of update transactions is bounded by the update pipeline.  It is estimated this can perform 15K update transactions per second over a 1GB Ethernet and 150K update transactions per second over a 10GB Ethernet.  Conflict detection and merge can do about 200K txs per second.

    The abort rate on write-hot data is bounded by the transaction’s conflict zone.  This is determined by the transactions end-to-end latency.  This conflict zone is about 200 microseconds in the prototype.  It is estimated that this would handle about 1500 TPS (assuming all the transaction have conflicts).

    Major Technologies:  This is using flash as append only (and it is not using SSD flash but raw flash with a special controller).  The custom controller over raw flash has mechanisms for sync and fault tolerance.  Storage is striped with a self-adaptive algorithm for storage allocation.  It includes a fault tolerant algorithm for a totally ordered log.  Also, it has a fast algorithm for conflict detection and merging of records.

    Status:  Most of Hyder has been prototyped but there is a long way to go.  Phil Bernstein and Colin Reid are working on a paper and HPTS is the first time Hyder has been discussed publicly.

    Again, this was a wonderful talk with some fascinating and different ideas.  Great work!

    A Performance Puzzle:  B-Tree Insertions are Slow on SSDs (Bradley Kszmaul – Tokutek – Research Scientist at MIT)

    So, this talk was about some surprises in SSD performance in supporting MySQL.  Much of the discussion was about the datasheets for some of the drives not matching the measured reality.

    Bradley had measured the value of running MySQL against SSDs and expected the SSDs (with higher IOPS) to get better Btree performance.  Here’s my rough and raw notes:

    Intel X25E Specs

    ·         Read BW up to 250 MB/s

    ·         Write BW up to 170 MB/s

    ·         Random 4KB read rate – 35 KIO/s

    ·         Random 4KB

    ·         <lost some…>

    Measure Berkeley DB

    ·         Trending to 4500 write per second

    Performance Model  --

    ·         Startup cost plus bandwidth cost

    ·         Looking at read performance as a function of block size

    ·         As block size gets bigger, bandwidth goes up

    Discussion à Margo points out that many Filesystems see large blocks and do read ahead.  That can perturb the results…

    When you read and write large blocks, Bradley sees Filesystems doing the right thing and keeping contiguous blocks adjacent.

    The datasheet for the drive is not matching the reality for larger block sizes

    Mixing reads and writes seems to impact the performance quite a bit…

    What Block Size to Use?

    ·         For point-queries, BTrees are insensitive to block size. 

    ·         For range queries, block size if important

    ·         Tension: Large block sizes make range queries faster; Small block sizes make exact match faster

    ·         Set block size so half of the time is accounted by seek, half by bandwidth

    ·         Reads: 50KB for SSD but 500KB-1000KB for rotating disk

    ·         Writes 10KB SSD , 500KB-1000KB for HDD

    ·         Mixed read/write 21KB for SSD

    Cache-Oblivious Approach

    ·         Use data structures that are fast for any block size

    ·         Also can speed insertions without slowing searches

    ·         Tokutek’s MySQL storage engine uses these ideas.

    Discussion: James Hamilton layering of software and release by release changes makes it hard to get predictable behavior.   When all these releases are separately evolved, the performance behavior is very unpredictable.   Bradley: fragmentation starts messing with this, too.  James: all the layers interact and we need to measure by the workload and adapt.  The complexity makes performance very hard to predict.

    Implications of Storage Class Memories (SCMs) on Software Architectures (C. Mohan – IBM)

    Storage Class Memory (SCM)

     

    Mohan gave a very thought provoking talk on the implications of Storage Class Memories.  Where do we need them?  What good will they do?

    If we have non-volatile memory that is as easy to access as DRAM, will we have corruption problems?  Will we see a running system trash the non-volatile memory like it trashes DRAM?  One of the advantages of our current system design is that we have to pass a “sanity test” of an IO driver or messaging stack to get nastiness outside of a running system.  That dramatically reduces the chance of this happening.

    Anyway, Mohan’s talk was really thought provoking and enjoyable!  It was good to see my old friend back at HPTS after taking a stint in India for a few years.

    Raw notes:

    ·         Blurs the distinction between DRAM and Disk

    Industry SCM Activities

    ·         IBM

    ·         Intel/ST spun out of Numonyx (FLASH and PCM)

    ·         Samsung, Numonyx sample PCM chips –

    ·         <lost som>

    Speed/Volitility/persistenct

    ·         DRAM plus SWCM – Fast and persistent

    ·         Persistent storage will not lose data

    Competing technologies for SCM

    ·         Phase Change RAM (most promising)

    SCM as a part of the storage stack

    ·         The distinction between RAM and Disk – the line in between is blurring

    ·         Memory-like and also Storage-Like

    SCM Design Triangle

    ·         Speed – endurance – Cost/bit

    ·         Speed with Write endurance is more memory-like

    Phase-Change RAM

    ·         When density increases, the failure rate goes up as bits flip more often

    ·         SCM is on Moore’s law

    PCM is fast

    ·         100-1000 ns

    Architecture

    ·         SCM may be used as Memory

    ·         SCM may, alternatively be used as a storage (through IO controllers)

    ·         Discussion à Pat:  Bugs in non-volatile memory can sometimes be preserved

    PCM Use Cases

    ·         PCM as Disk

    ·         PCM as Staging device

    ·         PCM as memory

    ·         PCM as extended memory

    Exploring DBMS as an exploiter of PCM

    ·         Should log records be written directly to PCM?

    ·         Should log be more like Disk (first to buffer and then forced to disk)?

    ·         PCM is not as reliable as disk à do you need to offload to a disk

    ·         If you use Group Commit, it is unclear that PCM helps a DBMS

    PCM replaces DRAM – Buffer Pool in PC

    ·         The PCM Buffer Pool access will be slower than DRAM

    ·         Writes will suffer more than reads!

    ·         Should we instead have a DRAM BP backed by a PCM BP?

    ·         Similar to DB2 z in parallel sysplex environment with BPs in coupling facility (CF)

    ·         But DB2 situation has well defined rules on when pages a move from DRAM CP to CF BP

    ·         Variation was used in SafeRAM work at MCC in 1989

    Assume the whole DB fits in PCM

    ·         Apply old main memory DB concepts design directly

    ·         Shouldn’t we leverage persistence specially?

    ·         Every bit change persisting may not be good!

    ·         Today’s failure semantics has flexibility on tracking changes to DB pages.

    ·         Memory overwrites will cause more damage

    ·         If every write assumed to be persistent as soon as write completes, then L1 and L2 caching can’t be leveraged.

    Assume whole DB fits in memory

    ·         Need to externalize periodically since PCM won’t have good endurance

    ·         If DB spans both DRAM and PCM, hard to figure out what goes where!

    What about Logging?

    ·         If PCM is persistent and whole DB is in PCM, do we need logging?

    ·         Of course it is needed to provide at least partial rollback even if data is being versioned (at least to track versions to invalidate)

     

    The end of Monday morning!  More soon… 

     

    -          Pat

     

  • Building Blocks and the Three Bears

    So, about a year and a half ago, I was working in Developer Division and there were lots of discussions about different kinds of components…  it seemed that everyone was talking past each other and meant really different things about they meant by a component.  I decided to try to identify three different meanings of the word “component”.   To avoid debates over the naming, I chose the really ridiculous names:  “Baby-Bear”, “Momma-Bear”, and “Papa-Bear” to describe the different things meant by different folks when they talked about a component… of course, these names were not meant as serious but it did lead to some illustrative discussions about the “Three Bears”.   Last week, Krzysztof Cwalina asked me to post a description of the “Three Bears” because it would be helpful for him in describing their project, MEF.   Here goes…

    The idea is that there are three different types of components. 

    • Baby-Bears are much like the classic object with inheritance, reference-passing, and synchronous interactions with their callers.  
    • Momma-Bears (closely related to what is provided by MEF), are comprised of multiple Baby-Bears.  While exhibiting synchronous call patterns, they do not support pass-by-reference nor inheritance.  They will, however, allow for support of dependency injection.
    • Papa-Bears are the biggest of the three and are most similar to a SOA Service.   A Papa-Bear will typically be composed of multiple Momma-Bears.   Papa-Bears interact with their partners asynchronously, are frequently remote, do not support pass-by-reference or inheritance, and may be offline.

    image

    My interest has always been in the realm of the Papa-Bear and other projects have focused more in the realm of the Baby-Bear and the Momma-Bear.   I will leave it to Krzysztof to explain MEF (which I believe is more akin to the Momma-Bear) type of component.

    Anyway, this is what I like to call “Building Blocks and the Three Bears”.

    - Pat

  • “Book Report” on the UC Berkeley Paper “Above the Clouds: a Berkeley View of Cloud Computing”

    As I mentioned in a previous blog post, I am very impressed with a recent paper from my friends at the Berkeley RAD Lab called: Above the Clouds: a Berkeley View of Cloud Computing.   it seemed to me that it would be important for these concepts to be better understood by a broad audience within Microsoft so I took the time to create a presentation covering the contents of the paper.  You can see the presentation I created covering the paper’s contents here

    If you are interested in cloud computing, you want to understand these ideas.

    Also, I highly recommend the Berkeley blog on cloud computing available at:  http://abovetheclouds.cs.berkeley.edu.

    Thanks to the folks at Berkeley for their hard work and permission to show this to you.

    - Pat

  • Great Paper from UC Berkeley on Cloud Computing

    My friends at the UC Berkeley RAD Lab have written a wonderful paper on Cloud Computing.   It can be found (with a video interview and some other stuff) at http://abovetheclouds.cs.berkeley.edu.  I strongly recommend looking at their work!

    - Pat

     

  • Slides from SQL Service Broker Design Preview from August 2001

    I know this is a LONG time ago but these slides were never made public before and there's some interesting (and still relevant) stuff in this collection.  At the time, it was confidential but that is no longer the case.

    In August of 2001, the SQL Service Broker team did a design preview and I wrote five presentations totaling 280 slides.  I presented these over a 12 hour period (actually, that time also includes a sixth talk with another 60 slides on "Fiefdoms and Emissaries" but I'll push that out later).

    CAUTION:  These slides are from August of 2001 and SQL Service Broker shipped in late 2005.  Many things are changed!  Do NOT count on these slides for accuracy of the product but they can be a big help with inspiration.

    The following slide decks are now out in my new public skydrive folder.

    SQL Service Broker Overview

    An overview and introduction to SQL Service Broker.  This slide deck has the following major sections:

    • Introduction:  Motivates the challenges in loosely-coupled messaging, the advantages of deeply integrating messaging with the database, support for scaling and fault tolerance, and some scenarios for SQL Service Broker.
    • High-Level Architecture: This covers web services, brokers, message fragmentation, service environments (which provide a namespace for services), message-types, service-contracts, services, and service programs.
    • Programming Model:  This section covers an overview of the programming model, service instances and state management, dialogs, and again ties these programmatic concepts to the service instance.  Caution: the syntax has changed between this and the shipped product!
    • Relationship to Other Stuff: Here we visit management issues, interoperation, and look at products that create services.
    • Conclusion:  We revisit the scenarios and summarize the points made.

    The SSB overview talk has 87 slides and can be found here.

    SQL Service Broker Messaging

    This talk examines the message protocols used in SQL Service Broker.  Reminder:  This is not an accurate depiction of what is in the product!

    This slide deck has the following major sections:

    • Introduction:  This covers the concepts of reliable messaging, brief discussion of web services, Internet and Intranet issues, transport support, binary and lossless transformations, and a sketch of the protocol stack.
    • Adjacent Broker Protocol: The adjacent broker protocol talks to a nearby SQL Server to propagate messages and get them closer to their destination.  This section outlines the challenges at this level with delivery (sometimes tunneled through firewalls), lifetime management, and format negotiation.
    • Endpoint Broker Protocol: The endpoint broker protocol functions across the SQL Server engines that actually constitute the two endpoints for the communication (and it layers atop the adjacent broker protocol).  This section covers dialog endpoint state, fragmentation and sequence numbers, piggy-backed acknowledgements, and unsequenced ack messages.
    • Message Formats: This covers pieces of the messages, encryption challenges, and ASCII versus binary sections of the SOAP message.
    • Monolog Contracts: Monologs were not implemented but they are VERY interesting if you care about Pub-Sub.  This short section on Monologs covers some of the issues related to the messaging formats.  It discusses dialog contracts to implement monologs.
    • Web Services and SOAP Bubbles:  Some brief framing of the relationship of the SSB protocols to Web Services and SOAP.
    • Conclusion: a summary of the content covered.

    The SSB messaging talk has 31 slides and can be found here.

    SQL Service Broker Performance

    This talk examines performance, scaling, and high-availability issues with SQL Service Broker.   Reminder:  This is not an accurate depiction of what is in the product!

    This slide deck has the following major sections:

    • Introduction 
    • Work Within a Service: This covers messages in and out of a service, service instance state management, data access from services, and scalable and highly-available ASP.NET apps.
    • Fault Tolerant Services: In this section, we cover designing services for fault tolerance, deploying services for fault tolerance, and more issues with fault tolerance.
    • Scaling and High-Performance: Here, we examine application issues for scale-up, execution of scale-up services, application issues for scale-out, and execution of scale-out services.
    • Conclusion

    The SSB performance talk has 46 slides and can be found here.

    SQL Service Broker Monologs

    One of my favorite talks, this describes the unshipped feature of SSB Monologs.  Monologs implement a very flexible form of pub-sub.  While they are an unimplemented feature in SSB, the serious implementer could follow the guidance in this talk and avoid a lot of difficulties.   Reminder:  This is not an accurate depiction of what is in the product (it has not shipped)!

    This slide deck has the following major sections:

    • Monolog Overview:  This covers "What's a Monolog?", programmatic interfaces, and broker-to-broker contracts.  Included are the notions of "History versus Currency", Monolog-Groups, and Subscriptions.
    • Programming with Monologs:  This section includes programming a publisher, requesting a subscription, and monolog delivery in a subscriber.
    • Some Special Monolog Features: A really cool concept called a Monolog-Checkpoint is covered as well as the lifetime of monologs.
    • Plumbing Monologs Over Dialogs:  We examine fan-out trees for scale, plumbing the fanout using dialogs, publisher agents, and subscriber agents.
    • Conclusion:  Here we examine content syndication and content aggregation and then summarize the talk.

    The SSB monologs talk has 53 slides and can be found here.

    SQL Service Broker Security

    Again, one of my favorite talks, this describes some of the special challenges with security that arise in a messaging and services system like SQL Service Broker.  Classic session-oriented security only works when messages are received in order.  That is just one of the challenges.   Reminder:  This is not an accurate depiction of what is in the product!

    This slide deck has the following major sections:

    • Security Behavior: This section covers identity, dialog authorization, monolog authorization, and auditing.
    • Encryption Concepts:  Here we examine symmetric key encryption, public key encryption, and some practical rules for using encryption.
    • Encrypting Dialogs:  In this section, we introduce a concept called "handshakes" which helps optimize the performance of certificate management in a dialog-based system.  We look at handshake certificates, messages and dialogs, and cryptographic guarantees.
    • Encrypting Monologs: Monologs are N-party communications.  We examine in this section why 2-party security works best and then how we can use a fan-out tree to implement a secure scalable publisher.
    • Conclusion

    The SSB security talk has 61 slides and can be found here.

    I hope these talks are fun as there's still lots of interesting concepts in them 8 years later.

    - Pat

  • Transaction Processing and the Barbarian Hordes

    So, I'm starting to figure out how to put stuff up on Windows Live Skydrive and I have beginning out here.   I've only spent a few hours on this and there's lots more old stuff to add soon.  I will send another blog post when this is in better shape.

    I just couldn't resist sharing some fun, though...

    As I was sorting through old files, I came across a video of Transaction Processing and the Barbarian Hordes and placed it up for all to see.  This was my favorite moment on stage (at least so far) and was the source of a rare bout of stage fright before starting the presentation...

    This was at the Long Beach PDC in November of 1996.  The PDC was titled "The Renaissance of Distributed Computing" and I had a one hour General Session to introduce Microsoft Transaction Server.  We wanted to show that MTS was easier to install than a classic Transaction Processing Monitor.  So, my friend Greg Hope came out to do a 2-1/2 minute setup on MTS.  While this was happening, I told a five minute joke...

    There were 7000 people in the room and (I think) 10000+ on video feed.  They had NO idea what I was doing for at least three minutes.

    - Pat

  • A Wonderful Time at CIDR (Conference on Innovative Database Research)

    Sunday, January 4th through Wednesday, January 7th was the Conference on Innovative Database Research (CIDR)  at Asilomar. 

    First, let me talk for a minute about Asilomar. It is located in the forest right off the beach in Pacific Grove, CA (right next to Monterey).  From 1913 to 1956, it was a YWCA and has a number of magnificent and beautiful historic buildings.  In 1956, it was sold to the State of California which has run it as a conference center since then.  Asilomar has a deep and special place in my heart.  I've spent 3 days every two years there since 1985 at the HPTS (High Performance Transaction Systems) workshop.  This gathering has been attended by many wonderful friends in the transaction processing and database community.  Every job I've gotten since the 1980s has been through contacts forged at HPTS.  I chaired the HPTS workshop in 1989 and (believe it or not), my first grandson was born DURING the workshop (after the keynote, before the first session, and immediately after I ordered more coffee in a crisis for the first break).  I still smile at the location of the (now removed) pay phone where I continually called to find out how the birth was progressing.  Nick (my grandson) is now at Cal State San Jose.    It was at Asilomar that I last saw my dear friend, Jim Gray.  Now, I go two times each two years for CIDR (January of the odd years) and HPTS (October of the odd years).  The smells, the feel, the fireplaces, the wonderful Chapel Building (where we meet to present to each other), and the amazing participants offer me a rich combination of memories and current stimulus.

    So, this year's CIDR was a good one.  The agenda was packed.  There was a healthy combination of seasoned database folks and young grad student.  We listened to a lot of presentations and had a lot of stimulating conversations.  I saw a bunch of old friends and made a bunch of new ones.  Someone (who... me??) managed to bring some nice sipping whiskey each night and get a gang off to chat quietly by the fire and ensure we ended up sleep deprived but with a better understanding of our new and old friends.  It was a very, very nice event.

    On Tuesday evening, I was on a panel with my old friend from Tandem days, Shel Finkelstein.  We were joined by a friend of mine I met last year, Michael Armbrust, who is a graduate student at UC Berkeley studying in the Berkeley RAD Lab.  The panel was on the ways in which cloud-based systems are, in many cases, evolving towards availability over consistency.  I presented a summary of "Building on Quicksand".   Shel did a marvelous job presenting "Principles for Inconsistency".  Finally, Michael presented about his project at Berkeley called "SCADS: Scale-Independent Storage for Social Computing Applications" and then we had a joint Q&A.  You can find the papers for these talks at http://www-db.cs.wisc.edu/cidr/cidr2009/program.html and then navigating the program to Tuesday evening.

    The panel was a blast...  I've known Shel for almost 20 years and knew he would do a great job.  He did.  This was Michael's first conference but I was THRILLED with his talk.  He was clear, organized, and thought provoking.  An absolute home run for someone's first conference! 

    FInally, I want to thank Mike Franklin, Joe Hellerstein, and Gerhard Weikum (and many others) for all their hard work to make CIDR 2009 possible.

    - Pat

  • "Building on Quicksand" Paper for CIDR (Conference on Innovative Database Research)

    I just submitted the final (photo-ready) version of a new paper called "Building on Quicksand" to the Conference on Innovative Database Research.  This is a paper I coauthored with my friend, Dave Campbell.  We submitted an earlier version (to try to get into the conference) in October, I wrote a presentation on this for TechEd EMEA, and now we have a new and longer full paper complete.  I like this version best of all. 

    - Pat

    Building on Quicksand

    Reliable systems have always been built out of unreliable components.  Early on, the reliable components were small such as mirrored disks or ECC (Error Correcting Codes) in core memory. These systems were designed such that failures of these small components were transparent to the application. Later, the size of the unreliable components grew larger and semantic challenges crept into the application when failures occurred.

    Fault tolerant algorithms comprise a set of idempotent sub-algorithms.  Between these idempotent sub-algorithms, state is sent across the failure boundaries of the unreliable components.  The failure of an unreliable component can then be tolerated as a takeover by a backup, which uses the last known state and drives forward with a retry of the idempotent sub-algorithm.  Classically, this has been done in a linear fashion (i.e. one step at a time).

    As the granularity of the unreliable component grows (from a mirrored disk to a system to a data  center), the latency to communicate with a backup becomes unpalatable.  This leads to a more relaxed model for fault tolerance.  The primary system will acknowledge the work request and its actions without waiting to ensure that the backup is notified of the work.  This improves the responsiveness of the system because the user is not delayed behind a slow interaction with the backup.

    There are two implications of asynchronous state capture:

    1)   Everything promised by the primary is probabilistic.  There is always a chance that an untimely failure shortly after the promise results in a backup proceeding without knowledge of the commitment.  Hence, nothing is guaranteed!

    2)   Applications must ensure eventual consistency.  Since work may be stuck in the primary after a failure and reappear later, the processing order for work cannot be guaranteed.

    Platform designers are struggling to make this easier for their applications.  Emerging patterns of eventual consistency and probabilistic execution may soon yield a way for applications to express requirements for a “looser” form of consistency while providing availability in the face of ever larger failures. As we will also point out in this paper, the patterns of probabilistic execution and eventual consistency are applicable to intermittently connected application patterns.  

    This paper recounts portions of the evolution of these trends, attempts to show the patterns that span these changes, and talks about future directions as we continue to “build on quicksand”.

  • An Interview with .NET Rocks while I Was at Barcelona Last Week

    Mark Dunn of .NET Rocks asked me to do an interview at TechEd EMEA last week.  We chatted about one of my talks called "If You Have Too Much Data, then Good Enough Is Good Enough".  The link to the interview is here.

    Thanks, Mark!  It was fun!

    - Pat

    image

  • Channel 9 Just Posted an Interview with Me on the Drive to Many-Core Processors

    This was filmed at TechEd NA in Orlando last June.  I received the link to the video today.  It was a nice chat with Patrick Weikle about the inevitability of Many-Core Processors and how this will inevitably drive us to parallel computation.

    The link is here.

    image

  • I'm Doing Five Talks at TechEd EMEA Developer (Barcelona -- November 10-14)

    Well, those folks at TechEd EMEA continue to be nice to me.  Looks like I get to do four talks for the architecture track AND the closing general session.  It is quite an honor to be told I can make up whatever I feel like and they will let me present it.

    Here are the titles and abstracts for the four talks for the architecture track:

    Building on Quicksand

    Quicksand is a mixture of sand and water that looks like it is solid but will slowly consume anyone that tries to walk over it.  This talk will argue that for many years our systems have built their reliability by layering on top of unreliable components and, effectively, built on top of quicksand.  Furthermore, the granularity of the unreliable components has been steadily increasing through the years and is gradually showing new and challenging semantics to the application developer.

    Fault tolerant algorithms are built by creating pieces that are idempotent or repeatable without harm.   Ensuring these idempotent operations can be performed by components that are unreliable, a reliable design must capture the state ACROSS the failure units and keep going if something fails.  This has been done with circuitry, mirrored disks, multiprocessors, and more.  In the past, failures were usually masked from the running application.  Now, the size of the unreliable component is inexorably getting larger as the economics of our systems changes.  This is making it impossible for the underlying hardware and software platforms to hide the failures.  Increasingly, applications must design to cope with the failures of these unreliable components.

    This talk will sketch the theory of fault tolerance and show a number of examples of how systems have been designed through the years and how their evolution has gradually impacted application design.  Hopefully, this discussion will help application designers understand the new responsibilities they face in building reliable distributed systems.  It is possible to stay afloat on top of quicksand with sufficient care!

    When You Have Too Much Data, “Good Enough” is Good Enough

    Classic database management systems have very tight control over their data and over the correctness of the answers they provide.   A number of pressures are being placed on classic database systems:

    · We are awash in vast quantities of data
    · There are many sources of the data, it frequently has inconsistent schema, and its interpretation is up to… well… interpretation
    · Data are streaming into the system(s) at the same time as we are attempting to analyze it
    · The machines holding the data may not all be concurrently available

    This talk will examine some of these forces on classic database management systems and propose that there are a number of new ways to think about data.  We will explore a number of different dimensions in which the “classic” database world is being relaxed to meet the needs for our customers in a changing world.

    RIAs and Emissaries

    The emerging field of RIA (Rich Internet Applications) is a hybrid between the classic rich-client application and a browser.  Designed to provide the richness of the client environment without installation, RIA applications must be small and lightweight.  To provide safety, they must have strictly limited access to resources on the client machine.   Sandboxing of the client access limits the RIA app to per-app/per-user/per-machine state.   The state of the art is evolving in how sharing can be done within RIA environments across applications and across users while providing safe computing.

    The emissary design pattern is one where there is non-trusted code running outside of a service which is specifically designed to support interactions with the service.  Leveraging reference data published by the target service, the emissary prepares requests with a sophisticated enough understanding of the service’s expectations that the request is likely to be accepted and processed.  All of this is done while respecting the trust boundaries imposed by the target service.   This pattern is clarified in the talk by showing parallels to the usage of paper forms in the business processes that proceeded the advent of computers.

    We will explore the use and evolution of the Rich Internet Application and discuss some of the fun challenges the industry faces as RIAs evolve.  Then, we will describe the emissary design pattern and show how RIAs are very well suited to working with their parent web environment using this pattern.

    This talk is simply the wild ideas of the presenter and does not reflect any product commitments.  It might, however, help you understand and build your applications differently.

    Metropolis: Buildings and Applications

    The Metropolis series of talks refers to a collection of presentations which examine cities and their evolution since being connected by the railroads and shows how Information Technology is undergoing the same transformations as IT shops are connected by the Internet.  This second talk in the series examines the parallels between buildings in the city and applications in the IT shop.

    Buildings come in many shapes, forms, and uses.  Some are built with major investments and the builders anticipate a long and vibrant lifetime for the building.  Others are built for quick turnover and short lives.  As buildings age, different aspects of the building have different lifetimes and amortizations.  It is common to see the roof, siding, plumbing, or wiring redone a number of times before the building is discarded.  We are also seeing an increasing trend towards buildings being designed for a class or category of occupant without knowing the specific occupant.  You see this in retail malls, office buildings, and more.  As time moves on, a larger percentage of buildings are designed for a generalized use and a smaller percentage customized for the buyer.

    Applications are following the same trend (although lagging the construction industry by a number of decades).   We see major investments in some pieces of software and then surprisingly quick-and-dirty implementations of others.  We see renovation of applications across releases with some aspects of the application remaining intact while other aspects may change many times over the lifetime of the code.  Finally, we are gradually seeing the creation of patterns or classes of applications and supporting infrastructure to ease their development.

    This talk will cover a number of parallels between buildings and applications.  We intend to show that the same forces that have shaped the building industry are shaping the development of applications.  It is possible to look to the building industry to tell us where applications will be going.

    Some (but not many) of you may have seen this talk a few years ago but it is still guaranteed to make you think differently about the composition and evolution of applications.

    And, finally, the closing general session at which the poor attendees have no alternative entertainment but listening to me!  The organizing folks offered me this spot and asked me what I wanted to present.  Since I was curious about data centers and felt REALLY uneducated, I had to select that as a topic.  This led me to the following title and abstract for the big presentation.

    Green Computing through Sharing
    Reducing both Cost AND Carbon

    Data centers consumed 1.5% of the total electricity in the US in 2006 and are on track to double as a percentage every five years. It is about 2% of the US total in 2008. Western Europe’s use is increasing at a slightly faster rate (from a slightly lower base percentage). The consumption of electicity within data centers is of significant financial and environmental importance.

    Where the heck is all this power going? Why is the electrical load increasing so much? What can be done about it?

    This talk will examine both traditional and emerging data center designs. We will start by examining how a data center is laid out, constructed, and managed. We will show two emerging trends: the change to designing data centers for the optimization of power and the emergence of new economies of scale in data centers which is contributing to the drive towards cloud computing. Microsoft is actively moving to compete in the space of cloud computing as we are seeing at the PDC (Professional Developers Conference) a few weeks before TechEd EMEA Developer.

    Next, we will examine the sources of waste in the system today and examine why so many of our resources are underutilized. Because we are reluctant to share computing resources, they are left idle much of the time. Why is this currently the dominant choice? What can be done in the design of applications, systems, and data centers to make them more green (both carbon and cash)? What can developers do to make a difference?

    Anyway, I am super excited to be going to Barcelona and humbled by the trust and opportunity I've been given!  Hope to see you there!  If so, please come up and say hi!  In the meantime, I'm going to work on my slides...

    - Pat

  • First You Live, Then You Die, Live Now!

    I didn't think I wanted to blog about this but it's been rattling around inside of me so I thought I would share it!

    Life is short and life is precious... none of us know when our time is up and we all need to hold close what is dear to us.

    Monday evening, Sept 29th, I was away from home for the evening at a Microsoft internal conference.  At around 10:30PM, I checked the local newspaper and found out that a friend of mine, Keith Grinstein, had passed away the day before.  Coverage was in the Seattle Post-Intelligencer, Puget Sound Business Journal, and later in the Seattle Times.  A few days later, a nice article was in Crosscut.

    Keith was my neighbor and I really, really liked him and really, really like his wife, Claire.  We'd only been out to dinner once as the two couples and kept talking about wanting to set up another dinner.  I would see Keith at least once a week hanging around in the lobby of the building we share.  We all live on top of a hotel and there are folks working for the hotel down in the lobby all the time.  It is a busy and vibrant place.  Keith had been a successful entrepreneur and businessman.  He was a lawyer, did a stint as an aid to various politicians, worked in telecommunications (he was formerly CEO of NexTel amongst other things), was an early investor in Amazon, and much more.  He had set his life up the way he wanted it and was now doing venture capitol work out of a small company he set up with some friends.  His office was one block away and he spent most mornings down in the "Studio" area off the hotel lobby where you can get breakfast.  He was there for a couple of hours each day, pacing around, doing business over his cell phone, and teasing, laughing, and smiling with whoever he would encounter.  Keith's wife, Claire, is a beautiful, vibrant, fun, and hard-working ophthalmologist.  Keith had the energy to be on countless boards of charitable organizations.  He was on the board for our home owner's association.  One week before he passed away, Keith and Claire opened their home to 300-400 people on a condo tour to benefit the Pike Place Market.

    While Keith was certainly not my closest friend nor I his, it is so very hard to see such a vibrant, good, kind, productive, and funny man go so suddenly.  In our building, we have a community and it is hurting from the loss.  The wonderful people working in the lobby are so sad... they interacted with Keith almost every day.  Miles, one of the concierges, said "I miss Keith pacing back and forth across the lobby in his bare feet doing business over the cell phone in his loud and laughing voice".  Denny, the general manager of the hotel, said that two days before he passed away, Keith had his head stuck through the door of Denny's office making him laugh hysterically.  None of this surprises me.

    We couldn't go to Keith's funeral as my wife, Lisa, had sinus surgery last Monday (she is doing fine).  About two hours after the surgery, she was getting ready to go home and Lisa said "I couldn't be more thrilled with the surgery... I just wish we had been able to go to the funeral!".   The surgeon said, "What, Keith's funeral?".  We had not had any discussion of Keith with him but the doctor had served on a charity board with Keith and had known him for years.   I've heard reports that the gathering was standing room only with over 1000 people in attendance.  It would be amazing to say how many lives Keith had touched in his 48 years but I saw his spirit and that makes it all seem so plausible.

    What can you say when a wonderful person's life is cut short?  My dear friend, Jim Gray, frequently said, "First you live, then you die, live now!"  It is SO true!  I miss Keith...

    - Pat

  • A Wonderful Few Days at a Wedding in Chicago

    Blogging on the bus again!!  This was written on August 18th and 19th but it's taken a few weeks to get around to editing the photos...  No computer stuff in this blog, just personal fun!

    So, on Wednesday, August 13th, Lisa and I flew to Chicago to attend a wedding of a friend from our building in Seattle.  We met John at the building in which we live.  He recently fell in love with a wonderful lady named Helen and when we were invited, we had to say yes!   Neither of us had been to Chicago (I am NOT counting two work trips where I didn't get out of the conference rooms).   We were very excited to go and simply had the BEST time.

    The wedding was at the downtown Ritz-Carlton right next door to the John Hancock Tower.   As we were leaving home and walking out the front of the building to get to the Seattle airport, we saw John and Helen loading up to go!   After some hugs, we determined that they we on an Alaska airlines flight landing 30 minutes before our United flight and we would see them at the hotel later in the evening since they were having dinner with Helen's parents who live in Chicago.   We landed, got to the hotel, unpacked, and had dinner in the hotel.  We stayed up late Wednesday night drinking at chatting with the almost newlyweds since we were the first to arrive into town.

    Helen is a beautiful Chinese lady originally from Beijing and an only child.  Her father came over to become a university professor (I think in material sciences) and started a successful company... the real American dream!   John is from Philadelphia and is the 11th of 12 children from an Irish family.   He has worked his butt off for years in the financial industry and has held amazing positions at very influential companies having climbed up solely with his wits, hard work, and honor.  It is a joy to have them both as friends.  They waited to later in life to marry since neither of them found their true love until recently.  Just watching them together (and knowing they are our neighbors) is so cool!

    Chicago-Aug08- 012

    Thursday, August 14th -- Walking, checking out the Chicago Spire, an amazing dinner, and partying with new friends!

    We were going to be attending the wedding on Saturday but had Thursday and Friday as a mini-vacation to acquaint ourselves with downtown Chicago.   The weather was perfect!  The town was alive and pumping.  We were in the middle of the Magnificent Mile of shopping on Michigan Avenue and we just had a blast.   After sleeping in on Thursday, we had to try Chicago Pizza!   We headed to Gino's East and were one of the first there for lunch (since this was our breakfast after sleeping in).  I wanted to try deep dish pizza.   Lisa wanted a thin crust... So, not wanting either of us to be disappointed, we ordered both!   It takes a LONG time for deep dish pizza...  When they both came, we were astonished that it is DEEP dish pizza!   It seemed like a lasagna more than like a pizza.  I actually liked Lisa's pizza better!   The locals told us later that it is always better the next day after sitting in the refrigerator all night.   I'm not convinced!

    Chicago-Aug08- 025

    Next, since Lisa and I are fascinated by cool buildings, we went to the presentation center for the Chicago Spire (also see).   This condo building will be 2000 feet tall hand hold about 1100 homes in it!  It is a really novel architecture right where the Chicago River meets Lake Michigan.   It will punch up on the Chicago Skyline about 500 feet taller than the Sears' Tower and be the tallest building in North America when it opens in 2012.  The construction is just about at ground level now.   Each of the 1100 (or so) homes has a different floor plan since the tower gets gradually smaller as it rises and the window (and structural) placement gradually rotates (about 2.5 degrees per floor).  It was a BLAST to see the project even though we are currently quite content with our condo home in Seattle that we are planning to move into next year (and the commute to Redmond from downtown Chicago would be harder than from downtown Redmond).  Still, it was fascinating checking it out and we will certainly remember on subsequent visits when it is a world famous landmark!

    After leaving the Chicago Spire offices, we walked to Navy Pier which is wonderful public pier extending into Lake Michigan with amusement rides and boat tours and much more.  There were lots of people just having fun in the sun on a beautiful day.  Soon, the roar of the jet engines told us that the Blue Angels were in town!  Now, we had just spent an afternoon watching the Blue Angels less than two weeks ago!  Also, Lisa had bumped into the pilots at the Washington Athletic Club where we exercise.  The WAC is where they stay when in Seattle and Lisa had a wonderful conversation with these fine servicemen.  So, when we saw them flying over us again, it was just a LOT of fun to watch them practice on Thursday.  

    Chicago-Aug08- 053 Chicago-Aug08- 029 Chicago-Aug08- 037 Chicago-Aug08- 039 Chicago-Aug08- 042 Chicago-Aug08- 043 Chicago-Aug08- 046 Chicago-Aug08- 047 Chicago-Aug08- 052

     

    By 3:30PM, we needed to start our walk back to the hotel as we had 6:00PM dinner reservations at Charlie Trotter's Restaurant and needed to get all dressed up!

    So, Lisa had heard about Charlie Trotter's and we were told it was one of the finest restaurants in the world.  Now, you never know how that should be interpreted but we have to say, it was phenomenal!  There was a seven course menu with wine pairing.  A coat and tie is required and it was my first opportunity to try out my new suit (since the old one from 18 months ago was too loose). 

    Chicago-080817- 001

    We had the most amazing food and had a wonderful time!   They have two seatings (6PM and 9PM) and by 8:30, we were heading back to the hotel.  Unfortunately, Lisa was getting a headache so she crashed for the night.  I shot a text message to the groom but the recently arrived gang was still at dinner so I walked and explored the beautiful city in the Magnificent Mile area on an absolutely lovely evening.  The place was abuzz with action and it felt SO good to walk and burn some calories.

    Soon, I was meeting some of John and Helen's friends and family and we were out partying!  Helen's friend Eva is currently living in Shanghai.  Most of John's friends are from Washington DC but his family (six brothers and five sisters) are from Philadelphia.  I met Mike and Karen (from Austin), Reza (from DC), TJ and Emily (from DC), and John's brother, Tony!   John, Tony, and I closed the place down and had a blast!

    Friday, August 15th -- Sushi Lunch, The Bus Tour, the Boat Tour, and Meeting more friends!

    So, I slept in until 10AM (Lisa had popped to life at around 8AM but let me sleep).  We started by tracking down sushi for my lovely (sushi-loving) wife.  We had a lovely lunch and the set out to figure out what to do.  We decided to hop onto one of those double-decker busses where you can sit on top, see a LOT, and get a sunburn.  We did all these!

    We rode around on the tour bus for two hours in the sun and saw SO many fascinating buildings, parks, the Chicago River, and more.  

    Chicago-080817- 003 Chicago-080817- 006 Chicago-080817- 014  Chicago-080817- 021  Chicago-080817- 115  Chicago-080817- 089 Chicago-080817- 101 Chicago-080817- 105 Chicago-080817- 106  Chicago-080817- 112 Chicago-080817- 114  Chicago-080817- 140  Chicago-080817- 161 Chicago-080817- 185 Chicago-080817- 218 Chicago-080817- 220 Chicago-080817- 224

    We decided to bail out near the Chicago River and walk down to Grant Park.   This would only lose us the last mile of the tour loop and that was the Magnificent Mile of shopping and we had already walked that.   As we crossed the river, we saw a tour boat on the river and decided that would be more fun and hiked over to Navy Pier (where we already knew they originated).  So, off we were on an "Architecture Tour" from a boat up the Chicago River.   This time, it was an hour in the sun and again, snapping photos like crazy!   Here's some of the best!

    Chicago-080817- 226 Chicago-080817- 234 Chicago-080817- 242 Chicago-080817- 245 Chicago-080817- 246 Chicago-080817- 247 Chicago-080817- 251 Chicago-080817- 255 Chicago-080817- 256 Chicago-080817- 261 Chicago-080817- 266 Chicago-080817- 277 Chicago-080817- 279 Chicago-080817- 282 Chicago-080817- 284 Chicago-080817- 288 Chicago-080817- 290 Chicago-080817- 291 Chicago-080817- 317 Chicago-080817- 292 Chicago-080817- 293 Chicago-080817- 310 Chicago-080817- 311

    We had promised to get back to the hotel by 5:30 to rendezvous with some of the folks flying in from other places.  That evening was the rehearsal dinner and, while we weren't in the wedding party, we were hanging out with some of them, their spouses, etc, while the rehearsal itself happened.  It was a BLAST.  Soon, we were running off with our new friend, Reza, to grab a steak a couple of blocks away while the wedding party did the dinner together.  We came back to the hotel later and had drinks with everyone until we bagged it at around 1AM.   Again, an absolutely LOVELY day of adventures, partying, and nice people.

    Chicago-Aug08- 068 Chicago-Aug08- 065 Chicago-Aug08- 066

    Saturday, August 16th -- A lovely lunch and an AMAZING Wedding!

    So, again we slept in and rested.  We got ourselves going at around 10:30 and knew we couldn't be too adventurous because we needed to be in formal clothing and ready for the wedding by 3:30.  We decided that the Chicago River was just TOO cool and so we hiked the mile down Michigan to the River and started exploring for a restaurant along the river to sit for lunch.  It took a little while but we found a wonderful place called Flatwater and had a GREAT lunch.  Walking back, we explored the hotel lobby for the new Trump Tower along the river.  Having an OPEN hotel while major construction is going on 90 stories above is really cool!  But there they are open while the building is being built!   We walked back and checked out the Fourth Presbyterian Church of Chicago which was only a block away from our hotel.  It is a beautiful 140 year old church!

    Chicago-080817- 322Chicago-080817- 331    Chicago-080817- 324

    Here's the building with our hotel, the Ritz-Carlton.

    Chicago-080817- 338

    And, right next door is the John Hancock Center.

    Chicago-080817- 341 Chicago-080817- 327

    Soon, it was back to a shower and getting the fancy clothes on.

    We showed up at around 3:30 for the 4:00PM wedding and it was gorgeous!  Our new friends were arriving and soon we saw the groom (our neighbor, John) and his six brothers.  There was a string quintet playing while everyone was seated and a trumpet player just waiting for the "Here Comes the Bride" procession... that's ALL the trumpet player did!

     Chicago-080817- 343 Chicago-080817- 346 Chicago-080817- 352 Chicago-080817- 356

    The wedding itself was short and beautiful!   Soon, we were outside the room for appetizers and cocktails while the wedding party had jillions of photos taken and the room was reset for dinner.  It was wonderful!

    Chicago-080817- 376 Chicago-080817- 359 Chicago-080817- 361 Chicago-080817- 364 Chicago-080817- 372 Chicago-080817- 374

    We sat at dinner with some of John's family (a sister, nephew, niece, and nephew's girlfriend).   Let me clarify... the nephew was 30, built like a mountain, and back from his second tour in Iraq.  They were fun!  There were also some of John's good friends from some of his jobs in the financial industry.  During the (lovely) meal, there were four musicians playing traditional Chinese music while we ate.  It was amazingly cool.

    Then, out came the THIRD group of musicians, a 9 piece orchestra to do the dance music.  Helen has spent years doing ballet and John and Helen did the most impressive "first dance" I have ever seen.  It was elegant, graceful, and charming.  The singer from the musicians said she had seen a LOT of first dances and that was the classiest.  Then dancing began, the wedding cake was served, and more and more dancing!   Lisa had to (temporarily) abandon her beautiful (but painful) high-heels in order to dance!

     Chicago-080817- 386 Chicago-080817- 388 Chicago-080817- 390 Chicago-080817- 393 Chicago-080817- 395 Chicago-080817- 398 Chicago-080817- 399

    Our new friend, Reza, caught the garter belt!

    Chicago-080817- 404 Chicago-080817- 407

    We managed to dance on and off until about 1AM.  My new suit DEFINITELY needed dry cleaning and we had a wonderful time.   It was just one of the classiest and most fun weddings I've ever been to.

    So, back to the room by 1AM... Packing... Too little sleep... Catching a ride to the airport by 9AM... Change planes in Denver and glad to be home!   Happy (and tired) to be at the gym by 6AM Monday morning... Between the cruise and Chicago, I need more time at the gym!

    A trip we won't forget!  Chicago is a great city!

    -- Pat

  • Confidence in the Cloud

    So, one morning a few weeks or so ago, my colleague, Ruwen Hess,  stuck his head in my door and commented on an Article on Gmail and User Confidence he had just read.  He also pointed out a very interesting Article on the S3 failure from July 20th (which I had seen before).  Now, before I start commenting on the point of this blog post, I want to say that this is NOT meant as a slam on our competitors.   This is a complex and nascent space in which we are all learning.   I want to complement my former company, Amazon, on the openness exhibited in their discussion of the event of July 20th.  It is a new thing for Amazon to explain when stuff happens and I think it is great for the industry and great for Amazon.  Google is just fighting the good fight to provide highly available services.  They are both fascinating companies and worthy competitors.  Still, there are some interesting aspects to this publicly available news.

    Of course, this leads my mind down too many different paths.  This was going to be a short post but I always get carried away!  Let me cover:

    • Some Observations about Reliable Process Pairs,
    • Less Is More,
    • N-Version Programming,
    • Availability Over Consistency,
    • Eventual Consistency,
    • Front-Ending the Cloud, and
    • It's Going To Be a Fun Ride!

    Some Observations about Reliable Process Pairs

    Let me start by discussing independence of failures (and the need to focus on it to achieve high availability).  During the 1980s, when I was developing system software at Tandem Computers, some of our important programs were process pairs.  In this scheme, the same software ran in two different processes on different computers within the distributed system connected via messaging.  The goal was that if only one of the computers (and, hence, processes) failed, the other would "take-over" and continue offering service.  Of course, it was essential to be able to restart a backup and fill it up with sufficient state to continue the computation.  Specifically, every failure and take-over created the need to restart the other process (which used to be the primary).  Restarting implies the filling of the new backup with enough state to be ready to take-over.

    During this time, I was privileged to be able to see a number of different implementations of process pairs within the Tandem System and, indeed, to do major development and implementation on a large process pair (in addition to a number of other pieces of software with different approaches to fault tolerance).   As I lived and breathed the triage of product support, crashes, dump analyses, patches, and major product upgrades, it became clear to me that there was a pattern.   Some process pairs were implemented to ship entire data structures.  Some were implemented to send very narrow and minimalist descriptions of an operational change in state (e.g. in the transaction manager: "We just advanced transaction-X to the stage of flushing the log records to disk").  In a different example, one process pair would send a message containing its entire address space of data to its partner whenever a change occurred!  Lots of lively debates raged about the best way to implement reliable process pairs.

    I observed that the implementations which sent the minimum amount of data seemed to be the most resilient.  There were many crash dump analyses that showed a process running along and it would trash some data structure.  Then, a checkpoint would copy the data structure to the backup process, including its corruption.  (A checkpoint is the message used to communicate state from the primary process to the backup process in a process pair.)    Soon, the corruption caused the primary process (and its computer) to crash.  Now, the backup (with the corruption received from the checkpoint) takes over to do the work of the process pair.  Before long, the backup gets tangled up because of the corruption in the data structure and IT falls over, too!   The entire fault tolerant system has succumbed to a single failure because there is insufficient isolation between the parts.

    Less Is More

    Communicating less information within a message is usually best.  If you send extra stuff, it can cause corruption!

    The failure of Amazon's S3 on July 20th is fascinating and, yet, very typical.   S3 uses a Gossip Protocol to spread state around the network without any master.   The knowledge is simply propagated to anyone that will listen and, very quickly, spreads around the network.   Unfortunately, there was a bit of poison introduced into the system.  Now, it just so happens that this was introduced because Amazon's MD5 hash protection of their messages did not cover some of the system status bits, but that is not my main point.   All (non-protected) data communications will occasionally have transmission errors and, unfortunately, the stuff being gossiped by S3 was not protected.  The article simply describes it as system status state.

    I know that in these complex distributed systems,it is essential to keep somewhat accurate track of the state of the system.  What is not clear from the Amazon S3 article from July 20th was the shape and form of the system status information and why it did not settle out automatically after the corruption.   The article mentions that the system status state is given a higher priority and was starving out the real work as it was gossiping about the system state.

    The main point is that when you are communicating information, it is essential to keep the information flow as sparse as possible.

    While S3 is a brilliant system which is designed to continue functioning during a data center outage, it had a flaw (which I'm sure is now fixed) which allowed for some bad state to propagate to ALL the data centers.  To eliminate the problem, all the data centers were taken offline and then the entire system was restarted.

    Let me reiterate my deep respect for the engineering at both Amazon and Google in addition to what I see at Microsoft.  Still, as the industry matures, we will all learn by the school of hard knocks to keep the information we spread across failure units as crisp and concise as semantically possible.

    N-Version Programming

    I've never personally experienced it but have read about uses of N-Version Programming.   In this scheme, specifications of a program are drawn up and presented to N different teams, each of which will produce a version of the system.  Now, if you develop 3 versions of the system, you may then implement a voting scheme in which when there are two disparate answers, you pick the one with two votes.  The concept is to reduce the probability of a software bug causing the entire system to fail.  Of course, one of the challenges in this approach is the development of comprehensive and clear specifications.

    The reason I think of N-Version Programming is that it involves the "Less Is More" principle.  To do N-Version Programming correctly, you need to isolate your development teams and prohibit any out-of-band communication.  If the teams communicate (other than through the specifications), it increases the chances of propagating some corruption.

    Availability Over Consistency

    I am not aware of any information from Google about the cause of their outage but, again, we as an industry are still learning how to keep ever more complex applications and services available.   The article cited above, expresses the dismay of many users as a service they've grown to depend upon is simply not there for a while.

    In many cases, the user would gladly take a "good enough" answer NOW rather than wait for the "correct" answer LATER.  In fact, it is MORE common to see users just want to keep going.  Right now, I am typing most of this on the bus with Windows Live Writer inserting notes to myself about the hyperlinks to fix when I am back online...  It's great to just "keep going" even with a reduced experience!

    Amazon published a wonderful paper on  Dynamo at SOSP 2007.  This paper provides an excellent overview of the Dynamo storage system (which I had a role in encouraging from the sidelines -- I can't take credit for it).  Dynamo is in production with at least two running services and numerous fascinating techniques employed in its implementation.  I would encourage you to read the paper.

    The reason that I raise this here is that Dynamo provides availability over consistency.  In a distributed system, it has been proven that Brewer's CAP Conjecture is true... Hence, it is now called the CAP Theory.   The idea is that you can have only two of Consistency, Availability, or Partition Tolerance.   You have have a consistent (and by this, the idea is a classic transactional ACID consistency) and partition tolerant system but it may not be available under some partitions.  Alternatively, you can have an available system which tolerates partitions but it won't have the classic notion of consistency.  Increasingly, I see applications designed with looser notions of consistency.  Most of the time, customers really want availability at the expense of classic consistency!  New means of expressing looser consistencies are emerging to provide availability even when failures occur!

    I learned over 20 years ago working in transactional systems to ASK a customer what their priority was when dealing with an outage.  Indeed, some customers wanted you to ensure that every transaction was correct before bringing the system online.  At first, it shocked me to learn that many, many customers wanted the system up even if the results of the work might be somewhat "less than perfect".  Indeed, I saw this in some banking systems with humongous amounts of money being pushed around.  It wasn't that they were doing anything wrong... if the system came up, most of the transfers could be accomplished and overnight interest gathered for them.  The funds involved were large enough that hundreds of bank workers would simply stay up all night and verify the accuracy of the work, cleaning up as necessary.  The timeliness mattered more than the accuracy!

    Again, more often than not, availability matters more than strict (classic) consistency!

    Eventual Consistency

    Also fascinating (at least to me) is the area of eventual consistency (and, again, this term implies a LOOSER form of consistency than classic ACID transactions).

    The basic idea is that it is OK to diverge opinions across replicas.  What is needed is a system that coalesces when communication is reestablished between the replicas.  I talked about this in my talk The Irresistible Forces Meet the Movable Objects.   I am bringing this up because (I am arguing) that people really want availability over consistency.  When divergent changes are made while disconnected, they also want the messes to clean themselves up as much as possible.

    It is my opinion that we will be designing systems to support eventual consistency.  Part of this trend will be to back away from our traditional separation of storage (e.g. database) from applications.  In my opinion, we will evolve towards integrated solutions that combine the storage with the execution of the application in a way that will make it easier to redo the operation by the application and create eventual consistency across different executions of the app.  This is much easier to do than to cope with reorderability of write operations against a data store.

    Front-Ending the Cloud

    So, I love and believe in cloud computing.  I think the entire space will grow and evolve.  The cost structures for many companies will make it very cost effective to use big data centers for hosting lots of their computation.  As I look at data center cost structures, it is clear that it is going to be a competitive business with many advantages to large data center managers with large economies of scale.  In a handful of years, most companies will look to offsite providers for their reliable servers.

    Still, I wonder if there will be a market for local front-end systems running a replica of a portion of the applications needed to keep the business going.   Not all the computation a company performs is "mission critical".  In other words, the company can function with a subset of its computation (at least for a short while).  Will there be an emerging space of loosely-coupled eventually consistent computation run either at the local company or at a different hosting company than the main cloud-purveyor?   If there is an outage of the cloud services, this front-end would keep the business going.  When reconnected, the results of the work would be shared and reconciled.

    This is very much like the relationship between Outlook and Exchange.  Outlook does a LOT of its functionality while disconnected and that is of huge value.

    This may be an interesting economic point -- combining the cost savings of the cloud with independent failures of the front-end.

    It's Going To Be a Fun Ride!

    I am SO glad to be working in this space!  It seems to me this is a great time for distributed systems geeks!

    - Pat

  • The Emissary Design Pattern and RIAs (Rich Internet Applications)

    Here is a first draft of a new presentation.   I gave it a couple of months ago just after TechEd and thought I would share it as I try to write up some of my thoughts on RIAs.  I plan to rework this a bit more and present it again at TechEd Europe.   The talk is titled: "The Emissary Design Pattern and RIAs (Rich Internet Applications)"

    Abstract:

    The Emissary design pattern was first described in 1999 in the old "Fiefdoms and Emissaries" talk.  The concept of a "fiefdom" is very similar to what we today call a service in a Service Oriented Architecture.  The fiefdom is a separate trust sphere and transactional boundary.   An emissary is a prescriptive pattern for interacting with a service (or fiefdom) which leverages reference data and a deep understanding of the service to prepare requests for service and maximize the chance those requests will comply with the requirements of the service.   An emissary may be richly interactive and anticipate the validation requirements of the service.

    The emerging world of RIAs (Rich Internet Applications) is a fascinating blend of a classic smart client and a browser-based web application.   In a RIA app, client code runs in the browser but still must comply with the browser enforced sand-boxing and not cause harm to the host client machine.  Navigation, naming, linking, and much more are being defined in a fashion drawing from both the web style and the client style.  Many of the design issues with RIAs are under discussion today as this support for these applications is emerging.

    This talk examines both the emissary design pattern and the nascent space of Rich Internet Applications.  It motivates how one can look to the workflow patterns contained in our parents' use of paper forms for workflow to understand the possibilities of implementing user-centric workflow as shared replicated data.  The talk concludes with some preliminary concepts of a shared and declarative definition of the "paper form" model and its constraints and how these may someday be used in the automatic generation of emissary-based RIA clients.

    More RIA stuff soon!

    - Pat

More Posts Next page »

© 2009 Microsoft Corporation. All rights reserved. Terms of Use  |  Trademarks  |  Privacy Statement
Microsoft
Page view tracker