Welcome to MSDN Blogs Sign in | Join | Help

Scalability Notes

[Read -> Think -> Write]
Implementing Consistency - Protocols for Data Replication and Cache Coherence
Various concrete consistency models have been described in [1], now it's time to discuss how to implement these models.

Consistency semantic is divided into two categories in [1] - "Coherence and Replication are very similar concepts and deal with the same problem, but the former is often used in hardware system (for example, in SMP Cache system) and the later is often used in software system (for example, in distributed files system and database system)".

Some other differences:
1. Coherence deals with data item at client side, while Replication deals with data item at server side
2. Coherence deals with relicated data item that has a corresponding one in a reliable storage backend, but Replication doesn't have that.

We will discuss protocols for Data Replication and Cache Coherence in different sections.

Part 1 - Data Replication

To keep replicated data item consistent in distributed storage system, there are some core problems to think about: where and how to store various replicas, what, when and how to propagate update to replicas? how to keep consistency?

I - Replica Server & Replica Content Placement
- As close to client as possible to improve performance
- Adjust repica placement dynamicly according to access history

II - General Update
Propagation Problems

1. What to Propagate?
- Notification (Invalid Message)
- Command (operation & parameters)
- Data (updated region)
- Change Log (to accumulate several updates into one)

2. When to Propagate?
- Pull (each replica asks for updates)
- Push (primary replica send updates to all secondaries)

3. How to Propagate?
- Unicasting (send updates to each replica one by one)
- Multicasting (send update message to all replicas at one time)

III - Replication Protocols

A replication protocol aims to implement some specific consistency model, the detail of the protocol is highly dependent on the target model. Here we only describe protocols which implement models that are popular in practical systems.

1. Sequential Consistency

1.1 Primary Based Protocol
- all write operations to a data item x is served by one special replica called primary replica, this replica is responsible for updating other replicas, client only interact with this special replica.

1. 2. Replicated Write Protocol - write operations are sent to each replica to execute:
- Active Replication, a total order of all write operations is required in order to make each replica execute the same order of write commands.
- Quorum Based, write operations only need to be executed on part of all replicas before return. It use votes to prevent write-write confilict and write-read conflict:
  • Suppose each data item has N replicas
  • To read a value, client must contact with at least Nr replicas
  • To write a data item, client must contact with at least Nw replicas
  • To ensure no WW and WR conflicts, Nr + Nw > N and Nw + Nw > N should be statisfied
2. Eventual Consistency Protocol

Two requirements should be meet for this kind of protocol:
- All update operations to a data item should reach and be executed all replicas at some time
- These operatioins should be executed in the same order

Some popular methods to ensure these requirements are:
- Use write set and read set (use Nw, Nr to control how many porcesses should be involved in write or read operation), thus update operations are ordered.
- Expose data item version number to client, so when client accesses data, it can pass known latest version number to server to implement some client centric consistency model
- Limit update operation execution process, so write-write conflicts can be solved easily

Note:
- All upper protocols ignore process/node & communication failure, which would occur often in practical distributed system.
- Replication protocols than deal with failure, such as Paxos, Two-Phase Commit, are much more complicated.

Part 2 - Cache Coherence

Cache Coherence Protocol is used to keep client side replicas consistent in the context that a reliable data item exists in storage backend (it's true for smp cache like hardware system and for distributed file system cache like software system).

I. Core problems of Cache Coherence Protocol:

1. Coherence Detection Strategy
That is to say, when inconsistencies are actually detected. For example, in distributed database system, this detection can be performed:
- At the beginning of a transaction
- Parallel with the on going transaction but before it commits
- When transaction commits

2. Coherence Enforcement Strategy
That is to say, how all caches are kept consistent with each other. Generally, there are two methods
- Invalidating: if a data item is modified in one cache, other caches that hold the same data item will receive a invalidation notification.
- Propagating: if a data item is modified in one cache, the update is propagated to other caches that hold the same data item. So all replicas in cache are update to the same version.

3. Cache-Server Consistency
That is to say, how to keep data items in cache and in storage server consistent with each other. There are serveral policies:
- Use Read Only Cache, modification can only be made on items in storage server, cache pull these updates
- Write Back, modifications to data items are accumulated at cache side, and are written to storage server at some other time.
- Write Through, modifications are made at cache, and are also propagated to storage server. In SMP system, bus or directory may be used to serialize all operations. While in distributed file systems, an exclusive lock may be needed for a data cache that can perform modification operations to avoid write-write confliction.

II Implementing Cache Coherence

1. Implementation Mechanism

1.1 Directory Based - In a directory-based system, some information about the data being shared is placed in a common directory. The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. When an entry is changed the directory either updates or invalidates the other caches with that entry.[2] It is not as fast as snooping but more scalable.

1.2 Snooping - In such system, the individual caches monitor address lines for accesses to memory locations that they have cached. When a write operation is observed to a location that a cache has a copy of, the cache controller invalidates or update its own copy of the snooped memory location.[2] Since it needs broadcasting, not very scalable.

2. Protocols

The various cache coherence protocols use Directory or Snooping mechanism to solve the three core problems described above:

2.1 Invalidation Protocols
- Write-Once
- Berkeley
- Illinois/MESI

2.2 Propagating Protocols
- Firefly
- Dragon

Most of the cache protocols in multiprocessors are supporting sequential consistency model, while in software distributed shared memory more popular are models supporting release consistency or weak consistency.

[Reference]
[1] http://blogs.msdn.com/csliu/archive/2009/06/15/consistency-model-a-survey.aspx
[2] http://en.wikipedia.org/wiki/Cache_coherence
[3] http://www.nedprod.com/NedHAL/Cache%20Coherency%20solutions.html
[4] http://en.wikipedia.org/wiki/Memory_coherence
[5] Shared Memory Architecture, 2001, Chinese Higher Education Press, Weiwu Hu
([5] 共享存储系统体系结构, 2001, 中国高等教育出版社, 胡伟武)
Consistency Model - A Survey
Part I - What's Data Consistency Model and Why Should We Care?

Data Consistency Model
- it is a Semantic Contract between a data storage system and its user. Here, data storage system may refer to hardware system ( for example : memory sub-system inDSM, SMP, CMP computers) , or software systems (for example: distributed file system, web caching or distributed database).

Essentially, Consistency Model defines what value to return in a read operation.

The most natural semantic for storage system is - "read should return the last written value". It is intuitive in memory system with uniprocessor associated, where no concurrent access and no data replication exist.

But when concurrent client access and multiple replica exist, it's not easy to identify what "last write" means, mainly due to the lack of global time clock in parallel/distributed system.

So various data models are proposed to define various data semantic when Contention and Replication exists:
- Data Contention - In memory sub system of parallel computers, there may be multiple processors that access the same memory location at the same period of time, what's the semantic each processor should expect? (different process access the same data item)
- Data Replication - In distributed file/database system, each data block is replicated at multiple places, even access(read/write/modify) requests from the same client may involve more than one replicas, what's the semantic in client's perspective? (data item is replicated)
- Both Contention & Replication : What if multiple clients access replicated data blocks? What if there is a local cache for each processor in SMP system? (many processes access replicated data items)

Part II - The Various Models

There are many consistency models exist in computer architectures and distributed storage community. Why people invent so many semantics between storage and its clients? The reason is the trade off between Strict Semantic VS Available Optimization Space - strict (also simple and easy to understand, use and implement) semantic will limit the space that we can use to improve availability and do performance optimization. So each model has its advantages and drawbacks.

Before describe various models, we should clarify some concepts first:

Program Order
- the order of operations (on data items) that is implied by the order of program statements(or assembly instructions). It's the issuing order of operations that are from the same thread/process.

Write Atomicity - write operation performs as no concurrent/overlapped operations can occur. So the meaning of "Atomicity" here differs from that in DBMS field, where "Atomicity" means that all ops in a transaction will be performed or none of them will be performed.

Consistency VS Coherence/Replication, Consistency focus on the semantic of data store (a set of data item) for multiple client accessing, while Coherence/Replication only deal with the semantic (and how to implement it) of a replicated data item (multiple replica exist) when it is accessed by multiple clients. Coherence/Replication is just part of the Consistency Model. Coherence / Replication Protocol only describes how to implement part of the semantic of some specific data consistency model.

[Update@07/23: Coherence and Replication are very similar and deal with the same problem, but the former is used in hardware system (for example, cache coherence) and the later is used in software system (for example, in distributed files system and database system]

Following are various Consistency Model descriptions:

1. Atomic Consistency/Strict Consistency

It means that for a particular data item (byte, cache block and file region etc.), each read will return the latest update. I.E. events are visible to all clients as they occurs/issues. As we already said, due to the lack of global time clock in distributed/parallel system, it is very hard to implement if not impossible.

2. Sequential Consistency

This model is defined as - "The result of any execution is the same as if the operations of all the processors were executed in some sequential order and the operations of each individual processor appears in this sequence in the order specified by its program." - Leslie Lamport

This definition implies:
1. Events/Operations from the same process are executed in program order
2. Every client sees the same order of all events/updates, which is a sequential one

Goods:
- Ensures every body sees the same order, very good for data replication
- Preserves causality

Bads:
- Many interleavings are valid (two independent program with M mem ops and N mem ops will have (M + N)!/(M!N!) valid interleavings)
- Different runs of a program might act differently
- Execution order may be very different from issuing order

3. Causal Consistency

Events that are causally related must be seen by everybody in the same order. Unrelated ("concurrent") events can be seen in different orders.

The challenge in implementing this model lies in how to identify "causally related" events.

4. Eventual Consistency

If no updates take place for a long time (inconsistency window), all replicas will gradually become the latest version of the value.

In this model, if client reads data from different replica, it's hard to define a clear semantic about the return value. To resolve this problem, client-centric consistency is introduced. These models focus on the semantic in a single client's perspective.

Monotonic Reads - If a process reads the value of an object, any successive read operations on that object by that process will always return the same or more recent value.

Monotonic Writes - A write operation by a process on a data item x is completed before any successive write operation on x by the same process.

Read Your Writes - The effect of a write operation by a process on data item x will always be seen by a successive read operation on x by the same process. A closely related model is "Session Consistency". In this model, Read Your Write semantic only works within a session context and not cross session boundary.

Write Follow Reads - Any successive write operation by a process on a data item x will be performed on a replica of x that is up-to-date with the value most recently read by that process.

5. Weak Consistency

A storage system is said to support weak consistency if:

  1. All synchronization operations are seen by all processors in the same sequentially order.
  2. All data operations may be seen in different order on different processors. (But order of ops on the same data item is preserved)
  3. The set of both read and write operations in between different synchronization operations is the same in each processor.
An alternative definition is as:
  1. Accesses to synchronization variables are sequentially consistent.
  2. No access to a synchronization variable is allowed to be performed until all previous writes have completed everywhere.
  3. No data access (read or write) is allowed to be performed until all previous accesses to synchronization variables have been performed.
This model is weaker than sequential consistency model, since no order assumption can be made on data operations between consecutive synchronization operations.

The weak consistency leverage the fact that - between consecutive sync operations, no other process can use the data being written. So it's safe to reorder write operations on different mem locations. But the order of operations on the same location must be preserved to ensure a reasonable memory semantic.

6. Release Consistency

Release consistency is an extension of weak consistency that exploits the information about acquire, release, and nonsynchronization accesses.
  1. Before an ordinary LOAD or STORE access is allowed to perform with respect to any other processor, all previous acquire accesses must be performed, and
  2. Before a release access is allowed to perform with respect to any other processor. all previous ordinary LOAD and STORE accesses must be performed, and
  3. Special accesses are processor consistent with respect to one another.
In Release Consistency Model, all write operations by a certain node are seen by the other nodes after the former releases the object and before the latter acquire it, but a node must call acquire to get up-to-date values.

There are two kinds of coherence protocols that implement release consistency:
- eager, where all coherence actions are performed on release operations, and
- lazy, where all coherence actions are delayed until after a subsequent acquire

7. Entry Consistency

Like all variants of Release Consistency, it requires the programmer to use acquire and release at the start and end of each critical section, respectively. However, entry consistency requires each ordinary shared variable to be associated with some synchronization variable such as a lock or barrier.

- An acquire access of a synchronization variable S is not allowed to perform with respect to process Pi until all updates to the guarded shared data Ds have been performed with respect to process Pi.

- Before an exclusive mode access to a synchronization variable S by processor Pi is allowed to perform with respect to Pi, no other processor may hold S in non-exclusive mode.

- After an exclusive mode access to S has been performed, any processor's next non-exclusive mode access to S may not be performed until it is performed with respect to the owner of S.

8. Processor Consistency

- Writes done by a single processor are received by all other processors in the order in which they were issued, but
- Writes from different processors may be seen in a different order by different processors


The motivation behind this model is to better reflect the reality of networks in which the latency between different nodes can be different.

This model is also known as FIFO Consistency and Pipeline Random Access Memory Consistency.

Notes on Nex Steps:
1. Why People Invent Each Model?
2. What's Good and What's Bad for Each Model?

[Reference]
[1] Wiki on Consistency Model
[2] Summary on Consistency Model by CTO@Amazon
[3] Shared Memory Consistency Models: A Tutorial
[4] Memory Consistency Models
[5] Notes On Consistency
[6] Entry Consistency, Midway : Shared Memory Parallel Programming with Entry Consistency for Distributed Memory Multiprocessors
[7] Release Consistency, Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors
[8] Weak Consistency, Memory Access Buffering in Multiprocessors
[9] Memory Ordering in Modern Microprocessors (Part I, Part II)
[10] Wiki on Cache Coherence
Scalable I/O Models - In Sync & Async Way
I. What's Sync & Async I/O

  The concepts - "Asynchronous and Synchronous" come from the telecommunication domain. They mainly deal with timing - "Sync communication needs an external clock signal to coordinate the rhythm of sending/receiving sides, while in Async model, any timing required to recover data from the communication symbol is encoded within the symbols."[1]

  Sync/Async are introduced into computer science for a different meaning: "Async operations are those occurring independently of the issuing program flow, while in Sync model, requesting program flow must wait for the operations to acknowledge completion."[1]

  Particularly, "Asynchronous I/O is a form of I/O processing that permits other processing to continue before the transmission has finished"[2]; in Synchrounous I/O, the operation sender "must wait until the hardware has completed the physical I/O, so that it can be informed of the success or failure of the operation"[5].

  MSDN also has a great article about the Sync/Async concepts - Synchronous and Asynchronous I/O.

  NOTE: in the following sections, we use Async/Non-Blocking, Sync/Blocking alternatively and don't distinguish them. This convention is conformed to wikipedia and MSDN, but some popular textbooks(such as APUE ) differentiate these terms explicitly.

II. Why Invent Two I/O Models?

   The Sync I/O model is simple to understand and straightforward to use, but you should wait while the I/O is performing, thus the ultimate performance is limited.

  The basic motivation for Async I/O is to boost performance. In fact, there are two assumptions about this problem:
   1. I/O operations are time consuming, there IS a lot of time for CPU to execute other instructions while the I/O ops are going on.
  2. There are large amount of available instructions to execute when the results from I/O ops are not available.

   If these two assumptions are not correct, the performance of your code may not be improved even if you adopted the powerful Async I/O programming model.

III. How to do Scalable I/O Operations?

The challenges for scalable I/O model are:
1. What code to execute after I/O operations are issued (maybe in async/sync way)?
2. How to detect and What to do when I/O operations are completed?

Scalable I/O operations can be accomplished in either Sync or Async way:

- Scalable Sync I/O Model

  In Sync I/O Model, the API calls return only when the requested operation completes. But this doesn't mean that these calls must block. For example, suppose you want to read data from socket, if there are some data in the socket buffer before you issued the read request, the read() call will succeed without block.

  In order to avoid cpu idle while i/o operations are blocked, you may execute other tasks that is not related to the pending i/o, or avoiding i/o block by means of only issuing i/o operations at proper time. That is to say:

1. Multitasking - Code in other tasks get executed after I/O request is issued and code follows the I/O operations get executed when they are completed. (Task may take in the form of thread or process)

2. I/O Multiplexing - Use one thread to handle multiple sync I/O device handles.
  The key idea of this model is that - you call sync i/o functions only when you know that call will NOT block the calling thread (Because the data/status is ready even if you haven't issued any corresponding i/o operations, for example: connect(), read() on socket).
  You use some sync calls (such as select, poll, epoll) to check multiple device handles' status to know what i/o operations on them will not block.
   Essentially, I/O multiplexing provides a way to check the status of multiple device handles in one sync call. It only returns when some events happened on some handles. Thus you can handle multiple Sync I/O operations in one thread.

- Scalable Async I/O Model

   In Async Model, I/O operation call returns immediately after issuing and code that follows will get executed. The various models differs only on how to get I/O completion notification:

1. Polling - Use various polling function (such as HasOverlappedIoCompleted() on Windows) to poll whether some pending Async I/O is completed

2. Callback - Kernel will call some user provided call back functions when some i/o operations are completed.

3. Waiting - Kernel will signal some kernel event objects when i/o operations are completed. You just wait on these events to get completion notification.

4. Queuing - Kernel will place some information packet into a Queue when I/O operations completed. Client code should check the Queue for completion information.

IV. Async I/O on Win32 Platform

- Multithreading is supported
- I/O multiplexing is supported by select(), WSAAsyncSelect(), WSAEventSelect()
- Polling/Waiting is supported by means of Overlapped I/O
- Callback model is supported by means of Alertable I/O
- Queuing is supported by means of I/O Completion Port

V. Async I/O on Linux

- Multithreading is supported
- I/O multiplexing is supported by select(), poll(), epoll_xxx()
- Callback is supported by Linux AIO using system signal or notify function

NOTE:
- poll() improves select() in that there is no limitation on handle counts, and epoll_xxx() improves poll() in that data transfer between kernel and user space is reduced greatly and the complexity of the corresponding kernel logic is also reduced from O(N^3) to O(N).[17]

VI. Async I/O Model on .Net

- Callback is supported in the form of "Event-based Asynchronous Pattern"
- Polling/Waiting/Callback is supported in the form of "IAsyncResult Async Pattern"

NOTE:

1. Asynchronous Procedure Call mechanism is a great infrastructure for inter-thread/inter-process communication facility: Win32 API QueueUserAPC() can be used to queue APC entry to other thread's APC queue even in other process.
- The most famous usage of this facility is Gracefully Terminating Thread - if a thread is in alertable waiting state and you want to terminate it, you can post a dummy APC to it. That thread will then resume from wait status, execute this dummy APC. You can do graceful termination actions after that. [more detail can be found in chapter 10, "Windows via C/C++"]

2. Alertable I/O and Overlapped I/O are old techniques, use I/O Completion Port as much as you can, it's easy to use and also very powerful.

[Reference]

- General
1. Wikipedia explanation: Async vs Sync, Async
2. Asynchronous IO, http://en.wikipedia.org/wiki/Asynchronous_I/O
3. C10K Problem, http://www.kegel.com/c10k.html
4. MPI Async I/O, http://beige.ucs.indiana.edu/I590/node109.html
5. Sync/Async by Oracle GURU, http://www.ixora.com.au/notes/asynchronous_io.htm

- Windows
6. Mulithreaded Async IO V.S. IO Completion Port
7. IO Completion Port (IOCP Introduction, Inside IOCP)
8. Callback supports by Windows Kernel
9. Tips for Async I/O and IOCP
10. Async vs Sync I/O on Windows

- Linux
11. Async IO on Linux from Lighttpd
12. Ideas for Async-IO model from Squid
13. Kernel AIO support for Linux
14. Linux AIO Programming Introduction
15. Linux AIO Design Notes
16. epoll performance analysis
17. select, poll, epoll comparison
18. I/O Event Handling Under Linux

- .Net/Java
19. Asynchronous Programming Design Patterns @ MSDN
20. .Net Asynchronous File I/O
21. Call Sync Method in Async Way on .Net
22. Java Async I/O design notes

- Talk
23. http://www.slideshare.net/Arbow/asynchronous-io-programming
[This Presentation describes many reasonings and insights behind Async IO]
24. http://www.slideshare.net/directi/async-io-and-multithreading-explained
[Intresting and Intuitve explanation on Threading and Async IO]
Time and Order of Events in Distributed System
1. The Need for Logical Clock

One of the challenges in distributed system is the lack of global time clocks, it's very hard to timestamp events is different processes and order them globally.

To solve the "Time & Order of Events in Distributed System" problem, let's rethink what "Time" and "Order" means - essentially, Time is a property of events that is used to Order or Sequence them.

In distributed system, many events occur in different process independently, we can't and no need to find a deterministic order between two independent events. But are there dependent/related events? Yes, events are related when one causally affects another, for example through communication between processes.

Causal action determines order among events, and we must use some form of information to represent this kind of orders. This kind of information catches the causal relationship of events but don't need a real time/clock, it is called Logical Time or Logical Clock.

2. How to implement Logical Clock?

What we have in hand?
- Ordered Evends within a Process
- Causual Relationship between some Crosss-Process Events

With these information, we can define a relation ship called "Happend Before"
- 1. if Event E1 and E2 are from the same process, and E1 comes before E2, then E1 Happened Before E2
- 2. if E1 is sending of a message and E2 is receiving of the same message, then E1 Happened Before E2
- 3. if E1 Happened Before E2, and E2 Happened Before E3, then E1 Happened Before E3.

In essence, we have a partial order among the events in distributed system.

2.1 Lamport Clock

Lamport Clock is a mechanism that can extend a partial order into a total order that is consistent with the original partial order.

The algorithm is:
  1. Each process set a initial clock value(arbitary) on start up;
  2. A process increments its counter between two consecutive events in that process; (Send message and Receive message are all events)
  3. When a process sends a message(after doing step 2), it includes its counter value with the message;
  4. On receiving a message(after doing step 2), the receiver process sets its counter to be greater than the clock value in received message AND greater than or equal to its present clock value;
  5. If two events have the same clock value, use its associated process ID to determine the order between them.
After applying this algorithm, we can get a total order (each event pair can be compared in terms of order) of all the events in a distributed system.

The invariant of Lamport clock algorithm is that: if event E1 is Happened Before event E2, then Clock(E1) > Clock(E2).

But the drawback of Lamport Clock is that: if Clock(E1) > Clock(E2), we don't know whether E1 is happened before E2 or E2 is happened before E1. The root cause of this problem is that, Lamport Clock lost some information during the extending algorithm. In another word, the resulting total order is just one of those valid total orders.

Sometimes, we need to keep the whole partial order information. To this end, people invented so called Vector Clock.

2.2 Vector Clock

The main purpose of vector clock mechanism is to retain the complete partial order information(I.E. all possible total orders, not just one of them, as in Lamport Clock) in a logical clock system.

The basic observation is that, in Lamport Clock algorithm, we only retain the information that a receiving message event is causally afftected by the sending message event, but some information about the fact that the receiving message event is NOT causally impacted by some events in other processes is lost.

To fix this problem, we should not only use the clock value of the sending process, but also the clock value from other processes to identify all causal events from all processes in the whole distributed system.

So the basic idea is:
- Use a vector to represent event time/clock
- Each element stores the clock value for one process

Combining the upper idea and Lamport Clock algorithm, we can design a new algorithm to produce vector clock value for each events in the whole distributed system:
  1. Initially all clock values are set to zero;
  2. Before a process experiences an internal event, it increments its own clock value in the vector by at least one.
  3. Each time a process prepares to send a message, it conducts Step 2 and then sends its entire vector along with the message being sent.
  4. Each time a process receives a message, it conducts Step 2 and updates each element in its vector by (suppose the sending process is Ps):
   If local_vector[Ps] <= other_vector[Ps] then
    local_vector[Ps] = other_vector[Ps] + 1;
   for each element i other than Ps do
    local_vector[i] = MAX(local_vector[i], other_vector[i]);

From the algorithm we can see that, an element of the vector clock of an event means that this event only causally dependents on the events happened before that time in the corresponding process.

How to use Vector Clock obtained by the upper algorithm to infer event order?
- Let vClock(x) denote the vector clock of event x
- VC(x) < VC(y) \iff \forall z [VC(x)_z \le VC(y)_z] \and \exists z' [ VC(x)_{z'} < VC(y)_{z'} ]
[ In English: vClock(x) is less than vClock(y) if and only if at least one element in vClock(x) is strictly less that that of vClock(y), and other elements are less than or equal to those in vClock(y). ]
- X Happened Before Y <=> vClock(X) < vClock(Y)

In a word, X Happened Before Y if and only if at least one element in vClock(x) is strictly less that that of vClock(y), and other elements are less than or equal to those in vClock(y).

[Reference]


1. Order
 - Partial Order
 - Total Order

2. Logical Clock
 - Lamport Clock
 - Vector Clock

3. Time, Clock and Ordering of Events in a Distributed System by Leslie Lamport (1978)
4. Timestamps in message-passing systems that preserve the partial ordering by Colin J. Fidge (1988)
Memory Management in Native Code

Memory management is a core task in native world, careless usage of dynamic memory may cause the following problems:

- 1. Heap Fragment, this will introduce performance penalty since it breaks data locality
- 2. Memory Leak, it's a prgm correctness problem and a horrible defect for long-run software

Here I summarized some tips related to the two issues: Mem Optimization & Mem Correctness


Part I - Mem Optimization

General Principles
1. Ensure mem layout cohesion (aka. improve data locality)
2. Avoid frequent alloc/free (aka. batch mem ops, prefer few & bulk over large & small mem ops)

How to implement them?
- Redesign your data structure to make them live in large blocks
- Make unrelated data structure in different region
- Use memory pool to manage mem


Part II - Mem Correctness

One of the challenges when doing native code development is avoiding memory leak. It's so easy (also difficult to avoid) to forget releasing each memory block/object that has been allocated explicitly.

Types of Mem Leaks
1. Constant Leak, allocated mem are totally forgotten to release
2. Casual Leak, allocated mem are not release under some conditions
3. One-Time Leak , allocated mem is not released but that line of code only get executed once (for example, mem allocated in ctor of singleton objec)
4. Implicit Leak, mem blocks are hold too long (released too late in application life cycle, this kind of mem leak happens even in GC enabled language such as Java/.Net, for example, unused objects are still reachable through Root Set in GC)

To deal with Mem Leak problem, you have two choices:
- Avoid it
- Detect & Fix it

Sec. I - How to Avoid Mem Leak?

1. Adopt Resource Acquisition Is Initialization (RAII) Mechanism in C++

std::auto_ptr is a good choice if RAII is semantic right for your problem. If you want to ensure your object/mem get released whenever the control goes out of some scope (for example, multiple exit path, potential exception etc.), RAII can be used to solve your problem.

But it can't be passed as return value, can't be put in STL containers.

2. stack based allocation

_alloca() will allocated mem from stack rather than heap. The mem returned will be released when function returns.

But there is potential stack overflow exceptions, since stack is far more smaller than heap.

3. Reference Counting (aka, share/smart pointer)

Use some data structure to track how many owners are referencing the mem block or objects. When reference counting is zero, the mem/object is released.

std::tr1::shared_ptr and boost::shared_ptr are all based on Reference Counting and RAII concepts. They resolved the problem of not being able to be put in container, can't be passed as parameter and return value etc.

But, if your objects have cyclic reference, this mechanism doesn't work. The fundamental problem behind is that - the semantic of "useless", should be defined as "Can't Be Reached", not "No One References".

Another draw back of smart pointer style reference counting is that, it can't handle pointers that should be put into a union structure. Because union can't consists of any member fields that has user defined ctor/non-trivial default ctor/dtor/copy ctor etc.

4. Garbage Collection (yes, gc for C/C++)

Most modern GC uses Mark & Sweep algorithm to implement GC. The idea behind is that, GC has pointer list for all heap-allocated objects and a Root Set object pointer list. When garbage collection is triggered, it traverse from root set to find all reachable objects and mark them. Those unmarked objects are garbage that can be deleted. GC for C/C++ is a huge topic, [7] is a very good reference doc.

General purpose GC for C/C++ is difficult, but for your specific application requirement, it may not that challenge. According to my own GC implementation experience, the most difficult part is define your object ownership policy.

GC is great, but it still can't handle some "semantic garbage objects". That is to say, if you hold references to some objects that are in fact you will never use again, GC won't collect mem and other resources owned by these objects.

Essentially, Memory Management is all about Consistency of Ownership.
- Each mem block should have an owner
- Each mem block should have only 1 owner
- Only the owner of the mem block is responsible for its life cycle

So, the most important design principle about C/C++ memory management is - consider carefully about the ownership of an object/mem block: when and who should be responsible for releasing it.

Sec. 2 - How to Detect Mem Leak

1. Use Debug Version C RunTime library

1.1 Use _CrtDumpMemoryLeaks()

Step 1. include the following directives into each cpp source file
#define _CRTDBG_MAP_ALLOC
#include "crtdbg.h"
#include "stdlib.h"

Step 2. call _CrtDumpMemoryLeaks() at the line where you want to check memory leaks.

This method has a drawback that mem objects that are released after _CrtDumpMemoryLeaks() invocation will be treated as leaked mem. (This happens when mem is released in global object's dtor) It's a false negative.

1. 2 Use _CrtSetDbgFlag()

Add the following code at the entry point of your application

int nFlag = _CrtSetDbgFlag( _CRTDBG_REPORT_FLAG );
nFlag |= _CRTDBG_LEAK_CHECK_DF;
_CrtSetDbgFlag( nFlag );

This method don't have the drawback of 1.1, but you have no control when the mem leak action performs.

1. 3 Use CrtMemState

_CrtMemState cms1, cms2, cms3;
_CrtMemCheckpoint(&cms1);

/* code to check */

_CrtMemCheckpoint(&cms2);
if(_CrtMemDifference(&cms3, &cms1, &cms2))
{
_CrtMemDumpStatistics(&cms3);
}

This code will dump heap statistics info about the changes happened in the "code to check".

_CrtSetReportMode() can be used to control where to output these diagnose information.

_crtBreakAlloc / {,,msvcrtd.dll}_crtBreakAlloc / _CrtSetBreakAlloc() can be used to control the debug break condition.

More info on CRT mem debug routines, please see reference [1] and [2]

2. Monitor Process Working Set

The Win32 API GetProcessMemoryInfo() can query process working set size. You can use this api to check whether the working set size is changed after calling some suspicious functions.

It can't tell you where mem leak happens, but it a good way to write unit test to track mem leak problems.

3. Use Professional Tools

BoundsChecker
IBM Purify
LeakTracer(Linux)
Windows Leaks Detector
Valgrind(Linux)
MemWatch
Insure++
Visual Leak Detector
User Mode Dump Heap

Part III - Other Tips

1. use "#define SAFE_DELETE(ptr) if (ptr != NULL) { delete ptr; ptr = NULL; }" to avoid redeleting the same object.
2. remember to delete objects pointed by elements in container that contains pointers.
3. pair delete/new delete[]/new[] malloc/free correctly.
4. use "new (std::nothrow)" to eliminate exceptions raising in low mem situation.


NOTE: (Lessons learned from topic investigating)

When solving hard problems, Be Sure To:
1. Use well-known idioms and well-understood mechanisms
2. Keep things as simple as possible

Memory Management:
1. It's another subsystem/component of your whole system
2. Design this component with care
3. Avoid using new/delete directly

Techniques discussed here apply to not only memory blocks, but also any type of "resource" that needs explicit requesting/releasing.

[Reference]

Mem Leak
1. Mem Leak Detection http://www.ddj.com/cpp/204800654
2. Microsoft CRT Debug Routines http://msdn.microsoft.com/en-us/library/1666sb98(VS.71).aspx
3. Microsoft CRT Debug Tech http://msdn.microsoft.com/en-us/library/zh712wwf.aspx
4. Mem Debugger List http://en.wikipedia.org/wiki/Memory_debugger
5. Purify from IBM - Use Purify for C code
6. Mem Leak in Java/.Net http://www.agiledeveloper.com/articles/MemoryLeak092002.pdf

Garbage Collection
7. C/C++ GC from HP http://www.hpl.hp.com/personal/Hans_Boehm/gc/
8. GC for C/C++ http://blog.codingnow.com/2008/06/gc_for_c.html
9. How .Net GC works, GC in OO Language, Auto GC

Understanding Mem Management
10. C++ Mem Management http://www.slideshare.net/reachanil/c-memory-management
11. Inside Mem Management http://www.ibm.com/developerworks/linux/library/l-memory/
12. C++ Memory Management: From Fear to Triumph (Part 1, Part 2, Part 3)
13. http://www.cantrip.org/wave12.html
14. Mem Optimization http://www.codingnow.com/2008/memory_management.ppt
15. Mem Mgmt 4 Sys Coder http://www.enderunix.org/simsek/articles/memory.pdf

Misc
16. C++ smart pointers http://www.onlamp.com/lpt/a/6559
17. Mem Leak Definition
18. Is Mem Leak Ever OK?

Windows, Unix and ANSI C API Comparison
  The "Windows System Programming"(3E) has a great appendix about Windows, Unix, C Library API comparison. It's obvious that this appendix lacks of many APIs in Memory, DLL and Security related areas, but it is still very helpful for referencing when doing system programming on windows platform.

  Original content can be found here: http://my.safaribooksonline.com/0321256190/app02

1. I/O

1. 1 File & Directory


SubjectWindowsUNIXC LibraryComments
Console I/OAllocConsoleterminal I/ON/A
Console I/OFreeConsoleterminal I/ON/A
Console I/OReadConsolereadgetc, scanf, gets
Console I/OSetConsoleModeioctlN/A
Console I/OWriteConsolewriteputc, printf, puts
Directory MgtCreateDirectorymkdir*N/AMake a new directory
Directory MgtFindCloseclosedir*N/AClose a directory search handle
Directory MgtFindFirstFileopendir*, readdir*N/AFind first file matching a pattern
Directory MgtFindNextFilereaddir*N/AFind subsequent files
Directory MgtGetCurrentDirectorygetcwd*N/A
Directory MgtGetFullPathNameN/AN/A
Directory MgtGetSystemDirectoryWell-known pathnamesN/A
Directory MgtRemoveDirectoryrmdir, unlink*remove
Directory MgtSearchPathUse opendir, readdirN/ASearch for a file on a specified path
Directory MgtSetCurrentDirectorychdir*, fchdirN/AChange the working directory
Error HandlingFormatMessagestrerrorperror
Error HandlingGetLastErrorerrnoerrnoGlobal variable
Error HandlingSetLastErrorerrnoerrnoGlobal variable
File LockingLockFilefcntl (cmd=F_GETLK, ..)N/A
File LockingLockFileExfcntl (cmd=F_GETLK, ..)N/A
File LockingUnlockFilefcntl (cmd=F_GETLK, ..)N/A
File LockingUnlockFileExfcntl (cmd=F_GETLK, ..)N/A
File SystemCloseHandle (file handle)close*fcloseCloseHandle is not limited to files
File SystemCopyFileopen; read; write; closefopen; fread; fwrite; fcloseDuplicate a file
File SystemCreateFileopen*, creat*fopenOpen/create a file
File SystemDeleteFileunlink*removeDelete a file
File SystemFlushFileBuffersfsynchfflushWrite file buffers
File SystemGetFileAttributesstat*, fstat*, lstatN/A
File SystemGetFileInformationByHandlestat*, fstat*, lstatN/AFill structure with file info
File SystemGetFileSizestat*, fstat*, lstatftell, fseekGet length of file in bytes
File SystemGetFileTimestat*, fstat*, lstatN/A
File SystemGetFileTypestat*, fstat*, lstatN/ACheck for character stream device or file
File SystemGetStdHandleUse file desc 0, 1, or 2Use stdin, stdout, stderr
File SystemGetTempFileNameUse C librarytmpnamCreate a unique file name
File SystemGetTempFileName, CreateFileUse C librarytmpfileCreate a temporary file
File SystemGetTempPath/temp pathN/ADirectory for temp files
File SystemMoveFile, MoveFileExUse C libraryrenameRename a file or directory
File SystemCreateHardLinklink, unlink*N/AWindows does not support links
File SystemN/AsymlinkN/ACreate a symbolic link
File SystemN/AreadlinkN/ARead name in a symbolic link
File SystemN/A, ReadFile returns 0 bytesN/A, read returns 0 bytesfeofRest for end of file
File SystemN/A, use multiple ReadFilesreadvN/A, use multiple freadsScatter read
File SystemN/A, use multiple WriteFileswritevN/A, use multiple fwritesGather write
File SystemReadFilereadfreadRead data from a file
File SystemSetEndOfFile chsize*N/A
File SystemSetFileAttributesfcntlN/A
File SystemSetFilePointerlseekfseekSet file pointer
FileSystemSetFilePointer (to 0)lseek (0)rewind
File SystemSetFileTimeutime*N/A
File SystemSetStdHandleclose, dup*, dup2*, or fcntlfreopendup2 or fcntl
File SystemWriteFilewritefwriteWrite data to a file

1.2 Async I/O

SubjectWindowsUNIXC LibraryComments
Asynch I/OGetOverlappedResultN/AN/A
Asynch I/OReadFileExN/AN/AExtended I/O with completion routine
Asynch I/OSleepExN/AN/AAlertable wait
Asynch I/OWaitForMultipleObjects (file handles)poll, selectN/A
Asynch I/OWaitForMultipleObjectsExN/AN/AAlertable wait
Asynch I/OWriteFileExN/AN/AExtended I/O with completion routine
Asynch I/OWaitForSingleObjectExwaitpidN/AAlertable wait

2. Memory & DLL

SubjectWindowsUNIXC Library
Mapped FilesCreateFileMappingshmgetN/A
Mapped FilesMapViewOfFilemmap, shmatN/A
Mapped FilesMapViewOfFileExmmap, shmatN/A
Mapped FilesOpenFileMappingshmgetN/A
Mapped FilesUnmapViewOfFilemunmap, shmdt, shmctlN/A
Memory MgtGetProcessHeapN/AN/A
Memory MgtGetSystemInfoN/AN/A
Memory MgtHeapAllocsbrk, brk, or C librarymalloc, calloc
Memory MgtHeapCreateN/AN/A
Memory MgtHeapDestroyN/AN/A
Memory MgtHeapFreeUse C libraryfree
Memory MgtHeapReAllocUse C libraryrealloc
Memory MgtHeapSizeN/AN/A
Shared MemoryCloseHandle (map handle)shmctlN/A
Shared MemoryCreateFileMapping, OpenFileMappingshmgetN/A
Shared MemoryMapViewOfFileshmatN/A
Shared MemoryUnmapViewOfFileshmdtN/A
DLLsLoadLibrarydlopenN/A
DLLsFreeLibrarydlcloseN/A
DLLsGetProcAddressdlsynN/A
DLLsDllMainpthread_onceN/A

3. Process & Thread

3.1 Process


SubjectWindowsUNIXC LibraryComments
Process MgtCreateProcessfork (); execl ()*, system()N/AThere are 6 execxx functions
Process MgtExitProcess_exitexit
Process MgtGetCommandLineargv []argv []
Process MgtGetCurrentProcessgetpid*N/A
Process MgtGetCurrentProcessIdgetpid*N/A
Process MgtGetEnvironmentStringsN/Agetenv
Process MgtGetEnvironmentVariableN/Agetenv
Process MgtGetExitCodeProcesswait, waitpidN/A
Process MgtGetProcessTimestimes, wait3, wait4N/A
Process MgtGetProcessWorkingSetSizewait3, wait4N/A
Process MgtN/Aexecl*, execv*, execle*, execve*, execlp*, execvp*N/AWindows does not have a direct equivalent
Process MgtN/Afork, vforkN/AWindows does not have a direct equivalent
Process MgtN/AgetppidN/ANo parent/child relationships in Windows
Process MgtN/Agetgid, getegidN/ANo process groups in Windows
Process MgtN/AgetpgrpN/A
Process MgtN/AsetpgidN/A
Process MgtN/AsetsidN/A
Process MgtN/AtcgetpgrpN/A
Process MgtN/AtcsetpgrpN/A
Process MgtOpenProcessN/AN/A
Process MgtSetEnvironmentVariableputenvN/Aputenv is not part of the Standard C library
Process MgtTerminateProcesskillN/A
Synch: ProcessWaitForMultipleObjects (process handles)waitpidN/A
Synch: ProcessWaitForSingleObject (process handle)wait, waitpidN/A
TimersKillTimeralarm (0)N/A
TimersSetTimeralarmN/A
TimersSleepsleepN/A
TimersSleeppoll or select, no file descriptorN/A

3.2 Thread

SubjectWindowsUNIX/PthreadsComments
Thread MgtCreateRemoteThreadN/A
TLSTlsAllocpthread_key_alloc
TLSTlsFreepthread_key_delete
TLSTlsGetValuepthread_getspecific
TLSTlsSetValuepthread_setspecific
Thread MgtCreateThread, _beginthreadexpthread_create
Thread MgtExitThread, _endthreadexpthread_exit
Thread MgtGetCurrentThreadpthread_self
Thread MgtGetCurrentThreadIdN/A
Thread MgtGetExitCodeThreadpthread_yield
Thread MgtResumeThreadN/A
Thread MgtSuspendThreadN/A
Thread MgtTerminateThreadpthread_cancelpthread_cancel is safer
Thread MgtWaitForSingleObject(thread handle)pthread_join
Thread PriorityGetPriorityClasspthread_attr_getschedpolicy, getpriority
Thread PriorityGetThreadPrioritypthread_attr_getschedparam
Thread PrioritySetPriorityClasspthread_attr_setschedpolicy, setpriority, nice
Thread PrioritySetThreadPrioritypthread_attr_setschedparam
Note: Pthreads, while a part of all modern UNIX offerings, are available on non-UNIX systems as well.

3.3 Synchronization

SubjectWindowsUNIX/PthreadsComments
Synch: CritSecDeleteCriticalSectionUse mutexes to emulate critical sections. Some systems provide proprietary equivalents.C library is not applicable
Synch: CritSecEnterCriticalSectionC library is not applicable
Synch: CritSecInitializeCriticalSection
Synch: CritSecLeaveCriticalSection
Synch: EventCloseHandle (event handle)pthread_cond_destroy
Synch: EventCreateEventpthread_cond_init
Synch: EventPulseEventpthread_cond_signalManual-reset event
Synch: EventResetEventN/A
Synch: EventSetEventpthread_cond_broadcastAuto-reset event
Synch: EventWaitForSingleObject (event handle)pthread_cond_wait
Synch: EventWaitForSingleObject (event handle)pthread_timed_wait
Synch: MutexCloseHandle (mutex handle)pthread_mutex_destroy
Synch: MutexCreateMutexpthread_mutex_init
Synch: MutexReleaseMutexpthread_mutex_unlock
Synch: MutexWaitForSingleObject (mutex handle)pthread_mutex_lock
Synch: SemCreateSemaphoresemget
Synch: SemN/AsemctlWindows does not directly support all these options
Synch: SemOpenSemaphoresemget
Synch: SemReleaseSemaphoresemop (+)
Synch: SemWaitForSingleObject (semaphore handle)semop (-)Windows can wait for only one count

3.4 IPC

SubjectWindowsUNIXC LibraryComments
IPCCallNamedPipeN/AN/ACreateFile, WriteFile, ReadFile, CloseHandle
IPCCloseHandle (pipe handle)close, msgctlpcloseNot part of the Standard C library—see Stevens
IPCConnectNamedPipeN/AN/A
IPCCreateMailslotN/AN/A
IPCCreateNamedPipemkfifo, msggetN/A
IPCCreatePipepipepopenNot part of the Standard C library—see Stevens
IPCDuplicateHandledup, dup2, or fcntlN/AOr use file names CONIN$, CONOUT$
IPCGetNamedPipeHandleStatestat, fstat, lstat64N/A
IPCGetNamedPipeInfostat, fstat, lstatN/A
IPCImpersonateNamedPipeClientN/AN/A
IPCPeekNamedPipeN/AN/A
IPCReadFile (named pipe handle)read (fifo), msgsndN/A
IPCRevertToSelfN/AN/A
IPCSetNamedPipeHandleStateN/AN/A
IPCTransactNamedPipeN/AN/AWriteFile; ReadFile
IPCWriteFile (named pipe handle)write (fifo), msgrcvN/A
Misc.GetComputerNameunameN/A
Misc.SetComputerNameN/AN/A
SecuritySetNamedPipeIdentityUse directory sticky bitN/A

4. Misc

4.1 Security

SubjectWindowsUNIXComments
SecurityAddAccessAllowedAcechmod, fchmodC library does not support security

SecurityAddAccessDeniedAcechmod, fchmod
SecurityAddAuditAceN/A
SecurityCreatePrivateObjectSecurityN/A
SecurityDeleteAcechmod, fchmod
SecurityDestroyPrivateObjectSecurityN/A
SecurityGetAcestat*, fstat*, lstat
SecurityGetAclInformationstat*, fstat*, lstat
SecurityGetFileSecuritystat*, fstat*, lstat
SecurityGetPrivateObjectSecurityN/A
SecurityGetSecurityDescriptorDaclstat*, fstat*, lstat
SecurityGetUserNamegetlogin
SecurityInitializeAclN/A
SecurityInitializeSecurityDescriptorUmask
SecurityLookupAccountNamegetpwnam, getgrnam
SecurityLookupAccountSidgetpwuid, getuid, geteuid
SecurityN/Agetpwend, setpwent, endpwent
SecurityN/Agetgrent, setgrent, endgrent
SecurityN/ASetuid, seteuid, setreuid
SecurityN/ASetgid, setegid, setregid
SecurityOpenProcessTokengetgroups, setgroups, initgroups
SecuritySetFileSecuritychmod*, fchmod
SecuritySetPrivateObjectSecurityN/A
SecuritySetSecurityDescriptorDaclUmask
SecuritySetSecurityDescriptorGroupchown, fchown, lchown
SecuritySetSecurityDescriptorOwnerchown, fchown, lchown
SecuritySetSecurityDescriptorSaclN/A

4.2 Exception Handling

SubjectWindowsUNIXC Library
SEH_try – _exceptUse C library signalsUse C library signals
SEH_try – _finallyUse C library signalsUse C library signals
SEHAbnormalTerminationUse C library signalsUse C library signals
SEHGetExceptionCodeUse C library signalsUse C library signals
SEHRaiseExceptionUse C library signalssignal, raise
SignalsUse _finally blockUse C libraryatexit
SignalsUse C library or terminate processkillraise
SignalsUse C libraryUse C librarysignal
SignalsUse SEH, VEHsigemptysetN/A
SignalsUse SEH, VEHsigfillsetN/A
SignalsUse SEH, VEHsigaddsetN/A
SignalsUse SEH, VEHsigdelsetN/A
SignalsUse SEH, VEHsigismemberN/A
SignalsUse SEH, VEHsigprocmaskN/A
SignalsUse SEH, VEHsigpendingN/A
SignalsUse SEH, VEHsigactionN/A
SignalsUse SEH, VEHsigsetjmpN/A
SignalsUse SEH, VEHsiglongjmpN/A
SignalsUse SEH, VEHsigsuspendfN/A
SignalsUse SEH, VEHpsignalN/A
SignalsUse SEH, VEH, or C libraryUse C libraryabort
Note: Many UNIX vendors provide proprietary exception handling capabilities.

4.3 System Information & Time

SubjectWindows
UNIXC LibraryComments
System InfoGetDiskFreeSpaceN/AN/A
System InfoGetSystemInfogetrusageN/A
System InfoGetVersionunameN/A
System InfoGetVolumeInformationN/AN/A
System InfoGlobalMemoryStatusgetrlimitN/A
System InfoVarious defined constantssysconf, pathconf, fpathconfN/A
TimeGetSystemTimeUse C librarytime, gmtime
TimeSee ls program,
Use C libraryasctime
TimeCompareFileTimeUse C librarydifftimeCompare "calendar" times
TimeFileTimeToLocalFileTime, FileTimeToSystemTimeUse C librarylocaltime
TimeFileTimeToSystemTimeUse C librarygmtime
TimeGetLocalTimeUse C librarytime, localtime
TimeSee touch program,
Use C librarystrftime
TimeSetLocalTimeN/AN/A
TimeSetSystemTimeN/AN/A
TimeSubtract file timesUse C librarydifftime
TimeSystemTimeToFileTimeUse C librarymktime
SOA programming model on Windows Hpc Server

Conventionally, HPC/Parallel problems can be roughly divided into the following three categories[1][2]:

- Embarrassingly Parallel, for these applications, little or no effort is required to separate the problem into a number of small tasks that runs on one CPU core. No or very lightweight post processing is needed.

- Parametric/Data Parallel, these applications divides the input data into a number of completely independent parts. The same computation is undertaken on each part. And some kind of post processing after the computations is needed.

- Message Passing, these are those jobs that there are dependencies among various tasks and there are communications among these tasks. These jobs are not easy to be parallelized.

On Windows Hpc Server 2008, programming model for Embarrassingly Parallel(especially web based) applications is referenced as SOA Model, in which, client send requests to service broker, and service broker forward these requests to service instances. Service instance never talks to each other, and communication among these components are all service oriented(more specifically, WCF based).

[SOA Programming Model Workflow, from Microsoft]

[3] is a detailed documentation on this topic, but in this article, I will demo a SOA based PI(3.1415926535...) value calculation application using Monte Carlo method[6] in a real Windows Hpc Cluster.

When I say "real windows hpc cluster", I mean:
1. The cluster has multiple nodes(6 nodes: 1 head, 1 broker, 4 compute).
2. This cluster has dedicated AD/Network, which is totally different from client/dev machines and network env.
3. You use some server called "Boundary Server" to access the Hpc Cluster. The boundary server has NICs to connect with both cluster private network and corp/enterprise network.
4. You dev/debug your application on boundary server, not on cluster head node.(let head node focus on job requests serving)
5. Your corp network domain account is different from Hpc Cluster private network domain account.
6. In the following sections, I assume the cluster environment is already correctly set up.
[These environment assumptions are more complex than those in [3], but they are more similar to real production env.]

The Monte Carlo PI value calculation contains two parts - the server part and client part.

The server part is a pure WCF service
- you define interface/contract and implement the interface. The core logic is listed below:
1. it's a .net/c# class library project, and it is a typical WCF service application.
2. use local machine IP and current date time to hash out a random seed number. This will make the whole process more random.
3. use .NET build-in random generator to generate a serial random number to do many independent Monte Carlo experiments. The idea is generating a ranged random point and see whether it is located within a circular area with some fixed diameter.
PiCalcServer Core Logic
 1 public PiCalcResult Calc(UInt64 scale)
 2 {
 3     // use system time and machine ip to hash out the seed for random number generation
 4     long ticks = DateTime.Now.Ticks;
 5     IPHostEntry host = Dns.GetHostEntry(Dns.GetHostName());
 6     ticks += host.AddressList[0].GetAddressBytes()[0];
 7     ticks += host.AddressList[0].GetAddressBytes()[1] * 256;
 8     ticks += host.AddressList[0].GetAddressBytes()[2] * 256 * 256;
 9     ticks += host.AddressList[0].GetAddressBytes()[3] * 256 * 256 * 256;
10     Random rand = new Random((int)ticks);
11
12     // result init
13     PiCalcResult calcResult = new PiCalcResult();
14     calcResult.InCount = 0;
15     calcResult.OutCount = 0;
16
17     // do Monte Carlo exercise
18     Int32 x = 0, y = 0;
19     for (UInt64 i = 0; i < scale; ++i)
20     {
21         x = rand.Next(RAND_RANGE_MIN, RAND_RANGE_MAX);
22         y = rand.Next(RAND_RANGE_MIN, RAND_RANGE_MAX);
23
24         UInt64 d = (UInt64)Math.Round(Math.Sqrt((double)x * (double)x + (double)y * (double)y));
25         if (d <= RAND_RANGE_MAX)
26         {
27             calcResult.InCount++;
28         }
29         else
30         {
31             calcResult.OutCount++;
32         }
33     }
34
35     return calcResult;
36 }


After the WCF service implementation, you should deploy it to the Windows Hpc Cluster. This includes two steps:
1. Compose a service configuration file.(The PiCalcService.config file is contained in the source code package, see [3] for detailed fields explanation)
  <microsoft.Hpc.Session.ServiceRegistration>
    <service assembly="%CCP_HOME%App\PiCalcServer.dll"
             contract="PiCalcServer.IPiCalcServer"
             type="PiCalcServer.PiCalcServer"
             architecture="x86">
      <environmentVariables>
        <add name="PATH" value="%MY_SERVICES_HOME%Bin"/>
      </environmentVariables>
    </service>
  </microsoft.Hpc.Session.ServiceRegistration>

2. Copy Bin/Conf files to each compute node
clusrun xcopy /y \\FileServer\PrjDir\PiCalcServer.dll "c:\Program Files\Microsoft HPC Pack\App"
clusrun xcopy /y \\FileServer\PrjDir\PiCalcServer.Config "c:\Program Files\Microsoft HPC Pack\ServiceRegistration"

To see whether the deployment is successful:
1. go to StartMenu -> Hpc Pack -> Hpc Cluster Manager -> Diagnostics -> Tests -> SOA -> SOA Service Configuration Report and run this test.
2. Diagnostics -> Test Results. It will list the detailed results of test in step 1. If your deployment is successful, the report will tell you the service name/bin/interface/implementation and target arch.

The other part of the solution is the client application.
- It's both WCF client application and Hpc cluster application:
1. It's a normal .Net/C# console/winform application, which will call remote WCF service and Hpc scheduler service.
2. As normal WCF client application, you should use svcutil tool to generate the wcf client proxy class(async style is used here) and add it to your client application project.
svcutil PiCalcServer.dll
svcutil *.wsdl *.xsd /async /language:C# /out:PiCalcServerProxy.cs

3. When developing SOA application, you should create session with Hpc cluster, get the SOA broker service endpoint from the session and call WCF service from this endpoint.
4. Your whole client logic looks like: create session, divide computation task, send requests, collect the partial results from various sub-tasks and compute the final result.
PiCalcClient Core Logic
 1 //
 2 // Create a session object that specifies the head node to which to connect and the name of
 3 // the WCF service to use.
 4 //
 5 SessionStartInfo ssInfo = new SessionStartInfo(schedulerHost, serviceName);
 6 ssInfo.Username = clusterHeadUser;
 7 ssInfo.Password = clusterHeadPassword;
 8 ssInfo.ResourceUnitType = Microsoft.Hpc.Scheduler.Properties.JobUnitType.Core;
 9 ssInfo.MinimumUnits = 2;
10 ssInfo.MaximumUnits = 1000;
11
12 Console.WriteLine("Creating a session ...");
13 using (Session session = Session.CreateSession(ssInfo))
14 {
15     Console.WriteLine("Session creation done!");
16     Console.WriteLine("Session's Endpoint Reference:{0}", session.EndpointReference.ToString());
17     int nodes = session.ServiceJob.AllocatedNodes.Count;
18
19     //
20     // Binds session to the client proxy using NetTcp binding (specify only NetTcp binding). The
21     // security mode must be Transport and you cannot enable reliable sessions.
22     //
23     System.ServiceModel.Channels.Binding myTcpBinding = new NetTcpBinding(SecurityMode.Transport, false);
24     myTcpBinding.ReceiveTimeout = maxTimeOut;
25     myTcpBinding.SendTimeout = maxTimeOut;
26     PiCalcServerClient calcServerClient = new PiCalcServerClient(myTcpBinding, session.EndpointReference);
27     calcServerClient.ClientCredentials.Windows.ClientCredential.UserName = wcfClientUser;
28     calcServerClient.ClientCredentials.Windows.ClientCredential.Password = wcfClientPassword;
29
30     //
31     //  There is no way to get the accurate allocated core count, just assume each node has avgCoresPerNode cores.
32     //
33     timeBegin = DateTime.Now;
34     int taskCount = session.ServiceJob.AllocatedNodes.Count * avgCoresPerNode;
35     asyncCalcCount = taskCount;
36     for (int i = 0; i < taskCount; i++)
37     {
38         UInt64 scale = totalScale / (UInt64)taskCount;
39         calcServerClient.BeginCalc(scale, AsyncCalcCallback, new CalcReqContext(calcServerClient, i));
40     }
41     asyncCalcDone.WaitOne();
42     Console.WriteLine("All sub tasks done!");
43     timeEnd = DateTime.Now;
44
45     calcServerClient.Close();
46     Console.WriteLine("========================================");
47     Console.WriteLine("totalIn:{0}, totalOut:{1}", totalIn, totalOut);
48     Console.WriteLine("the mc pi value:{0}", (totalIn + 0.0) / (totalScale + 0.0) * 4);
49     Console.WriteLine("the total time used:{0}", (timeEnd.Value.ToFileTime() - timeBegin.Value.ToFileTime()) / (10 * 1000 * 1000));
50     Console.WriteLine("Please enter any key to continue...");
51     Console.ReadLine();
52 }

NOTE:
1. this is just the core logic, for full code, see the source code package.
2. the serviceName is defined as the service configuration file name without the .config extension. schedulerHost is defined as the machine name of the head node of the Hpc cluster.
3. clusterHeadUser/clusterHeadPassword is used to login to head node to submit jobs, while wcfClientUser/wcfClientPassword is used to login to compute node to access WCF services, both of them should be explicitly set in a real cluster environment. The two account are usually the same in most cluster environments, but not the same as your domain account that is used to login corp network.
4. if "Can't find file - Microsoft.Hpc.Sheduler.Store.dll" exception raised when running the client, install Windows HPC Pack Client Utilities. Only Windows HPC Pack SDK is not enough for developing/running Hpc applications.
5. it takes some time(about 1 minute in my env) to establish session with Hpc cluster.
6. you can see the job status, node head map etc in Hpc Cluster Manager while the application is running. The cluster manager is also helpful for investigation when error encountered.

Typical Client Application Console Output
Creating a session ...
Session creation done!
Session's Endpoint Reference:net.tcp://dit840-013:9087/broker/206
Sub Task[3] Done = In:263541326, Out:72002994
Sub Task[2] Done = In:263543455, Out:72000865
......
......
Sub Task[10] Done = In:263544094, Out:72000226
Sub Task[15] Done = In:263534741, Out:72009579
All sub tasks done!
============================================================
totalIn:4216696722, totalOut:1152012398
the mc pi value:3.14168387800455
the total time used:1244
Please enter any key to continue...

You can increase the exercise count to get more precise PI value, but it will consume more time.

full source code download
http://code4cs.googlecode.com/files/McPiCalc.zip

Some Personal Observations:
1. Windows Hpc Cluster provides convenient management tools and utilities, which makes deploying/managing middle-level(several hundreds of nodes) of computing cluster very easy.
2. Windows SOA programming model greatly simplified the development process of some specific kind of Hpc applications.
3. Windows Hpc build-in security feature add some complexity of the develop/deploy process and potential performance downgrade occurs if large amount of data movement happens among nodes. But these overhead results very little gains - dose security problem really matter in a private computing cluster?
4. Hpc SOA programming model is very similar to so called "web server farm" architecture. But as a general programming platform, the head/broker node fail-over problems are not solved in a very elegant and scalable way.
5. Windows Hpc scheduler is too general purpose, too centralized, which making the session creation very very time-consuming. This means that it takes SOA application much time to do init work.
6. Although it is called "SOA" and it uses popular "WCF" technology, the Hpc SOA architecture is completely not suitable for web applications(especially for scaling purpose). Microsoft describes the target scenario as "interactive application", which mainly includes Monte Carlo Problems, Ray Tracing, Excel Calculation Add-in and BLAST Searches.

[Reference]
[1]http://www.cs.mu.oz.au/498/notes/node39.html
[2]http://computing.llnl.gov/tutorials/parallel_comp/#Models
[3]Microsoft Official SOA doc
[4]submit jobs to head node in another AD
[5]Call WCF services hosted on other nodes with specific client credentials(domain username/password)
[6]Monte Carlo Method
How to Read Source Code

Part I - General Steps and Principles

1. Define a Clear Goal
- what's the purpose? to know how, to own components, to modify and extend?
- results driven, what's the final outcome?

2. Know it as Client User
- read user manual
- get an overall big picture
- know what it can do and what can't
- what is it suitable for and what not

3. Thinking Before Reading
- what if you design the whole system?
- what's the core challenge?
- write down your questions and concerns
- read with questions

4. Know the Architecture/Components
- know the overall architecture first
- devide the whole system into components
- identify what to focus, what to ignore
- use build file to identify component dependencies
- try building it

5. Read Specific Detail
- make a SMART plan
- focus on core logics
- ignore trivial parts
- start from entry point: main/wmain funtion
- identify thread creation/termination
- identify the main loop (server application)
- identify core data structure
- identify operations on core DS
- noting/documenting/charting down while reading

6. Producing Results
- big picture from user's perspective
- big picture from dev's perspective
- arch/logic for individual component
- summerize core data structure
- practice: build/deploy/use/debug/modify/hack the system
- comments on the implementation(what's good, what's bad, what learned)

7. Misc Tips
- read doc(user manual, design doc) before code
- get core data structure doc first if possible
- read the code in both static and dynamic(debug) way
- debug/step into a particular execution scenario
- read code iteratively, don't deep into detail in the beginning
- use interface/contract to separate concerns
- overall -> detail, but just detail on small areas
- leverage code comprehension tools to get static information
- print out core code and read them on real papers
- you can write unit test / use case for the software
- consider refactoring the code(kind of active reading) if unit tests are given
- if it's really hard to read, rewrite it!

Part II - Tools

One of the most frequent activities when reading code is navigating among those source codes. So tools that help navigating are important to improve the reading efficiency. Here are some popular tools for this purpose:

1. Source Code Index Generator
cscope http://cscope.sourceforge.net/
ctags http://ctags.sourceforge.net/

2. GUI Frontend for Index Generator
kscope http://kscope.sourceforge.net/
cbrowser http://cbrowser.sourceforge.net

3. Code Index Generating and Navigating
Source Navigator http://sourcenav.sourceforge.net/
Source Insight http://www.sourceinsight.com/
LXR http://lxr.linux.no/

4. For C Language ONLY
CXRef http://www.gedanken.demon.co.uk/cxref/
cflow http://www.gnu.org/software/cflow/

[Reference]
1. Code Comprehension Tool http://www.cs.ubc.ca/~murphy/cs319/index.html
2. Code Doc Generating Tool http://www.stack.nl/~dimitri/doxygen/links.html
3. Survey on Code Comp. Tools http://www.grok2.com/code_comprehension.html
4. Tips for Code Reading http://c2.com/cgi/wiki?TipsForReadingCode
5. Read VS Rewrite http://www.joelonsoftware.com/articles/fog0000000069.html

Scalability in Online Game and Virtual World
The latest issue of ACM Queue Magazine has an article titled "Scaling in Games & Virtual Worlds", which talks about SUN's efforts on the scalability problem in Online Game and Virtual World systems(A.K.A. MMORPG).

The author is a Distinguished Engineer in SUN, who had been involved in the scalable online game platform: DarkStar for about 2 years. He summarized many interesting points in this paper.

I - Unique Characteristics of Online Game:
1. Being Fun is the Prime Directive
2. Should be Easy to Learn, but Hard to Master
3. Client is Fat and Powerful
4. Network Latency is a Critical Factor

II - Current State of the Art of Techniques in Game & World:
1. Network protocol is simple and should be hold in one network packet
2. Server is artictured as doing small things fast with large scale concurrent requests
3. Predict & Adjust tricks are used to hide network latency
4. Servers act as player interaction hub and arbiter to avoid client cheating

III - Current Scaling Strategy:
1. Geographic Decomposition - Partitioned by Geography Information. For example, all activities in an island/country are processed by one physical server
2. Sharding - create noninteracting copies of parts of a world, players can only interact players within the same shard

The problems that SUN found are:
1. Chip Architecture is switching to multi-core style, which is suitable for parallel tasks
2. Server side game tasks are essentially parallelized: one thread for one player, interactions among users are relatively small, compared with all activies
3. Currently, most game implementation can't expoit parallel characteristic of multi-core cpu

SUN's proposal for next generation game development:
1. Hide concurrency and distribution for game logic developer
2. Server is architectured as event-based(maybe SEDA like, I guess) and one client request served by one server task
3. Communication is abstracted as Channel, physical address is hiden from developer
4. Data/State is moved from server memory to global data service, which is based on simplified DBMS technologies
5. Since little state in task logic and server mem, task imigration is possible, which enables hot load balance

The last valubale word in this paper: the darkstar is open sourced.

I think the observation and current tech description are the most interesting parts, the solution part is not that exciting, since the idea is somewhat not so innovative.

One personal observation is that, general platform will improve productivity of application developer, but with the price of performance. So, what's future of the darkstar project? It may be useful for small scale game, but when your players grows to M or 10M level, productivity is not the primary problem for dev team, but performance/stability is. So large scale applications always use ad-hoc solutions. But if darkstar is not used in REALLY LARGE scale environment, how can we evaluate the value of this system, how can we say the project succeed or fail?
The Beauty of Windows HPC Server 2008
1. Components of Windows HPC Server
- an x64 version based Windows Server 2008 Core
- a new and faster MS-MPI, which is based on MPICH2
- a scalable job scheduler for creation, execution and monitoring
- windows deployment service based provisioning service
- UI, cmd utilities and PowerShell cmdlets for cluster managing and monitoring

2. Architecture of Windows HPC Server



 
 
 
 
 
 
 
 
 
 

from Microsoft

Nodes
- Head Node
- Compute Node
- WCF Broker Node

Networks
- Interface
a. Winsock Direct
b. NetworkDirect
c. TCP/IP interface
- Physical
a. Ethernet
b. InfiniBand
c. Myrinet

Web Interface for Scheduling Service
- Basic Profile Web Service, open standard by OGF

3. What can we do with it?
Application Types
- Batch/Parallel : MPI
- Interactive/SOA : HPC WCF

Job Types
- Parallel Job: same bin, communication amnong tasks
- Parametric Sweep Job: same bin, no communication among tasks
- Task Flow Job: different bin, dependices(DAG) among tasks

System Management
- Cluster Configuration
- Monitoring & Reporing & Auditing
- Node Provisioning
- Diagnostics

Job Management
- Job Scheduler Configuration
- Job Creation and Definition
- Job Execution and Monitoring

[Reference]
- management
HPCS Command Tools
HPCS PowerShell Cmdlets

- programming
HPCS Job Scheduler Web Interface
SOA Programming on HPCS
MPI Programming on HPCS

- community
Hpc Show @ Channel9
HPCS @ MSDN
HPCS @ TechNet
Page view tracker