Various concrete consistency models have been described in [1], now it's time to discuss how to implement these models.
Consistency semantic is divided into two categories in [1] -
"Coherence and
Replication
are very similar concepts and deal with the same problem, but the
former is often used in hardware system (for example, in SMP Cache
system) and the later is often used in software system (for example, in
distributed files system and database system)".
Some other differences:
1. Coherence deals with data item at client side, while Replication deals with data item at server side
2.
Coherence deals with relicated data item that has a corresponding one
in a reliable storage backend, but Replication doesn't have that.
We will discuss protocols for
Data Replication and
Cache Coherence in different sections.
Part 1 - Data ReplicationTo
keep replicated data item consistent in distributed storage system,
there are some core problems to think about: where and how to store
various replicas, what, when and how to propagate update to replicas?
how to keep consistency?
I - Replica Server & Replica Content Placement- As close to client
as possible
to improve performance
- Adjust repica placement dynamicly according to access history
II - General Update Propagation Problems1. What to Propagate?
- Notification (Invalid Message)
- Command (operation & parameters)
- Data (updated region)
- Change Log (to accumulate several updates into one)
2. When to Propagate?
- Pull (each replica asks for updates)
- Push (primary replica send updates to all secondaries)
3. How to Propagate?
- Unicasting (send updates to each replica one by one)
- Multicasting (send update message to all replicas at one time)
III - Replication Protocols
A
replication protocol aims to implement some specific consistency model,
the detail of the protocol is highly dependent on the target model.
Here we only describe protocols which implement models that are popular
in practical systems.
1. Sequential Consistency
1.1 Primary Based Protocol
- all write operations to a data item x is served by one special
replica called primary replica, this replica is responsible for
updating other replicas, client only interact with this special replica.
1. 2. Replicated Write Protocol - write operations are sent to each replica to execute:
-
Active Replication, a total order of all write operations is required in order to make each replica execute the same order of write commands.
-
Quorum Based, write operations only need to be executed on part of all replicas before return. It use
votes to prevent write-write confilict and write-read conflict:
- Suppose each data item has N replicas
- To read a value, client must contact with at least Nr replicas
- To write a data item, client must contact with at least Nw replicas
- To ensure no WW and WR conflicts, Nr + Nw > N and Nw + Nw > N should be statisfied
2. Eventual Consistency ProtocolTwo requirements should be meet for this kind of protocol:
-
All update operations to a data item should reach and be executed all replicas at some time
- These operatioins should be executed in the
same orderSome popular methods to ensure these requirements are:
- Use write set and read set (use
Nw, Nr to control how many porcesses should be involved in write or read operation), thus update operations are ordered.
-
Expose data item version number to client, so when client accesses
data, it can pass known latest version number to server to implement
some client centric consistency model
- Limit update operation execution process, so write-write conflicts can be solved easily
Note:
- All upper protocols ignore process/node & communication failure, which would occur often in practical distributed system.
- Replication protocols than deal with failure, such as
Paxos,
Two-Phase Commit, are much more complicated.
Part 2 - Cache CoherenceCache
Coherence Protocol is used to keep client side replicas consistent in
the context that a reliable data item exists in storage backend (it's
true for smp cache like hardware system and for distributed file system
cache like software system).
I. Core problems of Cache Coherence Protocol:1. Coherence Detection StrategyThat is to say,
when inconsistencies are actually detected. For example, in distributed database system, this detection can be performed:
- At the beginning of a transaction
- Parallel with the on going transaction but before it commits
- When transaction commits
2. Coherence Enforcement StrategyThat is to say,
how all caches are kept consistent with each other. Generally, there are two methods
-
Invalidating: if a data item is modified in one cache, other caches that hold the same data item will receive a invalidation notification.
-
Propagating:
if a data item is modified in one cache, the update is propagated to
other caches that hold the same data item. So all replicas in cache are
update to the same version.
3. Cache-Server ConsistencyThat is to say, how to keep data items in cache and in storage server consistent with each other. There are serveral policies:
- Use
Read Only Cache, modification can only be made on items in storage server, cache pull these updates
-
Write Back, modifications to data items are accumulated at cache side, and are written to storage server at some other time.
-
Write Through,
modifications are made at cache, and are also propagated to storage
server. In SMP system, bus or directory may be used to serialize all
operations. While in distributed file systems, an exclusive lock may be
needed for a data cache that can perform modification operations to
avoid write-write confliction.
II Implementing Cache Coherence1. Implementation Mechanism1.1 Directory Based
- In a directory-based system, some information about the data being
shared is placed in a common directory. The directory acts as a filter
through which the processor must ask permission to load an entry from
the primary memory to its cache. When an entry is changed the directory
either updates or invalidates the other caches with that entry.[2] It
is not as fast as snooping but more scalable.
1.2 Snooping
- In such system, the individual caches monitor address lines for
accesses to memory locations that they have cached. When a write
operation is observed to a location that a cache has a copy of, the
cache controller invalidates or update its own copy of the snooped
memory location.[2] Since it needs broadcasting, not very scalable.
2. ProtocolsThe various cache coherence protocols use
Directory or
Snooping mechanism to solve the three core problems described above:
2.1 Invalidation Protocols-
Write-Once-
Berkeley-
Illinois/MESI
2.2 Propagating Protocols-
Firefly-
DragonMost of the cache protocols in multiprocessors are supporting
sequential consistency model, while in software distributed shared memory more popular are models supporting
release consistency or
weak consistency.
[Reference][1]
http://blogs.msdn.com/csliu/archive/2009/06/15/consistency-model-a-survey.aspx[2]
http://en.wikipedia.org/wiki/Cache_coherence[3]
http://www.nedprod.com/NedHAL/Cache%20Coherency%20solutions.html[4]
http://en.wikipedia.org/wiki/Memory_coherence[5] Shared Memory Architecture, 2001, Chinese Higher Education Press, Weiwu Hu
([5] 共享存储系统体系结构, 2001, 中国高等教育出版社, 胡伟武)
Part I - What's Data Consistency Model and Why Should We Care?
Data Consistency Model - it is a Semantic Contract
between a data storage system and its user. Here, data storage system
may refer to hardware system ( for example : memory sub-system inDSM,
SMP, CMP computers) , or software systems (for example: distributed
file system, web caching or distributed database).
Essentially
, Consistency Model defines what value to return in a read operation.The most natural semantic for storage system is - "read should return the
last
written value". It is intuitive in memory system with uniprocessor
associated, where no concurrent access and no data replication exist.
But when concurrent client access and multiple replica exist, it's not easy to identify what "
last write" means, mainly due to the lack of global time clock in parallel/distributed system.
So various data models are proposed to define various data semantic when
Contention and
Replication exists:
-
Data Contention
- In memory sub system of parallel computers, there may be multiple
processors that access the same memory location at the same period of
time, what's the semantic each processor should expect? (different
process access the same data item)
-
Data Replication
- In distributed file/database system, each data block is replicated at
multiple places, even access(read/write/modify) requests from the same
client may involve more than one replicas, what's the semantic in
client's perspective? (data item is replicated)
-
Both Contention & Replication
: What if multiple clients access replicated data blocks? What if there
is a local cache for each processor in SMP system? (many processes
access replicated data items)
Part II - The Various ModelsThere
are many consistency models exist in computer architectures and
distributed storage community. Why people invent so many semantics
between storage and its clients? The reason is the trade off between
Strict Semantic VS
Available Optimization Space
- strict (also simple and easy to understand, use and implement)
semantic will limit the space that we can use to improve availability
and do performance optimization. So each model has its advantages and
drawbacks.
Before describe various models, we should clarify some concepts first:
Program Order
- the order of operations (on data items) that is implied by the order
of program statements(or assembly instructions). It's the issuing order
of operations that are from the same thread/process.
Write Atomicity - write operation performs as no concurrent/overlapped operations can occur. So the meaning of "
Atomicity" here differs from that in DBMS field, where "Atomicity" means that
all ops in a transaction will be performed or
none of them will be performed.
Consistency VS
Coherence/Replication,
Consistency focus on the semantic of data store (a set of data item)
for multiple client accessing, while Coherence/Replication only deal
with the semantic (and how to implement it) of a replicated data item
(multiple replica exist) when it is accessed by multiple clients.
Coherence/Replication is just part of the Consistency Model.
Coherence / Replication Protocol only describes how to implement part of the semantic of some specific data consistency model.
[
Update@07/23: Coherence and
Replication
are very similar and deal with the same problem, but the former is used
in hardware system (for example, cache coherence) and the later is used
in software system (for example, in distributed files system and
database system]
Following are various Consistency Model descriptions:
1. Atomic Consistency/
Strict ConsistencyIt
means that for a particular data item (byte, cache block and file
region etc.), each read will return the latest update. I.E. events are
visible to all clients as they occurs/issues. As we already said, due
to the lack of global time clock in distributed/parallel system, it is
very hard to implement if not impossible.
2. Sequential ConsistencyThis
model is defined as - "The result of any execution is the same as if
the operations of all the processors were executed in some sequential
order and the operations of each individual processor appears in this
sequence in the order specified by its program." - Leslie Lamport
This definition implies:
1. Events/Operations from the same process are executed in program order
2. Every client sees the same order of all events/updates, which is a sequential one
Goods:
- Ensures every body sees the same order, very good for data replication
- Preserves causality
Bads:
-
Many interleavings are valid (two independent program with M mem ops
and N mem ops will have (M + N)!/(M!N!) valid interleavings)
- Different runs of a program might act differently
- Execution order may be very different from issuing order
3. Causal ConsistencyEvents that are
causally related must be seen by everybody in the same order. Unrelated ("concurrent") events can be seen in different orders.
The challenge in implementing this model lies in how to identify "causally related" events.
4. Eventual ConsistencyIf
no updates take place for a long time (inconsistency window), all
replicas will gradually become the latest version of the value.
In
this model, if client reads data from different replica, it's hard to
define a clear semantic about the return value. To resolve this
problem,
client-centric consistency is introduced. These models focus on the semantic in a single client's perspective.
Monotonic Reads
- If a process reads the value of an object, any successive read
operations on that object by that process will always return the same
or more recent value.
Monotonic Writes - A write operation by a process on a data item x is completed before any successive write operation on x by the same process.
Read Your Writes
- The effect of a write operation by a process on data item x will
always be seen by a successive read operation on x by the same process.
A closely related model is
"Session Consistency". In this model,
Read Your Write semantic only works within a session context and not cross session boundary.
Write Follow Reads
- Any successive write operation by a process on a data item x will be
performed on a replica of x that is up-to-date with the value most
recently read by that process.
5. Weak ConsistencyA storage system is said to support weak consistency if:
- All synchronization operations are seen by all processors in the same sequentially order.
- All
data operations may be seen in different order on different processors.
(But order of ops on the same data item is preserved)
- The set of both read and write operations in between different synchronization operations is the same in each processor.
An alternative definition is as:
- Accesses to synchronization variables are sequentially consistent.
- No access to a synchronization variable is allowed to be performed until all previous writes have completed everywhere.
- No
data access (read or write) is allowed to be performed until all
previous accesses to synchronization variables have been performed.
This model is weaker than sequential consistency model, since no order
assumption can be made on data operations between consecutive
synchronization operations.
The weak consistency leverage the
fact that - between consecutive sync operations, no other process can
use the data being written. So it's safe to reorder write operations on
different mem locations. But the order of operations on the same
location must be preserved to ensure a reasonable memory semantic.
6. Release ConsistencyRelease consistency is an extension of weak consistency that exploits the information about
acquire,
release, and
nonsynchronization accesses.
- Before an ordinary LOAD or STORE access is allowed to perform with respect to any other processor, all previous acquire accesses must be performed, and
- Before a release
access is allowed to perform with respect to any other processor. all
previous ordinary LOAD and STORE accesses must be performed, and
- Special accesses are processor consistent with respect to one another.
In Release Consistency Model, all write operations by a certain node are seen by the other nodes after the former
releases the object and before the latter
acquire it, but a node must call
acquire to get up-to-date values.
There are two kinds of coherence protocols that implement release consistency:
-
eager, where all coherence actions are performed on release operations, and
-
lazy, where all coherence actions are delayed until after a subsequent acquire
7. Entry ConsistencyLike all variants of
Release Consistency, it requires the programmer to use
acquire and
release at the start and end of each critical section, respectively. However,
entry
consistency requires each ordinary shared variable to be associated
with some synchronization variable such as a lock or barrier.
- An
acquire access of a synchronization variable
S is not allowed to perform with respect to process
Pi until all updates to the guarded shared data
Ds have been performed with respect to process
Pi.
- Before an exclusive mode access to a synchronization variable
S by processor
Pi is allowed to perform with respect to
Pi, no other processor may hold
S in non-exclusive mode.
- After an exclusive mode access to
S has been performed, any processor's next non-exclusive mode access to
S may not be performed until it is performed with respect to the owner of
S.
8. Processor Consistency
- Writes done by a single processor are received by all other processors in the order in which they were issued, but
- Writes from different processors may be seen in a different order by different processorsThe
motivation behind this model is to better reflect the reality of
networks in which the latency between different nodes can be different.
This model is also known as
FIFO Consistency and
Pipeline
Random
Access
Memory
Consistency.
Notes on Nex Steps:
1. Why People Invent Each Model?
2. What's Good and What's Bad for Each Model?
[Reference][1]
Wiki on Consistency Model [2]
Summary on Consistency Model by CTO@Amazon[3]
Shared Memory Consistency Models: A Tutorial [4]
Memory Consistency Models [5] Notes On Consistency
[6] Entry Consistency, Midway : Shared Memory Parallel Programming with Entry Consistency for Distributed Memory Multiprocessors
[7] Release Consistency, Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors
[8] Weak Consistency, Memory Access Buffering in Multiprocessors
[9] Memory Ordering in Modern Microprocessors (Part I, Part II)
[10] Wiki on Cache Coherence
I. What's Sync & Async I/O The concepts - "
Asynchronous and
Synchronous" come from the telecommunication domain. They mainly deal with
timing -
"Sync communication needs an external clock signal
to coordinate the rhythm of sending/receiving sides, while in Async
model, any timing required to recover data from the communication
symbol is encoded within the symbols."[1]
Sync/Async are introduced into computer science for a different meaning:
"Async
operations are those occurring independently of the issuing program
flow, while in Sync model, requesting program flow must wait for the
operations to acknowledge completion."[1]
Particularly, "
Asynchronous I/O is a form of I/O processing that permits other processing to continue before the transmission has finished"[2]; in Synchrounous I/O, the operation sender "
must
wait until the hardware has completed the physical I/O, so that it can
be informed of the success or failure of the operation"[5].
MSDN also has a great article about the Sync/Async concepts -
Synchronous and Asynchronous I/O.
NOTE: in the following sections, we use
Async/Non-Blocking,
Sync/Blocking alternatively
and don't distinguish them. This convention is conformed to wikipedia
and MSDN, but some popular textbooks(such as
APUE ) differentiate these terms explicitly.
II. Why Invent Two I/O Models?
The Sync I/O model is simple to understand and straightforward to use,
but you should wait while the I/O is performing, thus the ultimate
performance is limited.
The basic motivation for Async I/O is to
boost performance. In fact, there are two assumptions about this problem:
1. I/O operations are time consuming, there
IS a lot of time for CPU to execute other instructions while the I/O ops are going on.
2. There are large amount of available instructions to execute when the results from I/O ops are not available.
If these two assumptions are not correct, the performance of your code
may not be improved even if you adopted the powerful Async I/O
programming model.
III. How to do Scalable I/O Operations?The challenges for scalable I/O model are:
1. What code to execute
after I/O operations are issued (maybe in async/sync way)?
2. How to detect and What to do
when I/O operations are completed?
Scalable I/O operations can be accomplished in either Sync or Async way:
- Scalable Sync I/O Model In Sync I/O Model, the API calls return
only when the requested operation completes.
But this doesn't mean that these calls must block. For example, suppose
you want to read data from socket, if there are some data in the socket
buffer before you issued the read request, the read() call will succeed
without block.
In order to avoid cpu idle while i/o operations are blocked, you may
execute other tasks that is not related to the pending i/o, or
avoiding i/o block by means of only issuing i/o operations at proper time. That is to say:
1.
Multitasking -
Code in other tasks get executed after I/O request is issued and code
follows the I/O operations get executed when they are completed. (Task
may take in the form of thread or process)
2.
I/O Multiplexing - Use one thread to handle multiple sync I/O device handles.
The key idea of this model is that - you call sync i/o functions only when you know that call will
NOT
block the calling thread (Because the data/status is ready even if you
haven't issued any corresponding i/o operations, for example:
connect(), read() on socket).
You use some sync calls (such as
select, poll, epoll) to check multiple device handles' status to know what i/o operations on them will not block.
Essentially, I/O multiplexing provides a way to check the status of
multiple device handles in one sync call. It only returns when some
events happened on some handles. Thus you can handle multiple Sync I/O
operations in one thread.
- Scalable Async I/O Model
In Async Model, I/O operation call returns immediately after issuing
and code that follows will get executed. The various models differs
only on how to get
I/O completion notification:
1.
Polling - Use various polling function (such as
HasOverlappedIoCompleted() on Windows) to poll whether some pending Async I/O is completed
2.
Callback - Kernel will call some user provided call back functions when some i/o operations are completed.
3.
Waiting - Kernel will signal some
kernel event objects when i/o operations are completed. You just wait on these events to get completion notification.
4.
Queuing - Kernel
will place some information packet into a Queue when I/O operations
completed. Client code should check the Queue for completion
information.
IV. Async I/O on Win32 Platform-
Multithreading is supported
-
I/O multiplexing is supported by
select(), WSAAsyncSelect(), WSAEventSelect()-
Polling/Waiting is supported by means of
Overlapped I/O-
Callback model is supported by means of
Alertable I/O-
Queuing is supported by means of
I/O Completion PortV. Async I/O on Linux-
Multithreading is supported
-
I/O multiplexing is supported by
select(), poll(), epoll_xxx()-
Callback is supported by
Linux AIO using system signal or notify function
NOTE:
-
poll() improves
select() in that there is no limitation on handle counts, and
epoll_xxx() improves
poll() in that
data transfer between kernel and user space is reduced greatly and the
complexity of the corresponding kernel logic is also reduced from O(N^3) to O(N).[17]
VI. Async I/O Model on .Net-
Callback is supported in the form of "
Event-based Asynchronous Pattern"
-
Polling/Waiting/Callback is supported in the form of "
IAsyncResult Async Pattern"
NOTE:
1.
Asynchronous Procedure Call mechanism is a great infrastructure for inter-thread/inter-process communication facility: Win32 API
QueueUserAPC() can be used to queue APC entry to other thread's APC queue even in other process.
- The most famous usage of this facility is
Gracefully Terminating Thread -
if a thread is in alertable waiting state and you want to terminate it,
you can post a dummy APC to it. That thread will then resume from wait
status, execute this dummy APC. You can do graceful termination actions
after that. [more detail can be found in chapter 10,
"Windows via C/C++"]
2.
Alertable I/O and
Overlapped I/O are old techniques, use
I/O Completion Port as much as you can, it's easy to use and also very powerful.
[Reference]- General
1. Wikipedia explanation:
Async vs Sync,
Async2. Asynchronous IO,
http://en.wikipedia.org/wiki/Asynchronous_I/O3. C10K Problem,
http://www.kegel.com/c10k.html4. MPI Async I/O,
http://beige.ucs.indiana.edu/I590/node109.html5. Sync/Async by Oracle GURU,
http://www.ixora.com.au/notes/asynchronous_io.htm- Windows
6.
Mulithreaded Async IO V.S. IO Completion Port7. IO Completion Port (
IOCP Introduction,
Inside IOCP)
8.
Callback supports by Windows Kernel9. Tips for
Async I/O and
IOCP10.
Async vs Sync I/O on Windows - Linux
11.
Async IO on Linux from Lighttpd12.
Ideas for Async-IO model from Squid13.
Kernel AIO support for Linux14.
Linux AIO Programming Introduction15.
Linux AIO Design Notes 16.
epoll performance analysis 17.
select, poll, epoll comparison 18.
I/O Event Handling Under Linux - .Net/Java
19.
Asynchronous Programming Design Patterns @ MSDN20.
.Net Asynchronous File I/O21.
Call Sync Method in Async Way on .Net22.
Java Async I/O design notes- Talk
23.
http://www.slideshare.net/Arbow/asynchronous-io-programming[This Presentation describes many reasonings and insights behind Async IO]
24.
http://www.slideshare.net/directi/async-io-and-multithreading-explained[Intresting and Intuitve explanation on Threading and Async IO]
1. The Need for Logical ClockOne
of the challenges in distributed system is the lack of global time
clocks, it's very hard to timestamp events is different processes and
order them globally.
To solve the "
Time & Order of Events in Distributed System" problem, let's rethink what "
Time" and "
Order" means - essentially,
Time is a property of events that is used to Order or Sequence them.In distributed system, many events occur in different process independently, we
can't and
no need
to find a deterministic order between two independent events. But are
there dependent/related events? Yes, events are related when one
causally affects another, for example through communication between
processes.
Causal action determines order among events,
and we must use some form of information to represent this kind of
orders. This kind of information catches the causal relationship of
events but don't need a real time/clock, it is called
Logical Time or
Logical Clock.
2. How to implement Logical Clock?What we have in hand?
- Ordered Evends within a Process
- Causual Relationship between some Crosss-Process Events
With these information, we can define a relation ship called "
Happend Before"
- 1. if Event E1 and E2 are from the same process, and E1 comes before E2, then E1
Happened Before E2
- 2. if E1 is sending of a message and E2 is receiving of the same message, then E1
Happened Before E2
- 3. if E1
Happened Before E2, and E2
Happened Before E3, then E1
Happened Before E3.
In essence, we have a
partial order among the events in distributed system.
2.1 Lamport ClockLamport Clock is a mechanism that can extend a
partial order into a
total order that is consistent with the original partial order.
The algorithm is:
- Each process set a initial clock value(arbitary) on start up;
- A
process increments its counter between two consecutive events in that
process; (Send message and Receive message are all events)
- When a process sends a message(after doing step 2), it includes its counter value with the message;
- On receiving a message(after doing step 2), the receiver process sets its counter to be greater than the clock value in received message AND greater than or equal to its present clock value;
- If two events have the same clock value, use its associated process ID to determine the order between them.
After
applying this algorithm, we can get a total order (each event pair can
be compared in terms of order) of all the events in a distributed
system.
The invariant of Lamport clock algorithm is that: if event E1 is
Happened Before event E2, then Clock(E1) > Clock(E2).
But
the drawback of Lamport Clock is that: if Clock(E1) > Clock(E2), we
don't know whether E1 is happened before E2 or E2 is happened before
E1. The root cause of this problem is that, Lamport Clock lost some
information during the extending algorithm. In another word, the
resulting total order is just
one of those valid total orders.
Sometimes, we need to keep the whole partial order information. To this end, people invented so called
Vector Clock.
2.2 Vector ClockThe main purpose of
vector clock mechanism is to retain the
complete partial order information(I.E. all possible total orders, not just one of them, as in Lamport Clock) in a logical clock system.
The basic observation is that, in Lamport Clock algorithm, we only retain the information that a receiving message event
is causally afftected by the sending message event, but some information about the fact that the receiving message event
is NOT causally impacted by some events in other processes is lost.
To
fix this problem, we should not only use the clock value of the sending
process, but also the clock value from other processes to identify all
causal events from all processes in the whole distributed system.
So the basic idea is:
- Use a vector to represent event time/clock
- Each element stores the clock value for one process
Combining
the upper idea and Lamport Clock algorithm, we can design a new
algorithm to produce vector clock value for each events in the whole
distributed system:
- Initially all clock values are set to zero;
- Before a process experiences an internal event, it increments its own clock value in the vector by at least one.
- Each time a process prepares to send a message, it conducts Step 2 and then sends its entire vector along with the message being sent.
- Each time a process receives a message, it conducts Step 2 and updates each element in its vector by (suppose the sending process is Ps):
If local_vector[Ps] <= other_vector[
Ps] then
local_vector[
Ps]
= other_vector[
Ps] + 1;
for each element i other than Ps do
local_vector[
i] =
MAX(local_vector[
i], other_vector[
i]);
From the algorithm we can see that, an element of the vector clock of an event means that this event only
causally dependents on the events happened
before that time in the corresponding process.
How to use Vector Clock obtained by the upper algorithm to infer event order?
- Let vClock(x) denote the vector clock of event x
-
[ In English: vClock(x) is less than vClock(y)
if and only if
at least one element in vClock(x) is strictly less that that of
vClock(y), and other elements are less than or equal to those in
vClock(y).
]- X
Happened Before Y <=> vClock(X) < vClock(Y)
In a word, X Happened Before Y if and only if at
least one element in vClock(x) is strictly less that that of vClock(y),
and other elements are less than or equal to those in vClock(y).
[Reference]
1. Order
- Partial Order
- Total Order
2. Logical Clock
- Lamport Clock
- Vector Clock
3. Time, Clock and Ordering of Events in a Distributed System by Leslie Lamport (1978)
4. Timestamps in message-passing systems that preserve the partial ordering by Colin J. Fidge (1988)
Memory management is a core task in native world, careless usage of dynamic memory may cause the following problems:
- 1. Heap Fragment, this will introduce performance penalty since it breaks data locality
- 2. Memory Leak, it's a prgm correctness problem and a horrible defect for long-run software
Here I summarized some tips related to the two issues: Mem Optimization & Mem Correctness
Part I - Mem Optimization
General Principles
1. Ensure mem layout cohesion (aka. improve data locality)
2. Avoid frequent alloc/free (aka. batch mem ops, prefer few & bulk over large & small mem ops)
How to implement them?
- Redesign your data structure to make them live in large blocks
- Make unrelated data structure in different region
- Use memory pool to manage mem
Part II - Mem Correctness
One of the challenges when doing native code development is avoiding memory leak. It's so easy (also difficult to avoid) to forget releasing each memory block/object that has been allocated explicitly.
Types of Mem Leaks
1. Constant Leak, allocated mem are totally forgotten to release
2. Casual Leak, allocated mem are not release under some conditions
3. One-Time Leak ,
allocated mem is not released but that line of code only get executed
once (for example, mem allocated in ctor of singleton objec)
4. Implicit
Leak, mem blocks are hold too long (released too late in application
life cycle, this kind of mem leak happens even in GC enabled language
such as Java/.Net, for example, unused objects are still reachable
through Root Set in GC)
To deal with Mem Leak problem, you have two choices:
- Avoid it
- Detect & Fix it
Sec. I - How to Avoid Mem Leak?
1. Adopt Resource Acquisition Is Initialization (RAII) Mechanism in C++
std::auto_ptr
is a good choice if RAII is semantic right for your problem. If you
want to ensure your object/mem get released whenever the control goes
out of some scope (for example, multiple exit path, potential exception
etc.), RAII can be used to solve your problem.
But it can't be passed as return value, can't be put in STL containers.
2. stack based allocation
_alloca() will allocated mem from stack rather than heap. The mem returned will be released when function returns.
But there is potential stack overflow exceptions, since stack is far more smaller than heap.
3. Reference Counting (aka, share/smart pointer)
Use
some data structure to track how many owners are referencing the mem
block or objects. When reference counting is zero, the mem/object is
released.
std::tr1::shared_ptr and boost::shared_ptr are all
based on Reference Counting and RAII concepts. They resolved the
problem of not being able to be put in container, can't be passed as
parameter and return value etc.
But, if your objects have cyclic
reference, this mechanism doesn't work. The fundamental problem behind
is that - the semantic of "useless", should be defined as "Can't Be Reached", not "No One References".
Another
draw back of smart pointer style reference counting is that, it can't
handle pointers that should be put into a union structure. Because
union can't consists of any member fields that has user defined
ctor/non-trivial default ctor/dtor/copy ctor etc.
4. Garbage Collection (yes, gc for C/C++)
Most modern GC uses Mark & Sweep
algorithm to implement GC. The idea behind is that, GC has pointer list
for all heap-allocated objects and a Root Set object pointer list. When
garbage collection is triggered, it traverse from root set to find all
reachable objects and mark them. Those unmarked objects are garbage
that can be deleted. GC for C/C++ is a huge topic, [7] is a very good
reference doc.
General purpose GC for C/C++ is difficult, but
for your specific application requirement, it may not that challenge.
According to my own GC implementation experience, the most difficult
part is define your object ownership policy.
GC is great, but it
still can't handle some "semantic garbage objects". That is to say, if
you hold references to some objects that are in fact you will never use
again, GC won't collect mem and other resources owned by these objects.
Essentially, Memory Management is all about Consistency of Ownership.
- Each mem block should have an owner
- Each mem block should have only 1 owner
- Only the owner of the mem block is responsible for its life cycle
So, the most important design principle about C/C++ memory management is - consider carefully about the ownership of an object/mem block: when and who should be responsible for releasing it.
Sec. 2 - How to Detect Mem Leak
1. Use Debug Version C RunTime library
1.1 Use _CrtDumpMemoryLeaks()
Step 1. include the following directives into each cpp source file
#define _CRTDBG_MAP_ALLOC
#include "crtdbg.h"
#include "stdlib.h"
Step 2. call _CrtDumpMemoryLeaks() at the line where you want to check memory leaks.
This
method has a drawback that mem objects that are released after
_CrtDumpMemoryLeaks() invocation will be treated as leaked mem. (This
happens when mem is released in global object's dtor) It's a false
negative.
1. 2 Use _CrtSetDbgFlag()
Add the following code at the entry point of your application
int nFlag = _CrtSetDbgFlag( _CRTDBG_REPORT_FLAG );
nFlag |= _CRTDBG_LEAK_CHECK_DF;
_CrtSetDbgFlag( nFlag );
This method don't have the drawback of 1.1, but you have no control when the mem leak action performs.
1. 3 Use CrtMemState
_CrtMemState cms1, cms2, cms3;
_CrtMemCheckpoint(&cms1);
/* code to check */
_CrtMemCheckpoint(&cms2);
if(_CrtMemDifference(&cms3, &cms1, &cms2))
{
_CrtMemDumpStatistics(&cms3);
}
This code will dump heap statistics info about the changes happened in the "code to check".
_CrtSetReportMode() can be used to control where to output these diagnose information.
_crtBreakAlloc / {,,msvcrtd.dll}_crtBreakAlloc / _CrtSetBreakAlloc() can be used to control the debug break condition.
More info on CRT mem debug routines, please see reference [1] and [2]
2. Monitor Process Working Set
The Win32 API GetProcessMemoryInfo()
can query process working set size. You can use this api to check
whether the working set size is changed after calling some suspicious
functions.
It can't tell you where mem leak happens, but it a good way to write unit test to track mem leak problems.
3. Use Professional Tools
BoundsChecker
IBM Purify
LeakTracer(Linux)
Windows Leaks Detector
Valgrind(Linux)
MemWatch
Insure++
Visual Leak Detector
User Mode Dump Heap
Part III - Other Tips
1. use "#define SAFE_DELETE(ptr) if (ptr != NULL) { delete ptr; ptr = NULL; }" to avoid redeleting the same object.
2. remember to delete objects pointed by elements in container that contains pointers.
3. pair delete/new delete[]/new[] malloc/free correctly.
4. use "new (std::nothrow)" to eliminate exceptions raising in low mem situation.
NOTE: (Lessons learned from topic investigating)
When solving hard problems, Be Sure To:
1. Use well-known idioms and well-understood mechanisms
2. Keep things as simple as possible
Memory Management:
1. It's another subsystem/component of your whole system
2. Design this component with care
3. Avoid using new/delete directly
Techniques
discussed here apply to not only memory blocks, but also any type of
"resource" that needs explicit requesting/releasing.
[Reference]
Mem Leak
1. Mem Leak Detection http://www.ddj.com/cpp/204800654
2. Microsoft CRT Debug Routines http://msdn.microsoft.com/en-us/library/1666sb98(VS.71).aspx
3. Microsoft CRT Debug Tech http://msdn.microsoft.com/en-us/library/zh712wwf.aspx
4. Mem Debugger List http://en.wikipedia.org/wiki/Memory_debugger
5. Purify from IBM - Use Purify for C code
6. Mem Leak in Java/.Net http://www.agiledeveloper.com/articles/MemoryLeak092002.pdf
Garbage Collection
7. C/C++ GC from HP http://www.hpl.hp.com/personal/Hans_Boehm/gc/
8. GC for C/C++ http://blog.codingnow.com/2008/06/gc_for_c.html
9. How .Net GC works, GC in OO Language, Auto GC
Understanding Mem Management
10. C++ Mem Management http://www.slideshare.net/reachanil/c-memory-management
11. Inside Mem Management http://www.ibm.com/developerworks/linux/library/l-memory/
12. C++ Memory Management: From Fear to Triumph (Part 1, Part 2, Part 3)
13. http://www.cantrip.org/wave12.html
14. Mem Optimization http://www.codingnow.com/2008/memory_management.ppt
15. Mem Mgmt 4 Sys Coder http://www.enderunix.org/simsek/articles/memory.pdf
Misc
16. C++ smart pointers http://www.onlamp.com/lpt/a/6559
17. Mem Leak Definition
18. Is Mem Leak Ever OK?
The "
Windows System Programming"(3E)
has a great appendix about Windows, Unix, C Library API comparison.
It's obvious that this appendix lacks of many APIs in Memory, DLL and
Security related areas, but it is still very helpful for referencing
when doing system programming on windows platform.
Original content can be found here:
http://my.safaribooksonline.com/0321256190/app02
1. I/O
1. 1 File & Directory| Subject | Windows | UNIX | C Library | Comments |
|---|
| Console I/O | AllocConsole | terminal I/O | N/A |
|
| Console I/O | FreeConsole | terminal I/O | N/A |
|
| Console I/O | ReadConsole | read | getc, scanf, gets |
|
| Console I/O | SetConsoleMode | ioctl | N/A |
|
| Console I/O | WriteConsole | write | putc, printf, puts |
|
| Directory Mgt | CreateDirectory | mkdir* | N/A | Make a new directory |
| Directory Mgt | FindClose | closedir* | N/A | Close a directory search handle |
| Directory Mgt | FindFirstFile | opendir*, readdir* | N/A | Find first file matching a pattern |
| Directory Mgt | FindNextFile | readdir* | N/A | Find subsequent files |
| Directory Mgt | GetCurrentDirectory | getcwd* | N/A |
|
| Directory Mgt | GetFullPathName | N/A | N/A |
|
| Directory Mgt | GetSystemDirectory | Well-known pathnames | N/A |
|
| Directory Mgt | RemoveDirectory | rmdir, unlink* | remove |
|
| Directory Mgt | SearchPath | Use opendir, readdir | N/A | Search for a file on a specified path |
| Directory Mgt | SetCurrentDirectory | chdir*, fchdir | N/A | Change the working directory |
| Error Handling | FormatMessage | strerror | perror |
|
| Error Handling | GetLastError | errno | errno | Global variable |
| Error Handling | SetLastError | errno | errno | Global variable |
| File Locking | LockFile | fcntl (cmd=F_GETLK, ..) | N/A |
|
| File Locking | LockFileEx | fcntl (cmd=F_GETLK, ..) | N/A |
|
| File Locking | UnlockFile | fcntl (cmd=F_GETLK, ..) | N/A |
|
| File Locking | UnlockFileEx | fcntl (cmd=F_GETLK, ..) | N/A |
|
| File System | CloseHandle (file handle) | close* | fclose | CloseHandle is not limited to files |
| File System | CopyFile | open; read; write; close | fopen; fread; fwrite; fclose | Duplicate a file |
| File System | CreateFile | open*, creat* | fopen | Open/create a file |
| File System | DeleteFile | unlink* | remove | Delete a file |
| File System | FlushFileBuffers | fsynch | fflush | Write file buffers |
| File System | GetFileAttributes | stat*, fstat*, lstat | N/A |
|
| File System | GetFileInformationByHandle | stat*, fstat*, lstat | N/A | Fill structure with file info |
| File System | GetFileSize | stat*, fstat*, lstat | ftell, fseek | Get length of file in bytes |
| File System | GetFileTime | stat*, fstat*, lstat | N/A |
|
| File System | GetFileType | stat*, fstat*, lstat | N/A | Check for character stream device or file |
| File System | GetStdHandle | Use file desc 0, 1, or 2 | Use stdin, stdout, stderr |
|
| File System | GetTempFileName | Use C library | tmpnam | Create a unique file name |
| File System | GetTempFileName, CreateFile | Use C library | tmpfile | Create a temporary file |
| File System | GetTempPath | /temp path | N/A | Directory for temp files |
| File System | MoveFile, MoveFileEx | Use C library | rename | Rename a file or directory |
| File System | CreateHardLink | link, unlink* | N/A | Windows does not support links |
| File System | N/A | symlink | N/A | Create a symbolic link |
| File System | N/A | readlink | N/A | Read name in a symbolic link |
| File System | N/A, ReadFile returns 0 bytes | N/A, read returns 0 bytes | feof | Rest for end of file |
| File System | N/A, use multiple ReadFiles | readv | N/A, use multiple freads | Scatter read |
| File System | N/A, use multiple WriteFiles | writev | N/A, use multiple fwrites | Gather write |
| File System | ReadFile | read | fread | Read data from a file |
| File System | SetEndOfFile | chsize* | N/A |
|
| File System | SetFileAttributes | fcntl | N/A |
|
| File System | SetFilePointer | lseek | fseek | Set file pointer |
| FileSystem | SetFilePointer (to 0) | lseek (0) | rewind |
|
| File System | SetFileTime | utime* | N/A |
|
| File System | SetStdHandle | close, dup*, dup2*, or fcntl | freopen | dup2 or fcntl |
| File System | WriteFile | write | fwrite | Write data to a file |
1.2 Async I/O | Subject | Windows | UNIX | C Library | Comments |
|---|
| Asynch I/O | GetOverlappedResult | N/A | N/A |
|
| Asynch I/O | ReadFileEx | N/A | N/A | Extended I/O with completion routine |
| Asynch I/O | SleepEx | N/A | N/A | Alertable wait |
| Asynch I/O | WaitForMultipleObjects (file handles) | poll, select | N/A |
|
| Asynch I/O | WaitForMultipleObjectsEx | N/A | N/A | Alertable wait |
| Asynch I/O | WriteFileEx | N/A | N/A | Extended I/O with completion routine |
| Asynch I/O | WaitForSingleObjectEx | waitpid | N/A | Alertable wait |
2. Memory & DLL| Subject | Windows | UNIX | C Library |
|---|
| Mapped Files | CreateFileMapping | shmget | N/A |
| Mapped Files | MapViewOfFile | mmap, shmat | N/A |
| Mapped Files | MapViewOfFileEx | mmap, shmat | N/A |
| Mapped Files | OpenFileMapping | shmget | N/A |
| Mapped Files | UnmapViewOfFile | munmap, shmdt, shmctl | N/A |
| Memory Mgt | GetProcessHeap | N/A | N/A |
| Memory Mgt | GetSystemInfo | N/A | N/A |
| Memory Mgt | HeapAlloc | sbrk, brk, or C library | malloc, calloc |
| Memory Mgt | HeapCreate | N/A | N/A |
| Memory Mgt | HeapDestroy | N/A | N/A |
| Memory Mgt | HeapFree | Use C library | free |
| Memory Mgt | HeapReAlloc | Use C library | realloc |
| Memory Mgt | HeapSize | N/A | N/A |
| Shared Memory | CloseHandle (map handle) | shmctl | N/A |
| Shared Memory | CreateFileMapping, OpenFileMapping | shmget | N/A |
| Shared Memory | MapViewOfFile | shmat | N/A |
| Shared Memory | UnmapViewOfFile | shmdt | N/A |
| DLLs | LoadLibrary | dlopen | N/A |
| DLLs | FreeLibrary | dlclose | N/A |
| DLLs | GetProcAddress | dlsyn | N/A |
| DLLs | DllMain | pthread_once | N/A |
3. Process & Thread
3.1 Process| Subject | Windows | UNIX | C Library | Comments |
|---|
| Process Mgt | CreateProcess | fork (); execl ()*, system() | N/A | There are 6 execxx functions |
| Process Mgt | ExitProcess | _exit | exit |
|
| Process Mgt | GetCommandLine | argv [] | argv [] |
|
| Process Mgt | GetCurrentProcess | getpid* | N/A |
|
| Process Mgt | GetCurrentProcessId | getpid* | N/A |
|
| Process Mgt | GetEnvironmentStrings | N/A | getenv |
|
| Process Mgt | GetEnvironmentVariable | N/A | getenv |
|
| Process Mgt | GetExitCodeProcess | wait, waitpid | N/A |
|
| Process Mgt | GetProcessTimes | times, wait3, wait4 | N/A |
|
| Process Mgt | GetProcessWorkingSetSize | wait3, wait4 | N/A |
|
| Process Mgt | N/A | execl*, execv*, execle*, execve*, execlp*, execvp* | N/A | Windows does not have a direct equivalent |
| Process Mgt | N/A | fork, vfork | N/A | Windows does not have a direct equivalent |
| Process Mgt | N/A | getppid | N/A | No parent/child relationships in Windows |
| Process Mgt | N/A | getgid, getegid | N/A | No process groups in Windows |
| Process Mgt | N/A | getpgrp | N/A |
|
| Process Mgt | N/A | setpgid | N/A |
|
| Process Mgt | N/A | setsid | N/A |
|
| Process Mgt | N/A | tcgetpgrp | N/A |
|
| Process Mgt | N/A | tcsetpgrp | N/A |
|
| Process Mgt | OpenProcess | N/A | N/A |
|
| Process Mgt | SetEnvironmentVariable | putenv | N/A | putenv is not part of the Standard C library |
| Process Mgt | TerminateProcess | kill | N/A |
|
| Synch: Process | WaitForMultipleObjects (process handles) | waitpid | N/A |
|
| Synch: Process | WaitForSingleObject (process handle) | wait, waitpid | N/A |
|
| Timers | KillTimer | alarm (0) | N/A |
|
| Timers | SetTimer | alarm | N/A |
|
| Timers | Sleep | sleep | N/A |
|
| Timers | Sleep | poll or select, no file descriptor | N/A |
|
3.2 Thread| Subject | Windows | UNIX/Pthreads | Comments |
|---|
| Thread Mgt | CreateRemoteThread | N/A |
|
| TLS | TlsAlloc | pthread_key_alloc |
|
| TLS | TlsFree | pthread_key_delete |
|
| TLS | TlsGetValue | pthread_getspecific |
|
| TLS | TlsSetValue | pthread_setspecific |
|
| Thread Mgt | CreateThread, _beginthreadex | pthread_create |
|
| Thread Mgt | ExitThread, _endthreadex | pthread_exit |
|
| Thread Mgt | GetCurrentThread | pthread_self |
|
| Thread Mgt | GetCurrentThreadId | N/A |
|
| Thread Mgt | GetExitCodeThread | pthread_yield |
|
| Thread Mgt | ResumeThread | N/A |
|
| Thread Mgt | SuspendThread | N/A |
|
| Thread Mgt | TerminateThread | pthread_cancel | pthread_cancel is safer |
| Thread Mgt | WaitForSingleObject(thread handle) | pthread_join |
|
| Thread Priority | GetPriorityClass | pthread_attr_getschedpolicy, getpriority |
|
| Thread Priority | GetThreadPriority | pthread_attr_getschedparam |
|
| Thread Priority | SetPriorityClass | pthread_attr_setschedpolicy, setpriority, nice |
|
| Thread Priority | SetThreadPriority | pthread_attr_setschedparam |
|
|
3.3 Synchronization| Subject | Windows | UNIX/Pthreads | Comments |
|---|
| Synch: CritSec | DeleteCriticalSection | Use mutexes to emulate critical sections. Some systems provide proprietary equivalents. | C library is not applicable |
| Synch: CritSec | EnterCriticalSection | C library is not applicable |
| Synch: CritSec | InitializeCriticalSection |
|
| Synch: CritSec | LeaveCriticalSection |
|
| Synch: Event | CloseHandle (event handle) | pthread_cond_destroy |
|
| Synch: Event | CreateEvent | pthread_cond_init |
|
| Synch: Event | PulseEvent | pthread_cond_signal | Manual-reset event |
| Synch: Event | ResetEvent | N/A |
|
| Synch: Event | SetEvent | pthread_cond_broadcast | Auto-reset event |
| Synch: Event | WaitForSingleObject (event handle) | pthread_cond_wait |
|
| Synch: Event | WaitForSingleObject (event handle) | pthread_timed_wait |
|
| Synch: Mutex | CloseHandle (mutex handle) | pthread_mutex_destroy |
|
| Synch: Mutex | CreateMutex | pthread_mutex_init |
|
| Synch: Mutex | ReleaseMutex | pthread_mutex_unlock |
|
| Synch: Mutex | WaitForSingleObject (mutex handle) | pthread_mutex_lock |
|
| Synch: Sem | CreateSemaphore | semget |
|
| Synch: Sem | N/A | semctl | Windows does not directly support all these options |
| Synch: Sem | OpenSemaphore | semget |
|
| Synch: Sem | ReleaseSemaphore | semop (+) |
|
| Synch: Sem | WaitForSingleObject (semaphore handle) | semop (-) | Windows can wait for only one count |
3.4 IPC| Subject | Windows | UNIX | C Library | Comments |
|---|
| IPC | CallNamedPipe | N/A | N/A | CreateFile, WriteFile, ReadFile, CloseHandle |
| IPC | CloseHandle (pipe handle) | close, msgctl | pclose | Not part of the Standard C library—see Stevens |
| IPC | ConnectNamedPipe | N/A | N/A |
|
| IPC | CreateMailslot | N/A | N/A |
|
| IPC | CreateNamedPipe | mkfifo, msgget | N/A |
|
| IPC | CreatePipe | pipe | popen | Not part of the Standard C library—see Stevens |
| IPC | DuplicateHandle | dup, dup2, or fcntl | N/A | Or use file names CONIN$, CONOUT$ |
| IPC | GetNamedPipeHandleState | stat, fstat, lstat64 | N/A |
|
| IPC | GetNamedPipeInfo | stat, fstat, lstat | N/A |
|
| IPC | ImpersonateNamedPipeClient | N/A | N/A |
|
| IPC | PeekNamedPipe | N/A | N/A |
|
| IPC | ReadFile (named pipe handle) | read (fifo), msgsnd | N/A |
|
| IPC | RevertToSelf | N/A | N/A |
|
| IPC | SetNamedPipeHandleState | N/A | N/A |
|
| IPC | TransactNamedPipe | N/A | N/A | WriteFile; ReadFile |
| IPC | WriteFile (named pipe handle) | write (fifo), msgrcv | N/A |
|
| Misc. | GetComputerName | uname | N/A |
|
| Misc. | SetComputerName | N/A | N/A |
|
| Security | SetNamedPipeIdentity | Use directory sticky bit | N/A |
|
4. Misc4.1 Security| Subject | Windows | UNIX | Comments |
|---|
| Security | AddAccessAllowedAce | chmod, fchmod | C library does not support security
|
| Security | AddAccessDeniedAce | chmod, fchmod |
| Security | AddAuditAce | N/A |
| Security | CreatePrivateObjectSecurity | N/A |
| Security | DeleteAce | chmod, fchmod |
| Security | DestroyPrivateObjectSecurity | N/A |
| Security | GetAce | stat*, fstat*, lstat |
| Security | GetAclInformation | stat*, fstat*, lstat |
| Security | GetFileSecurity | stat*, fstat*, lstat |
| Security | GetPrivateObjectSecurity | N/A |
| Security | GetSecurityDescriptorDacl | stat*, fstat*, lstat |
| Security | GetUserName | getlogin |
| Security | InitializeAcl | N/A |
| Security | InitializeSecurityDescriptor | Umask |
| Security | LookupAccountName | getpwnam, getgrnam |
| Security | LookupAccountSid | getpwuid, getuid, geteuid |
| Security | N/A | getpwend, setpwent, endpwent |
| Security | N/A | getgrent, setgrent, endgrent |
| Security | N/A | Setuid, seteuid, setreuid |
| Security | N/A | Setgid, setegid, setregid |
| Security | OpenProcessToken | getgroups, setgroups, initgroups |
| Security | SetFileSecurity | chmod*, fchmod |
| Security | SetPrivateObjectSecurity | N/A |
| Security | SetSecurityDescriptorDacl | Umask |
| Security | SetSecurityDescriptorGroup | chown, fchown, lchown |
| Security | SetSecurityDescriptorOwner | chown, fchown, lchown |
| Security | SetSecurityDescriptorSacl | N/A |
4.2 Exception Handling | Subject | Windows | UNIX | C Library |
|---|
| SEH | _try – _except | Use C library signals | Use C library signals |
| SEH | _try – _finally | Use C library signals | Use C library signals |
| SEH | AbnormalTermination | Use C library signals | Use C library signals |
| SEH | GetExceptionCode | Use C library signals | Use C library signals |
| SEH | RaiseException | Use C library signals | signal, raise |
| Signals | Use _finally block | Use C library | atexit |
| Signals | Use C library or terminate process | kill | raise |
| Signals | Use C library | Use C library | signal |
| Signals | Use SEH, VEH | sigemptyset | N/A |
| Signals | Use SEH, VEH | sigfillset | N/A |
| Signals | Use SEH, VEH | sigaddset | N/A |
| Signals | Use SEH, VEH | sigdelset | N/A |
| Signals | Use SEH, VEH | sigismember | N/A |
| Signals | Use SEH, VEH | sigprocmask | N/A |
| Signals | Use SEH, VEH | sigpending | N/A |
| Signals | Use SEH, VEH | sigaction | N/A |
| Signals | Use SEH, VEH | sigsetjmp | N/A |
| Signals | Use SEH, VEH | siglongjmp | N/A |
| Signals | Use SEH, VEH | sigsuspendf | N/A |
| Signals | Use SEH, VEH | psignal | N/A |
| Signals | Use SEH, VEH, or C library | Use C library | abort |
4.3 System Information & Time| Subject | Windows
| UNIX | C Library | Comments |
| System Info | GetDiskFreeSpace | N/A | N/A |
|
| System Info | GetSystemInfo | getrusage | N/A |
|
| System Info | GetVersion | uname | N/A |
|
| System Info | GetVolumeInformation | N/A | N/A |
|
| System Info | GlobalMemoryStatus | getrlimit | N/A |
|
| System Info | Various defined constants | sysconf, pathconf, fpathconf | N/A |
|
| Time | GetSystemTime | Use C library | time, gmtime |
|
| Time | See ls program,
| Use C library | asctime |
|
| Time | CompareFileTime | Use C library | difftime | Compare "calendar" times |
| Time | FileTimeToLocalFileTime, FileTimeToSystemTime | Use C library | localtime |
|
| Time | FileTimeToSystemTime | Use C library | gmtime |
|
| Time | GetLocalTime | Use C library | time, localtime |
|
| Time | See touch program,
| Use C library | strftime |
|
| Time | SetLocalTime | N/A | N/A |
|
| Time | SetSystemTime | N/A | N/A |
|
| Time | Subtract file times | Use C library | difftime |
|
| Time | SystemTimeToFileTime | Use C library | mktime |
|
Conventionally, HPC/Parallel problems can be roughly divided into the following three categories[1][2]:
- Embarrassingly Parallel,
for these applications, little or no effort is required to separate the
problem into a number of small tasks that runs on one CPU core. No or
very lightweight post processing is needed.
- Parametric/Data Parallel,
these applications divides the input data into a number of completely
independent parts. The same computation is undertaken on each part. And
some kind of post processing after the computations is needed.
- Message Passing,
these are those jobs that there are dependencies among various tasks
and there are communications among these tasks. These jobs are not easy
to be parallelized.
On Windows Hpc Server 2008, programming model for Embarrassingly Parallel(especially web based) applications is referenced as SOA Model,
in which, client send requests to service broker, and service broker
forward these requests to service instances. Service instance never
talks to each other, and communication among these components are all
service oriented(more specifically, WCF based).

[SOA Programming Model Workflow, from
Microsoft]
[3]
is a detailed documentation on this topic, but in this article, I will
demo a SOA based PI(3.1415926535...) value calculation application
using
Monte Carlo method[6] in a real Windows Hpc Cluster.
When I say "real windows hpc cluster", I mean:
1. The cluster has multiple nodes(6 nodes: 1 head, 1 broker, 4 compute).
2. This cluster has dedicated AD/Network, which is totally different from client/dev machines and network env.
3.
You use some server called "Boundary Server" to access the Hpc Cluster.
The boundary server has NICs to connect with both cluster private
network and corp/enterprise network.
4. You dev/debug your application on boundary server, not on cluster head node.(let head node focus on job requests serving)
5. Your corp network domain account is different from Hpc Cluster private network domain account.
6. In the following sections, I assume the cluster environment is already correctly set up.
[These environment assumptions are more complex than those in [3], but they are more similar to real production env.]
The Monte Carlo PI value calculation contains two parts - the server part and client part.
The server part is a pure WCF service- you define interface/contract and implement the interface. The core logic is listed below:
1. it's a .net/c# class library project, and it is a typical WCF service application.
2. use local machine IP and current date time to hash out a random seed number. This will make the whole process more random.
3.
use .NET build-in random generator to generate a serial random number
to do many independent Monte Carlo experiments. The idea is generating
a ranged random point and see whether it is located within a circular
area with some fixed diameter.
PiCalcServer Core Logic
1 public PiCalcResult Calc(UInt64 scale)
2 {
3 // use system time and machine ip to hash out the seed for random number generation
4 long ticks = DateTime.Now.Ticks;
5 IPHostEntry host = Dns.GetHostEntry(Dns.GetHostName());
6 ticks += host.AddressList[0].GetAddressBytes()[0];
7 ticks += host.AddressList[0].GetAddressBytes()[1] * 256;
8 ticks += host.AddressList[0].GetAddressBytes()[2] * 256 * 256;
9 ticks += host.AddressList[0].GetAddressBytes()[3] * 256 * 256 * 256;
10 Random rand = new Random((int)ticks);
11
12 // result init
13 PiCalcResult calcResult = new PiCalcResult();
14 calcResult.InCount = 0;
15 calcResult.OutCount = 0;
16
17 // do Monte Carlo exercise
18 Int32 x = 0, y = 0;
19 for (UInt64 i = 0; i < scale; ++i)
20 {
21 x = rand.Next(RAND_RANGE_MIN, RAND_RANGE_MAX);
22 y = rand.Next(RAND_RANGE_MIN, RAND_RANGE_MAX);
23
24 UInt64 d = (UInt64)Math.Round(Math.Sqrt((double)x * (double)x + (double)y * (double)y));
25 if (d <= RAND_RANGE_MAX)
26 {
27 calcResult.InCount++;
28 }
29 else
30 {
31 calcResult.OutCount++;
32 }
33 }
34
35 return calcResult;
36 }
After the WCF service implementation, you should deploy it to the Windows Hpc Cluster. This includes two steps:
1. Compose a service configuration file.(The PiCalcService.config file is contained in the
source code package, see [3] for detailed fields explanation)
<microsoft.Hpc.Session.ServiceRegistration>
<service assembly="%CCP_HOME%App\PiCalcServer.dll"
contract="PiCalcServer.IPiCalcServer"
type="PiCalcServer.PiCalcServer"
architecture="x86">
<environmentVariables>
<add name="PATH" value="%MY_SERVICES_HOME%Bin"/>
</environmentVariables>
</service>
</microsoft.Hpc.Session.ServiceRegistration>
2. Copy Bin/Conf files to each compute node
clusrun xcopy /y \\FileServer\PrjDir\PiCalcServer.dll "c:\Program Files\Microsoft HPC Pack\App"
clusrun xcopy /y \\FileServer\PrjDir\PiCalcServer.Config "c:\Program Files\Microsoft HPC Pack\ServiceRegistration"
To see whether the deployment is successful:
1.
go to StartMenu -> Hpc Pack -> Hpc Cluster Manager ->
Diagnostics -> Tests -> SOA -> SOA Service Configuration
Report and run this test.
2. Diagnostics -> Test Results. It will
list the detailed results of test in step 1. If your deployment is
successful, the report will tell you the service
name/bin/interface/implementation and target arch.
The other part of the solution is the client application. - It's both WCF client application and Hpc cluster application:
1. It's a normal .Net/C# console/winform application, which will call remote WCF service and Hpc scheduler service.
2.
As normal WCF client application, you should use svcutil tool to
generate the wcf client proxy class(async style is used here) and add
it to your client application project.
svcutil PiCalcServer.dll
svcutil *.wsdl *.xsd /async /language:C# /out:PiCalcServerProxy.cs
3.
When developing SOA application, you should create session with Hpc
cluster, get the SOA broker service endpoint from the session and call
WCF service from this endpoint.
4. Your whole client logic looks
like: create session, divide computation task, send requests, collect
the partial results from various sub-tasks and compute the final result.
PiCalcClient Core Logic
1 //
2 // Create a session object that specifies the head node to which to connect and the name of
3 // the WCF service to use.
4 //
5 SessionStartInfo ssInfo = new SessionStartInfo(schedulerHost, serviceName);
6 ssInfo.Username = clusterHeadUser;
7 ssInfo.Password = clusterHeadPassword;
8 ssInfo.ResourceUnitType = Microsoft.Hpc.Scheduler.Properties.JobUnitType.Core;
9 ssInfo.MinimumUnits = 2;
10 ssInfo.MaximumUnits = 1000;
11
12 Console.WriteLine("Creating a session ...");
13 using (Session session = Session.CreateSession(ssInfo))
14 {
15 Console.WriteLine("Session creation done!");
16 Console.WriteLine("Session's Endpoint Reference:{0}", session.EndpointReference.ToString());
17 int nodes = session.ServiceJob.AllocatedNodes.Count;
18
19 //
20 // Binds session to the client proxy using NetTcp binding (specify only NetTcp binding). The
21 // security mode must be Transport and you cannot enable reliable sessions.
22 //
23 System.ServiceModel.Channels.Binding myTcpBinding = new NetTcpBinding(SecurityMode.Transport, false);
24 myTcpBinding.ReceiveTimeout = maxTimeOut;
25 myTcpBinding.SendTimeout = maxTimeOut;
26 PiCalcServerClient calcServerClient = new PiCalcServerClient(myTcpBinding, session.EndpointReference);
27 calcServerClient.ClientCredentials.Windows.ClientCredential.UserName = wcfClientUser;
28 calcServerClient.ClientCredentials.Windows.ClientCredential.Password = wcfClientPassword;
29
30 //
31 // There is no way to get the accurate allocated core count, just assume each node has avgCoresPerNode cores.
32 //
33 timeBegin = DateTime.Now;
34 int taskCount = session.ServiceJob.AllocatedNodes.Count * avgCoresPerNode;
35 asyncCalcCount = taskCount;
36 for (int i = 0; i < taskCount; i++)
37 {
38 UInt64 scale = totalScale / (UInt64)taskCount;
39 calcServerClient.BeginCalc(scale, AsyncCalcCallback, new CalcReqContext(calcServerClient, i));
40 }
41 asyncCalcDone.WaitOne();
42 Console.WriteLine("All sub tasks done!");
43 timeEnd = DateTime.Now;
44
45 calcServerClient.Close();
46 Console.WriteLine("========================================");
47 Console.WriteLine("totalIn:{0}, totalOut:{1}", totalIn, totalOut);
48 Console.WriteLine("the mc pi value:{0}", (totalIn + 0.0) / (totalScale + 0.0) * 4);
49 Console.WriteLine("the total time used:{0}", (timeEnd.Value.ToFileTime() - timeBegin.Value.ToFileTime()) / (10 * 1000 * 1000));
50 Console.WriteLine("Please enter any key to continue...");
51 Console.ReadLine();
52 }
NOTE:
1. this is just the core logic, for full code, see the source code package.
2.
the serviceName is defined as the service configuration file name
without the .config extension. schedulerHost is defined as the machine
name of the head node of the Hpc cluster.
3.
clusterHeadUser/clusterHeadPassword is used to login to head node to
submit jobs, while wcfClientUser/wcfClientPassword is used to login to
compute node to access WCF services, both of them should be explicitly
set in a real cluster environment. The two account are usually the same
in most cluster environments, but not the same as your domain account
that is used to login corp network.
4. if "Can't find file -
Microsoft.Hpc.Sheduler.Store.dll" exception raised when running the
client, install Windows HPC Pack Client Utilities. Only Windows HPC
Pack SDK is not enough for developing/running Hpc applications.
5. it takes some time(about 1 minute in my env) to establish session with Hpc cluster.
6.
you can see the job status, node head map etc in Hpc Cluster Manager
while the application is running. The cluster manager is also helpful
for investigation when error encountered.
Typical Client Application Console Output
Creating a session ...
Session creation done!
Session's Endpoint Reference:net.tcp://dit840-013:9087/broker/206
Sub Task[3] Done = In:263541326, Out:72002994
Sub Task[2] Done = In:263543455, Out:72000865
......
......
Sub Task[10] Done = In:263544094, Out:72000226
Sub Task[15] Done = In:263534741, Out:72009579
All sub tasks done!
============================================================
totalIn:4216696722, totalOut:1152012398
the mc pi value:3.14168387800455
the total time used:1244
Please enter any key to continue...
You can increase the exercise count to get more precise PI value, but it will consume more time.
full source code download
http://code4cs.googlecode.com/files/McPiCalc.zipSome Personal Observations:1.
Windows Hpc Cluster provides convenient management tools and utilities,
which makes deploying/managing middle-level(several hundreds of nodes)
of computing cluster very easy.
2. Windows SOA programming model greatly simplified the development process of some specific kind of Hpc applications.
3.
Windows Hpc build-in security feature add some complexity of the
develop/deploy process and potential performance downgrade occurs if
large amount of data movement happens among nodes. But these overhead
results very little gains - dose security problem really matter in a
private computing cluster?
4. Hpc SOA programming model is very
similar to so called "web server farm" architecture. But as a general
programming platform, the head/broker node fail-over problems are not
solved in a very elegant and scalable way.
5. Windows Hpc
scheduler is too general purpose, too centralized, which making the
session creation very very time-consuming. This means that it takes SOA
application much time to do init work.
6. Although it is called
"SOA" and it uses popular "WCF" technology, the Hpc SOA architecture is
completely not suitable for web applications(especially for scaling
purpose). Microsoft describes the target scenario as "interactive
application", which mainly includes Monte Carlo Problems, Ray Tracing,
Excel Calculation Add-in and BLAST Searches.
[Reference]
[1]
http://www.cs.mu.oz.au/498/notes/node39.html[2]
http://computing.llnl.gov/tutorials/parallel_comp/#Models[3]
Microsoft Official SOA doc[4]
submit jobs to head node in another AD[5]
Call WCF services hosted on other nodes with specific client credentials(domain username/password)[6]
Monte Carlo Method
Part I - General Steps and Principles
1. Define a Clear Goal
- what's the purpose? to know how, to own components, to modify and extend?
- results driven, what's the final outcome?
2. Know it as Client User
- read user manual
- get an overall big picture
- know what it can do and what can't
- what is it suitable for and what not
3. Thinking Before Reading
- what if you design the whole system?
- what's the core challenge?
- write down your questions and concerns
- read with questions
4. Know the Architecture/Components
- know the overall architecture first
- devide the whole system into components
- identify what to focus, what to ignore
- use build file to identify component dependencies
- try building it
5. Read Specific Detail
- make a SMART plan
- focus on core logics
- ignore trivial parts
- start from entry point: main/wmain funtion
- identify thread creation/termination
- identify the main loop (server application)
- identify core data structure
- identify operations on core DS
- noting/documenting/charting down while reading
6. Producing Results
- big picture from user's perspective
- big picture from dev's perspective
- arch/logic for individual component
- summerize core data structure
- practice: build/deploy/use/debug/modify/hack the system
- comments on the implementation(what's good, what's bad, what learned)
7. Misc Tips
- read doc(user manual, design doc) before code
- get core data structure doc first if possible
- read the code in both static and dynamic(debug) way
- debug/step into a particular execution scenario
- read code iteratively, don't deep into detail in the beginning
- use interface/contract to separate concerns
- overall -> detail, but just detail on small areas
- leverage code comprehension tools to get static information
- print out core code and read them on real papers
- you can write unit test / use case for the software
- consider refactoring the code(kind of active reading) if unit tests are given
- if it's really hard to read, rewrite it!
Part II - Tools
One
of the most frequent activities when reading code is navigating among
those source codes. So tools that help navigating are important to
improve the reading efficiency. Here are some popular tools for this
purpose:
1. Source Code Index Generator
cscope http://cscope.sourceforge.net/
ctags http://ctags.sourceforge.net/
2. GUI Frontend for Index Generator
kscope http://kscope.sourceforge.net/
cbrowser http://cbrowser.sourceforge.net
3. Code Index Generating and Navigating
Source Navigator http://sourcenav.sourceforge.net/
Source Insight http://www.sourceinsight.com/
LXR http://lxr.linux.no/
4. For C Language ONLY
CXRef http://www.gedanken.demon.co.uk/cxref/
cflow http://www.gnu.org/software/cflow/
[Reference]
1. Code Comprehension Tool http://www.cs.ubc.ca/~murphy/cs319/index.html
2. Code Doc Generating Tool http://www.stack.nl/~dimitri/doxygen/links.html
3. Survey on Code Comp. Tools http://www.grok2.com/code_comprehension.html
4. Tips for Code Reading http://c2.com/cgi/wiki?TipsForReadingCode
5. Read VS Rewrite http://www.joelonsoftware.com/articles/fog0000000069.html
The latest issue of ACM Queue Magazine has an article titled
"Scaling in Games & Virtual Worlds", which talks about SUN's efforts on the scalability problem in Online Game and Virtual World systems(A.K.A.
MMORPG).
The author is a Distinguished Engineer in SUN, who had been involved in the scalable online game platform:
DarkStar for about 2 years. He summarized many interesting points in this paper.
I - Unique Characteristics of Online Game:1. Being Fun is the Prime Directive
2. Should be Easy to Learn, but Hard to Master
3. Client is Fat and Powerful
4. Network Latency is a Critical Factor
II - Current State of the Art of Techniques in Game & World:1. Network protocol is simple and should be hold in one network packet
2. Server is artictured as doing small things fast with large scale concurrent requests
3. Predict & Adjust tricks are used to hide network latency
4. Servers act as player
interaction hub and
arbiter to avoid client cheating
III - Current Scaling Strategy:1.
Geographic Decomposition - Partitioned by Geography Information. For
example, all activities in an island/country are processed by one
physical server
2. Sharding - create noninteracting copies of parts of a world, players can only interact players within the same shard
The problems that SUN found are:
1. Chip Architecture is switching to multi-core style, which is suitable for parallel tasks
2.
Server side game tasks are essentially parallelized: one thread for one
player, interactions among users are relatively small, compared with
all activies
3. Currently, most game implementation can't expoit parallel characteristic of multi-core cpu
SUN's
proposal for next generation game development:
1. Hide concurrency and distribution for game logic developer
2. Server is architectured as event-based(maybe
SEDA like, I guess) and one client request served by one server task
3. Communication is abstracted as Channel, physical address is hiden from developer
4. Data/State is moved from server memory to global data service, which is based on simplified DBMS technologies
5. Since little state in task logic and server mem, task imigration is possible, which enables hot load balance
The last valubale word in this paper: the
darkstar is open sourced.
I
think the observation and current tech description are the most
interesting parts, the solution part is not that exciting, since the
idea is somewhat not so innovative.
One personal observation
is that, general platform will improve productivity of application
developer, but with the price of performance. So, what's future of the
darkstar project? It may be useful for small scale game, but when your
players grows to M or 10M level, productivity is not the primary
problem for dev team, but performance/stability is. So large scale
applications always use ad-hoc solutions. But if darkstar is not used
in REALLY LARGE scale environment, how can we evaluate the value of
this system, how can we say the project succeed or fail?
1. Components of Windows HPC Server- an x64 version based Windows Server 2008 Core
- a new and faster MS-MPI, which is based on MPICH2
- a scalable job scheduler for creation, execution and monitoring
- windows deployment service based provisioning service
- UI, cmd utilities and PowerShell cmdlets for cluster managing and monitoring
2. Architecture of Windows HPC Server
Nodes
- Head Node
- Compute Node
- WCF Broker Node
Networks
- Interface
a. Winsock Direct
b. NetworkDirect
c. TCP/IP interface
- Physical
a. Ethernet
b. InfiniBand
c. Myrinet
Web Interface for Scheduling Service
- Basic Profile Web Service, open standard by OGF
3. What can we do with it?Application Types- Batch/Parallel : MPI
- Interactive/SOA : HPC WCF
Job Types- Parallel Job: same bin, communication amnong tasks
- Parametric Sweep Job: same bin, no communication among tasks
- Task Flow Job: different bin, dependices(DAG) among tasks
System Management- Cluster Configuration
- Monitoring & Reporing & Auditing
- Node Provisioning
- Diagnostics
Job Management- Job Scheduler Configuration
- Job Creation and Definition
- Job Execution and Monitoring
[Reference]
- management
HPCS Command Tools HPCS PowerShell Cmdlets- programming
HPCS Job Scheduler Web Interface SOA Programming on HPCSMPI Programming on HPCS- community
Hpc Show @ Channel9HPCS @ MSDNHPCS @ TechNet