Welcome to MSDN Blogs Sign in | Join | Help

One of the customer requests we had on Windows Server 2003 R2 was the ability to increase the maximum number of quotas in the FSRM Quota report. The current limit is 1000 quotas maximum. How do I set this limit to 20,000 for example?

Here is how - assuming that you have Powershell 1.0 installed, just run this script:

$gm=New-Object -com Srmsvc.SrmGlobalStoreManager

$s=[xml]$gm.GetStoreData("Settings", "ReportSettings")

$s.Save("\fsrm-reports-backup.xml")

$s.root.MaxQuotas="20000"

$gm.SetStoreData("Settings", "ReportSettings",$s.get_InnerXml())

One note, however. The script above uses an COM object that is internally used by FSRM to store its internal configuration. Use this COM object as your own risk as this is not a published API.

 

You might remember Statcounter.com as the site that came in the news a few weeks ago when Bing overtook Yahoo, due to the sudden interest immediately after launch. Back then, as many predicted, it didn’t last long, as people quickly switched back to their old search habits.

But I just checked the Statcounter site and noticed that Bing just got ahead Yahoo again. Granted, this time in North America region only. On last Friday (26 Jun) Bing had 8.74% query share, compared with Yahoo with 8.55%.

 

clip_image001

 

Of course, this situation probably won’t last long either, but there are a few observations to make:

1) The two graphs are getting closer and closer, and it’s probably safe to predict that they will start intersecting a lot more often in the next weeks. The same trend can be seen in the WW numbers as well.

2) There is a curious pattern when Bing traffic is higher on Wednesday and Friday (not sure why?)

Either way, it’s an interesting area to watch out in the next weeks to come …

Check it out:

http://www.bing.com/community/

Nice layout & content organization. Much better organized when comparing it with the typical forums/community sites that you see on Microsoft sites or MSDN (or other non-Microsoft ones)

Just found a new video demonstrating the new classification feature in Windows Server 2008 R2. Enjoy!

Special thanks to dawho1 who posted the video.

In my personal opinion, BING = Bing Is Not Google (to continue the tradition of recursive acronyms)

And it’s true. Bing attempts to be a decision engine, not just another search engine.

It is interesting that Google itself has an “I’m feeling Lucky” button, but using it simply means that you let the search engine make the decision for you. True, in some cases, this is good enough – but as a former Google user I don’t recollect using that button too much. I much prefer drilling into the results myself.

It is an eye opening experience to sit at our FCI booth and see customer after customer telling us their biggest problem with managing file servers today: lots of old data sitting on their file servers. When I tell them how that our classification feature solves this problem, it something that always brings a sincere smile on their face.

Creating a file expiration policy is super easy:

  1. Open FSRM management console
  2. Go to the "File Management Tasks" view
  3. Create a file management task to expire your files. You need to specify the following settings:
    • The source directory of files to be expired
    • The target directory (containing expired files)
    • A condition for expiration (such as files created ten years ago (or files that were not modified in the last year)
    • A schedule (say, weekly)

Here is an example of the Action tab:

image

That's it. One simple task to solve the "old files lying around" problem.

Now, a note to be added: the effect of this command will be the to move these expired files into the target location (while trying to keep the original path). One effect though is that the original files will "disappear" from the original location, which in rare cases it might cause confusion to the end users. If you are concerned about that problem, the solution is easy: as part of the file management task you can run your own custom "move" command which leaves in the original path a "stub" (a text file) explaining where the files have gone. Or, you can replace the original files with a symbolic link (or some other form of link) pointing to the target location.

To do this, you need to do a few things:

  1. In the File Management Task dialog, in the Action tab, change the task type from "File expiration" to "Custom". Several more options appear (such as the path to the custom script, the account the script will run under, etc)
  2. Add the [Source File path] macro to the script arguments
  3. Run the command as "local system" (so it will be able to perform the move operation).

Here is an example of how the task will look like:

custom1

The move_file_and_leave_link.cmd (located in c:\windows\system32) file is simple:

if exist "c:\protected\%~pnx1" @echo Target file already exists! & goto :EOF
md "c:\HSM\%~p1"
move %1 "c:\protected\%~p1"
mklink %1 "c:\protected\%~pnx1"

The last command (mklink) has the role of creating a symbolic link from the source to the target. That’s it!

More and more posts appear on File Classification - it's hard to keep track of them. Here is a few sample (if I missed anyone, sorry - it s not intentional)

(more to come)

As promised, I will add more technical information about File Classification Infrastructure (FCI) in a series of technical posts devoted to FCI architecture and internals.

We started the classification project with an ambitious vision: to provide a simple, open and comprehensive way to organize data at enterprise level, primarily for files stored on shares. The classification process has to be simple in order to be understood intuitively by any administrator under pressure of managing millions, maybe billions of files. It needs to be open (i.e. extensible) as we need to deal with a huge matrix of file formats, classification methods and types of management policies on top of classification. Finally, it needs to be comprehensive - by installing the latest version of Windows, administrators should be able to get enough value “out of the box” to get started in organizing their information and apply policies right away. 

Overview

Here is a quick introduction in classification, if you don’t know anything about it yet (By the way, a detailed overview of the problem space is presented here, and an functional overview is presented here. There is also a Microsoft Technet step-by-step guide for enabling classification and file management tasks here.)

In essence, classification allows you to assign metadata to every file on a file server. Specifically, classification deals with actionable file metadata, the kind of metadata that can be used to drive file management policies (such as encryption, protection, retention, expiration, etc). This metadata is just a set of classification properties (in the form Name=value) that is attached to files in some form or another, either through explicit or automatic classification. It is important to note that you (the user) are the one responsible in designing your own classification schema for your solution. By default, Windows does not provide (nor enforce) a predefined set of classification properties available by default.

Classification information can be “produced” in several different ways:

  1. Through automatic classification, where you need to define classification rules that indicate which property value will be assigned to each affected file. Classification rules have a scope (a set of affected namespaces) and also could have a “filtering condition” (think of it as a WHERE clause) for the files being classified. Finally, a rule also specifies the property to be assigned (so one rule is associated with only one property).
  2. Classification properties could be already residing in the file, when supplied by other property storage systems, or even by previous runs of the automatic classification mechanism. For example, an Office file downloaded from a Sharepoint document library already has extra embedded metadata that could be viewed as classification properties.
  3. You can also explicitly “set” a certain classification information using a certain API from scripts or LOB applications (using IFsrmClassificationManager::SetFileProperty). Since this mechanism is not rule-based, this method is subject to certain applicability limitations, depending on the scenario. (BTW, you can think of the “Set API” called to set a property on a file as equivalent with running a temporary classification rule that is created once, ran once against the file, and then deleted). 

Classification can be consumed in several ways:

  1. Classification properties can be used directly (from scripts or applications) through “Get” APIs (like IFsrmClassificationManager::EnumFileProperties or IFsrmClassificationManager::GetFileProperty). Such APIs will provide the correct set of properties depending on the set of rules defined ahead of time and/or depending on existing metadata that is already attached to the file.
  2. Classification properties could be used indirectly, as a filtering criteria, to selectively execute File management tasks – which are tasks that perform a certain action against a desired set of files. One example: you can use a file management task to encrypt any file that has been classified as “Confidentiality=High”.
  3. Classification properties are also used when generating “Files By Property” reports, such as reporting all files on a server grouped by their Confidentiality information.

Of course, producing/consuming always occurs in parallel – which makes classification look a little bit complex at the first sight.

The classification pipeline

The whole process is simple to understand if you use the classification pipeline as a simple underlying concept. The pipeline defines how the classification process is executed in each combination of the scenarios above that produce/consume classification:

image

As you can see, classification works in five stages on a file-by-file basis. Let’s drill down into each of them:

1) Discover Files

In this phase we “assemble” a list of files that will be run through the pipeline.

  • In the case of automatic classification, we will create a snapshot of the volume and scan the snapshot for any files that might have been changed or added since the last classification (FCI offers a set of heuristics for incremental classification, so we won’t need to rescan the whole world every time we are running an automatic classification job).
  • In the case of executing the EnumFileProperties()/GetFileProperty()/SetFileProperty() APIs, we will only provide the file in discussion.

After this step, we generate a stream of “property bag” objects (one per file) that is passed through the pipeline. You can think of the property bag object as an in-memory object holding the set of “Name=value” properties for a given file. Of course, NTFS metadata (file name, attributes, timestamps, etc) is extracted at this stage and inserted in the property bag.

2) Extract classification (retrieve existing classification from files)

In this phase, we try to extract existing properties from files, that have been previously stored by other FCI-unaware applications such as Microsoft Word or Sharepoint. Some of these mechanisms might parse the file content (such as in the case of Office files) or use other techniques (such as discovering classification information that has been “cached” from a previous classification run).

One important thing to note is that existing classification data might be extracted from different sources, and therefore we have a precise process of reconciliation of potentially conflicting metadata. This process is called aggregation – and it is one of the central concepts in the pipeline architecture. 

Extracting/storing property values is done through dedicated software components called storage modules. Each of these storage modules implements a standard COM interface to essentially receive a “stream” of property bags (from the discovery/scanning process) and enhance them in the way through extracted property values.

In Windows Server 2008 R2 we ship several storage module by default:

  • An Office 2007 storage module that persists properties within Office 2007 file formats (DOCX, XLSX, PPTX, etc) in a format compatible with Sharepoint.
  • Similarly, we ship a Office XP-2003 storage module for files in DOC, XLS, PPT, etc format.
  • The System Storage module (that persists properties in an alternate data stream attached to the file. This storage module has a dual role. First, it is used as a generic storage module for files that do not support embedded metadata (such as plain TXT files, or files with an unknown extension) and secondly, as a cache for previous classification runs on the same file. This second functionality is particularly important – as it guarantees that you have a way to cache properties for files which did not change in the meantime (assuming that the rules only act on file metadata or content). We will address the caching concept in detail later.

One more thing to note – you can also define your own storage modules that can store metadata either in the file content, or in an external database. It is recommended, though, that any storage module that parses file content should live in its own process with tightened privileges for security reasons (as parsing file content could be subject to buffer overruns).

3) Classify files

At this stage, we identify which rules affect the current file being processed, and then run each rule to produce new classification property values. Again, we use aggregation to reconcile potentially conflicting values coming from different rules (and potentially from the previous stage as well).

Rules are implemented with the help of a different type of pipeline modules called classifiers. A classifier implements a similar interface as a storage module – it accepts a stream of incoming property bags, and it outputs a stream of “enhanced” property bags containing extra properties discovered during classification.

A classifier can look at the file content, the file system metadata, external storage, or all of them to determine the new property values to be assigned. There are many “flavors” of classifiers which we will detail later.

A pipeline could contain one or several different classifiers. There are two classifiers shipped by default with Windows:

  • The Folder classifier – which simply assigns properties to files depending on their location on the disk, and
  • The Content classifier – which assigns properties to certain files depending on whether certain strings or regular expressions are successfully matched against the file content.

The content classifier can also operate on non-text files (such as Office documents) through the IFilter technology, which basically offers a way to “convert” a certain Office file into a simplified pure text format. One more exciting thing to note – Windows Server 2008 R2 ships with an IFilter-based OCR engine, so the content classifier will also work against text contained in TIFF scanned images! (If you have seen the TechEd keynote demo you know what I mean :-)

4) Store classification properties

This is the natural counterpart of step 2 and is implemented, not surprisingly, by the same storage modules. In this stage the basic philosophy is the following: we store properties in as many locations as possible so we can reconstruct it later.

Of course, we store the aggregated property values as mentioned before.

5) Apply policy based on classification

This is the most interesting of the pipeline steps, and the final stage in the classification pipeline. After all, the whole point of classification is to run some policies against the classification data. Classification in itself is not interesting – what’s truly interesting is the fact that you can run declarative policies on top of classification data.

Policies come in several flavors:

  • Expiration File Management Tasks – which have the role of selectively moving files, depending on their age and/or classification information
  • Custom File Management Tasks – which can execute a custom script against a certain category of files. Examples could include: migrating certain documents to Sharepoint, custom expiration, custom retention policies, encrypting/protecting sensitive data, HSM policies, deduplication policies, etc. Sky is the limit for your imagination of what can be implemented in a custom file management task
  • Automatic Classification – of course, running an automatic classification job can be viewed of a task of (1) saving classification properties for newly-classified files and (2) generating a report containing a summary of the job results.
  • The “Files By Property” report – this is a simple report containing a basic statistical distribution of files classified against a certain property.

That’s it for now – in next posts we will drill down even more in each of these phases.

(update – small typos)

Today it is an exciting day for all of us – we just announced our new classification platform at TechEd! It is a culmination of many months of team work, mostly in secret until now (which also explains also the silence on this blog).

For an overview of FCI (the File Classification Infrastructure) in Windows Server 2008 R2 you can start from here:

http://blogs.technet.com/filecab/archive/2009/05/11/windows-server-2008-r2-file-classification-infrastructure-managing-data-based-on-business-value.aspx

http://blogs.technet.com/filecab/archive/2009/05/11/classifying-files-based-on-location-and-content-using-the-file-classification-infrastructure-fci-in-windows-server-2008-r2.aspx

I will follow-up with an in-depth blog post about technical architecture around FCI. There is a ton of new information just waiting to be released sand we are very excited to talk about it.

Meanwhile, if you are at TechEd, don’t forget to stop by at our booth – we are in the “Core Infrastructure/WSV” area (look for the “File Services” booth)

Tom’s Hardware does it again!

http://www.tomshardware.com/reviews/256gb-samsung-ssd,2265-8.html

What is notable is (again) the relative power consumption between various SSDs – ranging from 0.5W (for Intel X25-E/M) to a whopping 4.6 W (for Soliddata).

Here is the summary table:

Manufacturer

Intel

Intel

Mtron

Mtron

Family

X25-E

X25-M

SSD MOBI

SSD MOBI

Model Number

SSDSA2SH032G1GN

SSDSA2MH080G1

MSD-SATA3525-032

MSD-SATA3525-064

Capacity

32 GB

80 GB

32 GB

64 GB

Other Capacities

64 GB

160 GB

16, 64, 128 GB

16, 32, 128 GB

Rational Speed (RPM)

Flash SLC

Flash MLC

Flash SLC

Flash SLC

Platter

-

-

-

-

Form Factor

2.5"

2.5"

2.5"

2.5"

Interface

SATA/300

SATA/300

SATA/150

SATA/150

Cache ( MB)

16 MB

16 MB

N/A

N/A

NCQ

Yes

Yes

No

No

Height

6.5 mm

6.5 mm

9.5 mm

9.5 mm

Weight

78 g

80 g

60 g

84 g

MTBF

2 Million Hours

1.2 Million Hours

1 Million Hours

1 Million Hours

Operating Temperature

0-70°C

0-70°C

0-70°C

0-70°C

Specified Idle Power ( Low Power)

0.06 W

0.06 W

N/A

N/A

Measured Idle Power ( Low Power)

0.5 W

0.5 W

1.1 W

1.1 W

Operating Shock ( 2 ms, read)

1000 G

1000 G

1500 G

1500 G

Web site

Intel

Intel

Mtron

Mtron

Warranty

   

3 years

3 years

 

       

 

Manufacturer

Samsung

Samsung

Solidata

Solidata

Super Talent

Family

SSD SATA 3.0Gbps 2.5"

PB22-J

X Series

X Series

MasterDrive OX

Model Number

MCCOE64G5MPP

MMDOE56G5MXP-0VB

X1-64

X2-128

FTM64GL25H

Capacity

64 GB

256 GB

64 GB

128 GB

64 GB

Other Capacities

32 GB

128 GB

128, 256 GB

256 , 512 GB

16, 32, 128 GB

Rotational Speed (RPM)

Flash MLC

Flash MLC

Flash MLC

Flash MLC

Flash MLC

Platter

-

-

-

-

-

Form Factor

2.5"

2.5"

2.5"

2.5"

2.5"

Interface

SATA/300

SATA/300

SATA/300

SATA/300

SATA/300

Cache (MB)

No

128 MB

N/A

N/A

N/A

NCQ

No

Yes

No

No

No

Height

9.5 mm

9.5 mm

9.5 mm

9.5 mm

9.5 mm

Weight

74 g

82 g

98 g

98 g

66 g

MTBF

2 Million Hours

1 Million Hours

2 Million Hours

2 Million Hours

1 Million Hours

Operating Temperature

0-70°C

0-70°C

0-70°C

0-70°C

0-70°C

Specified Idle Power ( Low Power)

0.24 W

0.19 W

3 W

3 W

N/A

Measured Idle Power (Low Power)

0.21 W

0.15 W

4.6 W

4.7 W

1.4 W

Operating Shock ( 2 ms, read)

1500 G

1500 G

N/A

N/A

1500 G

Web site

Samsung

Samsung

Solidata

solidata

Super Talent

Warranty

   

2 years

1 year

 

 

         

 

Manufacturer

Mtron

Mtron

PhotoFast

PhotoFast

Model Number

SSD PRO

SSD PRO

G-Monster

G-Monster-V2

Family

MSD-SATA3525-032

MSD-SATA3525-032

   

Capacity

32 GB

32 GB

32 GB

128 GB

Other Capacities

16, 64, 128 GB

16, 64, 128 GB

64, 128 GB

 

Rotational Speed (RPM)

Flash SLC

Flash SLC

Flash SLC

Flash MLC

Platter

-

-

-

-

Form Factor

2.5"

2.5"

2.5"

2.5"

Interface

SATA/150

SATA/150

SATA/300

SATA/300

Cache (MB)

N/A

N/A

N/A

N/A

NCQ

N

N

N/A

N/A

Height

9.5 mm

14.8 mm

9.5 mm

9.5 mm

Weight

82 g

196 g

92 g

82 g

MTBF

1 Million Hours

1 Million Hours

1.75 Million Hours

1.75 Million Hours

Operating Temperature

0-70°C

0-70°C

-10-70°C

-10-70°C

Specified Idle Power ( Low Power)

N/A

N/A

N/A

N/A

Measured Idle Power (Low Power)

1.1 W

1.5 W

0.8 W

2.4 W

Operating Shock ( 2 ms, read)

2000 G

2000 G

1500 G

1500 G

Website

Mtron

Mtron

PhotoFast

PhotoFast

Warranty

5 years   

5 years

1 year

1 year

In the past, I presented various ways to browse these mysterious device objects called "shadow copies". Shadow copies are static images in time (snapshots) of your volume contents, at some point in the past. These shadow copies are volumes on their own, with a file system namespace accessible through the regular Win32 APIs such as FindFirstFile/FindNextFile. For example the existing sample code in MSDN for these APIs that enumerates files on a real volume will work just fine on a shadow copy volume. In fact, that's how all backup applications are accessing shadow copy content today.

So, if these devices are real volumes, how can we view them in Explorer? It turns out that you can't view them by default - this is simply because these are volumes without an associated drive letter or root mount point. However, in XP or Windows Server (and Vista), you can still access these shadow copies by assigning them a drive letter using utilities like DOSDEV, or by doing tricks with the FOR command, etc. 

Now, if you have Vista, it is much simpler to access shadow copy devices directly from Explorer. The trick is to use a new feature called Symbolic Links: to access the contents of a shadow copy as a "directory", simply create a symbolic link to the device. Vista also includes a convenient command-line tool called MKLINK.EXE to create symbolic links, which makes this operation very easy.

Here is an example of accessing the contents of a shadow copy device. The first step is to enumerate shadow copies on the machine, using the VSSADMIN LIST SHADOW command. This will give us the devices and also a creation timestamp.

C:\Windows\system32>vssadmin list shadows |more
vssadmin 1.1 - Volume Shadow Copy Service administrative command-line tool
(C) Copyright 2001-2005 Microsoft Corp.

Contents of shadow copy set ID: {c72c8036-d563-43c8-b351-1994dfad580a}
   Contained 1 shadow copies at creation time: 2/23/2008 9:59:04 AM
      Shadow Copy ID: {f3727808-bea6-4b59-bef7-6849ee721709}
         Original Volume: (C:)\\?\Volume{3e83355f-7c0e-11dc-b416-806e6f6e6963}\
         Shadow Copy Volume: \\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy4
         Originating Machine: Adi-Game-PC
         Service Machine: Adi-Game-PC
         Provider: 'Microsoft Software Shadow Copy provider 1.0'
         Type: ClientAccessibleWriters
         Attributes: Persistent, Client-accessible, No auto release, Differential, Auto recovered

Contents of shadow copy set ID: {0bf23f77-8461-4869-b391-da4d213940a5}
   Contained 1 shadow copies at creation time: 2/24/2008 4:00:24 AM
      Shadow Copy ID: {87d59b22-9e84-4d0d-81ca-2b565d6f7e55}
         Original Volume: (C:)\\?\Volume{3e83355f-7c0e-11dc-b416-806e6f6e6963}\
         Shadow Copy Volume: \\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy5
         Originating Machine: Adi-Game-PC
         Service Machine: Adi-Game-PC
         Provider: 'Microsoft Software Shadow Copy provider 1.0'
         Type: ClientAccessibleWriters
         Attributes: Persistent, Client-accessible, No auto release, Differential, Auto recovered

C:\Windows\system32>mklink /d c:\shadowcopy \\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy5\
symbolic link created for c:\shadowcopy <<===>> \\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy5\

C:\Windows\system32>dir c:\shadowcopy
Volume in drive C has no label.
Volume Serial Number is 4A02-860C

Directory of c:\shadowcopy

12/14/2007  01:46 AM    <DIR>          Asi
01/15/2008  12:56 AM    <DIR>          bin
12/13/2007  11:59 PM    <DIR>          debuggers
12/13/2007  11:55 PM        17,644,031 dir.log
01/14/2008  11:41 PM    <DIR>          Downloads
01/01/2008  05:50 PM    <DIR>          dumps
12/30/2007  11:43 PM    <DIR>          garbage
01/08/2008  11:13 PM    <DIR>          Garmin
10/15/2007  09:03 PM    <DIR>          Intel
12/30/2007  11:59 PM    <DIR>          Program Files
01/27/2008  01:32 AM    <DIR>          Program Files (x86)
01/15/2008  12:17 AM    <DIR>          test
01/30/2008  06:52 AM    <DIR>          Users
12/14/2007  01:55 AM    <DIR>          WinDDK
02/13/2008  05:23 AM    <DIR>          Windows
02/21/2008  10:43 PM    <DIR>          Work
               1 File(s)     17,644,031 bytes
              15 Dir(s)  147,657,666,560 bytes free

 

That's it. Now I have a persistent link called c:\shadowcopy which points to the contents of the shadow copy device - which is the image of my C:\ drive at 4:00 AM (this is when my latest system restore point was created).

A new notes,though:

1) Make sure you use the "/D" option in MKLINK so you create a directory-based, not a file-based symbolic link

2) Make sure you append a backslash to the shadow copy device in the MKLINK command (marked in red above)

 

If this made you interested about shadow copies - note that you can create, enumerate and delete shadow copies programatically using either VB scripts that use the WMI API for shadow copy administration, or by using the VSS API (documented publicly on MSDN). Sample code is available in the Platform SDK as well.

I love this - if you have kids, and if you are a geek, you'll understand.

http://gizmodo.com/photogallery/microserveces08/1000446145

The title sums it all.

 

P.S. And, as a Christmas bonus, here is a nice chess puzzle.

image

In the above diagram you must add the two missing kings in such a way that White, who is on the move, can deliver immediate mate, i.e. mate in one move.

[source: ChessBase]

A review in ComputerWorld features the new Samsung's 64 GB Flash drive:

Samsung rates the drive with a read speed of 100MB/sec and write speed of 80 MB/sec, compared to 59MB/sec and 60MB/sec (respectively) for a traditional 2.5" hard drive. We ran several benchmarks. In our HD Tune performance tests, the drive turned in an average transfer speed of 28.6 MB/sec with .3ms average access time. Burst speed was 25.5MB/sec with 4.5% CPU utilization. HD Tach tests using both 8MB and 32MB blocks for testing measured the burst speed at 30.8MB/sec, the average sequential read speed at 28.0MB/sec. HD Tach won't measure write speeds on primary drives, so we turned to Fresh Diagnose, which reported an average write speed of 11.65 MB/sec.

With such a spectacular performance, it's easy to see why Flash drives will become the next top-of-the-line feature in any high-end laptop, maybe more important than a high-speed CPU.

I think that the problem stated in one of my earlier posts is one of the most fascinating puzzles I came across recently. Many people that got confronted with it said bluntly that the problem simply has no solution, otherwise it would contradict common sense, information theory, etc. But surprisingly, it does have a solution.

For X% strictly bigger than 50% the solution is known as the MJRTY algorithm, discovered a while back by R.S. Boyer and J.S. Moore (also called the Majority Vote algorithm). Although the algorithm is deceptively simple, its proof is not. the In the PDF cited above, the authors present a complete proof and also some interesting history around this non-trivial result.

The correct algorithm was also found by Lailin Chen - also presented in the post comments, but without a proof:

Well, Since we already knew there is one and only one who has more votes than any others, we can find him by letting the votes "fight" each other:

Picked the first one, say X1, count 1 for X1, then move on,  if X1 repeats, add count to X1.

When X1 doesn't repeat, reduce the count by one, if it reaches 0, picked up the current one (say X2), and do the some counting to X2. 

Keep doing this until the end. The last one left with a counting bigger than 0 is the leader:

X3  X1  X1  X2  X1  X3  X3  X3  X4

   ^

X3, 1

X3  X1  X1  X2  X1  X3  X3  X3  X4

         ^

       X3, 0

X3  X1  X1  X2  X1  X3  X3  X3  X4

               ^

            X1, 1

X3  X1  X1  X2  X1  X3  X3  X3  X4

                     ^

                   X1, 0

X3  X1  X1  X2  X1  X3  X3  X3  X4

                           ^

                          X1, 1

X3  X1  X1  X2  X1  X3  X3  X3  X4

                                 ^

                               X1, 0

X3  X1  X1  X2  X1  X3  X3  X3  X4

                                       ^

                                    X3, 1

X3  X1  X1  X2  X1  X3  X3  X3  X4

                                             ^

                                           X3, 2

X3  X1  X1  X2  X1  X3  X3  X3  X4

                                                    ^

                                                  X3, 1

So X3 is the winner.

Here is a more formal description of the algorithm:

- You have with two registers: an “candidate register” containing the name of some candidate (candidate), and a counter (counter).

- Candidate_register = NULL;

- Counter = 0

- For each new candidate:

     If (Candidate_register is NULL)

            Candidate_register = candidate

            Counter = 1

     Else If (candidate == candidate_register)

            Counter++;

     Else if (counter > 0)

            Counter --;

            If (counter == 0)

                Candidate_register == NULL

- At the end you have the correct candidate in the register.

For X < 50% we can adapt a more complex variation of the same algorithm which requires two passes instead of a single one. Instead of keeping a single register (keeping the last candidate) and its associated counter, we keep a number of K registers (with their associated counters) and run the same algorithm. The integer K is chosen such that 100/(K+1) < X. After this pass, we discovered K candidates but of course the candidate with the biggest counter might not be the real candidate. To find the top candidate we simply reset all counters to zero (but keep the candidates found) and re-run the same algorithm.

- Instead of maintaining one “element” register and one counter as in the main algorithm, you need to use K registers, each with K counters.

- First pass

     Initialize all candidate registers with NULL and counters with zero

     For each candidate in the log file

          If the candidate is in one of the K registers, increase its counter

          Else if there is one candidate register which is NULL, put it there and increase its counter with 1

          Else, reduce all counters by 1

               If some counters become zero, set their corresponding candidate registers to NULL

     At the end, each of the candidate registers contains every element with frequency > 100/(K+1). One of them is the winner

- Second pass

     Initialize all counters with zero, but keep the candidate registers filled with the values above.

     For each candidate in the log file

          If the candidate is in one of the K registers, increase its counter

     At the end, you have all the correct counters for all the candidate registers.

     Select the one that has the maximum count.

A variation of this second algorithm is briefly mentioned in this presentation although I could not find a formal article describing it.

What all these algorithms have in common? The fact that you know ahead of time that there is a winner, but you don't know which one. While the algorithms are simple to implement, this is a pretty strong condition that is not likely to be encountered in our real world. Maybe on a planet called Xen :-)

More Posts Next page »
 
Page view tracker