Previously, I had discussed what it took to use PGO on the Windows PHP build. The lead to me building automated build scripts…
Automation as the root of all evil
"Anything that can be done for you, automatically, can be done to you, automatically." – David C. Wyland
First, I had to get the entire dependency stack into the mix. While some of the dependent libraries had VCProject files, some didn't. Worse, even if they had them, you couldn't tell with a degree of certainty that they were compiled with the same settings which would enable them to take advantage of PGO optimization. I began taking each project, updating (or creating, using the Trace and mkProject tools) the Visual C++ project files that would use the same settings as the rest, and eventually came up with a solution file that had 74 projects in it (some of the projects generated more than one binary).
Next, I had to actually automate the process of creating the vcproject files. Once you've got the right dependencies, the PHP build process cranks out over 30 binaries when you include the PHP extensions that get built as part of the core. After what seemed like a million compile-verify-tweak iterations, I had the tools that could generate VCProject files for the core PHP and all the extensions, provided it was all in the right place.
Next I wrote a .cmd batch script that went step-by-step, checking out the source, compiling the dependent libraries, building the PHP makefile, compiling PHP like the community did—and logging what it was doing, then switching to instrumentation, rebuilding the dependencies again, building the stack, PGO training it with test data and some applications (Wordpress, MediaWiki and phpBB) and then relinking it with optimization.
I got the .cmd script almost working, but it was fairly fragile. At that point I decided to switch batch scripting strategies, and in about a week, rewrote the batch script in JScript, which was far more flexible, and a lot more reliable.
What's next…
"The future always arrives too fast... and in the wrong order." –Alvin Toffler
During this process, I've tweaked the build process that is generated quite a bit, added in a few more applications to the PGO training which cranks the performance up more and more. Now, I can add in more scripts to assist with the training pretty trivially, but it still takes some effort to package up an entire application like MediaWiki or Wordpress and include it into the build process. Even once I've added in an application, I end up doing a whole slew of comparative testing to see what impact it has on the final executables.
As time goes forward, I'm sure there's more tweaking to be done, but in all likelihood, any significant performance gains are going to be the result some modification of the PHP codebase itself.
Previously, I had talked about using PGO in the PHP build process. In order to use it I had to observe…
The Heisenberg build process
"A process cannot be understood by stopping it. Understanding must move with the flow of the process, must join it and flow with it." – The First law of Mentat, quoted by Paul Atreides to Reverend Mother Gaius Helen Mohiam
Really, what I needed was a tool in two parts. The first would watch what happens during the build process, and the second would take that data and spit out some .vcproj files.
When I want to see what's happening on my own system I use ProcMon—a Sysinternals tool that monitors processes, what files they touch, what commands get executed, etc. I grabbed that and tried to watch what happens when you run NMake on the makefile when building PHP. It turns out that are a few problems with that—ProcMon isn't very scriptable (making it tricky to automate) and even if it was, it has problems chopping off the command line in its log files when it's past a certain length.
I found nothing else that did quite what I needed, so I started thinking about how to write a tool that does the same thing. In the past I have used Detours (an API detouring library built by Microsoft Research) to build a couple quick-and-dirty snoop/debugging tools. Starting with a sample that came from the Detours library, I cobbled together a tool that would watch a process and its children, recording every file written or read, every command issued, and dump it into an XML file which I could process later.
Creating the project files
At the same time, I began working on a tool that would generate .vcproj files from the data gathered during the make process. I first tried just putting together a tool which assembled the .vcproj XML file from what I knew about the layout of the project file, but as the build got trickier, the xml was getting harder to make sure it came out the way that Visual Studio expected. I turned to the Visual Studio SDK to see if there are any COM objects I could use to manipulate project files—there were, but they aren't documented in great detail, and they were really designed to be used to inside Visual Studio for automation. Having scoured the planet, I found some examples of using the VCProjectEngine to generate project files.
For a couple of weeks solid, I worked on the tool to generate project files, compiling, testing, tweaking, etc. I finally reached a point where I generated a project file completely that would compile the php.exe and php5.dll . Having finally arrived at this point, I built PHP using PGO instrumentation, ran the bench.php script from the PHP source directory, and then re-linked the project. This first time, I saw about an 18% improvement in speed over the previous version!
That moment
"It ain't over 'til it's over, and maybe not then, either. " – Slovotsky's Law #29
Well, as anyone who's done software development will tell you, there's the moment when you finally get your program to do what you want under very controlled conditions, and then—quite some time later—there's the moment that you can give the fruits of that labor to someone else so they can do the same thing.
Now that I had passed the point where I'd finally proven that it was worth the effort to build a PGO-optimized version of PHP, I had to get it scripted so that it could be done in an automated fashion, not just on my computer, or a computer in our Lab.
In the final part, I wrap up with the automation of the build and look to where we might go next in PHP.
I had talked about getting started in building the PHP stack last time, now I’m taking it…
One step further
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." – Donald Knuth
A chance conversation I had last summer at OSCON with Trent Nelson—who was building Python on Windows—had planted the seeds of how to get PHP on Windows optimized further. Trent was using the PGO features of Visual Studio to generate Python binaries that run faster. Rather than spend a lot of time optimizing all the little bits of PHP itself, I thought that this would be an ideal way to improve the overall speed of PHP, provided I could find the right scenarios to train PHP with. Little did I know that finding the right scenarios wasn't the hardest part.
| | What is PGO? (from Wikipedia) Profile-guided optimization (PGO) is a compiler optimization technique in computer programming to improve program runtime performance. In contrast to traditional optimization techniques that solely use the source code, PGO uses the results of test runs of the instrumented program to optimize the final generated code. The compiler is used to access data from a sample run of the program across a representative input set. The data indicates which areas of the program are executed more frequently, and which areas are executed less frequently. All optimizations benefit from profile-guided feedback because they are less reliant on heuristics when making compilation decisions. |
Adding PGO to the existing build process
"I have not failed, I've just found 10,000 ways that won't work." – Thomas Edison
I had downloaded the source to the dependent libraries off the PHP wiki, checked out the PHP source code, and began the process of adding in PGO support to the existing build process. This proved to be extremely difficult. Even limiting the scope to just the core of PHP itself—without the dependent libraries, I ran trouble trying to compile using PGO instrumentation and then re-linking after running some tests. The make file that gets generated by the configure.js script (a JScript version of the automake configure script for the Windows platform) was just not built with what I had in mind.
I spent the better part of two weeks trying different approaches to tweaking the makefile so that I could use PGO to improve the PHP executable, but I kept running into roadblocks. Worse, the closer I got to a makefile that did that I wanted, the farther away from the current build process I was getting, and I wasn't sure that what I would end up with would even be close to what was being built today.
The long dark winter road
"Only the meek get pinched. The bold survive." – Ferris Bueller
I came to the conclusion that I'd have to build new Visual Studio project files from scratch. What worried me is that this would end up to be a completely different build process and I'd never get the community to abandon what was already working, so I'd better be able to rebuild these new project files easily. I started looking (inside Microsoft and out) for any tools which generated Visual C++ project files. I found someone internally who had used some JScript to create project files from text files, but after some experimentation, I found this was nowhere near what I needed. What I really needed was a way to convert the generated Makefile into a .vcproj file—and not just 'wrap' it.
Once I found there was no such tool* , I began trying to figure out how to create one. I had this idea a few times in the last decade or so: watch how a program was compiled, and create a project file that does the same thing. Having tossed around the idea in my head before, I knew it wasn't going to be trivial, but without it, I couldn't do what needed to be done.
| * Let me tell you: you never want to think about writing a tool to parse out what a makefile does. It's rather like making a tool that tells you how sausage is made, in excruciating detail. Ugh. |
In Part III, I’ll talk about the trouble with observing the build process.
The last several months, I’ve been working very deeply with PHP—specifically—compiling the PHP core itself, and looking for avenues for optimization. This is the first of four posts about the journey I’ve been on with PHP.
I get started building PHP
"It is a bad plan that admits of no modification" – Publilius Syrus
I started working with building PHP itself about a year ago. Initially, I was trying to put together an environment to compile up the PHP stack so that I could do some debugging, and track down a few faults that we were encountering in some of the PHP applications that we were trying to modify to use the SQL Server PHP driver that the SQL Server team here at Microsoft was creating.
Once I began to work with the source code, I found out very quickly that on top of having a hard time recreating the exact same binaries that the community build process generated, there were a large number of dependent libraries that were available in binary-only form which were kept in a zip file that was passed around from developer to developer. That seemed a little odd for an open-source project but I can certainly understand that over time, unless someone is working hard to keep it all together, these things happen.
Around the same time, the community had started to invest a time and effort to 'clean up' the dependencies for building PHP on Windows, and move towards supporting VC9 (Visual Studio 2008) as an officially supported compiler.
In order to help in this process, I built out some testing environments in our Lab, which would let me compile up PHP on Windows and Linux, in order to get decent and reliable test results which we could use to identify any shortcomings that we could address. This includes benchmarking not just the core PHP executable, but replicable and comparable testing of PHP applications such as Wordpress, MediaWiki, Gallery and phpBB.
PHP 5.3 on Windows: Not your father's PHP
"I'm looking for a lot of men who have an infinite capacity to not know what can't be done." – Henry Ford
For PHP 5.3, Pierre (and others) had gone out and found up-to-date versions of all the dependencies, brought them together, and managed to get them compiling with VC6 and VC9. They had posted these in binary and source form to the PHP Windows Internals site, which allows anyone to rebuild the PHP stack on Windows, and theoretically, get the same results as the 'official' build.
Jumping in at that point was much easier than it had been, as all you had to do was download the binaries of the libraries, check out the source code, and run a few commands at the command line, and presto you had your PHP executables.
At this point Pierre and I played around with the build flags on VC9 and found some settings that gave some pretty significant improvements to the speed of PHP vs. the speed of the VC6 version—and a lot of speed improvements to vs. the old 5.2x line of PHP.
In Part II, I’ll talk about the going one step further with optimization.
Now that I’ve shown how to build a cool .Format() method for strings, we can put it to good use in a lot of places.
In batch scripting, it’s really nice to be able to make nearly every call support replacement arguments in a consistent fashion. In cmd.exe batch scripts, we use %var% all over the place. In JScript batch scripting, we simply use {$VAR} instead, and put a little bit of code in the top of our functions to help out with that.
First, the one-stop-universal-arguments helper, which we can add to the source from before:
| Scripting.js |
function ArgsToArray(x) { return Array.prototype.slice.call(x);}
// FormatArguments must be passed either:
// one argument:
// containing the arguments object from the caller
// where the first argument of that should be the format string.
// == or ==
// just the format string.
//
// two arguments:
// the first argument is the format string
// the second argument is the argument collection from the caller.
function FormatArguments(args, moreargs) {
var result = "";
if (arguments.length == 1) {
if (typeof (args) == "object") {
args = ArgsToArray(args);
result = "" + (args.shift());
if (args.length == 1 && arguments[0].length > 0)
args = arguments[0];
return result.Format(args);
} else return ("" + args).Format();
} else if (arguments.length == 2) {
if (typeof (args) == "string" && typeof (moreargs) == "object") {
result = args;
args = ArgsToArray(moreargs);
args.shift();
return result.Format(args);
}
}
throw "Invalid Argument passed to FormatArguments";
}
|
FormatArguments() gives us the ability to do variable substitution in any function, in a very flexible way. You can simply use the function to give you a completed string:
| Test-4.js |
// use the first string as the format string,
// the rest are potential value substitutions
function Test1() {
var foo = FormatArguments(arguments)
WScript.echo( foo );
}
Test1("the path is {$PATH}");
// both arguments are used as format strings. any
// parameter substitution should use numbers starting
// at {1}, since the 'destfile' parameter is techincally {0}
function Test2(srcfile, destfile) {
var srcfile = FormatArguments(srcfile, arguments);
var destfile = FormatArguments(destfile, arguments);
var msg = "Copy {0} to {1}".Format(srcfile, destfile);
WScript.echo(msg);
}
Test2("{$WINDIR}\\system32\\notepad.exe" , "{$USERPROFILE}\\desktop\\notepad.exe");
|
Knowing that, we can create a few functions that will use the FormatArguments function:
| Scripting.js |
// Some global objects we'll need
var WSHShell = WScript.CreateObject("WScript.Shell");
var WinShell = WScript.CreateObject("Shell.Application");
var procEnvironment=WSHShell.Environment("PROCESS")
var fso = new ActiveXObject("Scripting.FileSystemObject");
function Print() {
WScript.echo(FormatArguments(arguments));
}
function cd() {
WSHShell.CurrentDirectory = FormatArguments(arguments);
}
function pwd() {
return WSHShell.CurrentDirectory;
}
function erase(file) {
file = FormatArguments(arguments);
if (exists(file))
fso.DeleteFile(file);
}
function rmdir(folder) {
folder = FormatArguments(arguments);
if (folderExists(folder))
fso.DeleteFolder(folder, true);
}
function exists(file) {
return fso.FileExists(FormatArguments(arguments));
}
function folderExists(folder) {
return fso.FolderExists(FormatArguments(arguments));
}
function mkdir(folder) {
folder = FormatArguments(arguments);
if (!folderExists(folder))
fso.CreateFolder(folder);
}
|
Now, we can do some pretty nifty little batch scripting:
| test-5.js |
// Bootstrap Scripting Library
eval(new ActiveXObject("Scripting.FileSystemObject").OpenTextFile("Scripting.js", 1, false).ReadAll());
$ORIG_DIR = pwd();
if( exists("{$WINDIR}\\log.txt") ) {
Print("Deleting log file");
erase("{$WINDIR}\\log.txt");
}
mkdir("{$TEMP}\\tempdir");
|
Next time, we’ll see how we can call on command-line tools to do our bidding, and play with the results.
I’ve just finished tweakin’ out the PGO (Profiled Guided Optimization) build script for PHP on Windows to crank out the thread-safe version of PHP as well.
So, now you can test PHP 5.3 RC3-dev PGO optimized for Windows with Apache 2.2!
What’s the difference between thread-safe and non-thread-safe?
The non-thread-safe version of PHP should be used when there is a single request per instance of PHP--like, when you use FastCGI—a single PHP-CGI.EXE handles a single request at a time, and the Web Server spins off multiple instances of PHP-CGI.EXE to handle requests in parallel. Because each instance is in a separate process, there is no need to have all the thread-safety code in PHP.
The thread-safe version is required when you use PHP as a module—as you would in Apache on Windows—and the WebServer handles multiple requests in the same process.
Given the choice, the non-thread-safe version should be faster, and if you can, you should probably use that one.
You can read more about this at http://www.iis-aid.com/articles/my_word/difference_between_php_thread_safe_and_non_thread_safe_binaries
If you already use the thread-safe version of PHP on Windows, please download and test this version if you can—and send me some feedback!
Links to the latest PHP 5.3 Builds:
PHP 5.3-RC-dev snapshots:
http://windows.php.net/downloads/snapsoptimized/
Non-thread-safe:
http://windows.php.net/downloads/snapsoptimized/php-5.3-nts-win32-VC9PGO-x86-latest.zip
Thread Safe:
http://windows.php.net/downloads/snapsoptimized/php-5.3-ts-win32-VC9PGO-x86-latest.zip
Last time, I wrote about synthesizing an #include facility along with handling environment variables in a trivial way.
This time, let’s look at filling in a couple more gaps in JScript’s basic scripting functionality.
What’s wrong with String?
The String class in JavaScript/JScript is … ok. If you work in .NET enough, you’ll eventually get to the point where you’d like a couple of functions that seem to be missing. First, is Trim, which I’ve always thought as obvious:
| Scripting.js – String Prototypes |
//---------------------------------------------------------------------------- // String Prototypes
String.prototype.Trim = function() {
return (this || "").replace(/^\s+|\s+$/g, "");
}
|
And secondly, is something I miss dearly from when I’m in .NET, is a string Format function. Now, I’ve seen a couple of these that were pretty simple, but I wanted to be able to do some really cool Format strings (and, since the rest of my scripting will rely on them quite a bit, I’m going for the gold here.
| Scripting.js – String Prototypes |
String.prototype.Format = function() {
var args = (arguments.length == 1 && typeof (arguments[0]) == "object") ? arguments[0] : arguments;
result = this;
while (z = /{(.*?)}/.exec(result))
try { result = result.replace(z[0], isNaN(z[1]) ? eval(z[1]) : (args[z[1]]||"??<" + z[1] + ">??" )); }
catch (x) { result = result.replace(z[0], "??<" + z[1] + ">??"); }
return result.replace("??<", "{").replace(">??", "}");
}
|
So, with this tasty little script in my pocket, I can do ad-hoc format strings that use one of a few types of replacement:
| test-3.js |
// simple parameter # replacement like .NET
var foo = "My Name is {0}. Please to meet you {1}".Format("Garrett", "Mr. Serack");
WScript.echo(foo);
// *global* variable replacement
ABC = "this is a test";
var bar = "'{ABC}' is really a replacement string".Format();
WScript.echo(bar);
// really, any legal expression in there is fine.
bar = "'{100+200/75}' is really a replacement value?".Format();
WScript.echo(bar);
// and if it doesn't match, just leave it alone...
bar = "'{teehee}' isn't really a replacement value.".Format();
WScript.echo(bar);
// even if it's just a number.
bar = "'{1}' isn't really an index.".Format();
WScript.echo(bar);
|
Which spits out the following:
| Ouptut |
My Name is Garrett. Please to meet you Mr. Serack
'this is a test' is really a replacement string
'102.66666666666667' is really a replacement value?
'{teehee}' isn't really a replacement value.
'{1}' isn't really an index.
|
Basically, it just looks for anything in braces, and checks to see if it’s a number. if it is, it tries to substitute in the n’th parameter passed in. Otherwise it just does an eval() on it—which, will replace expressions, or global variables. If something goes haywire, it just leaves it alone (well, it does some switcheroo stuff to get past the while loop, but it puts it back in the end).
Next time, we’ll see how to use this… everywhere.
As I mentioned a few days ago, I chose JScript to script of the optimized PHP build process that I’ve built. JScript in-box on pretty much every modern Windows operating system, and provides a great deal of flexibility and benefits for a scripting language:
- it’s syntax is C like. Very tasty.
- it gives access to *a lot* of functionality via COM and WMI.
- if you know enough about Windows Scripting Host and JavaScript, you can accomplish darn near anything if you want it bad enough.
- JScript’s regular expressions. While not the universe’s most powerful, they are certainly an integral part of the language.
- Prototypes allow you to do things to classes that can significantly boost productivity.
It however does lack a few of the basic things that I’d like to see in a batch* scripting language:
- an #include mechanism.
- Easy interaction with environment variables.
- Interaction and leveraging of external processes
- Analogs to the built in command-line functions like DIR, MKDIR, ERASE, RMDIR, etc.
* And, by “batch” scripting, I mean the scripting of external commands and programs to automate something that would otherwise be done by hand.
I was thinking about all of this over the last few months and started experimenting. Along the way I came up with a basic library of functions that address the deficiencies in a rather clever way.
First, let’s fix the lack of #include
JScript (well, and VBScript) really didn’t do us any favors by not supplying us with the ability to reuse code in a simple fashion (And yes, I know about .WSC components, and I’m not keen on how *that* turned out. Ask me about that again later some time)
Anyway, #including another JScript file is pretty easy if you know what to do. Not pretty, but easy:
//---[Test01.js]----------------------------------------------------------
// includes the scripting library
eval( new ActiveXObject("Scripting.FileSystemObject").OpenTextFile("Scripting.js", 1, false).ReadAll() );
The eval function gives us the ability to just run code that we pass in at runtime. This will give us a few little bumps along the way later, but for the most part, is pretty darn good.
How about those environment variables
The WScript.Shell object has some methods that let us get at environment variables, but I wouldn’t exactly consider them “Script Friendly”. So, the first thing I did, was create a some basic functionality for exposing the environment as global variables. (Some of how this gets useful, comes a bit later.)
//----------------------------------------------------------------------------
// Global variables
var GLOBALS=this;
var WSHShell = WScript.CreateObject("WScript.Shell");
var procEnvironment=WSHShell.Environment("PROCESS")
GlobalInit();
// Loads the environment from into UPPERCASE variables in the global namespace.
// each variable is prefixed with a $
function loadEnvironment() {
env = CollectionToStringArray(procEnvironment);
for(each in env) {
var v= env[each];
if(typeof(v)=='string')
if ((p = v.indexOf('=')) == 0 )
continue;
else
GLOBALS['$'+v.substring(0,p).toUpperCase()] = v.substring(p+1) ;
}
}
// Sets environment variables with the all string variables in the global namespace
// that start with a $
function setEnvironment() {
for(each in GLOBALS) {
var t = typeof(GLOBALS[each]);
if(t =='string' || t=='number') {
if( each.indexOf("$") == 0 ) {
if( IsNullOrEmpty(GLOBALS[each]) )
procEnvironment.Remove(each.substring(1));
else
procEnvironment(each.substring(1)) = GLOBALS[each];
}
}
}
}
// takes one of those funky-groovy COM collections and gives back a JScript
// array of strings.
function CollectionToStringArray(collection){
var result = new Array();
for( e = new Enumerator(collection); !e.atEnd(); e.moveNext() )
result.push(""+e.item());
return result;
}
// returns true if the string is null or empty
// Yeah, I was thinking of c# when I wrote this.
function IsNullOrEmpty(str) {
return (str || "").length == 0;
}
// Our function for bootstrapping the required environment.
function GlobalInit() {
loadEnvironment();
}
Now, we can easily access environment variables:
//---[Test02.js]----------------------------------------------------------
// includes the scripting library
eval( new ActiveXObject("Scripting.FileSystemObject").OpenTextFile("Scripting.js", 1, false).ReadAll() );
WScript.echo( "Path is :" + $PATH );
Next time, I’ll show how I added code to execute and capture other external commands, and show a few cool functions that make playing in JScript a bit simpler.
Howdy,
I’ve been working for many months with Pierre Joye—well really, many people in the PHP community--on getting PHP to run faster on Windows.
Pierre has been working rapidly on upgrading libraries (Pierre pioneered the work to get PHP and its hoard of dependent libraries updated and properly compiling on Windows), replacing old POSIX-emulation code with native calls, patching bugs, and about a million other things, all of which had a huge impact on performance and stability of PHP on Windows.
For my part, I’ve been spending my time behind the scenes by feeding information to Pierre that he needs, testing, analyzing, and finally by constructing a new build process that enables us to take advantage of some pretty sweet optimization technology in Visual Studio.
Starting today, you can find snapshot builds of PHP 5.3 that are built using my optimized build process on the windows.php.net site:
Current PGO Optimized Snapshot:
http://windows.php.net/downloads/snapsoptimized/php-5.3-nts-win32-VC9PGO-x86-latest.zip
A few Notes:
Over the course of the next couple of weeks, I’ll be explaining how this build process works, and making available the tools that make it all possible.
Only the non-thread-safe version is available, so you need to use FastCGI with IIS in order to use it.
Since this is a radically different build than the ones that had been traditionally used to create the Windows PHP binaries, you should download the binaries and test with them, but you probably should avoid using them in production just yet.
If you have any feedback about the builds, leave me a comment, or send me mail at
garretts@microsoft.com
My earlier post on Windows Home Server's Drive Extender vs RAID, a lot of what I said was a good example of somethin' my pappy once told me:
"Good judgment comes from experience, and a whole lotta that comes from bad judgment."
Well, rest assured that I've had my share of bad judgement, as I'm sure others have, given some of the feedback I've received. Charlie pointed out to me some of the other really good reasons why you're better off not going with RAID5, rather than just using Drive Extender:
While RAID5 hardware is fast, it is also pretty much universally incompatible. In other words, if your RAID5 controller fails you will have to find an IDENTICAL controller to replace it or your array will be unreadable.
For those playing along at home, this is bad. If you think your data is valuable enough to keep around, and you're worried enough about failure that you're going to do something about it, think about what else will fail--not can, but will-- hard drives, controller cards, motherboards, ram, network adapters, power failure. I'd mention the possibility of meteor strikes, but ... I'm gonna play the odds with that one.
When I think about the how each of those failing is going to affect my ability to even recover data, the controller card is a nasty wildcard. There is no standard layout for how each controller card uses all that storage, and it's going to be needing replacement with the exact same controller.
All drives in a RAID5 array have to be the same size.
Hey sure, no problem. I can get 750 gb drives the same size in the future... Uh, except there is no guarantee that the drives will be exactly the same size, (even if they are labeled the same), and it's possible that other changes in the future could render them so obsolete, that it could be hard to even source drives of a particular size.
Furthermore, as you progress into the future, with Drive Extender, you can simply buy any size of drive that fits your budget, and add it to the server. No worries about how big, what brand, how fast, etc.
Drive Extender gives you the flexibility to NOT duplicate files that don’t need to be duplicated (making it MORE efficient than RAID 5 in some cases).
And this, is a really really good point too. Sometimes, I want to have data that I really don't care as much about, or I'm less worried about it being backed up and online. With Drive Extender, you can mark some content as not requiring duplication. Like maybe TV shows I've recorded on Media Center. Or those videos of my in-law's trip to Cleavland. You know what I mean.
RAID5 rebuild times with a single drive failure are often as long as your 3.5 month ordeal of recovery!
Well that's certainly true--I took all summer to restore my data, and I had no idea until I got it back if I would.
RAID deals at the block level and thus knows nothing about the interesting file information (metadata, etc…). Thus it will NEVER be able to be smart about your storage. But Drive Extender is file based and tons of innovation can happen (will happen!) leveraging this fact.
This is the point that I think is even more important, and yet so many folks will either ignore or not really understand. RAID is a system of aggregating independent disks and presenting that larger volume that to the OS as a single large drive. It works at a block level, meaning that your RAID system really knows nothing about files, just blocks. And because it's unaware of that, there is so many optimizations that can't be done with it. With file-level data redundancy, the system knows a lot more about it, and can begin to make decisions that would be much more beneficial. Think, down the line in a couple years, as I add a few more hard drives to my growing server--which already contains drives that are connected by Firewire, SATA, ATA and USB (*sigh*)--it's possible that the technology could realize that I rarely ever touch some kinds of data (or specifically, some types of data) and it could migrate those files to the slower media, and keep the faster media for things that need it.
Now, my momma said that I should always end by sayin' somthin' nice, so I'll leave ya with this:
So is there anything good about RAID5 or are you just some sort of once-scorned-RAID-hater?
I've said it before but it bears repeatin': RAID--in general--is great for performance, and RAID5 adds to that, availability. My desktop at home has 1.4 terabytes of space, striped (that's RAID1). My desktop at work has 2.8 terabytes of storage, again, striped. Sure, sure, with 4 750 gb drives, and using striped storage, I'm subject to failure at 4x the rate I was before (probably more, cowboy math ain't that great... :D). On those systems, I'm not leaving anything there that I can't afford to lose, either because I've backed it up, or it's recreate-able.
But WHOOOOOSH, it sure goes fast!
I use Windows Home Server at home to store *everything*... it's really quite a fantastic product. It has a feature called Drive Extender, for which Wikipedia describes nicely:
Windows Home Server Drive Extender is a file-based replication system that provides three key capabilities:[12]
- Multi-disk redundancy so that if any given disk fails, data is not lost
- Arbitrary storage expansion by supporting any type of hard disk drive (Serial ATA, USB, FireWire etc.) in any mixture and capacity
- A single folder namespace (no drive letters)
Users (specifically those who configure a family's home server) deal with storage at two levels: Shared Folders and Disks. The only concepts relevant regarding disks is whether they have been "added" to the home server's storage pool or not and whether the disk appears healthy to the system or not.
Shared Folders have a name, a description, permissions, and a flag indicating whether duplication (redundancy) is on or off for that folder.
If duplication is on for a Shared Folder (which is the default on multi-disk Home Server systems and not applicable to single disk systems) then the files in that Shared Folder are duplicated and the effective storage capacity is halved. However, in situations where a user may not want data duplicated (e.g. TV shows that have been archived to a Windows Home Server from a system running Windows Media Center), Drive Extender provides the capability to not duplicate such files if the server is short on capacity or manually mark a complete content store as not for duplication.
Here at Microsoft, we have an internal mailing list for WHS, and every once in a while, someone asks one of the following questions:
Isn't RAID better than Drive Extender?
Why should I use Drive Extender instead of RAID?
Which RAID card should I buy?
How good is software RAID5?
I try to ignore those threads, but when the responses start coming in about the merits of RAID vs simply using DE, I end up getting itchy, and chime in. The topic came up again, this last weekend, and I recycled an old response, and it started looking like a good blog post... so here's the skinny.
First, the reason why you don’t want Software RAID 5
First, there’s a big gap between software RAID5 and hardware RAID5. Software RAID5 is slow. Damn Slow. Faster than that… maybe pretty damn slow. Not a great solution. You won’t be happy at the end of the day (see section below “Why you don’t want RAID 5.”)
Hardware RAID5 is fast. Zippity fast. So is how fast you will lose your data.
Why you don’t want RAID 5
RAID 5 is not about data integrity… it’s about performance and availability.
If you want your data to be safe, replicate it. Back it up. Put it in more than one place at a time.
If you use RAID5, you still need to back up your data. RAID5 is designed so that a single drive failure, will preserve your data, and make it available (but slower) until you get another drive in place, when it will rebuild the missing volume.
Here’s the kicker. What happens when a drive fails, and you are not there? If the system is in use, it’s going to get really really busy, and all of the drives in the array are going to get a lot of use.
When that hard drive fails (and you are not planning for IF but WHEN), and the others pick up the slack, the chances of losing a second drive go thru the roof. What will you lose if a second drive goes?
This is common, especially in a server/computer in a home environment, where the drives may not be busy most of the time.
One other contributing factor to multiple drive failure in RAID5, is people tend to use the same brand of drives, especially if they are the same batch (ie, you bought them at the same time).
My personal experience with RAID5
I had a server running RAID5 at home, it ran perfect for over a year (actually, close to two). One night after I went to bed, a drive failed. 3 minutes later another failed. This was a 2 terabyte RAID array.
I came down in the morning to my worst nightmare. Every bit of ‘valuable data’ I had in the world was now gone. In desperation I scoured the internet, and finally found a piece of software that (for $40!) could recreate every file that I still had data for, if not a little slowly. I rushed out and bought 3 750gb drives, and started to restore everything I had lost. The restore process took 3 and half MONTHS, running full time, around the clock. The good news is that I was able to get one of the failed drives spinning again, and I lost a total of one file.
What did I learn?
RAID5 doesn't back up my data. Sadly, I thought it was safer. Worse than that, it was actually less safe. A single drive failure would have meant nothing. Add another drive, and keep chugging. Potentially, it may have taken a few hours to rebuild the lost volume, but I could have been using it while it did.
A second drive failure would have meant I was offlined for the time it took to restore--If I actually had a backup. Still, not bad, considering that would have been less than the 3.5 months.
But a two drive failure(which is fairly likely)--without a backup-- is a nightmare.
If you value your data, replicate. I now have a home server with 6 250gb drives and 3 750gb drives, and the data that I value is replicated. (and the really valuable data is foldershare’d to a friend’s house, and vice versa, giving us offsite backups too). Sure, it’s not as ‘space efficient’ but at least I can deal with a drive failure.
Raid 1 (mirroring) is the only RAID where a failure doesn’t increase drive activity drastically—well, reads are all going to one drive now, but if you had 8 drives mirrored in 4 sets, typical access won’t cause all the drives to get busier.
Raid 0 of course, is purely about speed. Half the safety at twice the speed. That’s what I use in my desktops. (where I want it fast. I of course back up anything that I’m not willing to lose to the server.)
What is my Advice?
Know this: if you are using hard drives, one day, you will experience a drive failure. Not 'might', but 'will.
How you are affected depends on your choices.
Determine how valuable your data is.
Stop thinking about the price of the hard drives. Disk Space is very cheap. It got cheaper while you were reading this. It's the stuff you store that's not.
Are you planning for the inevitable, or playing the odds?
I can talk all day why DE is better than RAID, or why one particular strategy is better than the other. At the end of the discussion, you're still the one making your decision, and you're probably pretty smart. (You're reading my blog). Ask yourself: why are you doing what you are doing?
I'll leave the rest to your imagination.
I attended a session this morning called "PHP Taint Tool: It Ain't a Parser" by Luke Welling. Luke introduced a tool he's working on at OmniTI that is designed to assist in sniffing out where the potential for untrusted input is handled. From the session description:
... You want to see where untrusted input can propagate taint within the application. In complex logic that might mean chasing many possible execution paths. Using an automatic tool to try to follow these paths without running all possible input variations is called static analyis. ... The Taint tool allows the PHP engine to do as much as possible, then cuts in at the last stage to analyze the compiled opcodes and trace possible flow of execution.
The Taint tool presents opcodes in a readable way, making it clear what lines of source got compiled into specific opcodes. It also performs a static analysis on the code, following the opcodes to attempt to trace all possible code branches and mark lines that tainted data can be passed to.
Essentially, the tool uses the parts of the PHP engine to compile PHP code to opcodes, and then tracks where data comes and goes, and highlights the code that handles data that *could* be tainted--that is, input from the user either by POST or GET parameters. This provides a facility for a developer to identify the lines that they should closely review to ensure that they are not accidentally introducing security holes (like cross-site-scripting opportunities).
Now, it's not-quite-ready for prime-time, but it's getting close, and the folks over at OmniTI intend to release it as open source when they are ready. When this gets released, I'll be really excited, as it looks like it could be really good for hunting down security holes.
I also attended Rasmus Lerdorf's (the Yahoo PHP guy) tutorial on "PHP: Architecture, Scalability, and Security" that was really quite good too, and he demonstrated a tool (the name of which I can't remember now...grrr) that they have at Yahoo that he points to a web page, and it starts throwing a large library of strings that may uncover security problems, but it does it from the client side. Unfortunately, he's not releasing it, not because he doesn't want to let folks find and fix their bugs, but because the release of a such a tool could bring about Internet Armageddon--it would likely find exploitable problems in the vast majority of the Internet.
Both approaches to finding application holes are useful, and it's clear from both talks that this is still a really large problem that developers need to address.
(I've had a problem with spam comments; I'll be addressing that soon, so if you see comments turned off you can drop me a email: garretts...at...microsoft...dot...com)
This week I'm at OSCON in Portland, OR. I like what their site says about it:
"OSCON is the crossroads of all things open source, bringing together the best, brightest, and most interesting people to explore what's new, and to champion the cause of open principles and open source adoption across the computing industry."
It really is exactly that. It seems like I've met so many people here, and have had so many great conversations, it's like time slows right down, and the universe is conspiring to squeeze everything it can into just a few days.
I'm having a great time here, and with so much going on, I feel like a kid in a candy store. The biggest trouble I'm having is picking what sessions I want to attend, as there is just so many worth while. However, given the work I'm currently doing with PHP, I think I'll stick pretty close to the PHP related sessions for the most part.
The last couple of years, Microsoft has had a fair number of people here, and this year is no exception. I keep bumping into people I know... Hey, if you're reading this, and you see me, stop and say hello!
You can recognize me by my picture.
Ever notice how folks who blog sporadically (uh, like me!) always apologize for not blogging for a while, and then re-affirm their dedication to blogging regularly? And often, accompanying their apology, is also a reason. I was going to "Blame it on the Rain" but the very thought of quoting Milli Vanilli makes me shudder.
So, instead, Patty gets to explain it for me. Well, now that I think about it, it really doesn't explain anything. But I was listening to that song last night, and the lyrics stuck in my head.
..... Aaaaaanyway...
The worst part about not blogging for weeks on end is that I can't just ramble on as if you know what I've been up to for the last last few weeks, but I'll try to catch ya up:
Over the last several weeks, I've been moving my focus from doing "Program Management" tasks to more "Software Developer" tasks. You see, during the last year, I've discovered that I'm a Developer. Deep down, that's what I do best. Focusing in that direction is already paying off, and I'm finding that I'm accomplishing far more than I had before.
So, rather than focus on simply facilitating, I've been actually compiling, debugging, coding... aaaahhh. It's so nice.
And the best part: all the work that I'm doing is dedicated to getting Apache and PHP working much better on the Windows platform. I may just possibly have the absolute best job at Microsoft.
Technorati Tags:
OSCON,
PHP,
Apache
Back in January, I invited the Apache Software Foundation to attend the Windows Server 2008 Application Compatibility Labs, here on our campus in Redmond. In order to get as many developers as possible to attend, we even paid for flights and accommodations for some members.
The week that Apache was here, was so valuable for both groups--the product groups got to see and understand what some of the issues were that some of the Apache projects have run into, and the Apache folks were able to get their hands on the developers who built the system.
Myself and Bill Rowe had hammered out some details before I actually sent the invitation out. Along with posting it on some of the Apache Mailing Lists, I also posted the invitation on my own blog so that others could see what we're up to. And, as to be expected, there was a wide variety of comments posted--both positive, and ... less positive.
My favorite though, was:
"Microsoft should go to Apache developers and see if Windows Server 2008 works correctly with Apache, not the other way around."
In some ways, that would have been somewhat impractical--when the Apache folks visited us, they had the opportunity to meet with engineers and program managers from many different groups, in addition to getting access to the hardware in the lab and the expertise of the folks who run that. For us to pick up the 20 or so people from the product groups that they actually met with, and drag them all out to all the locations where Apache developers are--which is pretty much everywhere--would not have been possible.
Still, I felt it would be more than valuable for me to go ApacheCon, so that I had the opportunity to meet with Apache developers where they roam. When Bill was in Redmond, he invited me to the Apache Hackathon--the couple of days at the beginning of the conference that developers could hang out and code. So, a snappy 10hr flight later, here I am at ApacheCon in Amsterdam.
The Apache Foundation is an interesting community--or rather community of communities. It's not just one project (the http server is what most people think when they hear Apache), but literally dozens of top level projects, and a whole bunch more in the 'incubator' (where baby projects are cultivated until it is clear that it will have ongoing support and development). The hackathon is just a large room with tables where folks can come in, sit down open their laptops and start coding. It's actually a lot quieter than I imagined it would be. Naturally, the folks in communities tend to gravitate together and discuss their projects.
As I'm not really on any project, I've been bouncing around chatting up different groups, getting their perspective of their own little chunk of Apache. Most of the people I've talked to aren't surprised at all that I'm here--which is definitely a change from conferences a year ago--and are excited to hear about our efforts.
Now, for the funny thing. I booked my hotel a few weeks back, using the internal travel system here at Microsoft. The hotel that the conference is at was booked, so I looked for one nearby. Unfortunately, the tool doesn't let me search for hotels near another hotel, and I didn't know what else was close that I could search near (and my inability to read Dutch didn't help), so I used the tool to show me where the hotels were, I'd switch to http://local.live.com and see how close it was, and if it was close, I'd switch to the other tool to check out the availability, and there was not much available. ... I guess I was distracted while I was doing it, and I ended up booking a hotel right next to the airport, which is in no way close to the conference, and so I spent the night in that hotel--and called the wonderful travel support folks who found me a hotel where I needed to be, and I moved there the next morning. Lesson learned: next time I travel to the Netherlands, I'm asking Hank to find me a hotel.