Windows Azure SQL Database Marketplace
Aron Pilhofer acts as editor of interactive news technologies
at The New York Times, overseeing a
news-focused team of journalist/developers who build dynamic, data-driven
applications to enhance the Times' reporting online. He joined The Times in
2005. Previously, he was at the Center for Public Integrity in Washington, and
before that at Investigative Reporters and Editors (IRE.org).
In this interview, we discuss:
Could you take a moment to introduce yourself and to give us some background on
Sure. I wear a couple of different hats. At The
New York Times, I'm editor of interactive news, which is a team of
developers in the newsroom who are journalists. What we do is both editorially
and data-driven. We operate like a news desk, but we also were a technology
My other day job is on DocumentCloud, which is a nonprofit
funded by the Knight Foundation. I proposed a grant to fund it with Eric
Umansky and Scott Cline. We were awarded the grant, and we're entering our
second year right now.
The goal of the project
is to improve journalism by creating a site that allows journalists to analyze,
upload, share, and search public source documents that would be otherwise
extremely difficult to find or analyze.
an old issue in
journalism that journalists often cite documents that aren't available to the
reader. DocumentCloud lets the journalist post those source document in a
public place, so the reader can go back to the source, just as a journalist
As far back as the '20s, though, guys like Walter Lippmann
argued that the public just isn't that interested in the details. Do you find that
people aside from journalists are benefiting from DocumentCloud?
no. Let me just explain a little bit what DocumentCloud is, how it started, and
why the answer is no. There's DocumentCloud the software, which is one part of
what we're building. It sort of sits on top of OpenCalais, which is an open API
that does entity extraction and semantic markup.
Think of it as a set of tools we're providing to journalists
to give them the ability to treat unstructured text more like structured data,
so they can find links between documents that they could not have found through
As an example, think of a case where you send through a
document that includes a reference to the CIA. CIA is meaningful to a human
being. You and I can look at that and go "Oh, that probably means the
Central Intelligence Agency." Or, in other context, it might be the
Culinary Institute of America. It's less clear to a traditional text search.
The Calais engine allows us, in an automated way, to go
through and say "OK, that's the Central Intelligence Agency. And by the
way, here's this other document also about the Central Intelligence Agency, and
both of them reference the same individual that you are curious about."
So, that's an example of some of the tools we're building with journalism in
mind. That's DocumentCloud the software.
Then there's DocumentCloud the community, which is the other
piece of what we're trying to put together. Right now, it includes about 150
journalists and journalism organizations, with that number growing by leaps and
bounds. They're joining the community to use this tool to improve their
In order to join that community, you pretty much need to be a
journalist, by our definition. That is, you must be someone whose job, either
paid or unpaid, involves the acquisition, analysis, and ultimately publishing
of public source documents to benefit the public. Normally that means
government documents, and a lot of those documents are acquired through FOIA,
or they might exist on some other site.
Having said all that, we have been approached by any number
of non-journalism organizations, such as law firms. We've gotten the sense that
there is a need out there for sort of a lightweight document management tool,
and we may explore that as a potential revenue generator, but that isn't really
our main focus.
talked about this idea of document management. One of the reasons that self
publishing has been so popular has been the ease by which you can actually
publish to a platform. Can you talk a little bit about how DocumentCloud
removes some of the impediments traditionally associated with IT departments?
Aron: The genesis
of DocumentCloud was from a piece of software we developed at the Times called
DocumentViewer, which is a really straightforward piece of software. It will
take a PDF or a Word document, pretty much anything OpenOffice can open or a
PDF, break it up, extract the text, make it searchable, and then publish it to
the web in an attractive way.
Our thinking going in was
that most news organizations, even the smallish ones, would want something
similar. So our original conception was that DocumentCloud would be sort of the
hub. We would want your metadata, but generally speaking, we thought that all
the member organizations would want this sort of viewer to be on their
hardware, behind their firewalls.
We could not have been more wrong about that, for both good
reasons and unfortunate ones. My perception is that newsrooms lack fundamental
technology to deal with documents, and that is sort of scary. The traditional
way that newsrooms deal with big document dumps is to split them up and have
people sit down with yellow legal pads and pens and highlighters.
That is the highest technology, really, that most newsrooms
currently employ. A lot of newsrooms don't have access to the simplest things,
like OCR. That's surprising to a lot of people, but it's true, and in this
little area of public source documents, we think we can help.
That's why we pivoted early on away from thinking about
DocumentCloud as a federated thing running on hundreds of websites, to a vision
where fundamentally it all goes through us. For the most part, we actually host
the documents on behalf of news organizations.
The way we made that simple and scalable was to make the
entire DocumentViewer portion static. What you see on the website is just HTML,
dynamically as a service.
All a news organization has to do is get a little embed code
from us, which they can embed anywhere they want in their CMS. They can put it
within a blank page on their own site, in a blog post, or whatever. It's really
simple and really straightforward.
Robert: Some of
these technologies like DocumentCloud coming out are pretty exciting. Can you
talk a little bit about some other ways that the cloud might be fundamentally shaping
Aron: My team
here at the Times couldn't do what we do without the cloud. We run everything
off of Amazon. On an election night, we can suddenly go from four or five
servers to 22 servers to handle all that traffic. A day later, we can just spin
back down to five servers. There's no way you could do that in a traditional IT
concern that governments and corporations have about the cloud is where
data is stored. Typically they want or need the data to be stored within
their country's borders. But what's a drawback for some companies in this
scenario actually looks like an advantage for journalism. Is one benefit of the
cloud that it's possible to store any potentially embarrassing government
of the reach of that government?
thought certainly has occurred to me, and I don't know that it's been
adjudicated anywhere, really. To
flip that idea on its head, consider that in the UK, there's this notion of
Crown copyright, where the public doesn't really own public documents and data.
It's sort of bizarre. For example, postal codes are
copyrighted under Crown copyright, and you have to pay a huge amount of money
to get boundaries of postal codes in the UK. I don't know what would happen if
somebody were to make that data publically available on a server in the US. If
there were some assertion of Crown copyright, would that even apply
jurisdictionally to where that data is hosted?
It's a really good question, and I'm not sure I want to find
out, because this is sort of new territory for everybody. We're pretty cautious
about what we put up on the cloud and what we don't.
at DocumentCloud, what was it that required something new to be built? I mean
Microsoft has Office 365
Google obviously offers Google Docs.
There's also Scribd. What did you need
that you didn't find in these existing resources?
Aron: We looked
at all those options early on, and while in 2007, this field obviously wasn't
quite as crowded as it is now, none of them did what we wanted DocumentViewer
to do. DocumentViewer is more than just a way of putting a document online.
For example, it also allows you to do annotations, which is
kind of key from a journalistic standpoint. There's what we have come to refer
to as kind of a journalistic layer on top of a document.
A reporter can go into DocumentViewer, highlight a key paragraph,
click "Drag," and create an annotation. He or she can actually write
a couple of paragraphs to identify the significance of a particular sentence,
phrase, or paragraph and deep link into it.
That allows you to add a narrative to what is effectively a
piece of raw data, and say to the reader, "OK, here's the document that
we're basing our reporting on. But more that, here are the key paragraphs, and
here's why they're key. Here's really what this means."
Scribd didn't do that. Docstoc didn't do that. There was
really no technology we could find that did it in a way that we thought
accomplished our goals. We also wanted something that wasn't Flash-based, which
Scribd at that time was.
makes a lot of sense, particularly to support standards, when you consider all
of the form factors that you can use to access the Web. I imagine various
reporters want to use something like an iPad, a mobile phone, you name it.
Aron: Right. It's
not the world's greatest experience, but you can actually use DocumentViewer on
an iPhone. This is not an anti-Flash rant, or anything like that. It's just we
felt that the right technology for this was to stick to web standards, and what
we've come to refer to as HTML5.
Robert: On your blog, you've talked about how to use
Amazon EC2 behind the scenes. Can you explain how the elasticity of the cloud,
scaling up and down on demand, gets put to use by DocumentCloud?
Aron: Sure. It's
a big challenge. Document processing is a very CPU-intensive process, and so we
needed to be able to scale up rapidly when there's a big uploaded document, so
we did two things. One is that we have built and released a fairly lightweight
parallel processing library we call CloudCrowd. DocumentCloud has actually
released a number of open source libraries. We haven't released the entire
project, but that will come soon.
But the first piece was CloudCrowd, and that was sort of a
lightweight, Ruby-based parallel processing library, which allows us to quickly
add additional processing nodes if we get a 3,000 document dump from AP, which
actually happened last week.
Relatively easily, we can add two, three, four, or 100
servers to the processing pool and split that job up. It's basically a
MapReduce project at that point. So that's how the elasticity helps us on
DocumentCloud. The front end isn't as much of an issue, because once the
documents are actually rendered, it's 100% static content. We just serve those
off of S3.
Robert: Can you
talk about how much you're processing and what you expect that to grow to?
fluctuates, obviously, and it's pretty spiky, which is why we couldn't really
do this in a traditional environment. If you're building a data center, you
have to size it to the biggest spike you expect to have, which means you've got
a lot of time where you're sitting and idling with unused resources. Because we
don't need to worry about that, we can spin up 10, 20, or whatever at a time.
I think the most we've every processed in a day is a few
thousand documents. And then there are certain days where it's just a few
dozen. We opened our beta this summer, and I think we're over 400,000 pages
now, closing in on 500,000.
mentioned already that DocumentCloud uses open source, and is itself open
it's MapReduce, but we don't use Hadoop. Our version of Hadoop is CloudCrowd.
Think of in the old Apple ad, CloudCrowd is Hadoop for the rest of us. It's a
much simpler Ruby-based MapReduce library for doing parallel processing.
definitely sense that investigative journalism is being cut from a lot of news
organizations, because it's expensive and time-consuming. At the same time, computer
assisted reporting, which includes things like web scraping and data
mining, is on the rise and has actually led to Pulitzer Prize winning stories.
Do you think that technology offers new hope to investigative journalism?
and DocumentCloud, I think, is an example of how technology can be brought to
bear on that. As I said before, most journalists do serious document reporting
and analysis as a very analog process, and I think that the document piece is
just one tiny fragment.
Part of what a lot of computer assisted reporting folks are
doing these days in newsrooms is acquiring the data and making it searchable,
so it's easier for non-technical journalists to work with. I think the smart
application of technology in newsrooms can be a force multiplier for shrinking
The Times obviously has made a significant commitment to
investigative reporting, which not every news organization has. Anyone who
reads a newspaper knows the industry is struggling, which is a very good reason
why newspaper staffs are shrinking. The way I see it is that technology can
help overcome some inefficiencies, which can help preserve journalistic
thanks so much for your time. I greatly appreciate it.
Aron: You bet.