Download Research Tools
I recently sponsored an event in Manizales, Colombia, training biologists on .NET Bio and BioHPC, two projects that make computational research easier in the life sciences. As part of the training, Jarek Pillardy—the head of the Cornell Bioinformatics Facility (CBSU) at Cornell University—and some of his staff presented various aspects of BioHPC. I had the opportunity to sit down with Jarek, who is not only the developer of BioHPC but also a long-time user of the .NET Bio project. Here is a recap of that conversation.
Simon: You lead the CBSU—what activities does it support?
Jarek: CBSU is the Cornell University Bioinformatics Facility, and its mission is to support biological research with advanced computational infrastructure and bioinformatics tools and techniques. The facility’s main activities can be divided into maintaining extensive computational infrastructure configured for bioinformatics; providing easy access to the infrastructure through the web via BioHPC Web or interactively through BioHPC Lab; training, mainly through workshops and consulting; direct research collaborations, ranging from small projects to participating in major grants as co-principal investigators; and software and LIMS development.
Simon: What prompted you to develop BioHPC, and what does it do?
Jarek: BioHPC is our main way to deliver computational infrastructure to biologists. It is not easy for an experimental biologist to use computing tools directly and navigate the complicated maze of schedulers, command-line tools, data-storage methods, and other infrastructure. BioHPC simplifies access, both through the web and interactively, and management of the infrastructure (hardware and software). We created BioHPC to make our life easier and to provide services for many more researchers. BioHPC Web gives users a simple way to submit data for processing and for managing jobs and data. BioHPC Lab is a tool to organize access to interactive machines, reserve time, and manage associated resources, like storage and computing time. For us, it provides a convenient platform to deliver computational resources (hardware and software combined) and a set of tools to manage them.
Simon: Do you have any plans to extend the capabilities of BioHPC in the future?
Jarek: BioHPC is constantly evolving to meet the changing needs in bioinformatics and adapt to technological changes. Currently, we are supporting a diverse array of local and remote clusters, but we are planning to add capacity to run computations in the cloud. We are in the final stages of adding Windows Azure to our supported computing infrastructure. We will be also adding new software.
Architectural overview—BioHPC schema
Simon: How do you see the Windows Azure cloud being used in bioinformatics?
Jarek: For direct research computing, I can see two main scenarios. First, there will be advanced users, running their own virtual machines. These probably will be a minority of users. Second, there will be researchers who access Azure resources via an intermediate service like BioHPC. This scenario will involve a lot of task-focused services (for example, analyzing population data, assembling and annotating sequences, or handling a particular software pipeline) running on Azure, with the end-user not even fully aware of that. Azure provides an opportunity to bring data closer to the computing infrastructure.
Simon: How has BioHPC been able to help the Colombian BIOS Center?
Jarek: I think BioHPC may deliver for them the same benefits it does for us: an easy-to-use tool that provides convenient access to infrastructure and simplified management. They are still in the process of setting up and organizing, and we are in close contact with them, providing consultation and help. BIOS’s mission to the Colombia researchers is very similar to what our facility provides to Cornell, so our tools should be very useful to them. I hope they will be able to improve and expand BioHPC in order to meet their particular needs, which will make it much better.
As Jarek notes, BioHPC is a living, constantly evolving project, as is .NET Bio. If you’re a biological researcher, I encourage you take a good look at these tools.
—Simon Mercer, Director of Health and Wellbeing, Microsoft Research Connections
“If I have seen further, it is by standing on the shoulders of giants.”—Sir Isaac Newton
Standing on the shoulders of giants is a metaphor we often use to describe how research advances. More than an aphorism, it is a mindset that we ingrain in students when they start graduate school: take the time to understand the current state of the art before attempting to advance it further. Having to justify why you have reinvented the wheel during your PhD defense is not a comfortable situation to be in. Moreover, the value of truly reproducible research is reinforced every time a paper is retracted because its results cannot be reproduced, or every time that promising academic research—such as pursuit of important new drugs—fails to meet the test of reproducibility.
Of course, to truly learn from work that has preceded yours, you need access to it. How can you build on the latest research if you don’t know its details? Thankfully, open access (OA) is making it easier to find research papers, and Microsoft Research is committed to OA. Though it’s a good start, OA articles only contain words and pictures. What about the data, software, input parameters, and everything else needed to reproduce the research?
While research software provides the potential for better reproducibility, most people agree that we are some way from achieving this. It’s not just a matter of throwing your source code online. Even though tools such as GitHub provide excellent sharing and versioning, it is up to the researcher or developer to make sure the code cannot only be re-run but also understood by others. There are still technical issues to overcome, but the social ones are even harder to tackle. The development of scientific software and researchers’ selection of which software to use and reuse are all intertwined. We at Microsoft Research are concerned with this—see “Troubling Trends in Scientific Software” in the May 17, 2013, issue of Science magazine.
Kenji Takeda talks about reproducible research and the cloud at CW14.Photo: Tim Parkinson, CC-BY
This year’s Collaboration Workshop (CW14), run by the Software Sustainability Institute (SSI), brought together likeminded innovators from a broad spectrum of the research world—researchers, software developers, managers, funders, and more—to explore the role of software in reproducible research. This theme couldn’t have been timelier, and I was excited to take part in this dynamic event again with a talk on reproducible research and the cloud. The “unconference” format—where the agenda is driven by attendees’ participation—was perfect for exploring the many issues around reproducible research and software. So, too, was the eclectic make-up of the attendees, so unlike that at more conventional conferences.
Hack Day winners receive Windows 8.1 tablets for Open Source Health Check. Left to right: Arfon Smith (GitHub), Kenji Takeda (Microsoft Research), James Spencer (Imperial College), Clyde Fare (Imperial College), Ling Ge (Imperial College), Mark Basham (DIAMOND), Robin Wilson (University of Southampton), Neil Chue-Hong (Director, SSI), Shoaib Sufi (SSI)
Instead of leaving after two days, many participants stayed on for Hack Day—a hackathon that challenged them to create real solutions to problems surfaced at the workshop. Eight team leaders had to pitch their ideas to the crowd, as the researchers and software developers literally voted with their feet to join their favorite team. The diversity of ideas was impressive, such as scraping the web to catalogue scientific software citations, extending GitHub to natively visualize scientific data, and assessing research code quality online. We made sure that teams were able to use Microsoft Azure to quickly set up websites, Linux virtual machines, and processing back-ends to build their solutions.
Arfon Smith from GitHub and I served as judges, and we had a tough time choosing a winning project. After much back-and-forth, we awarded the honor to the Open Source Health Check team, which created an elegant and genuinely usable service that combines some of the best practices discussed during the workshop. Their prototype runs a checklist on any GitHub repository to make sure that it incorporates the critical components for reproducibility, including documentation, an explicit license, and a citation file. The team worked furiously to implement this, including deploying it on Microsoft Azure and integrating it with the GitHub API, to demonstrate a complete online working system.
Recomputation.org aims to make computational experiments easily reproducible decades into the future.
In addition to our role at CW14, Microsoft Research is delighted to be supporting teams working on new approaches to scientific reproducibility as part of our Microsoft Azure for Research program:
While we still have not achieved truly reproducible research, CW14 proved that the community is dedicated to improving the situation, and cloud computing has an increasingly important role to play in enabling reproducible research.
—Kenji Takeda, Solutions Architect and Technical Manager, Microsoft Research Connections
The Microsoft Faculty Summit celebrates the ongoing collaboration of Microsoft Research and the academic community, providing a forum for leading faculty members and Microsoft personnel to collectively discuss the future of computing and its applications in solving real-world problems. This productive partnership extends all the way back to the founding of Microsoft Research, so at this year’s summit, we are pleased to release Science@Microsoft, an e-book that commemorates our many years of fruitful teamwork
Now, not to complain, but imagine the task that fell to me and my fellow editors—David Heckerman, Stephen Emmott, and especially Yan Xu and Kenji Takeda—reviewing years and years of research to select a handful of stories that encapsulate the irrepressible innovation, the remarkable collegiality, and the ground-breaking impact that have characterized the collaboration between Microsoft Research and leading academic researchers. It was almost as daunting as the original research. Well, not really, but it was challenging. Which stories would make the cut? What were the selection criteria? As David Heckerman observed, “Our challenge was to select a small number of stories that each represented a unique aspect of the new paradigm—the eigenstories, if you will.”
In the end, we focused on the last 10 years, choosing stories that demonstrate the breadth of our collaborative research and the potential of computer science to address some of the world’s most vexing problems. We believe these stories demonstrate the amazing power of technology to impact areas far afield from traditional computer science.
Within these pages, you will read about investigations into the genetic basis of human disease, the study of the heavens, and the design of three-dimensional objects. You’ll find accounts of basic research with practical outcomes: from protecting endangered wildlife to safeguarding consumers. You’ll see how Microsoft Researchers, working in concert with academic and government investigators, have tackled some of the most pressing issues of the twenty-first century, from climate change to the AIDS epidemic to world hunger. You’ll also discover equally valuable, if less headline-worthy, contributions to the publication of chemical information and the reuse of data from clinical studies. Still, choosing was difficult. In the words of Stephen Emmott, “It was virtually impossible to select, given the first-rate science characterizing all of the projects.” Above all, this collection demonstrates Microsoft Research’s commitment to applying computer science to basic research and our rich history of working with external researchers. These stories commemorate a great record of using computing technologies in the service of humankind.
Science@Microsoft is published under a Creative Commons license, and is available as a PDF at microsoft.com/scienceatmicrosoft. It is also offered as an e-book through the Amazon and Barnes & Noble online stores. So fire up your laptops or e-readers!
—Tony Hey, Vice President, Microsoft Research Connections