Download Research Tools
Microsoft Research’s various summer schools provide excellent opportunities to work with our academic partners to foster the next generation of computer scientists and breakthrough applications. So with great anticipation, I headed to Moscow for the 2014 summer school in Russia, which took place over the sunny days of July 30 to August 6.
The summer school is Microsoft Research’s largest annual event in Russia. Since its inception in 2009, it has tackled a wide array of cutting-edge computing topics, including:
This year’s event, which was co-sponsored by Lomonosov Moscow State University and Yandex (one of Russia’s leading Internet companies), focused on conducting research in the cloud. Our goal was to train a new generation of researchers to build cloud-based tools and services that will support scientific discovery in this age of “big data.”
Faculty, students, and staff of the 2014 summer school in Moscow
At past summer schools, we’ve focused on assembling a student body consisting primarily of young computer scientists. This year, we expanded the student selection to include researchers from any academic discipline who have an understanding of basic scientific data analysis and programming skills. More than 600 students and professionals, including graduate and advanced undergraduate students as well as young scientists and developers, applied. After a very selective process, 42 students, representing universities and research institutions in Russia, Kyrgyzstan, Azerbaijan, and Ukraine, were admitted to the seven-day course. The attendees included candidates doing trailblazing research in mathematics, computer science, geology, engineering, cryptology, space monitoring, photonics and optics, bioinformatics, and aero and plasma physics.
The faculty for the school was selected for their wide range of experience in research applications of cloud computing. Tony Hey from Microsoft Research gave the opening address and set the context for the rest of the week. Professor Geoffrey Fox of Indiana University (United States) covered topics related to scalable data analysis algorithms and the rapidly expanding open-source cloud software stack. Professor Sergey Berezin of Lomonosov Moscow State University talked about the challenges of building client-plus-cloud applications, describing the creation of desktop and web applications that use the cloud for analysis and data visualization. Professor Paul Watson of Newcastle University (United Kingdom) lectured on cloud workflows for scientific applications, hybrid cloud security, and cost models; and Sergey Bykov of Microsoft Research discussed a next-generation cloud-computing platform, the recently released Orleans cloud-programming toolkit.
The august faculty notwithstanding, the students proved to be the real stars of the school. We started them off with a one-day training on how to use Microsoft Azure and provided each student with a small Azure account to use for their projects. At the end of the first day, we asked them to form into small teams to build a cloud application. With only one full day and three days outside of class hours to work on their projects, the outcomes, which they presented on the final day of class, were simply astounding. The topics ranged from highly scientific areas, such as bioinformatics and satellite orbital guidance, to social networks and the Internet of Things. The students integrated ideas from the lectures and truly understood why the cloud is a very different paradigm of computing than any they had encountered in the past.
The faculty honored the best projects in four categories:
The faculty had a difficult time selecting the winners, as so many of the teams demonstrated creativity, collaboration, and amazing energy. Yandex provided gifts for the winners and hosted a lovely end-of-school party at their Moscow headquarters. Microsoft Research is already looking forward to next year’s summer school in Russia, where we will again strive to push computer science and research applications to new heights.
—Dennis Gannon, Director, Microsoft Research
Big data: it’s the hot topic these days, promising breakthroughs in just about every field, from medicine to marketing to machine learning and more. But for many of us, the problems of managing big data hit home when we confront the welter of digital photos and videos we have recorded with our smartphones and cameras. Multiply this by the number of people doing this around the world and it is a big problem. On the surface, it does not seem like an endeavor on the order of treating cancer (more on that later), but it is a colossal headache to organize, classify, search, and retrieve our multimedia content—and designing systems to do this at scale effectively is a huge challenge.
Thankfully, Professor Heiko Schuldt and Ivan Giangreco of the Databases and Information Systems (DBIS) Group at the University of Basel are working on a project to do just that, and a whole lot more. Their integrated system harnesses the power of the cloud, through Microsoft Azure, to understand and sort through the terabytes of data that make up multimedia content to find and return like objects.
The Basel team’s system combines the power of relational databases, with the adaptability of information retrieval systems. The Basel system can handle and store any type of multimedia data, including their features. When an algorithm for feature extraction is defined, the system automatically executes the extraction, storage, and indexing of both the feature data and the object itself. This approach efficiently carries out Boolean queries as well as searches based on ranking images based on their feature similarity scores. In addition, it provides novel query paradigms and interfaces; for example, you can sketch an image or parts thereof and find images that are similar to your sketch.
It's exciting to see how this work has progressed since the Basel researchers attended our first European Microsoft Azure for Research training workshop at ETH Zurich last November. They successfully applied for an Azure Award, which got them up and running on the cloud within a few weeks. This allowed the team to quickly develop and deploy their system in a scalable way. Microsoft Azure is ideal as a fast, distributed storage and computing fabric for running the Basel team’s project, whose MapReduce-style program can grow as millions of images are added to the system. By moving to the cloud, the Basel researchers have been able to develop, deploy, and demonstrate the system, testing their ideas at scale on the 14 million images that comprise the ImageNet database. They presented this work at the IEEE International Congress on Big Data (BigData 2014).
Professor Schuldt explains how Azure has helped him with his research. "In large-scale image retrieval, both effectiveness and efficiency are essential requirements. Thanks to Microsoft’s support and the use of the Azure cloud, we have been able to successfully address the retrieval efficiency so that we can concentrate further on retrieval effectiveness, especially by developing novel search paradigms and user interfaces based, for instance, on gestures or sketches."
The Basel researchers are looking forward to tackling the even bigger Bing Clickture dataset, which contains 40 million images. They also plan to test the system on video content, in what they’re calling the IMOTION project, which will “multiply the challenges in terms of retrieval efficiency,” notes Professor Schuldt. Their next paper was presented at 37th International ACM-SIGIR Conference on Research and Development in Information Retrieval, and we're looking forward to seeing how the team continues to push the boundaries of big data by using Microsoft Azure.
Now back to that earlier comment about treating cancer. Approaches similar to those used by the Basel team’s project might, in fact, someday help us to better understand and treat cancer. The underlying computer science and cloud technologies could be used, for example, for managing and analyzing MRI scans of tumors.
The Basel team’s project is just one example of how easy it is to get up and running on the cloud and accelerate your research—especially when by taking advantage of the Microsoft Azure for Research initiative, which offers not only training but also substantial grants of Azure storage and compute resources for qualified projects. Read about the initiative and our requests for proposals. Who knows? Maybe your project will be the next big thing in big data.
—Kenji Takeda, Solutions Architect and Technical Manager, Microsoft Research
The following blog is from guest contributor Paul Greenfield of CSIRO, Australia’s national science agency. He and his colleagues have developed a new correction tool to address the problem of DNA sequencing errors in biological and ecological research, and they have just released it to the research community worldwide.
—Simon Mercer, Director, Microsoft Research
The rapid development of next-generation DNA sequencing has revolutionized biological and ecological research in the last few years. The cost of DNA sequencing has fallen dramatically, and sequencing machines are becoming a standard piece of lab equipment. Low-cost sequencing is enabling researchers to uncover the gene differences that make some people more susceptible to diseases; to explore the genetic makeup microbial communities from the human gut or the bottom of the ocean; and to rapidly identify the organism responsible for a life-threatening infection.
But while the costs of sequencing have plummeted, the accuracy of the data produced has improved only slowly: about 1 percent of the bases generated are still called incorrectly. The bioinformatics community has responded to this problem by building specialized error correction tools that use the inherent redundancy in sequence data to find and repair miscalls and other sequencing errors. Tests have shown that incorporating the best of these error-correction tools into standard bioinformatics analytical pipelines can result in much better quality genomes and more accurately called gene variants.
However, accurately correcting errors turns out to be a difficult problem, largely because of the repetitive and ambiguous nature of genomes. It is easy to correct simple substitution errors, such as when 50 sequence reads say that a given base is an A, and only the read being corrected says it’s a G. Such simple errors are well handled by downstream tools such as assemblers and aligners. The challenge is making the right correction when there are multiple plausible corrections—such as when 50 reads say A, 49 say G, and the read being corrected says T—as happens whenever reads fall across the end of a repeated region within a genome. Just to make things more challenging, this correction has to be done without any knowledge of the genomes being sequenced, and the only clues about which corrections are ‘“right” comes from the sequence data itself.
My colleagues and I at the Commonwealth Scientific and Industrial Research Organisation (CSIRO) have just released a new error correction tool we’ve developed for use by the research community. We call it “Blue.” Blue is a high-performance C# application that runs natively on Windows systems, and under Mono on Linux and OS X. As we reported in a paper published in Bioinformatics, test results show that Blue is significantly faster than other available tools—especially on Windows—and is also more accurate as it recursively evaluates possible alternative corrections in the context of the read being corrected.
Another uncommon feature of Blue is that it can correct all three types of possible errors (substitutions, deletions, and insertions), making it suitable for use of data produced by the Roche 454 and Life Technologies Ion Torrent systems. Blue also allows for the correction of one set of reads with a consensus derived from another set of reads, and this capability has been used to correct small numbers of long (and expensive) Roche 454 reads with a consensus derived from a large file of cheaper (but shorter) Illumina reads. This “cross-correction” method has been used very effectively to improve the quality of several reference assemblies, ranging in size from bacteria to moths and grasses.
Blue and its associated tools can be downloaded from CSIRO Bioinformatics.
—Paul Greenfield, Research Group Leader, CSIRO, Division Computational Informatics