Download Research Tools
Big data: it’s the hot topic these days, promising breakthroughs in just about every field, from medicine to marketing to machine learning and more. But for many of us, the problems of managing big data hit home when we confront the welter of digital photos and videos we have recorded with our smartphones and cameras. Multiply this by the number of people doing this around the world and it is a big problem. On the surface, it does not seem like an endeavor on the order of treating cancer (more on that later), but it is a colossal headache to organize, classify, search, and retrieve our multimedia content—and designing systems to do this at scale effectively is a huge challenge.
Thankfully, Professor Heiko Schuldt and Ivan Giangreco of the Databases and Information Systems (DBIS) Group at the University of Basel are working on a project to do just that, and a whole lot more. Their integrated system harnesses the power of the cloud, through Microsoft Azure, to understand and sort through the terabytes of data that make up multimedia content to find and return like objects.
The Basel team’s system combines the power of relational databases, with the adaptability of information retrieval systems. The Basel system can handle and store any type of multimedia data, including their features. When an algorithm for feature extraction is defined, the system automatically executes the extraction, storage, and indexing of both the feature data and the object itself. This approach efficiently carries out Boolean queries as well as searches based on ranking images based on their feature similarity scores. In addition, it provides novel query paradigms and interfaces; for example, you can sketch an image or parts thereof and find images that are similar to your sketch.
It's exciting to see how this work has progressed since the Basel researchers attended our first European Microsoft Azure for Research training workshop at ETH Zurich last November. They successfully applied for an Azure Award, which got them up and running on the cloud within a few weeks. This allowed the team to quickly develop and deploy their system in a scalable way. Microsoft Azure is ideal as a fast, distributed storage and computing fabric for running the Basel team’s project, whose MapReduce-style program can grow as millions of images are added to the system. By moving to the cloud, the Basel researchers have been able to develop, deploy, and demonstrate the system, testing their ideas at scale on the 14 million images that comprise the ImageNet database. They presented this work at the IEEE International Congress on Big Data (BigData 2014).
Professor Schuldt explains how Azure has helped him with his research. "In large-scale image retrieval, both effectiveness and efficiency are essential requirements. Thanks to Microsoft’s support and the use of the Azure cloud, we have been able to successfully address the retrieval efficiency so that we can concentrate further on retrieval effectiveness, especially by developing novel search paradigms and user interfaces based, for instance, on gestures or sketches."
The Basel researchers are looking forward to tackling the even bigger Bing Clickture dataset, which contains 40 million images. They also plan to test the system on video content, in what they’re calling the IMOTION project, which will “multiply the challenges in terms of retrieval efficiency,” notes Professor Schuldt. Their next paper was presented at 37th International ACM-SIGIR Conference on Research and Development in Information Retrieval, and we're looking forward to seeing how the team continues to push the boundaries of big data by using Microsoft Azure.
Now back to that earlier comment about treating cancer. Approaches similar to those used by the Basel team’s project might, in fact, someday help us to better understand and treat cancer. The underlying computer science and cloud technologies could be used, for example, for managing and analyzing MRI scans of tumors.
The Basel team’s project is just one example of how easy it is to get up and running on the cloud and accelerate your research—especially when by taking advantage of the Microsoft Azure for Research initiative, which offers not only training but also substantial grants of Azure storage and compute resources for qualified projects. Read about the initiative and our requests for proposals. Who knows? Maybe your project will be the next big thing in big data.
—Kenji Takeda, Solutions Architect and Technical Manager, Microsoft Research
The following blog is from guest contributor Paul Greenfield of CSIRO, Australia’s national science agency. He and his colleagues have developed a new correction tool to address the problem of DNA sequencing errors in biological and ecological research, and they have just released it to the research community worldwide.
—Simon Mercer, Director, Microsoft Research
The rapid development of next-generation DNA sequencing has revolutionized biological and ecological research in the last few years. The cost of DNA sequencing has fallen dramatically, and sequencing machines are becoming a standard piece of lab equipment. Low-cost sequencing is enabling researchers to uncover the gene differences that make some people more susceptible to diseases; to explore the genetic makeup microbial communities from the human gut or the bottom of the ocean; and to rapidly identify the organism responsible for a life-threatening infection.
But while the costs of sequencing have plummeted, the accuracy of the data produced has improved only slowly: about 1 percent of the bases generated are still called incorrectly. The bioinformatics community has responded to this problem by building specialized error correction tools that use the inherent redundancy in sequence data to find and repair miscalls and other sequencing errors. Tests have shown that incorporating the best of these error-correction tools into standard bioinformatics analytical pipelines can result in much better quality genomes and more accurately called gene variants.
However, accurately correcting errors turns out to be a difficult problem, largely because of the repetitive and ambiguous nature of genomes. It is easy to correct simple substitution errors, such as when 50 sequence reads say that a given base is an A, and only the read being corrected says it’s a G. Such simple errors are well handled by downstream tools such as assemblers and aligners. The challenge is making the right correction when there are multiple plausible corrections—such as when 50 reads say A, 49 say G, and the read being corrected says T—as happens whenever reads fall across the end of a repeated region within a genome. Just to make things more challenging, this correction has to be done without any knowledge of the genomes being sequenced, and the only clues about which corrections are ‘“right” comes from the sequence data itself.
My colleagues and I at the Commonwealth Scientific and Industrial Research Organisation (CSIRO) have just released a new error correction tool we’ve developed for use by the research community. We call it “Blue.” Blue is a high-performance C# application that runs natively on Windows systems, and under Mono on Linux and OS X. As we reported in a paper published in Bioinformatics, test results show that Blue is significantly faster than other available tools—especially on Windows—and is also more accurate as it recursively evaluates possible alternative corrections in the context of the read being corrected.
Another uncommon feature of Blue is that it can correct all three types of possible errors (substitutions, deletions, and insertions), making it suitable for use of data produced by the Roche 454 and Life Technologies Ion Torrent systems. Blue also allows for the correction of one set of reads with a consensus derived from another set of reads, and this capability has been used to correct small numbers of long (and expensive) Roche 454 reads with a consensus derived from a large file of cheaper (but shorter) Illumina reads. This “cross-correction” method has been used very effectively to improve the quality of several reference assemblies, ranging in size from bacteria to moths and grasses.
Blue and its associated tools can be downloaded from CSIRO Bioinformatics.
—Paul Greenfield, Research Group Leader, CSIRO, Division Computational Informatics
Summer Bridge students and their hosts at Microsoft
Experts agree that the next wave of innovation in computing requires diversity in the research and development teams who will create it. I believe that means expanding the pipeline of students entering computing. In particular, we need to get more girls into the pipeline, which is why I am so pleased to have had two amazing young women working with me as interns this summer: Veronica Catete, a third-year doctoral student at North Carolina State University, and Alka Pai, a senior at Tesla STEM High School in Redmond, Washington.
Veronica and Alka are enthusiastic about encouraging more young women to study and work in the computer sciences. To that end, they are developing a free, online computer science toolkit for middle-school girls as well as a course that teaches principles of computer science through game design. When they aren’t busy developing amazing tools, this dynamic duo is participating in events and activities that are designed to excite young people about the future of computer science.
I’d like to hand it over to Veronica and Alka, to discuss an event they hosted in July at Microsoft’s Redmond (Washington) campus. As you read their account, I encourage you to ask yourself how you, too, might help foster more diversity to computing. We all have an interest in promoting innovation in technology and computer science. Perhaps Veronica and Alka’s blog inspires some ideas—if so, I’d love to hear from you!
—Rane Johnson-Stempson, Principal Research Director for Education and Scholarly Communication, Microsoft Research
On July 17, 13 students (10 girls and 3 boys) from the greater Seattle area came to Microsoft to explore the possibilities offered by careers in computing. These students are part of the Summer Bridge Program, an academic enrichment and college readiness project offered through the University of Washington Women’s Center. This program is designed for promising eighth-grade students who are interested in exploring science, technology, engineering, and mathematics—the so-called STEM fields.
We gave the students a tour of the Microsoft campus, highlighting several of the amazing projects underway here. The students started their day by exploring modeling and graphics by designing 3D models of Seattle’s iconic Space Needle, which they were able to print in the Microsoft Research hardware lab.
Working together, students build a model of the Seattle Space Needle.
During lunch, our visitors enjoyed a panel discussion from three of our high-school interns, Alisha Meherally, Arjun Narayan, and me (Alka). We discussed how we got started in computer science and what it’s like to work at Microsoft. We also offered our tips for finding opportunities to work in and learn about computer science outside the classroom. I think we surprised the students by admitting that all three of us entered computer science studies reluctantly—kicking and screaming, so to speak. But we hastened to add that now, having experienced the thrill of resolving software bugs and seeing computing’s potential for creative disruption, we are avid enthusiasts, deeply passionate about our work in computer science.
The Summer Bridge students then participated in a TouchDevelop workshop, where they used Windows 8 phones to write actual software code. Then we headed off to tour Microsoft’s state-of-the-art Cybercrime Center, where the students got upclose and personal with the forensics lab and experienced, firsthand, the tools and techniques used to spot cyber crimes. For example, students Waltana Dewit, Yohannes Seghane, and Sarina Tran examined several supposed Microsoft products, working together to determine which were legitimate and which were counterfeit. “You have to look really hard to notice the differences,” said Yohannes. “If someone were to buy one of these from Amazon, I don’t think they would be able to tell.”
Looking for cyber crimes: students try to identify counterfeit software products.
Our visitors finished the day by touring Microsoft Research’s hardware lab. There they got to see the cool gadgets that the researchers use to prototype their ideas or fix a broken part.
The students were excited to see the potential of computer science careers to change the world, and they came away with a deeper understanding of why they should study STEM. They left with smiles on their faces, souveniors in their pockets, and a world of opportunity ahead. “This place is amazing,” observed Ngocmi Ngo. “I’ve already decided that I want to work here, now I just have to wait until I’m a junior.”
That’s the spirit, Ngocmi.
—Veronica Catete and Alka Pai, Microsoft Research Interns