Download Research Tools
Big data—that buzzword seems to dominate information technology discussions these days. But big data is so much more than a clever catchphrase: it’s a reality that holds enormous potential. We now have the largest and most diversified volume of data in human history. And it’s growing exponentially: approximately 90% of today’s data has been generated within the past two years. The exploding science of big data is changing the IT industry and exerting a powerful impact on everyday life.
But what should big data science be, and where is it headed? These are the fundamental questions that have prompted Tsinghua University and Microsoft Research Asia to work together to establish a pioneering graduate course on Big Data Foundations and Applications. Turing Award winner and Tsinghua professor Andrew Chi-Chih Yao spent more than eight months developing the course, which launched in September 2014.
Turing Award winner and Tsinghua professor Andrew Chi-Chih Yao
Solidifying knowledge through academia-industry cooperation
On October 9, Hsiao-Wuen Hon, managing director of Microsoft Research Asia, delivered the course’s first lecture. Dr. Hon stressed that the importance of big data lies not only in its value in academic research but also in its application to real-world problems, which, he said, is why the academia-industry cooperation represented by the course is so critical.
“One of our purposes in launching this course with Tsinghua is to introduce Microsoft’s ideas to students, to let them get to know us better,” he explained. “Meanwhile, our top professional researchers can deepen their understanding of big data while teaching the students. So the course is not just about enhancing the students’ understanding of big data; it’s also about solidifying the researchers’ knowledge of big data.”
Hsiao-Wuen Hon, managing director of Microsoft Research Asia, delivered the course's first lecture.
Echoing the importance of the industry-academia connection, Professor Yao remarked, “Big data is an epoch-making subject. It has influenced all the other disciplines, including computer science and information technology. We should not only focus on the scientific research. Education development is also a new trend. ”
Leading the forefront of big data science
Wei Chen, a senior researcher at Microsoft Research Asia, has been a visiting professor at Tsinghua University since 2007. He has helped design and launch several entry-level courses at Tsinghua, and he is a strong proponent of the new big data course.
“We certainly don’t expect this course to become a platform for its product promotion. Instead, it is being established to provide students with cutting-edge knowledge, to get them engaged in research and technology development, and to foster their ability to do research and experimentation,” he said.
Wei Chen, senior researcher at Microsoft Research Asia, talks with student.
Professor Chen pointed out that while the course will provide an academic understanding of big data, it will also introduce students to real-life cases of Microsoft big data research and applications. In addition, students will have the opportunity to conduct experiments using Microsoft Azure, the company’s cloud-computing platform. He believes these practical, hands-on components distinguish this class from other big data courses.
Feeding the talent pipeline
Microsoft Research has a long tradition of collaborating with universities and has undertaken several initiatives to nurture the next generation of talented researchers. Since 2002, for instance, Microsoft Research Asia has hosted over 4,000 interns and carried out projects with more than 40 universities and institutes. The new big data course comes directly out of that tradition, and both Microsoft Research and Tsinghua University have high expectations for this collaboration. Professor Yao probably put it best, saying, “I believe this world-class course will give students a comprehensive understanding of big data and its knowledge structure, helping them reach their goals in future jobs and research.”
—Kangping Liu, Senior Research Program Manager, Microsoft Research Asia
Halloween 2013 brought real terror to an Austin, Texas, neighborhood, when a flash flood killed four residents and damaged roughly 1,200 homes. Following torrential rains, Onion Creek swept over its banks and inundated the surrounding community. At its peak, the rampaging water flowed at twice the force of Niagara Falls (source: USA Today).
While studying the flood site shortly afterwards, David Maidment, a professor of civil engineering at the University of Texas, ran into an old acquaintance, Harry Evans, chief of staff for the Austin Fire Department. Recognizing their shared interest in predicting and responding to floods, the two began collaborating on a system to bring flood forecasts and warnings down to the local level. The need was obvious: flooding claims more lives and costs more federal government money than any other category of natural disasters. A system that can predict local floods could help flood-prone communities prepare for and maybe even prevent catastrophic events like the Onion Creek deluge.
Soon, Maidment had pulled together other participants from academia, government, and industry to start the National Flood Interoperability Experiment (NFIE), with a goal of developing the next generation of flood forecasting for the United States. NFIE was designed to connect the National Flood Forecasting System with local emergency response and thereby create real-time flood information services.
The process of crunching data from the four federal agencies that deal with flooding (the US Geologic Survey, the National Weather Service, the US Army Corps of Engineers, and the Federal Emergency Management Agency) was a burden for even the best-equipped physical datacenter—but not for the almost limitless scalability of the cloud. Maidment submitted a successful proposal for a Microsoft Azure for Research Award, which provided the necessary storage and compute resources via Microsoft Azure, the company’s cloud-computing platform.
Today, NFIE is using Microsoft Azure to perform the statistical analysis necessary to compare present and past data from flood-prone areas and thus build prediction models. By deploying an Azure-based solution, the NFIE researchers can see what’s happening in real time and can collaborate from anywhere, sharing data from across the country. The system has also proved to be easy to learn: programmers had their computer model, RAPID (Routing Application for Parallel computation of Discharge) up and running after just two days of training on Azure. Moreover, the Azure cloud platform provides almost infinite scalability, which could be crucial as the National Weather Service is in the process of increasing its forecasts from 4,500 to 2.6 million locations. Of course, the greatest benefits of this Azure-based solution accrue to the public—to folks like those living along Onion Creek—whose property and lives might be spared by the timely prediction of floods.
—Dan Fay, Director: Earth, Energy, and Environment, Microsoft Research
The forests that surround Campos do Jordao are among the foggiest places on Earth. With a canopy shrouded in mist much of time, these are the renowned cloud forests of the Brazilian state of São Paulo. It is here that researchers from the São Paulo Research Foundation—better known by its Portuguese acronym, FAPESP—have partnered with Rafael Olivier, professor of ecology at the University of Campinas, in an ambitious effort to understand the climate and ecology of these spectacular woodlands. Their aptly named Cloud Forest Project has both conservation and practical goals, as it seeks to understand how to protect one of Brazil’s largest forested areas while learning to manage access to water and other natural resources more effectively.
The researchers want to unravel the impact of micro-climate variation in the cloud forest ecosystem. Essentially, they want to understand how the forest works—how carbon dioxide, water, nitrogen, and other nutrients cycle through plants, animals, and microorganisms in this complex ecosystem. To do so, they’ve placed some 700 sensors in 15 forest plots, locating the devices at levels throughout the forest, from beneath the soil to the top of the canopy.
The integration of such a vast number of sensor data streams poses difficult challenges. Before the researchers can analyze the data, they have to determine the reliability of the devices, so that they can eliminate data from malfunctioning ones. They also need to translate scientific questions into analysis of the time-series data streams—a process much more sophisticated than the traditional “open all the data in Excel spreadsheets” approach.
Consequently, the project scientists have collaborated with Microsoft Research to manage the data with help from the Microsoft Azure for Research project. Think of it as cloud to cloud: cloud forest data being managed and analyzed through the power of cloud computing. Essentially, it’s a parallel process with some researchers developing the sensors, power supplies, and data flow in the cloud forest; others working with computers to set up receptacles for those massive incoming data flows; and everyone striving to reach a level of confidence that new insights can be discovered and explored through the data.
Reliance on cyber infrastructure built on the Microsoft Azure cloud platform frees the researchers from purchasing and maintaining physical computers, saving time and money and eliminating the aggravation of learning how to be a computer system administrator. Moreover, the cloud-based system gives researchers the power to combine interrelated data to create “virtual sensors” that quantify things that cannot be measured readily by one type of sensor. For example, measuring fog is difficult and expensive with just one sensor, but the presence of fog can be inferred by combining data from temperature, sunlight, and humidity sensors.
Similar cloud-computing advantages are available in almost any research project that involves the collection, management, and analysis of big data. If that describes your research, you’ll want to check out the Microsoft Azure for Research project, especially its award program, which offers substantial grants of Microsoft Azure compute resources to qualified projects. Your research might not involve a cloud forest, but if it entails a forest of data, the Microsoft Azure cloud could be your ticket to a more productive and less costly project.
—Rob Fatland, Senior Research Program Manager, Microsoft Research