Download Research Tools
In keeping with our mission to collaborate with top academic and scientific researchers to foster innovations in scientific inquiry, Microsoft Research Connections was proud to sponsor the 2013 KDD Cup, arguably the world’s best-known competition in data mining. The winning teams were announced at KDD 2013, the 19th annual conference of ACM SIGKDD (the Association for Computing Machinery’s Special Interest Group on Knowledge Discovery and Data Mining) which took place in Chicago in August. KDD is the premier event for researchers grappling with today’s data deluge, as it’s the only conference spanning big data, data mining, data science, and analytics and all the related algorithms, foundations, applications, and practices.
2013 KDD Cup challenge winners, Team Algorithm, from National Taiwan University
The 2013 KDD Cup challenge focused on the ability to search literature and to collect metrics around publications—a capability that is essential to modern research, as academic and industry researchers increasingly rely on search to discover what has been published and by whom. The competition made use of a data set of 250,000 authors and 2.5 million published papers. The dataset was broken up into a distinct labeled training set, a validation set for the leaderboard, and a test set. The competitors faced two tasks: first, a prediction task to determine whether an author had written a paper, and second, a name disambiguation task to identify duplicate author names in a dataset with name variants.
These tasks go to the heart of one of the main challenges of information extraction and curation in any people-centric dataset: resolving people-name ambiguity. In the scholarly publishing world, many authors publish under several variations of their own name, and to add to the complexity of discovery, different authors might share a similar or even the same name. As a result, the profile of an author with an ambiguous name tends to contain noise, resulting in papers that are incorrectly assigned to him or her. The KDD Cup task challenged participants to determine which papers in an author profile were truly written by a given author. Read the full parameters of the challenge.
The competition was fierce, with more than 800 teams from more than 40 different countries developing approximately 12,000 data-mining models over the course of a few months. The winning solution, created by Professor Chih-Jen Lin and Team Algorithm from National Taiwan University, was the product of outstanding teamwork: eighteen students and three teaching assistants actually designed a graduate course around the competition. Other winners included teams from University of Illinois at Urbana-Champaign, Moscow State University, and FICO. Winners presented their solutions at a KDD Cup workshop and poster session at the conference. Moreover, solutions created for the competition resulted in 10 research papers that are available through the KDD Cup 2013 Workshop proceedings.
KDD Cup poster session participants at KDD 2013
On behalf of Microsoft Research Connections, I would like to thank the key collaborators who helped make this competition a success. The Microsoft Research Connections proposal for the KDD Cup challenge was selected after careful deliberation by 2013 KDD Cup chairpersons Claudia Perlich and Brian Dalessandro of Media6°. Partnering with me in designing the contest rules and evaluation criteria were Professors Martine DeCock of Ghent University and Senjuti Basu Roy of the University of Washington Tacoma, along with Ben Hamner and Will Cukierski of Kaggle. Swapna Savvana and Yitao Li from the University of Washington Tacoma helped with the logistics of the contest execution.
So congrats to the KDD Cup winners, and kudos to everyone who accepted the challenge. The many outstanding solutions showed great creativity, which is exactly what we’ll need as we move forward in this new world of data-intensive scientific inquiry.
—Vani Mandava, Senior Program Manager, Microsoft Research Connections
I recently sponsored an event in Manizales, Colombia, training biologists on .NET Bio and BioHPC, two projects that make computational research easier in the life sciences. As part of the training, Jarek Pillardy—the head of the Cornell Bioinformatics Facility (CBSU) at Cornell University—and some of his staff presented various aspects of BioHPC. I had the opportunity to sit down with Jarek, who is not only the developer of BioHPC but also a long-time user of the .NET Bio project. Here is a recap of that conversation.
Simon: You lead the CBSU—what activities does it support?
Jarek: CBSU is the Cornell University Bioinformatics Facility, and its mission is to support biological research with advanced computational infrastructure and bioinformatics tools and techniques. The facility’s main activities can be divided into maintaining extensive computational infrastructure configured for bioinformatics; providing easy access to the infrastructure through the web via BioHPC Web or interactively through BioHPC Lab; training, mainly through workshops and consulting; direct research collaborations, ranging from small projects to participating in major grants as co-principal investigators; and software and LIMS development.
Simon: What prompted you to develop BioHPC, and what does it do?
Jarek: BioHPC is our main way to deliver computational infrastructure to biologists. It is not easy for an experimental biologist to use computing tools directly and navigate the complicated maze of schedulers, command-line tools, data-storage methods, and other infrastructure. BioHPC simplifies access, both through the web and interactively, and management of the infrastructure (hardware and software). We created BioHPC to make our life easier and to provide services for many more researchers. BioHPC Web gives users a simple way to submit data for processing and for managing jobs and data. BioHPC Lab is a tool to organize access to interactive machines, reserve time, and manage associated resources, like storage and computing time. For us, it provides a convenient platform to deliver computational resources (hardware and software combined) and a set of tools to manage them.
Simon: Do you have any plans to extend the capabilities of BioHPC in the future?
Jarek: BioHPC is constantly evolving to meet the changing needs in bioinformatics and adapt to technological changes. Currently, we are supporting a diverse array of local and remote clusters, but we are planning to add capacity to run computations in the cloud. We are in the final stages of adding Windows Azure to our supported computing infrastructure. We will be also adding new software.
Architectural overview—BioHPC schema
Simon: How do you see the Windows Azure cloud being used in bioinformatics?
Jarek: For direct research computing, I can see two main scenarios. First, there will be advanced users, running their own virtual machines. These probably will be a minority of users. Second, there will be researchers who access Azure resources via an intermediate service like BioHPC. This scenario will involve a lot of task-focused services (for example, analyzing population data, assembling and annotating sequences, or handling a particular software pipeline) running on Azure, with the end-user not even fully aware of that. Azure provides an opportunity to bring data closer to the computing infrastructure.
Simon: How has BioHPC been able to help the Colombian BIOS Center?
Jarek: I think BioHPC may deliver for them the same benefits it does for us: an easy-to-use tool that provides convenient access to infrastructure and simplified management. They are still in the process of setting up and organizing, and we are in close contact with them, providing consultation and help. BIOS’s mission to the Colombia researchers is very similar to what our facility provides to Cornell, so our tools should be very useful to them. I hope they will be able to improve and expand BioHPC in order to meet their particular needs, which will make it much better.
As Jarek notes, BioHPC is a living, constantly evolving project, as is .NET Bio. If you’re a biological researcher, I encourage you take a good look at these tools.
—Simon Mercer, Director of Health and Wellbeing, Microsoft Research Connections
Today is a proud day for Microsoft Research Connections and our academic collaborators, as two of our research efforts have been named 2013 IDG's Computerworld Honors Program Laureates. The Computerworld Honors program, founded in 1988, recognizes organizations and individuals who have used information technology to promote positive social, economic, and educational change. The program judges reviewed more than 700 nominations this year to select 269 Laureates from 29 countries. Microsoft Research is being honored for our collaborative work on combatting scourges that affect millions around the world: pneumonia and HIV infection.
Working in collaboration with the University of Oxford, our research strives to make pneumonia vaccine more effective.
Pneumonia persists as a leading cause of death in children worldwide, despite the availability of a vaccine. To be properly vaccinated against the disease, children must receive a series of three shots over a period of several months. The research for which we’re being honored, “Adjusting Pneumonia Vaccination Periods to Save Lives,” strives to make vaccination more effective by changing the timing of the shots. In collaboration with the University of Oxford, we have developed software that can be used to create and deploy clinical trial support infrastructure in a fraction of the time, at a fraction of the cost, of conventional methods. The system can collect well-defined and standardized data from multiple sources; however, the greater benefit is its ability to combine data simply and efficiently, enabling large-scale data analysis. Such analysis is now being used by the Oxford Vaccine Group to evaluate the effectiveness of revised schedules of immunization.
Our HIV program involves support of the efforts of the Ragon Institute at Massachusetts General Hospital to create an effective HIV immunization agent.
HIV infection remains another prolific killer, taking the lives of approximately 5,000 people a day—despite the emergence of antiviral therapies that can control, but not cure, the disease. Until a cure is found, the best hope in controlling HIV infection lies in creating an effective vaccine. This is why our second honored program, “Uncovering New Ways the Human Immune System Fights HIV,” involves support for the efforts of the Ragon Institute at Massachusetts General Hospital to create an effective HIV immunization agent. In collaboration with South African healthcare workers, the Ragon Institute has recruited large numbers of South African HIV-positive patients, whose blood samples enable studies of the body’s defense mechanisms in the laboratory. Joining the Ragon Institute in this effort are the Centre for the AIDS Programme of Research in South Africa (CAPRISA) and the KwaZulu-Natal Research Institute for Tuberculosis and HIV (K-RITH). Microsoft Research is working with the Ragon Institute to quantify how the immune system attacks various fragments of HIV—data that we hope will, one day, lead to a vaccine.
These two global research programs not only capture the very essence of our mission at Microsoft Research Connections: to collaborate with the world’s top academic researchers and institutions to develop technologies that fuel data-intensive scientific, but also help Microsoft Research improve “Big Data” algorithms to further advance Microsoft products. We look forward to the presentation at The Computerworld Honors Laureate Ceremony and Awards Gala at the Andrew W. Mellon Auditorium in Washington, D.C., on June 3, 2013.
—Tony Hey, Vice President of Microsoft Research Connections