Importing the Data with Power Query
Getting the right format with Power Query
Crawl, then walk
This year’s world cup was filled with unbelievable surprises; Miroslav Klose’s record breaking total goals, Germany’s thrashing of Brazil, Spain’s unfortunate crash-out, United States’ resilience through the Group of Death, and moments of Messi magic. For the stat geek in all of us, Power BI Q&A made it easy to put the game in context and help you make better predictions about the outcome. If you didn’t have a chance to ask Q&A about the world cup during the tournament, you can still give it a try here.
We heard from many of you how amazing it was to see the breadth of questions Q&A could answer on the world cup. If you’d like a glimpse into how we can teach Q&A a whole new sport in a matter of a day, read on!
In this post we will discuss how we used the breadth of tools available in Excel and Power BI to bring this experience to you including:
We partnered with Opta Sports, a leading provider of historical sports statistics to give Q&A all of the facts you’d expect a sports statistician to know. The historical data was delivered as a series of XML files with the statistics for each game organized in folders by year. Each file contained information on the teams, coaches, referee, players, goals, cards, substitutes, and game statistics. This isn’t the easiest way to import data, but luckily Power Query has robust support for handling different data formats and folder structures.
We used the From Folder option in Power Query to import all the data at once. Each row was a path to the individual XML file. Power Query also has the ability to read in the content of an XML file. Under content, clicking on “binary” lets you expand out all the data within a XML. After deleting and renaming a few columns, you end up with data for each game (see image below). As you can see, the value “Table” is in a lot of cells. That’s because Opta’s XML feed was highly nested, but it allowed for easy organization and navigation of the data.
We had to use a different approach for the 2014 data since it was updated in real time from Opta as the games progressed. Using the From Web option in Power Query, it was easy to import the latest data from each game and refresh it whenever we wanted. If you’d like to learn more about how to connect to almost any data source with Power Query, check out Getting Started with Power Query Part I.
Once we pulled in the data, we had to create a query for each table we wanted in our data model: games, goals, cards, referees, managers, stadiums, etc. The motivation behind creating a query for each table is that when we load each table to the data model, we can easily create the relational diagram required to make Q&A shine. Most queries involved expanding out nested tables, and filtering columns to get one query for one particular table. One functionality we would like to point out with Power Query is the advanced editor, which allows you full control over the transformation of you data. The entire advanced editor is not the scope of this post, but we wanted to highlight the scenario of pivoting tables. In the Power Query editor, under the “View” tab, there is a button to launch into the advanced editor (shown below):
Chris Webb wrote a great blog post on pivoting tables.
After each query was created for the particular tables we wanted, we loaded them to the data model for Power Pivot to consume.
When you load to the data model from Power Query, it automatically adds your tables to Power Pivot. All we had to do in Power Pivot was match unique IDs to one another to create relationships so we know which players had goals, cards and which referees, teams, players, and coaches were in a particular game. Now that the relationships were created, we could take this to Power Q&A.
Now that we have our relational model created, we optimized our data by going through the steps outlined in the Demystifying Power BI Q&A Series (Parts 1 | 2).
The model is now ready to be used on Power Q&A. However, we optimized the model further using the new cloud modeling environment to be able to ask a wider variety of questions. The cloud modeling environment also is an easy way to manage synonyms for your columns and tables. Since all the changes are saved on to the cloud, it allows everyone to reap the benefits.
Documentation on how to get started with the cloud modeling environment can be found here. As an example on how to make a wider variety of questions to work, we created the following phrasing “aggressive team has large average cards per game”. The inclusion of this phrasing allowed us to ask more questions like:
You can click the links above to see how one good phrasing allows Q&A to give really great answers!
As you can see, there’s a lot more to teaching Q&A than just understanding natural language. Finding the right data, choosing the right tables and relationships, and data modeling all contribute to a great experience in Q&A. Luckily, the full suite of Power BI tools like Power Query and Power Pivot make it easy to achieve powerful results in a few minutes.