The modern web is vast and decentralized topology of web sites and services connected via an almost infinite amount of links.   Fortunately as the web has grown more complex, tools for understanding and leveraging the web have kept pace.   The underlying technology for many of these tools is this web crawler.  Web crawlers work for us non-stop to crawl and index the web, thereby allowing us the freedom to access the web while knowing very little about the overall topology and structure.    The web crawlers are also very good at discovering sites and information that we could not possibility find outside of random chance.

 

By contrast, as the modern enterprise has trended towards becoming more web-like, the tools for understanding and leveraging the enterprise data topology have been almost nonexistent.   But before diving into why we believe that is a problem and what we intend to do about it, let’s examine the reasons why the modern enterprise has become a data jungle.

 

Back 15-20 years as relational databases started to become the corporate standard for storing data, it was quite common in most companies to have one central IT department that had control over most corporate data and knowledge of all the “data flows” in the enterprise.   This was due in part because of the expertise required to manage most data systems and tools was far above what skills a typical information worker possessed.   However, over the last decade, a number of interesting trends have turned this relationship upside down:

 

  • Acquiring and storing data is headed toward zero cost.
  • Exposing, moving and transforming data has become very easy and cost efficient.
  • Self Service Everything – Personal databases and web portals, self-service business intelligence.  Often non-developers build and own.
  • Excel hell - The great proliferation of Excel (and Access, Sharepoint) as the enterprise data management tool.

 

Now this isn't entirely a bad thing, not even close.   In fact, most organizations would say these are huge steps forward that have led to significant productivity gains.   However it has also made even the simplest DBA and ETL developer tasks increasingly complex and error prone.   On top of that, it is almost impossible for information workers to know anything about enterprise data outside of their specific data silos.  For example, let’s say a spreadsheet (posted on a Sharepoint site) pulls data from a database that was fed by some SSIS packages and an external web service.   If the information worker was savvy enough (and knew Sql) they could perhaps figure out the data came from the database, but they really can't figure out the real origin of the data without considerable help from their IT folks.

 

But do the IT folks really have time to help the information workers?   Let’s take for example the DBA who wants to do some standard DBA tasks like changing the data type of a column  (How many of you have had to go from smallint to int in your lifetime?).   Within the server there are excellent tools for figuring out what will break, but how about outside the database server?   Will the SSIS packages pulling data from the server need to be fixed?  How about the reports and cubes?    And does the DBA really know about everything that will break?  How about the excel spreadsheet that pulls data from the server once a week?   

 

When we ask DBAs about how they handle these sort of tasks, the answer we get most often is they 1) do the best they can to investigate the impact of the change 2) make the necessary changes to impacted systems, 3) then make the change and see what breaks.    Now for the English majors doing their CS101 final project, this is probably a fine way to make changes to their code – but in the enterprise this seems far from an acceptable solution. 

 

Even if it isn’t acceptable, it is the reality for a lot of DBAs these days.    What make this particularly nasty is the fact that this strategy doesn’t really work since there may be a lot of silent breaks.   Let’s say the DBA wants to retire a database server.   They do their homework and figure how all the effected systems and processes they know about.   Then they make the change and a few SSIS packages break they didn’t know about.  Now the downstream server being fed by the packages starts to have stale data and the afore mentioned information worker just knows something is wrong with the spreadsheet they use which accesses data from the downstream server – “My Excel is broken!”.  Or even worse the information worker doesn’t notice and makes a bunch of business decisions based on stale data!

 

So what exactly do we think we are building and how will this help?

 

One of the jokes on the team is that we are building the Marauder’s Map for the enterprise.  For those not familiar with Harry Potter, the Marauder’s Map was a magical map that showed the location of everyone in his school in real time.   Although the analogy breaks down a bit, at a high level we are trying to build a magical map of the enterprise data topology.   More importantly, we want to gather this information with minimal up front setup or administration tasks for the user.    The vision is that one installs the software, points our crawlers at their systems and out comes a map of the enterprise’s metadata and the data flows between systems.

 

At a high level, this translates into three different buckets of components:

 

Crawlers– We will provide a number of crawlers for MSFT products (Sql Server, Excel, Sharepoint, Reporting Services, Analysis Services, SSIS, etc) that will be able to extract metadata and enterprise dataflow information from the target sources to be indexed in the Barcelona Index Server.  For sources that can’t be crawled (For example – the dataflow is in executable program, the crawler doesn’t have permissions, a crawler doesn’t exist for target domain) we will provide a declarative way of describing the metadata and dataflow information.  Finally, the crawler infrastructure will support auto discovery of new target sources further reducing the need for up front costs and/or costly modeling.

 

Barcelona Index Server– The Index will be the cache for the for all the harvested metadata and enterprise dataflow information.   The server will also expose a API for querying, augmenting, and annotating the metadata and dataflow information.

 

Tools – For the initial version, we will be building a couple of tools.   First there will be an admin experience for managing the crawlers and the Index Server.   Second we are developing a DBA experience designed to help the DBA with tasks that require enterprise wide knowledge of the data topology (for example – renaming a column, retiring a server, figuring out where data comes from).

 

The below diagram illustrates the overall architecture.

 

 

Note – although we are designing the first iteration of the product to be a DBA/ ETL developer solution, we believe that the long term value will grow significantly beyond this.   Hence, from the start, the base platform for the product will be completely open.  For example, developers can plug in their own crawlers or metadata providers.  They can also access the harvested metadata and dataflow information via the query API.   Finally, we will support metadata augmentation and have rich annotation support (both crawler support and via server API) which will allow producers and consumers of the system to leverage the crawlers and Index server in ways we haven’t even thought about.

 

To be continued …

 

One of our goals for Project Barcelona is customer driven innovation.  In other words, we feel like we have a good idea of the core product requirements, but really want to work with the community on the design and feature prioritization.    At the end of the day, we want to build a set of tools that makes managing the modern enterprise data topology significantly easier – so we strongly believe we will need significant feedback before landing on the right design and feature set.  Hence, to accelerate the feedback loop, in addition to shipping a number of CTP releases, we plan on being very transparent on our design plans via this blog.   In the next few days I will be posting a more detailed proposal for that process, but I will say we plan on providing almost real time updates of our internal design discussions thereby allowing for input to the design from the community as we develop the features.

 

Andrew Conrad

Project Barcelona