This is Syed Aslam Basha here from Information security and risk management team.

During my initial stages at Microsoft, I had an opportunity to work on a data warehousing project. This write up is follow-up with the hands on experience I had with the project for over a year.

Let me give a brief history of the project, v3.0 was a simple on-line transaction processing (OLTP) solution developed with SQL 2005. The drawbacks with the system was

  • Huge wait time for the queries to execute and complete
  • Difficulty in maintenance
  • Difficulty in extending the functionalities
  • Limitation of amount of data storage

We went from OLTP in v3.0 to on-line analytical processing OLAP v4.0 aiming to overcome the drawbacks and also improve on the amount of data storage with SQL 2008 and use features like table partitioning, compression and others;

  • Less time to execute even complex queries
  • Performance improvements in terms of processing, data loading and querying data
  • Progress information
  • Effective usage of the available resources
  • Beautiful executive reports can be generated and shared with customers
  • Ease of maintenance and extendibility
  • Better analysis of data
  • Huge amount of storage of data 60 TB

The following are the definition and differences of OLTP and OLTA:

On-line transaction processing (OLTP): It’s normally used in day to day transaction applications, forms original source of data and involves on-line transactions insert, update and delete. The systems are fast in query processing and maintain data integrity.

On-line analytical processing (OLAP): Used in data mining solutions, source of data can come from OLTP DBs or log files, helps in analyzing data and make appropriate planning and decisions, multi-dimensional view of the business activities.

Let’s understand the basics of data warehousing terminologies involved;

Data warehousing is suitable for solutions which require analysis of huge sets of data. It will have star/snowflake schema, dimension tables, fact tables, rules and ETL tools.

Star schema: In star schema one fact table associated with one or more dimension tables you can visualize it as a star fact table being in the center and dimensions surrounding it. Star schema has fact table in normalized format and dimensions in de normalized format.

Snowflake schema: In snowflake schema one fact table is associated with one or more dimension tables and these dimension tables are associated with one or more dimension tables. In this both fact and dimensions are in normalized format.

We used star schema in our data warehouse solution.

Dimension tables: stores information like identity based on which we want to query the data warehouse fact tables example sales by country; here country becomes dimension all countries will be stored in the countries dimension table.

Fact tables: stores information like the facts which are essentially stores all transactions example sales or purchase transactions.

Rules: Before storing data we may want to apply certain business rule. We have Transformation rules, filter rules and lookup rules.

Transformation rule: Before storing data in to data warehouse we want to transform the data available example aggregate data before storing or remove version number.

Filter rule: Certain data we don’t want to store in data warehouse example invalid links or page not found links details which are present in source we don’t want in data warehouse use the filter rule.

Lookup rule: Update certain dimensions while loading data from source example updating dimension countries

The dimensions and rules needs to be defined by the business.

At high level any data warehouse will have the following components:

  • Source of data – source is upstream, it can be database or log files
  • Dimension tables data – data comes from feed store or from source
  • Fact tables data – data comes from source
  • Processing of data – extract data, apply rules, transform and load in data warehouse
  • Notification mails
  • Event log
  • Views to query data
  • Different user profiles DB to run adhoc queries using SQL 2008
  • Reports – reports forms the downstream for the data warehouse

There can be multiple levels or layers to achieve the same say data from source is copied using batch job tool and for processing and loading data we can use extract transform and load (ETL) tools example SSIS.

 

- Syed Aslam Basha (syedab@microsoft.com)

Microsoft Information Security Tools (IST) Team

Test Lead

---------------------------------------------------------

Please leave a comment if the blog post has helped you.