MIS 587 Business Intelligence: March 2012

Saturday, March 24, 2012

Dimensional Modeling is an important part when attempting to create business intelligence out of pure data. In order to understand dimensional modeling, we first need to differentiate the difference between OLAP and OLTP. OLAP is online analytical processing, and is used for analysis. OLAP needs to have more historical information so that you can understand trends and prior data. OLTP is online transaction processing where you record information and insert it into databases. OLTP has a lot of data manipulation in terms of updating, inserting, and deleting rows in databases. On the other hand OLAP only needs read access to the data since there is not manipulation in OLAP.

Below is an image to describe the difference visually:

Source: http://datawarehouse4u.info/OLTP-vs-OLAP.html

Another key difference that this article mentions is that OLTP has a more normalized database structure, whereas OLAP will have less normalized relations. This brings us to dimensional modeling with various types of schema.

Star Schema requires fact tables and dimension tables. A dimension can categorize data so that users can interact with the data warehouse in different ways. Dimensions are typically small with not too many rows. A fact table is created by having measures or numberic facts that want to be analyzed, and multiple foreign keys to connect to the various dimensions. Fact tables will continue growing and can be very large, which will make processing time longer.

Source: http://oracle-online-help.blogspot.com/2006/11/star-vs-snowflake-schema.html

Snowflake schema is a more complex version of a Star schema where the dimensions are more normalized. Fact tables connect to the dimensions similarly to star schema, but then they can also connect to dimensions through dimensions, which can be defined through various relationships.

Source: http://datawarehouse4u.info/OLTP-vs-OLAP.html

Data warehouses can use a hybrid approach of star and snowflake schema if they want to normalize only certain dimensions. This can be beneficial so that the structure is not too complex, but allows the user to do the necessary analysis.

As illustrated, dimensional modeling is an important part of data warehousing and the schema defines the structure. Analysis is much better with a correct data warehouse instead of using the database structures to perform similar analysis. I have been using SQL Server 2005/2008 to develop data warehouses and specific cubes. I am currently looking into SQL 2012 to see the additional features and functionality that it offers.

Tuesday, March 20, 2012

This week in Business Intelligence we had an introduction to understanding the necessary steps to get to dashboards, reports, and various analytics. Cleaning the data is a very time consuming component of data warehousing, but you need to have clean data in order to have useful analysis. Giving bad recommendations due to bad data will not be useful.

We were tasked with using Ataqamma to perform data profiling. However, there are many tools out there, such as Informatica's data profiling tool. More information can be found here:

http://www.informatica.com/us/data-profiling/

The slogan that they use is: "Increase Confidence in Your Enterprise Data with Informatica Data Profiling Solutions"

Data profiling is when you try to understand the data in various tables or sources to have a better understanding of the data characteristics. Some of the primary findings that I had were the different formats that people enter when there are no constraints, and null values. These problems need to be addressed for the future inputs, and corrected immediately for the past inputs before proceeding with the business intelligence motive.

Extract, transform, and load is an important part that takes source data, makes the necessary transformations and adjustments, and loads it to a database. Here is a general picture:

ETL is a very important part to ensure that you can do the proper analysis. Once the data has been loaded, we can use different ways of analyzing the data. The key findings are found in the analysis stage, but this is a minor time requirement in the overall process. Analyzing dirty data will lead to incorrect decisions, and may be worse than any business intelligence at all.

Saturday, March 3, 2012

Network Graph Properties

Business Intelligence class continued the discussion on Network analysis and properties of a network graph. There are many properties of a network that we want to analyze in order to understand the network and compare it with other networks. Structural properties are very important for this analysis.

Degree centrality is the number of links or proportion of links that a particular node has. The degree can be an in degree or an out degree. The degree is usually expressed as a number between 0 and 1. Degree centrality is the number of links that the node has divided by the total number of nodes and subtracting one. Below is an example to calculate this:

Source: http://www.sscnet.ucla.edu/soc/faculty/mcfarland/soc112/cent-ans.htm

Another important feature is the closeness centrality. This describes how close a particular node is to another node based on the shortest path. It basically says how fast you can reach someone when starting at an arbitrary node. Some of the areas it works for is information diffusion or disease transmission. Below is an example of how this is calculated:

Source: http://www.sscnet.ucla.edu/soc/faculty/mcfarland/soc112/cent-ans.htm

Eigenvector Centrality is another important characteristic that we can use to analyze. The Eigenvector centrality takes into account a different measure of importance. If a node is connected to many other nodes, then the node is very important, which is what Google's PageRank uses for webpages. A node does not have importance on its own, but it comes from being connected to other nodes that are connected to other nodes. This is useful because it can tell you how many people can this node reach directly and how well connected the person is.

Gephi will be used throughout this course to visualize the graphs. I was able to use Gephi for the upcoming assignment to analyze my facebook friends to see how everything is interrelated.

Below is my Facebook network built with Gephi and Netvizz app with ranking it by Degree.