Archive for Uncategorized

Datawarehouse Design Topics

Dimensional Data Model:

Dimensional data model is commonly used in data warehousing systems. This section describes this modeling technique.

Slowly Changing Dimension: This is a common issue facing data warehousing practitioners. This section explains the problem, and describes the three ways of handling this problem with examples.

Conceptual, Logical, and Physical Data Model: Different levels of abstraction for a data model. This section explains their differences and lists the steps for constructing each.

What is OLAP: Definition of OLAP.

MOLAP, ROLAP, and HOLAP:

What are these different types of OLAP technology?

This section discusses how they are different from the other, and the advantages and disadvantages of each.

Dimensional data model is most often used in data warehousing systems. This is different from the 3rd normal form, commonly used for transactional (OLTP) type systems. As you can imagine, the same data would then be stored differently in a dimensional model than in a 3rd normal form model.

To understand dimensional data modeling, let’s define some of the terms commonly used in this type of modeling:

Dimension: A category of information. For example, the time dimension.
Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.
Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year → Quarter → Month → Day.

Fact Table: A fact table is a table that contains the measures of interest. For example, sales amount would be such a measure. This measure is stored in the fact table with the appropriate granularity. For example, it can be sales amount by store by day. In this case, the fact table would contain three columns: A date column, a store column, and a sales amount column.

Lookup Table: The lookup table provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Each row (each quarter) may have several fields, one for the unique ID that identifies the quarter, and one or more additional fields that specifies how that particular quarter is represented on a report (for example, first quarter of 2001 may be represented as “Q1 2001″ or “2001 Q1″).

A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.

In designing data models for data warehouses / data marts, the most commonly used schema types are Star Schema and Snowflake Schema.

Star Schema: In the star schema design, a single object (the fact table) sits in the middle and is radially connected to other surrounding objects (dimension lookup tables) like a star. A star schema can be simple or complex. A simple star consists of one fact table; a complex star can have more than one fact table.

Snowflake Schema: The snowflake schema is an extension of the star schema, where each point of the star explodes into more points. The main advantage of the snowflake schema is the improvement in query performance due to minimized disk storage requirements and joining smaller lookup tables. The main disadvantage of the snowflake schema is the additional maintenance efforts needed due to the increase number of lookup tables.

Whether one uses a star or a snowflake largely depends on personal preference and business needs. Personally, I am partial to snowflakes, when there is a business case to analyze the information at that particular level.

What is Datawarehousing ??

A data warehouse is a repository storing integrated information for efficient querying and analysis. Information is extracted from heterogeneous sources as it is generated or updated. The information is then translated into a common data model and integrated with existing data at the warehouse. When a user query is submitted to the warehouse, the needed information is already there, with inconsistencies and differences already resolved. This makes it much easier and more efficient to run queries over data that originally came from different sources.

Key advantages of data warehousing include:

1.Since query execution does not involve data translation and communication with remote sources, complex queries can be executed easily and efficiently.
2.End users can use a single data model and query language.
3.System design becomes simpler. For example, there is no need to perform query optimization over heterogeneous sources, a very difficult problem faced by other approaches.
4.Information sources may be unreliable and may purge data. On the other hand, information at the warehouse is under the control of the warehouse users; it can be stored safely and reliably for as long as necessary.

Ralph states that a data warehouse is “a copy of transaction data specifically structured for query and analysis”. Two quibbles I have with Ralph’s definition are: 1) Sometimes non-transaction data are stored in a data warehouse – though probably 95-99% of the data usually are transaction data. 2) I say “querying and reporting” rather than “query and analysis” because the main output from data warehouse systems are either tabular listings (queries) with minimal formatting or highly formatted “formal” reports. Queries and reports generated from data stored in a data warehouse may or may not be used for analysis. – For some more information about why the transaction data are copied, you may want to see my essay The Case for Data Warehousing.

Data Warehousing Objects

Fact tables and dimension tables are the two types of objects commonly used in dimensional data warehouse schemas.

Fact tables are the large tables in your warehouse schema that store business measurements. Fact tables typically contain facts and foreign keys to the dimension tables. Fact tables represent data, usually numeric and additive, that can be analyzed and examined. Examples include sales, cost, and profit.

Dimension tables, also known as lookup or reference tables, contain the relatively static data in the warehouse. Dimension tables store the information you normally use to contain queries. Dimension tables are usually textual and descriptive and you can use them as the row headers of the result set. Examples are customers or products.

Fact Tables

A fact table typically has two types of columns: those that contain numeric facts (often called measurements), and those that are foreign keys to dimension tables. A fact table contains either detail-level facts or facts that have been aggregated. Fact tables that contain aggregated facts are often called summary tables. A fact table usually contains facts with the same level of aggregation. Though most facts are additive, they can also be semi-additive or non-additive. Additive facts can be aggregated by simple arithmetical addition. A common example of this is sales. Non-additive facts cannot be added at all. An example of this is averages. Semi-additive facts can be aggregated along some of the dimensions and not along others. An example of this is inventory levels, where you cannot tell what a level means simply by looking at it.

Creating a New Fact Table

You must define a fact table for each star schema. From a modeling standpoint, the primary key of the fact table is usually a composite key that is made up of all of its foreign keys.

Dimension Tables

A dimension is a structure, often composed of one or more hierarchies, that categorizes data. Dimensional attributes help to describe the dimensional value. They are normally descriptive, textual values. Several distinct dimensions, combined with facts, enable you to answer business questions. Commonly used dimensions are customers, products, and time.

Dimension data is typically collected at the lowest level of detail and then aggregated into higher level totals that are more useful for analysis. These natural rollups or aggregations within a dimension table are called hierarchies.

Hierarchies:

Hierarchies are logical structures that use ordered levels as a means of organizing data. A hierarchy can be used to define data aggregation. For example, in a time dimension, a hierarchy might aggregate data from the month level to the quarter level to the year level. A hierarchy can also be used to define a navigational drill path and to establish a family structure.
Within a hierarchy, each level is logically connected to the levels above and below it. Data values at lower levels aggregate into the data values at higher levels. A dimension can be composed of more than one hierarchy. For example, in the product dimension, there might be two hierarchies–one for product categories and one for product suppliers.

Dimension hierarchies also group levels from general to granular. Query tools use hierarchies to enable you to drill down into your data to view different levels of granularity. This is one of the key benefits of a data warehouse.
When designing hierarchies, you must consider the relationships in business structures. For example, a divisional multilevel sales organization.
Hierarchies impose a family structure on dimension values. For a particular level value, a value at the next higher level is its parent, and values at the next lower level are its children. These familial relationships enable analysts to access data quickly.
Below figure illustrates a common example of a sales fact table and dimension tables customers, products, promotions, times, and channels.

<Refer the doc for the diagram>

Granularity

The first step in designing a fact table is to determine the granularity of the fact table. By granularity, we mean the lowest level of information that will be stored in the fact table. This constitutes two steps:
1. Determine which dimensions will be included.
2. Determine where along the hierarchy of each dimension the information will be kept.

The determining factors usually goes back to the requirements.

Which Dimensions To Include

Determining which dimensions to include is usually a straightforward process, because business processes will often dictate clearly what are the relevant dimensions.
For example, in an off-line retail world, the dimensions for a sales fact table are usually time, geography, and product. This list, however, is by no means a complete list for all off-line retailers. A supermarket with a Rewards Card program, where customers provide some personal information in exchange for a rewards card, and the supermarket would offer lower prices for certain items for customers who present a rewards card at checkout, will also have the ability to track the customer dimension. Whether the data warehousing system includes the customer dimension will then be a decision that needs to be made.

What Level Within Each Dimensions To Include

Determining which part of hierarchy the information is stored along each dimension is a bit more tricky. This is where user requirement (both stated and possibly future) plays a major role.

In the above example, will the supermarket wanting to do analysis along at the hourly level? (i.e., looking at how certain products may sell by different hours of the day.) If so, it makes sense to use ‘hour’ as the lowest level of granularity in the time dimension. If daily analysis is sufficient, then ‘day’ can be used as the lowest level of granularity. Since the lower the level of detail, the larger the data amount in the fact table, the granularity exercise is in essence figuring out the sweet spot in the trade off between detailed level of analysis and data storage.

Note that sometimes the users will not specify certain requirements, but based on the industry knowledge, the data warehousing team may foresee that certain requirements will be forthcoming that may result in the need of additional details. In such cases, it is prudent for the data warehousing team to design the fact table such that lower-level information is included. This will avoid possibly needing to re-design the fact table in the future. On the other hand, trying to anticipate all future requirements is an impossible and hence futile exercise, and the data warehousing team needs to fight the urge of the “dumping the lowest level of detail into the data warehouse” symptom, and only includes what is practically needed. Sometimes this can be more of an art than science, and prior experience will become invaluable here.

Types of Facts

There are three types of facts:

Additive: Additive facts are facts that can be summed up through all of the dimensions in the fact table.
Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others.
Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.

Let us use examples to illustrate each of the three types of facts. The first example assumes that we are a retailer, and we have a fact table with the following columns:

Date
Store
Product
Sales_Amount –> (Additive) Fact

The purpose of this table is to record the sales amount for each product in each store on a daily basis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because you can sum up this fact along any of the three dimensions present in the fact table — date, store, and product. For example, the sum of Sales_Amount for all 7 days in a week represent the total sales amount for that week.

Say we are a bank with the following fact table:

Date
Account
Current_Balance –> (Semi Additive Fact)
Profit_Margin –> (Non Additive Fact)

The purpose of this table is to record the current balance for each account at the end of each day, as well as the profit margin for each account for each day. Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what’s the total current balance for all accounts in the bank?), but it does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information). Profit_Margin is a non-additive fact, for it does not make sense to add them up for the account level or the day level.

Types of Fact Tables:

Based on the above classifications, there are two types of fact tables:

• Cumulative: This type of fact table describes what has happened over a period of time. For example, this fact table may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts. The first example presented here is a cumulative fact table.

• Snapshot: This type of fact table describes the state of things in a particular instance of time, and usually includes more semi-additive and non-additive facts. The second example presented here is a snapshot fact table.

Slowly changing Dimension :

The “Slowly Changing Dimension” problem is a common one particular to data warehousing. In a nutshell, this applies to cases where the attribute for a record varies over time. We give an example below:

Christina is a customer with ABC Inc. She first lived in Chicago, Illinois. So, the original entry in the customer lookup table has the following record:

Customer Key Name State
1001 Christina Illinois

At a later date, she moved to Los Angeles, California on January, 2003. How should ABC Inc. now modify its customer table to reflect this change? This is the “Slowly Changing Dimension” problem.

There are in general three ways =to solve this type of problem, and they are categorized as follows:

Type 1: The new record replaces the original record. No trace of the old record exists.

Type 2: A new record is added into the customer dimension table. Therefore, the customer is treated essentially as two people.

Type 3: The original record is modified to reflect the change.

We next take a look at each of the scenarios and how the data model and the data looks like for each of them. Finally, we compare and contrast among the three alternatives.

Type 1 Slowly changing Dimension

In Type 1 Slowly Changing Dimension, the new information simply overwrites the original information. In other words, no history is kept.

In our example, recall we originally have the following table:

Customer Key Name State
1001 Christina Illinois

After Christina moved from Illinois to California, the new information replaces the new record, and we have the following table:

Customer Key Name State
1001 Christina California

Advantages:
- This is the easiest way to handle the Slowly Changing Dimension problem, since there is no need to keep track of the old information.

Disadvantages:
- All history is lost. By applying this methodology, it is not possible to trace back in history. For example, in this case, the company would not be able to know that Christina lived in Illinois before.

Usage:
About 50% of the time.
When to use Type 1:
Type 1 slowly changing dimension should be used when it is not necessary for the data warehouse to keep track of historical changes.

Type 2
In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new information. Therefore, both the original and the new record will be present. The new record gets its own primary key.
In our example, recall we originally have the following table:

Customer Key Name State
1001 Christina Illinois

After Christina moved from Illinois to California, we add the new information as a new row into the table:

Customer Key Name State
1001 Christina Illinois
1005 Christina California

Advantages:
- This allows us to accurately keep all historical information.
Disadvantages:
- This will cause the size of the table to grow fast. In cases where the number of rows for the table is very high to start with, storage and performance can become a concern.
- This necessarily complicates the ETL process.

Usage:
About 50% of the time.
When to use Type 2:
Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to track historical changes.

Type 3 :
In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular attribute of interest, one indicating the original value, and one indicating the current value. There will also be a column that indicates when the current value becomes active.
In our example, recall we originally have the following table:

Customer Key Name State
1001 Christina Illinois

To accommodate Type 3 Slowly Changing Dimension, we will now have the following columns:

• Customer Key
• Name
• Original State
• Current State
• Effective Date

After Christina moved from Illinois to California, the original information gets updated, and we have the following table (assuming the effective date of change is January 15, 2003):

Customer Key Name Original State Current State Effective Date
1001 Christina Illinois California 15-JAN-2003

Advantages:
- This does not increase the size of the table, since new information is updated.
- This allows us to keep some part of history.
Disadvantages:
- Type 3 will not be able to keep all history where an attribute is changed more than once. For example, if Christina later moves to Texas on December 15, 2003, the California information will be lost.

Usage:
Type 3 is rarely used in actual practice.
When to use Type 3:
Type III slowly changing dimension should only be used when it is necessary for the data warehouse to track historical changes, and when such changes will only occur for a finite number of time.

There are three levels of data modeling. They are conceptual, logical, and physical. This section will explain the difference among the three, the order with which each one is created, and how to go from one level to the other.

Conceptual Data Model

Features of conceptual data model include:
• Includes the important entities and the relationships among them.
• No attribute is specified.
• No primary key is specified.
At this level, the data modeler attempts to identify the highest-level relationships among the different entities.

Logical Data Model

Features of logical data model include:
• Includes all entities and relationships among them.
• All attributes for each entity are specified.
• The primary key for each entity specified.
• Foreign keys (keys identifying the relationship between different entities) are specified.
• Normalization occurs at this level.

At this level, the data modeler attempts to describe the data in as much detail as possible, without regard to how they will be physically implemented in the database. In data warehousing, it is common for the conceptual data model and the logical data model to be combined into a single step (deliverable).

The steps for designing the logical data model are as follows:

1. Identify all entities.
2. Specify primary keys for all entities.
3. Find the relationships between different entities.
4. Find all attributes for each entity.
5. Resolve many-to-many relationships.
6. Normalization.

Physical Data Model

Features of physical data model include:
• Specification all tables and columns.
• Foreign keys are used to identify relationships between tables.
• Denormalization may occur based on user requirements.
• Physical considerations may cause the physical data model to be quite different from the logical data model.

At this level, the data modeler will specify how the logical data model will be realized in the database schema.

The steps for physical data model design are as follows:
1. Convert entities into tables.
2. Convert relationships into foreign keys.
3. Convert attributes into columns.
4. Modify the physical data model based on physical constraints / requirements.

What is OLAP ?

OLAP stands for On-Line Analytical Processing. The first attempt to provide a definition to OLAP was by Dr. Codd, who proposed 12 rules for OLAP. Later, it was discovered that this particular white paper was sponsored by one of the OLAP tool vendors, thus causing it to lose objectivity. The OLAP Report has proposed the FASMI test, Fast Analysis of Shared Multidimensional Information. For a more detailed description of both Dr. Codd’s rules and the FASMI test, please visit the OLAP Report.

For people on the business side, the key feature out of the above list is “Multidimensional.” In other words, the ability to analyze metrics in different dimensions such as time, geography, gender, product, etc. For example, sales for the company is up. What region is most responsible for this increase? Which store in this region is most responsible for the increase? What particular product category or categories contributed the most to the increase? Answering these types of questions in order means that you are performing an OLAP analysis.

Depending on the underlying technology used, OLAP can be broadly divided into two different camps:
MOLAP and ROLAP.
A discussion of the different OLAP types can be found in the MOLAP, ROLAP, and HOLAP section.

Leave a Comment

Learn XML Part1

Keywords: root element, child element, attributes, contents, structure, semantics, style,

Basics:

The XML declaration has version and standalone attributes. An attribute is a name-value pair separated by an equals sign.

<?xml version=”1.0” standalone=”yes”?>

Every XML document begins with an XML declaration that specifies the version ofXML in use. In the above example, the version attribute says this document conforms to XML 1.0. The XML declaration may also have a standalone attribute that tells you whether or not the document is complete in this one file or whether it needs to import other files. In this example, and for the next several chapters, all documents will be complete unto themselves so the standalone attribute is set to yes.

Markup tags can have three kinds of meaning: structure, semantics, and style.

Structure divides documents into a tree of elements. It merely expresses the form of the document, without regard for differences between individual tags and elements.

Semantics relates the individual elements to the real world outside of the document itself. Semantic meaning exists outside the document, in the mind of the author or reader or in some computer program that generates or reads these files.

Style specifies how an element is displayed. Stylemeaning specifies how the content of a tag is to be presented on a computer screen or other output device. ( achieved thru style sheets)

Every good XML document must have a root element. This is an element that completely contains all other elements of the document.

An attribute is a name-value pair associated with an element.

Attributes vs elements

Attributes can’t hold structure well.
Elements allow you to include meta-meta-data (information about the information about the information).
Not everyone always agrees on what is and isn’t meta-data.
Elements are more extensible in the face of future changes

CSS styles only apply to element content, not to attributes.

Leave a Comment