Why build a Data warehouse?

Why build a Data warehouse?
 

Often when we talk about implementing BI, the first thing we need to look at is how and where is the data? Is it scattered? Is it easy to report on? With this we often need to know if we need to build a reporting database or a data mart or for that matter a data warehouse. We all know WHAT a data warehouse essentially is; here we are going to discuss the WHY.

Consolidate/Integrate Data:

Usually the data in an enterprise is scattered across the various applications and data sources that form the enterprise’s ecosystem. To aid better insights and business decisions, the business users need data from all the various subject areas to be accessible. DWH helps us to consolidate and integrate data from various sources and store in a single data model.

 

Cleanse data & Enhance data Quality:

DWH includes converting data from different sources into common formats, using common keys, structure, decimal/time formats. We can restructure the data and rename tables and fields so if makes more sense to the users. We can also use master data management to consolidate common data from various sources and build relationships.

 

Store Historical Data:

DWH is usually used to store huge amounts of historical data that the transactional systems do not /cannot retain. This helps analytics across different time periods, trend analysis.

 

Speed up Data Retrieval:

In a DWH, the data is structured in such a way that aids faster retrieval. The structure is designed for faster data access, drill down, filtering, and summarization. The transactional systems usually optimized for smaller transactions that inserts/update data in specific tables and access small amount of data. However, BI requires analyzing, aggregating a large amount of data quickly.DWH stores data to help such retrievals.

 

Unburden transactional Systems:

Executing resource intensive BI queries that access huge amount of data on transactional systems may affect these systems. It may even bring them down affecting business. Building a DWH separate from the transactional systems solves this problem.

 

Prepare / Pre-aggregate data:

Depending on the business needs, data can be pre-aggregated at various levels and stored in the DWH. This can save time and improve performance of the BI reports. The KPIs that help the business users can also be pre-calculated and stored. Users can then analyse these indicators across dimensions to gain knowledge and insights.

 

Support Self Service BI:

The DWH either uses dimensional model or easily feeds dimensional models, which can be used to build analytical cubes. These cubes can be used by business users to do Adhoc reporting, slice-n-dice the data and build their own reports.

 

Increased findability for Business users:

In a DWH, the data is stored in ways that is more meaningful to the business users. Also, the data relations can be easily established. The table names, field names are business user friendly. These make it easier for users to find the data they were looking for.

 

Helps Data mining / trend analysis / Predictive Analytics:

Once a DWH is created, it makes easier to analyse huge amounts of data and associations to uncover hidden patterns using data mining methodologies. The historical data in DWH can we analysed across time periods to discover trends / perform business predictions.
 
These are some major reasons / benefits of building a DWH. Though building DWH is a time consuming and complex process, it solves many problems and delivers effective BI.

Shraddha Tambe | Helical IT Solutions

About Analytical Databases

Analytics-iconAbout Analytical Databases 

Companies and Corporates that build Business Intelligence solutions for their business, often want to consolidate their data that is spread across various applications, for reporting and analysis purposes. For that, they build datawarehouses or data marts as the foundation of their enterprise-wide BI solutions. These DWs have been implemented using traditional relational databases and successfully so. But as organizations add more and more data and users to these systems, the scabiltity and performance become crucial. One of the useful contributors to solving the problems of scability / performance could be the use of Analytical Databases or Datawarehousing DBs.

 

What are these Analytical Databases ?

Analytics databases are optimized to be used for retrieval, aggregation and analytics. They deliver benefits in a BI environment because they are designed for high performance in this environment.

These Analytical databases, or analytical platforms, span a range of technology from appliances and columnar databases to shared nothing, massively parallel processing databases. The common thread among them is that most are read-only environments that deliver exceptional price/performance compared with general-purpose relational databases originally designed to run transaction processing applications. Analytical databases are now a central component in any BI ecosystem, providing the necessary performance and scalability to support large numbers of analytical users and growing volumes of both structured and unstructured data.

Below are a few differences between transactional and analytical DBs:

  • On-line transactional procesing (OLTP) databases are designed to host operational systems that run the business. Analytics databases step in when operational use of the data has ended, and the data are to be used for information and analysis.
  • OLTP-oriented databases, are designed for high performance in an on-line transactional environment, where additions, updates, and deletions are being continuously made to individual or small numbers of records in a random manner over time, to data scattered accross tables. In an analytics database, there are mostly bulk uploads of data, deletions and update are rare / infrequent.
  • Data integrity is critical to transactional data, OLTP databases are optimal for these.
  • OLTP databases support environments where many concurrent users are retrieving, adding, updating, and deleting data on realtime basis. Analytics databases, on the other hand, support environments where very many concurrent users are retrieving data but adds, updates, and deletes are done periodically, on scheduled basis.
  • In OLTP environments, retrievals are oriented towards single or few records where many or most columns from these records are required. In analystics databases, retrievals are oriented towards many or most records (statistical operations); few columns from these records are required.

 

What makes Analytical DBs faster and more scalable ?

Let’s look at some of the technologies I mentioned above to see how these make analytical DBs different. These are some but not all of the technologies that are used.

Columnar Databases :

A columnar database, also known as a column-oriented database, is a database management system (DBMS) that stores data in columns rather than in rows as relational DBMSs. The main difference between a columnar database and a traditional row-oriented database are centered around performance, storage necessities and schema modifying techniques. The goal of a columnar database is to efficiently write and read data to and from hard disk storage and speed up the time it takes to return a query. One of the main benefits of a columnar database is that data can be highly compressed allowing columnar operations — like MIN, MAX, SUM, COUNT and AVG— to be performed very rapidly. Another benefit is that because a column-based DBMS is self-indexing, it uses less disk space than a relational database management system (RDBMS) containing the same data.

Column-oriented organizations are more efficient when an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data, because reading that smaller subset of data can be faster than reading all data.

Massively Parallel Processing:

MPP (massively parallel processing) is the coordinated processing of a program by multiple processors that work on different parts of the program, with each processor using its own operating system and memory. An MPP system is also known as a “loosely coupled” or “shared nothing” system. MPP allows the warehouse to break up large data analytics jobs into smaller, more manageable chunks, which are then distributed to multiple processors. Analytics is run simultaneously – or in parallel – on each processor, then the results returned and synthesized.

In Memory analytics:

I don’t think I really need to elaborate on this one. They are databaes which load the source data into system memory in a compressed, non-relational format in an attempt to streamline the work involved in processing queries.

Online analytical processing (OLAP):

These are the databases, which store multidimensional “cubes” of aggregated data for analyzing information based on multiple data attributes.

 

This is a little bit about analytical databases and how they might add value to a DW/BI Solution.

Shraddha TambeHelical IT Solutions

References:
Slide share – determine right analytic db
http://databases.about.com/od/datamining/a/Analytics-Databases.htm
http://insideanalysis.com/2013/08/understanding-analytical-databases/