Data Analysis Techniques

Data Analysis Techniques    

The need and way to analyse data basically depends on the end and not on the source.That end is typically the need to perform analysis and decision making through the use of that source of data. Data analysis in these days typically include reporting, multidimensional analysis and data mining which relates to “Display”, “Analyse” and “Discover” respectively. Depending on the type of data analysis the source data’s requirement may vary.

     If reporting is required for analysis then simplest of the data source would give best results.Query and reporting capability primarily consists of selecting associated data elements, perhaps summarizing them and grouping them by some category, and presenting the results. Retrieving relevant data from the data warehouse, transforming it into the appropriate context, and displaying it in a readable format.Finally, the report is delivered to the end user in the desired output format be it graph, pie and table form in the required output medium.

     If the objective is to perform multidimensional data analysis, a dimensional data model would be more appropriate. This type of analysis requires that the data model support a structure that enables fast and easy access to the data on the basis of any of numerous combinations of analysis dimensions. For example,you may want to know how many of a specific product were sold on a specific day, in a specific store, in a specific price range.Multidimensional analysis enables users to look at a large number of interdependent factors involved in a business problem and to view the data in complex relationships. End users are interested in exploring the data at different levels of detail, which is determined dynamically. The complex relationships can be analyzed through an iterative process that includes drilling down to lower levels of detail or rolling up to higher levels of summarization and aggregation.This is a data analysis operation whereby the user takes a different viewpoint than is typical on the results of the analysis, changing the way the dimensions are arranged in the result. Like query and reporting,multidimensional analysis continues until no more drilling down or rolling up is performed.

     As said before Data mining is nothing but “Discovery”. This discovery could take the form of finding significance in relationships between certain data elements, a clustering together of specific data elements,or other patterns in the usage of specific sets of data elements. After finding these patterns, the algorithms can infer rules. These rules can then be used to generate a model that can predict a desired behavior, identify relationships among the data, discover patterns, and group clusters of records with similar attributes.Data mining is most typically used for statistical data analysis and knowledge discovery. Statistical data analysis detects unusual patterns in data and applies statistical and mathematical modeling techniques to explain the patterns. Data mining is data driven . There is a high level of complexity in stored data and data interrelations in the data warehouse that are difficult to discover without data mining. Data mining offers new insights into the business that may not be discovered with query and reporting or multidimensional analysis. Data mining can help discover new insights about the business by giving us answers to unasked questions .

     These data analysis techniques offers new insights into the business through keen look into the data for its analysis to fetching quality information, which can be be used for ultimate buisness intelligence through profitable buisness decisions in order to grace buisness upliftment and growth in different sectors.

Helical Insight for ETL Tester (QA)

Helical Insight for ETL Tester (QA)

Let’s take an example that your ETL developer has just wrote an ETL script, which loaded the data into the Data Warehouse, few millions of records got inserted into the Data Warehouse.

For QA, it is super difficult to test all (millions) of the data and also a time consuming process. Moreover, there are many other problems like

  • some might not be well versed with writing complex SQL queries
  • There might be multiple data sources involved in generating the Data Warehouse.
  • Writing multiple queries on multiple data sources and then combining it in one single result set would be a sweat breaking task.

Now think of Helical Insight as your friend which helps you to do most of your tasks, allowing you to focus on more critical thinking rather than just doing stare and test whole day. Wondering how Helical Insight can help you achieve the same?

Here is your answer,

  • Helical Insight is capable of communicating with any of your data source(s) like CSV, Excel, MySql, Oracle, SQL Server, Postgres etc.
  • Helical Insight has a drag and drop feature to make your test cases ready, without even having a prior knowledge of SQL queries. Helical Insight generates the SQL queries including the joins between multiple entities. Helical Insight understands your data structure, hence it generates the query even if you are working with different set of data source(s), DW or normalized database, Helical Insight understands all.
  • Using this (drag and drop) facility you can generate all your test cases in a form of report, which you can call it as your test cases. For example, your report does a count of records for a dimension. You can save this report and check the same with source system by creating a separate report. If its a migration project the count should definitely match otherwise we know something is wrong with the ETL Script.
  • Once you have saved these reports (test cases) you can re-executes them again an again to verify your data. In the above situation you have just identified that Helical Insight is really a very useful tool with some really cool features like generating the SQL query for any database or data source(s), ability to execute the queries on any databases or data source(s), ability to save the report / test cases / SQL queries for future usage.

Staging DB in Datawarehousing – The WHYs and WHY NOTs

Staging DB in Datawarehousing – The WHYs and WHY NOTs
 

Whenever we talk about designing datawarehouses, staging database is one consideration we have to make. And it depends on many factors whether you choose to have one or not. I feel when we start small, like implementing small datamarts, we may not see the need for a staging database but as the DW implementations grow in size and complexity, staging DBs have a lot of benefits to offer. Staging DBs are often the place to dump source data, the workplace to create temperory intermediate tables for processing the transformation steps in ETLs, to keep what is good, to reject what is bad and so on.

Lets quickly look through the PROs and CONs of having Staging DBs in the DW implementations:

 
The WHYs

Source data accumalation: If data is being pulled from multiple systems Source systems and not all data is available at the same time, we may need to accumalate all raw data pulled at different times and then integrate/merge and load the same.

Data transformation, cleansing, data alignment: Staging DB may be the workplace for the ETL to do the required cleansing, apply transformation rules and perform alignment of data from various systems.

Decoupling of Source from Datawarehouse: Staging DB helps decouple the Source systems from the DW. Source systems are accessed only by ETL jobs which pull data into staging, or they may even choose to push data there. This greatly reduces the amount of time that source systems are accessed by DW loading jobs, reducing contention too.

ETL Performance: Loading staging tables is simple as they are mostly the exact replicas of the source tables, minus the indexes / foreign key relationships. As most of the joins on the data , cleansing, calculations are done on staging tables and may be stored in intermediate temporary tables, loading of DW from staging tables is also simpler. This greatly increases the ETL performance as well as quality of the DW data.

Handle Failure, debugging, restart , recovery: As staging DBs have snapshots of source data, can have extensive logging, intermediate tables used for data transformations etc, they make it easier to debug errors/failures. In case of restarts / failures, the step of pulling the source data may not have to be redone.

Data reconciliation / pre / post load validations : This is very much similar to the point above. Staging DBs are excellent to capture auditing information which can be used for reconciling data, validating data to reject bad data, make sure the data we loaded is correct , consistent with the source, nothing is missed out.

Change Data Capture: Staging DB is also used as one of the mechanisms for CDC. For this the copy of source data tables from previous load are preserved till the next execution. The current and previous source data are compared to identify changes to the data. This is mostly used in scenarios where the source systems do not have means to support change detections.

Simpler, manintenable ETLs: With staging, we divide the process of Extract Transform Load into different processes. Combining the extraction, cleansing, transformation logic, precalculations, summarizations into one single ETL makes it very complex and also a single point for failure. With staging step, ETLs are smaller, simpler and easy to maintain as well.

Prevent Orphan records / race conditions: When we fetch data from source directly and load into DW, there are chances that we sometimes may pull some partial updates happening in the transactional system or the new data might be getting inserted as we pull causing race conditions. With Staging, the data from source is pulled in one single go preventing such problems.

Backup of source: The staging DBs if archived / retained for some specific period of time may act like a backup of the source data which is loaded into the DW.

 

The WHY NOTs

Increases processing time , as data is first loaded into staging tables, transformed and then loaded to the DW. Due to the additional step, there is slight increase in processing time.
Increases latency, source changes take more time to reflect in DW. In realtime/near realtime DW, staging is not usually included.
Environment requirements : Additional Database servers / storage space required.

The overhead that staging may impose is quite less as compared to the benefits we derive. But then again, it completely depends on what suits a given requirement. The above points may however help us in making a choice.

 
Shraddha Tambe | Helical IT Solutions

 
References:

1) http://data-warehouses.net/architecture/staging.html
2) http://dwbi.org/etl/etl/52-why-do-we-need-staging-area-during-etl-load
3) http://danlinstedt.com/allposts/datavaultcat/data-vault-and-staging-area/
4) http://www.dwh-club.com/dwh-bi-articles/change-data-capture-methods.html

KPI used by Ecommerce Companies in Business Intelligence

Keeping a tab of KPI is a very good exercise and can help entrepreneurs in understanding the current state of business and the way forward.The general KPI used by Ecommerce companies are listed below (ofcourse there could be variations and a lot will also depend on the companies business goal as well). These KPI, in different charts and tables and visualizations, can be built using any of the Business Intelligence software (BI). This blog will be talking about KPI used by Ecommerce Companies in Business Intelligence

 

Ecommerce KPI

Ecommerce KPI

Generic Reports are
– Average number of items per purchase
– Percentage (%) of new customers to existing customers
– Frequency of sales transactions
– Average lifetime value of customers
– Shopping cart abandonment rate
– Checkouts per cart
– Average Days to Purchase
– Order session Percentage (%)
– Average Visits to Purchase
– Coupon conversion percentage
– Cart conversion rate
– Percentage (%) of returning customers
– Percentage (%) of canceled checkouts
– Average order size

Operational reports
– Stock Week Cover: Giving information about how much units we have in stock and for how long this will last
– Rate of Stock Turns
– Stock at the COGS value
– Stock at the SRP value
– Dispatch costs per warehouses

Performance Related Reports:-
– Gross Profit
– Gross Margin Percentage (%)
– Return Rate and Return Rate Percentage (%)
– AVG Order Value
– AVG Item per order
– Traffic: number of visits or visitors, it depends what do you like to measure
– Conversion Rate = Number of Dispatched orders / Visitors
– Price Point

Thanks
Nikhilesh
Helical IT Solutions

Jasper BI And Pentaho Reports Performance Optimization

In any BI solution, we create a lot of reports and dashboard. Often the developers come across the problem that the solution is not performing well, the reports are

taking a lot of time to load.
In this blog we will try to discuss the different approaches which could be used to optimize the performance.

a. Having good hardware. This ensures that the processing is not taking much time

b. Having good internet speed : This reduces latency

c. Reports with many pages: Let’s say if we are generating a report which is having 100 pages. Good part in viewing the same via BI (via web browser) (Jaspersoft and Pentaho) is that we don’t need to load all the pages before we can show it to the end user. As soon as the pages are loaded end customer can start viewing it. Hence he
can start seeing first couple of pages immediately and meanwhile the remaining pages will keep on loading and rendering to the user. But the case is not the same with other formats, if a client wants to download the report in excel/pdf/csv format, he can only download once the entire report has been loaded.

d. Caching at BI level : Jaspersoft & Pentaho, both are having caching facility. We can increase the size of the caching memory, thus this leads to improved performance.

e. Caching at database level : If we feel that some of the reports / db generally takes  a lot of time, what can be done is we can create temp tables inside db and we can put the values required for those reports in this tables. This way we can have multiple level of caching (application level and database level).

f. Optimization at query level: Using the best practices of writing sqlqueries (like avoid writing nested queries, including too many Joins etc).

g. Database indexing and controlling of users on database will also increase performance of report execution.

h. Remove unused variables / parameters / table columns with in the report.

i. Maximum Usage of built in functions which are available with in jasper and pentaho reports for calculating total / percentages / avg and for type conversions like

string to integer etc will also increase the performance of report.

j. Whenever a BI solution is developed directly hitting the transactional database, the performance is compromised. Hence its always suggested that there is separate

reporting database (or data warehouse).

k. If there are multiple input parameters (like the one present in the below snapshot), then on the submit button only the report should change.
BI Report Performance
Otherwise it will result in firing many queries to the database. Also for the report/dashboard, ideally there should not be multiple input parameters.

Navin Dasari
Helical IT Solutions

Introduction to XSD – XML Schema Definition

This blog will talk about Introduction to XSD – XML Schema Definition   XML Schema Definition:- XML Schema describes the structure of an XML document, what a given XML document can contain, in the same way that a database schema describes the data that can be contained in a database (i.e. table structure, data types, constraints etc.). The XML schema defines the shape, or structure, of an XML document, along with rules for data content. The purpose of an XML Schema is to define the legal building blocks of an XML document, just like a DTD. XML Schemas are richer and more powerful than DTDs. XSD (XML Schema Definition), a Recommendation of the World Wide Web Consortium (W3C)

  • defines elements that can appear in a document
  • defines attributes that can appear in a document
  • defines which elements are child elements
  • defines the order of child elements
  • defines the number of child elements
  • defines whether an element is empty or can include text
  • defines data types for elements and attributes
  • defines default and fixed values for elements and attributes
  • xml schema support datatype
  • XML Schemas are written in XML
  • XML Schemas support data types
  • XML Schemas support namespaces

NameSpace:-

  • XML Namespace is a mechanism to avoid name conflicts by differentiating elements or attributes within an XML document that may have identical names, but different definitions.
  • It solve the name conflict by using a prefix.

<code><h:table> <h:tr>         <h:td>IT</h:td>          <h:td>Management</td> </h:tr> </h:table> <f:table>                 <f:name>Playground</f:name>                     <f:width>100</f:width>                 <f:length>200</f:width> </f:table> </code>

namespace for the prefix must be defined. The namespace is defined by the xmlns attribute in the start tag of an element. The namespace declaration has the following syntax. xmlns:prefix=”URI“. <root> <h:table  xmlns:h=”http://www.w3.org/tr/html4”> <h:tr>           <h:td>IT</h:td>                           <h:td>Management</h:td> </h:tr> </h:table> </root>   Advantages of XML Schemas Created Using XSD:-

  • XSD provides control over the type of data that can be assigned to elements and attributes.
  • XSD enables you to create your own data types. This feature enhances the flexibility of defining the structure of the XML document.
  • XSD enables you to specify restriction on data. For example, you can ensure that the content of an element is a positive integer value.
  • The syntax for defining an XSD is the same as the syntax used for xml documents. Therefore, it is easier to learn the syntax of an XSD.
  • XML schema content models can be used to validate mixed content.

XML schema is extensible. This enables you to reuse parts of a schema in another schema and derive custom data types from existing data types.

D3 DayHour Charts integration with Jaspersoft

This blog talks about D3 DayHour Charts integration with Jaspersoft

All the reports are develop using ireport 5.5 professional and jasper server 5.5

As html component of jasper server does not load any scripts in the html component, we loaded the script in one of the decorator page (jsp page). The page is located at the location:

C:\Jaspersoft\jasperreports-server-5.5\apache-tomcat\webapps\jasperserver-pro\WEB-INF\decorators\decorator.jsp

In the page we included the scripts which we want to load. We added the following code in the jsp page at line no 46:

<script type=”text/javascript” language=”JavaScript” src=”${pageContext.request.contextPath}/scripts/d3.v3.min.js”></script>

The script to be added should be kept at location:

C:\Jaspersoft\jasperreports-server-5.5\apache-tomcat\webapps\jasperserver-pro\scripts

Meaning of DayHour Chart:-

This chart represents functioning of any particular thing during different hour of different days. These graph can be used to view variations in different situations.

DayHour Jaspersoft Integration

Integration with JasperServer:

The data which we use for developing the calendar view can be fetched from any database. The data fetched from database is stored in a variable and is then accessed in the html component using the same variable. Applying this of process makes the report dynamic instead of static. Few parameters can also be added in the report which can be used in query and/or html component.

Generally for these type of charts we pass a variable which contains required data containing date, hour and a value associated with that particular date and hour. The string is created according to JSON format, so that when accessed in script tag, can be easily converted to JSON object using eval function.

Any variable/parameter can be accessed as shown below:

“<script> var arr =”+$V{variable1}+” </script>”

Parameter in query:

Select * from table_name

where date between $P{parameter1}  and $P{parameter2}

 

The sample code of static chart can be found at:

http://bl.ocks.org/tjdecke/5558084

The steps on how to integrate it with jasperserver was discussed in my previous blog(D3 Integrating with Jasperserver).

 

 

Avi Dawra

Helical IT Solutions

SCD Type 1 Implementation on Pentaho Data Integrator

This blog will talk about SCD Type 1 Implementation on Pentaho Data Integrator

Slowly Changing Dimension Type 1 does not preserve any historical versions of the data .
This methodology overwrites old data with new data, and therefore stores only the most current information. In this article let’s discuss the step by step implementation of SCD Type 1 using Pentaho.
The number of records we store in SCD Type 1 does not increase exponentially as this methodology overwrites old data with new data
Create table in Database for source and target and Create connections for database.

Table Input Step:

SCD Penthao

Drag and Drop the table input in spoon workspace and give the connection, then click on get select statement.

Then Click on ok.

Database Lookup:-

Drag And Drop The Database Lookup. Double Click on Database Lookup And Set The Following properties  and then click on ok.:-

SCD Penthao 2

Filter Rows:-

Set The following properties  of filter rows and click on ok.

SCD Penthao 3

Select Values:-

Before you work on select values, connect the table output and specify all the database field on to the output step.

SCD Penthao 4

Table Output :-

SCD Penthao 5

Update:-

Set The Following On Update  and click on ok.

SCD Penthao 6

Program should Like the Following Figure :-

SCD Penthao 7

 

For any other query related to Pentaho Data Integrator, get in touch with us at Helical IT Solutions

Bidyut Kumar

Helical IT Solutions

Jaspersoft Report Scheduler

This blog will talk about how to use Jaspersoft Report Scheduler, its configuration etc.

1.       Report Scheduler  :  It is an automated and iterative platform to email reports to anyone in PDF,CSV,Excel,….etc format.

 

2.       To  send email , it should be connected to outbound email server,  so for this it should be configured accordingly

a.       Go  to this  path : C:\Jaspersoft\jasperreports-server-5.5\apache-tomcat\webapps\jasperserver-pro\WEB-INF\js.quartz.properties

 

You will find the js.quartz.properties file like below screenshot:

Jaspersoft Report Scheduler 1

b.Change the yellow mark values to your required input

c.Restart the Jasper Server

 

3. Now Go to Report Wizard and scheduled your required job or report :

 

Below mentioned are the steps to be followed for doing the same:

Jaspersoft Report Scheduler 2

b. After this go to “create schedule “ tab , you will find like this:

Jaspersoft Report Scheduler 3

Here you can set how frequently you want to schedule this report and for what timezone

c. Then go for the “output options” tab in this window only[in the 2nd right] Here you have to specify in what format you want to send the report

Jaspersoft Report Scheduler 4d. Then go for “notification” in this window. To whom all we want to send the reports, the same is to be specified in this window

Jaspersoft Report Scheduler 5

Now You have successfully scheduled the reports…!!!

4. If we want to stop the scheduler, then go to this location

C:\Jaspersoft\jasperreports-server-5.5\apache-tomcat\webapps\jasperserver-pro\WEB-INF\applicationContext-report-scheduling.xml

Jaspersoft Report Scheduler 6

Remove the init-method=”start”

Restart the Jasperserver

Thanks

Rupam

Helical IT Solutions

MariaDB to MariaDB Data Replication Strategies

Data Replication Strategies

 

This blog will try to explain about the different MariaDB to MariaDB replication strategies which can be followed.

Replication Benefits

Replication is having a number of benefits, some of them includes:

  • Scalability: By having one or more slave servers, reads can be spread over multiple servers, reducing the load on the master. The most common scenario for a high-read, low-write environment is to have one master, where all the writes occur, replicating to multiple slaves, which handle most of the reads.
  • Low hardware : Having a slave system db can provide the freedom to play with that data without affecting master database. We can have all the reporting etc done on this particular database. Hence, having a very high hardware requirement on the master database can be avoided.
  • Data analysis: Analyzing data may have too much of an impact on a master server, and this can similarly be handled on a slave server, while the master continues unaffected by the extra load.
  • Backup assistance:  Backups can more easily be run if a server is not actively changing the data. A common scenario is to replicate the data to slave, which is then disconnected from the master with the data in a stable state. Backup is then performed from this server.
  • Distribution of data: Instead of being connected to a remote master, it’s possible to replicate the data locally and work from this data instead.

 

 

Approach 1 : Maria DB to Maria DB replication using Master slave configuration.

 

Data Replication

Replication is a feature allowing the contents of one or more master servers to be mirrored on one or more slave servers.

We can exert control over which data to replicate. All databases, one or more databases, or tables within a database can all be selectively replicated as well.

The main mechanism used in replication is the binary log. If binary logging is enabled, all updates to the database (data manipulation and data definition) are written into the binary log as binlog events. Slaves read the binary log from each master in order to access the data to replicate.

In this particular approach, masters and slaves do not need to be in constant communication with each other. We can take the servers offline or disconnect from the network, and when they come back, replication will continue where it left off. The slave in this case will be in access read only mode, the data copying from Master to slave will be on real time without any lag. Hence, on the BI and reporting front, we can have real time BI capabilities.

 

 

Approach 2 : Usage of Third party tools

The second approach which can be used for data replication from Master DB to Slave DB is by the usage of DB replication third party tools. One of the example which could be cited here is a tool like Tungsten. Tungsten is an open source third party data replicator tool. It not only works on MySQL but also on Maria DB. Tungsten can help solve problems like promoting masters easily from pools of slaves, replicating data between different database versions, replicating efficiently across sites, building complex topologies, and parallelizing data flow between servers. Tungsten Replicator runs equally well in cloud as well as locally hosted environments. Tungsten Replicator users range from tiny start-ups to the largest web properties on the planet.

 

 

Approach 3: using ETL. CDC implement, free version

The third approach which could be followed is to use ETL. We are having open source ETL options which includes Talend Open Studio or Kettle – Pentaho Data Integrator.

For having the master slave databases configuration implemented, we would be implementing CDC (change data capture). By CDC we can identify, capture and deliver only the changes which are made to the master database.  Frequency of the same can be set. Change Data Capture (CDC) is based on Publisher/Subscriber principle. The publisher captures the change data and makes it available to the subscribers. Implementing CDC will also help in reducing the workload on ETL since we would be sharing the information of only the changed data.

The advantages of implementing via ETL is security can be implemented like data encryption, while copying data we can have options like we can select which all tables data we want to replicate, how much amount of data we can to replicate, rules can be implemented while replicating etc.