Beyond Monitoring - Measuring the effectiveness of a Data Warehouse - II

The data warehouse should be considered a work in progress, always needing improvements. Part-2 of this article discusses the factors that affect complex data environments and how they can be managed and improved.


Part-1 of this article proposed three steps to improve and ensure performance of a data warehouse. Part-1 also focused on the first step - measuring the complex data environment.

The Next Steps – Manage and Improve
As large and complex data environments become an integral part of the way in which organizations regularly conduct their business, certain problems arise:

· Usability – Efficiency of access to data becomes an issue in the face of large amounts of data, and response times can go beyond acceptable limits.
· Predictability – The growth of the workload in the IT environment becomes unpredictable.
· Manageability – Undisciplined and unstructured growth in the volume and usage of data becomes very hard to manage.
· Cost – The amount of money spent on the IT data environment, especially database licenses and disk storage, becomes very significant.

The challenge here is to understand the underlying factors that drive these problems and how they can be managed and improved to ensure end-user satisfaction. Then, an organization can make informed decisions on how to improve the usability, predictability, manageability and cost of their data environments. Certainly, the metrics we discussed in Part-1 would guide us with historical information on the usage and effectiveness of the data warehouse.

The underlying factors at a broad level need to be looked at, in order to increase the effectiveness of the data warehouse. They are:
· Transaction timing 
· Data Usage patterns 
· Storage requirements 

Now let’s look at these underlying factors and at recommendations on how to manage and improve them.

Transaction Timing

Data Warehouse transactions have response times from sub-second up to many hours. Thirty minutes is not an unusual response time for a Data Warehouse transaction. Data Warehouse transactions may involve a few units of data or millions of units of data. And they may occur at unpredictable times of the day.

Because of these very fundamental differences between unpredictable data warehouse transactions and repetitive, static transactions, measuring and monitoring must be different as well.

Measuring Unpredictable Response Times

Measurements need to be made because response times are not the same every time; a query may have a faster response time during early hours of the day and the same query may take a longer time to execute during the post-lunch sessions. The end-users’ perception of response time is shaped by both measurements. Some will care about how long it takes to get any data. Some will care only about how long it takes to get all the data; for these types of applications, longer response times may be acceptable, as response time of a few seconds may not be expected.

Just because a response time is long, it does not mean it is a problem. The Data Warehouse management must determine what is acceptable to the end-user. By measuring response times by department, user, query type, or which tables are accessed, we can work with users to determine if the response times are a problem and changes can be explored. Maybe queries could be run outside of peak usage times or at night, when there is less usage. Perhaps there are network issues that can be addressed.
In the end, we must know:

1. the response time for first data returned, 
2. the response time for last data returned, and 
3. user expectations for the above two measurements, so that management and improvement can be addressed.

Measuring Unpredictable Occurrence Patterns
The amount of concurrent activity affects performance. Response time can vary throughout the day as the workload varies. A five-minute response at 8:00 a.m. turns into a 20-minute response at 2:30 p.m. An important aspect of managing the environment is to look at the differences in response times throughout the day. There are many factors that can affect these response times. Some items to consider are:

· How many queries were running at that time of day
· What types of queries were being run
· What was the actual SQL text the user submitted
· Who were the users
· What applications were in use

How to Manage and Improve

We can address improving access and performance of data environments once we have a comprehensive view of all daily activities. By understanding the activities of the day, and the time of day when they occurred, we can identify ways to manage the varying workload. When we identify peak periods and light periods, we may be able to shift some queries to the lighter times, smoothing out response times. Users may need additional training or new tools. 

Data Usage Patterns

Like transaction timing, data usage patterns are also fundamentally different between data warehouse and OLTP (On-line Transaction Processing) environments. In OLTP, for example, there are usually a finite number of pre-designed transactions, and they follow predictable access paths through the data. But there are an infinite variety of data warehouse transactions that may occur, and there is no predictable pattern to the complex access paths that may be used.

Because of the evolving nature of the data warehouse, the workload of the environment is one of inherent instability and unpredictability. Yet this workload requires management as much as a static, predictable operational workload.

Measuring Transactions with Unpredictable Purposes

Often times, the results drive the users to ask additional questions, adding to the volume of unpredictable queries. Ad hoc querying of a database makes it extremely difficult to tune the database, something that is easily done in a predefined environment. Also, whereas workload sampling is adequate for monitoring predefined environments, the unpredictability of data warehouse workloads requires that data usage tracking be continuous, so that we have a complete picture of who uses what data and how. This ensures that our data management decisions are based on a true understanding of how the business uses the data.

Comprehensive tracking of all queries submitted identifies the data important to the end-users and how the end-users are using data. Once usage becomes clear, then patterns can be understood and we can improve the data design. For example, if users are making queries that manually join and summarize two tables, adding a de-normalized summary table might be the answer.

How to Manage and Improve
To effectively manage data usage, we must be able to identify:
· What data is being used
· How much data is being returned to the end user after an access has been made
· What tables are frequently joined, and with what columns
· What columns are used for row selection
· Whether repetitive queries are being submitted

Once usage patterns are understood, we can take the following steps to improve access to the data and, ultimately, the performance of the query:
· Indices can be added
· Redundant units of data can be intelligently deleted
· The physical location of data can be optimized
· The merging of tables with like patterns of access can be accomplished
· Summary tables can be built
· Standard queries can be designed

In short, once the pattern of access and usage of data are understood, there are many ways in which the data can be manipulated in order to make data access most efficient. However, without knowing about the usage of the data, there is no way to intelligently reorganize it. Data usage tracking allows us to find patterns, so we can ensure that the data design keeps up with the changing needs of the business users.

Storage Requirements

Another major issue in the management of data warehouse systems is managing and controlling the volume of data. Clearly, these systems demand a large amount of historical data. In addition, since there is no way to accurately predict what data will be needed, it is necessary to store data at the lowest level of detail. Finally, summary data is stored as well as detail data. Summaries are stored to make some data warehouse transactions run faster. This requires space over and above the detailed data.

Measuring Unused Data

The data warehouse often grows at an astonishing rate. When these environments first grow beyond their hardware capacity, we generally expand disks, processors and database licenses. As they mature, it is natural for the pattern of growth to continue. But at some point in time, as growth continues, the question must be asked: “Is ALL the data actually going to be used?”
Periodically, the management needs to evaluate the data stored in the data warehouse and evaluate, economically and technologically, whether it makes sense to remove unused data, and reuse the space the data has been occupying, rather than to purchase new storage. In addition, by removing dormant or redundant data, user access times are improved.

How to Manage and Improve

Data usage metrics allows us to manage performance by finding out what data is actually being used and what data is not being used. By measuring and tracking data usage, we can remove data intelligently – that is, remove unused data for the purpose of reclaiming storage. Data usage tools can reduce the cost of additional storage, processors and licenses and improve user experiences in accessing data.

Summary

The key to efficiently building and managing data is to understand how the data is being used. By going beyond the internals and focusing on how users are interacting with the data, the management gains a comprehensive view of how business actually uses the data environment. By getting real-time and historical insight from external metrics – who is using what data, when, how, how long, how often and from what applications – regular improvements can be made that ensure peak performance of mission-critical applications across the enterprise. We can improve business performance, control costs, increase efficiency and effectiveness for users, achieve service level promises and enhance the user experience.

By including data usage tracking in our performance management strategy, we enable the business to react quickly in a competitive business climate, which ultimately ensures success both for the user community and for IT.

It isn’t always clear who should be responsible for measurement. Some organizations have groups dedicated to performance and these are the folks who normally have the primary responsibility. Depending on the specific measurement, the responsibility may fall to the DBA group, the Architecture Group, or to Capacity Planning. Measurement is usually not a full-time job but it is a job that cannot be forgotten or denied. Ultimately, the metrics reflect the focus of the organization, e.g. response time, and they should be agreed upon by all partners from the start.

We need to measure because the data warehouse should always be considered a work in progress, always needing improvements. The process should always be to measure, identify problems and opportunities, and take appropriate action to solve the problems and exploit the opportunities.

Finally, Management wants to know how things are going. They just spent millions for the data warehouse and they want to know if they are getting their money's worth. Are the users using the system, are they happy, and are they achieving the benefits they were expecting? Each organization should identify the metrics they will need and use, as they continually work to improve their own data warehouse. An understanding of the appropriate metrics, the responsibility for gathering the metrics, and the use of those metrics can make the difference between success and failure of the data warehouse project.

About Author:
Soumendra Mohanty is with the Data Warehousing/Business Intelligence Capability division of Accenture, India




Added on April 21, 2008 Comment

Comments

Post a comment

Your name:

Comment: