An answer to infrastructure questions

Do we buy more memory? When do we buy it? How long can we keep this resource at a high-utilization state? Usually all these questions are solved by heavy assumptions that can lead us to be on the safe-side, over-provisioning and trying to keep all our infrastructure components under the thresholds which we consider risky. Capacity Management exists to stop that.

What is "Capacity Management" in IT?

The main objective of Capacity Management is to "avoid waste".

It is to only spend the minimum necessary on IT infrastructure while achieving your standards for quality of service and limiting risks to an acceptable level. In other words: avoid wasting money on IT infrastructure without negatively impacting the business.

Even shorter: balancing costs and risks.

 


 

 

The tiers of Capacity Management

An organization makes money by investing in creating IT services (which it can use or sell) that are built on IT infrastructure.

So here we have the 3 tiers of Capacity Management:

1. The business tier is about how the organization measures and maximizes profit by intelligent investment in IT. The main metric is money (cost, revenue, etc).

2. The Service tier. In this context, the term "application" is interchangeable. It is about describing and having metrics for the services the organization build and uses or sells. At this level, the metrics are what make sense to the users, they are used to measure the quality of the service. For a web application, for instance, we are talking about metrics like number of concurrent users that can be served, what latency a user experiences for completing a service request or how many service requests can be completed in one hour.

3. The infrastructure tier. This tier focuses on the IT components like; virtual or physical servers, network devices, and components that those use, like CPU, memory, disk storage, etc. The metrics are component oriented, like % of CPU or memory used, and total amount of CPU power or memory space.

These 3 tiers are tightly interconnected. A service must be built on a set of IT components. The capacity and utilization of the components “under the hood” explain how services perform, and depending on the expectations we have for service metrics, there may be a need to increase the power of our components. That translates in more expenditure by IT.

On the other hand, we could realize that our service quality is much better than needed, so we can afford to decommission or downgrade IT components so that we save money.

In summary, the relationship between these three layers is tight but intricate. It could be that one single IT component is acting as a bottleneck, resulting in poor service performance; we also should expect that changes in capacity at the infrastructure level do not reflect linearly in the service performance. Understanding, dealing with, and even predicting this complex relation between the 3 layers is one the main challenges in Capacity Planning.

 


 

 

FROM "Excel-based Capacity Management" to Report & Recommend & Predict & Plan

The traditional approach to Capacity Management (CM) has been to focus on the infrastructure tier, leaving aside the challenge that is connecting with the other 2 tiers.

The first "natural" step in CM is to make an inventory of IT resources and attach a set of metrics to each one which describes utilization and capacity. Then, we set thresholds for different metrics, so that we get warnings or alerts when certain values are exceeded.

These alerts can be expressed as coloured cells in a table where rows are IT components and columns are meaningful metrics such as CPU capacity, CPU utilization, memory capacity, memory utilization, and more.

Traditionally this has been stored and displayed in the form of excel tables. As Gartner's latest report "Market Guide for Capacity Management Tools", May 2016, pointed out: "Many organizations depend on complex, but undocumented, homegrown Excel-based capacity management tools, with associated dependencies on the individual practitioners and authors of the tools. This business risk is rarely acknowledged by IT or infrastructure and operations (I&O) Leaders."

This is the basis of CM, the first step out of the 4 key steps, which is what we call Report. Reporting is about the present (plus all the historic information), and about describing by using meaningful metrics.

The reports generated this way have a purpose, to provide enough useful information to take action. From these excel-style reports with their coloured cells we have hints about what to do next:

- See the components that are heavily used.

- Think about the steps we need to take so that those components with high utilization are upgraded (given more capacity).

- See the components that are underutilized.

- Think about what we can do with those under-used components. Usually, to migrate the processes they run to other elements of our infrastructure and either get rid of them or assign them to support other components that are overloaded.

This second aspect or step is what we call Recommend. Recommending is about taking action now or in the very near future.

As we can see, the traditional recommendation system for capacity management has been very limited. It has mainly been based on the infrastructure tier, it has also been based on weak assumptions about the impact that changes on our infrastructure have on the service and business level.

Conclusion

In this post we've presented what Capacity Management is, and its three tiers: business, service and infrastructure. We've also outlined traditional  infrastructure level and "excel-style" Capacity Management. In our next post we'll be focusing on the challenges and future of this discipline. We will make our best effort to predict the future and bring service and business tiers into play by applying machine learning, automating recommendations and "real-time" capacity planning.

Tags: Insights