Data Quality User Guide

Introduction

Communications between the gateway and other processes occur in a dedicated communications thread, which is separate to the main data processing thread. Data updates from Netprobes are read and placed on in-memory queues (one queue for each metrics view) for subsequent processing.

This allows the gateway to provide more predictable and consistent service during periods of high load. By responding in a timely manner to netprobe network heartbeat requests, netprobes do not disconnect from gateway. This allows the gateway to control if and when disconnect events should occur.

When gateway is overloaded, the rate of data input exceeds the rate of processing gateway can perform. In this situation, the data currently being processing from data queues will lag behind the head of the queue, meaning that monitoring data becomes stale. The gateway will attempt to resolve this situation by temporarily suspending connections to one or more netprobes.

Data Quality consists of the following elements:

  • A gateway algorithm to maintain a quality of service.
  • Periodic reporting of data quality information to the gateway log file.
  • Additional fields in the Probe Data plugin allowing monitoring, logging and alerting of data quality metrics.

Data quality algorithm

The gateway uses the following algorithm to maintain quality of service. The settings referred to below are configured in the Operating Environment > Data Quality section of the gateway setup file.

  1. The gateway monitors dataview updates to determine if the oldest pending update has become stale (as defined by the maxDataAgeMs setting). If this occurs, a probe connection will be suspended to reduce gateway load and restore timely data processing.
  2. The gateway determines which connection to drop based upon total CPU utilisation when processing the incoming data, over the last minute. The connection with the highest CPU load is then suspended for a period (as defined by the connectionSuspensionDuration setting) before the gateway reconnects.
  3. Once a connection has been suspended, no further suspensions will occur until a grace period (as defined by the suspendGracePeriod setting) has elapsed, allowing time to evaluate the effect of the suspension on the quality of the data.
  4. Setup changes represent a special case where the data age metrics may spike in the gateway. During setup application no incoming data from netprobes is processed, leading to a backlog of updates to be applied. To avoid unnecessary netprobe suspensions, the algorithm is disabled during setup changes and for suspendGracePeriod seconds afterwards.

Gateway log file reporting

Data quality statistics are reported to the log file every 10 minutes. The reports will appear similar to the following output:

<Tue Sep 25 09:13:46> INFO: DataQuality Statistics 10 minute periodic summary
<Tue Sep 25 09:13:46> INFO: DataQuality Maximum data age : period - 0 ms, lifetime - 2007 ms
<Tue Sep 25 09:13:46> INFO: DataQuality Maximum queue data size: period - 9055 bytes, lifetime - 27123 bytes
<Tue Sep 25 09:13:46> INFO: DataQuality Maximum total data size: period - 9055 bytes, lifetime - 27123 bytes

Each statistical summary contains the worst case recorded in the last 10 minute period, as well as the worst case seen during the lifetime of the gateway process. The metrics reported on are described below.

Note: Only the maximum data age metric is used by the data quality algorithm.

  • Maximum data age.
  • The age of the oldest message (over all queues) seen in the 10 minute period.
  • Maximum queue data size.
  • The largest size of a single queue (including its contained data) seen in the 10 minute period.
  • Maximum total data size.
  • The maximum memory consumed by all queues (including contained data) at a single point in time, seen during the 10 minute period.

This data can be useful for a historical analysis of gateway data quality performance. An example script is provided in the Appendix section to convert a gateway log into a .csv file for analysis in Microsoft Excel or other tools.

Analysing Data Age on a Gateway

Due to the highly configurable nature of the Geneos product, each gateway installation behaves and performs differently. This section describes methods for analysing the Gateways performance in order to determine appropriate data quality settings for a given configuration.

A quick way to determine a threshold value is to look in the Gateway log file reporting for the last Maximum data age statistical output. The lifetime value gives the maximum age seen in the lifetime of the gateway process. Setting the threshold at something approximating this value should help prevent unnecessary Netprobe suspensions. If the value looks too high, see below for details on performing a more in-depth analysis.

Data quality statistical reports can be extracted from the log file and used for analysis. The Appendix provides an example script to extract this data on a Unix system. The following graph shows data age figures collected from a gateway over a one week period. The graph illustrates areas of analysis.

data-quality5

  1. There is a spike in activity around Aug 26 15.40 on the data age graph. Referring back to the log file reveals this is related to a setup change made at that time. The suspendGracePeriod Data Quality setting should be checked to ensure it is large enough to cover setup changes and allow a return to stable behaviour without disconnection. Following the spikes the Gateway returned to a stable state.
  2. While running in a stable state we see a few spikes in data age after 02.20 on 28th Aug. The largest of these is below 10,000 ms or 10 seconds. This may make a good maximum data age value for this configuration.
  3. The data should be checked to ensure the maxDataAge is reasonable when running in a steady state. This value is the age of the oldest message in the queue. If this value is greater than 2000‑3000 ms during steady state this could indicate the Gateway is generally overloaded. The next step in this case is to refer to the performance tuning documentation to find areas of gateway configuration for tuning.

The next two graphs show the maximum queue data size and maximum total data size statistics extracted from the same gateway log file.

These statistics can help characterise the type of data flowing through the gateway; whether it comes mainly from a single source or is more uniformly distributed. Comparing the graphs for total data size against data age can also give an indication of the incoming data rate gateway is processing at the time.

data-quality6

data-quality7

FAQ

What is the gateway data quality feature?

The gateway data quality feature is intended to maintain a high quality of service by ensuring timely processing of incoming monitoring data. Data update messages from netprobes are queued with a timestamp when they are received. Metrics on the age are gathered periodically and made available to users in the Probe Data plugin sampler and Gateway log file reporting. The gateway data quality algorithm also makes use of these same statistics.

Please see the Introduction for a more detailed description of this feature.

What is the data quality algorithm?

The data quality algorithm helps maintain timely processing of monitoring data by suspending netprobes if the maximum data age exceeds a configured threshold. For a more detailed description please see the Data quality algorithm section of this document.

Settings controlling the algorithm (including settings to disable it entirely) can be found in the gateway setup file, in Operating Environment > Data Quality.

How do I know if some probes have been suspended?

When a probe is suspended, the data item icon will display a warning sign containing the "Pause" sign in Active Console 2. These icons will be visible in the entities view, probes view, state tree and list-views containing these data items.

Note: Icons require AC2 minimum GA2011.2.1-120822 or newer. AC2 versions prior to GA3.0.8 may display a warning sign containing "S").

data-quality8

When a netprobe is suspended, the following message will appear both in the Gateway log and event ticker:

<Tue Sep 25 14:35:56> INFO: ProbeManager Netprobe <ProbeName> (localhost:8098) Suspended

After the suspend period elapses the netprobe will be unsuspended (or resumed) with the following message logged to the gateway log file:

<Tue Sep 25 14:38:26> INFO: Translator::SuspendedConnections Resuming previously suspended connection 4 (localhost:8098).

Some of my probes keep showing as disconnected?

These probes may be temporarily suspended by the data quality algorithm and not disconnected. Versions of AC2 prior to GA3.0.0 or GA2011.2.1‑120822 will show suspended probes as disconnected.

Check for the other notifications of suspended probes such as the gateway log or event ticker output. These are described above in the question How do I know if some probes have been suspended?.

What do I do if I see probes being suspended?

The feature is designed to alleviate temporary backlogs of data in the Gateway and ensure timely processing of data. The probe will reconnect automatically after a suspension period. If you see many repeated Netprobe suspensions on a particular Gateway this may indicate that:

  1. A particular set of probes are sending a lot of data and may be misconfigured.
  2. The data quality defaults may not be suitable for the specific configuration.
  3. The Gateway may be overloaded.

Please see the question How do I find the right maxDataAge threshold for my setup? below.

How do I find the right maxDataAge threshold for my setup?

The default values should be suitable for most gateway installations. If you see a large number of netprobes being suspended on a gateway, then the threshold may be too low or the gateway overloaded.

Please refer to the Analysing Data Age on a Gateway section for details on how to determine an appropriate threshold value. For overloaded gateways, please see the Gateway Performance Tuning documentation.

How can I monitor my gateway for stale data?

Data quality metrics are available from the Probe Data plugin. The 'maxDataAge' value and configured 'dataAgeLimit' are published as headlines variables. Users can configure a rule to alert if the maxDataAge is greater than a specified threshold, or compare this value with the dataAgeLimit if desired.

The same view also shows metrics for each netprobe, indicating the relative percentage of Gateway load required to process incoming data for that probe.

How can I disable the data quality algorithm?

The "disableChecks" setting in the gateway setup Operating Environment > Data Quality section will (if activated) disable the data quality algorithm. When the algorithm is disabled no Netprobes will be suspended, although data quality metrics will still be generated.

What is the impact of disabling the data quality algorithm?

Running with data quality disabled the gateway will not attempt to take remedial action against growing queues and stale data. Gateway plugin and log statistics about data quality will continue to be output. On a persistently overloaded gateway, message queues will continue to grow resulting in growing memory utilisation of the gateway process.

Appendix

Awk script to extract data quality metrics from a gateway log file

  1. Copy the following script to your Unix machine as a file named convert.awk:
BEGIN { print "Date,Data age,Queue size,Total size" }
/DataQuality Statistics/ { sub(/</,""); sub(/>/,""); printf "%s-%s-%s,",$2,$3,$4 }
/DataQuality   Maximum data age/ { printf "%s",$13 }
/DataQuality   Maximum queue data size/ { printf ",%s",$13 }
/DataQuality   Maximum total data size/ { printf ",%s\n",$13 }	
  1. Pipe the Gateway log file through this script:
cat gateway2.log | awk -f convert.awk > dataquality.csv
  1. The file dataquality.csv can now be loaded into Microsoft Excel (or other tools) for statistical analysis.