Azure Redis Cache Test

The Azure Redis Cache is a high-performance caching service that provides in-memory data store for faster retrieval of data. It is based on the open-source implementation Redis cache. This ensures low latency and high throughput by reducing the need to perform slow I/O operations. It also provides high availability, scalability, and security.

Figure 1 shows the cache workflow.

Figure 1 : How the Azure Redis Cache works?

When a user uses an application, the application tries to read data from the cache. If the requested data is not available in the cache, the application gets the data from the actual data source. Then the application stores that data in the cache for subsequent requests. When the next request comes to an application, it retrieves data from the cache without going to the actual data source. This process improves the application performance because data lives in memory. Also, it increases the availability of an application in case of unavailability of the database.

If the cache does not serve its purpose - i.e., if the cache repeatedly fails to service application requests or is slow when at it - the performance of the application will significantly deteriorate. To ensure the high availability and optimal usage of the cache, administrators need to track requests to and responses from the cache, promptly detect cache misses and latencies, diagnose the reasons for the same, and eliminate those reasons quickly. This is where the Azure Redis Cache test helps!

For each Redis cache that is created for a target Azure subscription, this test tracks the status of that cache, and alerts administrators if any cache is in an Error/Unknown state. The test also monitors the requests to and responses from every cache. In the process, the test measures the time taken by each cache to process application requests. If any cache takes an unusually long time to service requests, the test notifies administrators of the same. Similarly, if any cache fails to service requests consistently, the same is brought to the attention of the administrator. Additionally, the test also reveals the probable reasons for these anomalies - is it because the cache is running out of memory? is it because the cache is blocking/rejecting connections to it? is it owing to heavy load on the cache? or is it because of a failed failover attempt? The test also reveals if the cache is sized with adequate bandwidth and CPU for its operations. Alerts are sent out if the cache is running low on such critical resources.

Target of the Test: A Microsoft Azure Subscription

Agent deploying the test: A remote agent

Output of the test: One set of results for every Azure Cache for Redis that is configured in each resource group of the target Azure subscription

Configurable parameters for the test
Parameters Description

Test Period

How often should the test be executed.

Host

The host for which the test is to be configured.

Subscription ID

Specify the GUID which uniquely identifies the Microsoft Azure Subscription to be monitored. To know the ID that maps to the target subscription, do the following:

  1. Login to the Microsoft Azure Portal.

  2. When the portal opens, click on the Subscriptions option (as indicated by Figure 2).

    Figure 2 : Clicking on the Subscriptions option

  3. Figure 3 that appears next will list all the subscriptions that have been configured for the target Azure AD tenant. Locate the subscription that is being monitored in the list, and check the value displayed for that subscription in the Subscription ID column.

    Figure 3 : Determining the Subscription ID

  4. Copy the Subscription ID in Figure 3 to the text box corresponding to the SUBSCRIPTION ID parameter in the test configuration page.

Tenant ID

Specify the Directory ID of the Azure AD tenant to which the target subscription belongs. To know how to determine the Directory ID, refer to Configuring the eG Agent to Monitor a Microsoft Azure Subscription Using Azure ARM REST API.

Client ID, Client Password, and Confirm Password

To connect to the target subscription, the eG agent requires an Access token in the form of an Application ID and the client secret value. For this purpose, you should register a new application with the Azure AD tenant. To know how to create such an application and determine its Application ID and client secret, refer to Configuring the eG Agent to Monitor a Microsoft Azure Subscription Using Azure ARM REST API. Specify the Application ID of the created Application in the Client ID text box and the client secret value in the Client Password text box. Confirm the Client Password by retyping it in the Confirm Password text box.

Proxy Host and Proxy Port

In some environments, all communication with the Azure cloud be routed through a proxy server. In such environments, you should make sure that the eG agent connects to the cloud via the proxy server and collects metrics. To enable metrics collection via a proxy, specify the IP address of the proxy server and the port at which the server listens against the Proxy Host and Proxy Port parameters. By default, these parameters are set to none, indicating that the eG agent is not configured to communicate via a proxy, by default.

Proxy Username, Proxy Password and Confirm Password

If the proxy server requires authentication, then, specify a valid proxy user name and password in the Proxy Username and Proxy Password parameters, respectively. Then, confirm the password by retyping it in the Confirm Password text box.

Detailed Diagnosis

To make diagnosis more efficient and accurate, the eG Enterprise embeds an optional detailed diagnostic capability. With this capability, the eG agents can be configured to run detailed, more elaborate tests as and when specific problems are detected. To enable the detailed diagnosis capability of this test for a particular server, choose the On option. To disable the capability, click on the Off option.

The option to selectively enable/disable the detailed diagnosis capability will be available only if the following conditions are fulfilled:

  • The eG manager license should allow the detailed diagnosis capability
  • Both the normal and abnormal frequencies configured for the detailed diagnosis measures should not be 0.
Measures made by the test:
Measurement Description Measurement Unit Interpretation

Status

Indicates the current status of this cache.

 

The values reported by this measure and its numeric equivalents are mentioned in the table below:

Measure Value Numeric Value
Succeeded 1
Updating 2
Error 3
Unknown 0

Note:

By default, this measure reports the Measure Values listed in the table above to indicate the current provisioning status . The graph of this measure however, is represented using the numeric equivalents only.

Use the detailed diagnosis to know the host name, port, and version of the cache.

Latency

Indicates the time taken by this cache to respond to requests.

Secs

A high value is indicative of a slow cache. This can adversely impact application performance.

Total number of command processed

Indicates the rate at which commands were processed by this cache.

Commands/sec

A consistent rise in the value of this measure is a sign of good health. On the other hand, a steady drop in the value of this measure hints at processing bottlenecks. In such a situation, look up the Memory used and CPU usage measures, and cache reads and writes to figure out if there are any resource contentions. A poorly sized cache can often be sluggish when responding to requests. You may want to consider increasing the resource allocations/limits for the cache, so that the cache has more processing power at its disposal.

Cache hit ratio

Indicates the ratio of failed cache lookups to the total number of requests received by this cache.

Percent

A value less than 80% is a cause for concern, as it implies that the cache has failed to service a large majority of requests to it. In such a situation, check the value of the Server load measure to see if there is any abnormal increase in load, causing the cache server to timeout without completing requests. This can cause cache misses. You can also check the Memory used and Memory fragmentation ratio measures to see if the cache has sufficient memory for storing data. Memory contention on the cache is one of the common causes for poor cache performance.

Memory used

Indicates the amount of memory used by this cache.

MB

If the value of this measure is close to that of the Maximum memory measure, it means that the cache is about to exhaust its memory allocation. Without enough memory, the cache will not be able to store data. This can result in cache misses, which in turn will affect application performance. To avoid a memory contention therefore, consider the following:

  • You can implement a cluster setup with multiple Redis nodes to enhance the memory capacity of the cache.

  • If the cache cluster is already in place, then you may want to reduce the amount of memory that is available for non-cache operations, so that more memory is available for caching. For that, reduce the maxmemory-reserved setting of the cluster.

  • The maxfragmentationmemory-reserved setting configures the amount of memory, in MB per instance in a cluster, that is reserved to accommodate for memory fragmentation. When you set this value, the Redis server experience is more consistent when the cache is full or close to full and the fragmentation ratio is high. When memory is reserved for such operations, it's unavailable for storage of cached . data. By reducing the value of this setting, you can make sure that more memory is available for caching.

  • Change the eviction policy

Memory fragmentation ratio

Indicates the percentage of memory in this cache that is fragmented.

Percent

Fragmentation is likely to be caused when a load pattern is storing data with high variation in size. For example, fragmentation might happen when data is spread across 1 KB and 1 MB in size. When a 1-KB key is deleted from existing memory, a 1-MB key cannot fit into it causing fragmentation. Similarly, if 1-MB key is deleted and 1.5-MB key is added, it cannot fit into the existing reclaimed memory. This causes unused free memory and results in more fragmentation.

The fragmentation can cause issues when:

  • Memory usage is close to the max memory limit for the cache, or

  • Used memory is higher than the Max Memory limit, potentially resulting in page faulting in memory

Evicted keys

Indicates the number of keys that have been evicted from this cache.

Number

If the cache is under severe memory pressure, you will want to see this measure report a high value. To increase the number of keys evicted, you may want to change the eviction policy. The default policy for Azure Cache for Redis is volatile-lru, which means that only keys that have a TTL value set are eligible for eviction. If no keys have a TTL value, then the system won't evict any keys. If you want the system to allow any key to be evicted if under memory pressure, then you may want to consider the allkeys-lru policy.

Blocked clients

Indicates the number of client connections to this cache that were blocked.

Number

Ideally, the value of this measure should be 0.

Connected clients

Indicates the number of clients currently connected to this cache.

Number

The maxclients setting governs the maximum number of connected clients allowed at the same time. If the value of this measure is equal to the maxclients setting, Redis closes all the new connections, returning a 'max number of clients reached' error.

Connected slaves

Indicates the number of slaves connected to this cache currently.

Number

 

Last interaction time between master and slave

Indicates the time since the master and slave last interacted.

Seconds

A high value for this measure could be a sign that there are issues in master-slave communication. If these issues persist, failover attempts may fail, and the whole purpose of an HA configuration for the cache will be defeated. Moreover, if slaves do not communicate with the master, then delays in data replication will become inevitable. Timely replication of data between the master and slaves is key to ensuring that the data replicas are always in sync with the master. If they are not, then the slaves may not be able to service cache requests effectively, when the master is down.

Total number of keys in database

Indicates the number of keys in this cache's database.

Number

 

Unix timestamp of last save to disk

Indicates when data in this cache was written to disk last.

Number

 

Number of changes to the database since last dump

Indicates the number of changes that have been written to this cache's database since the last database update happened.

Number

 

Rejected connections

Indicates the number of connections to this cache that were rejected.

Number

Ideally, the value of this measure should be 0. However, if this measure reports a non-zero value consistently, it could be because the maxclients setting is not set commensurate to the connection load on the cache.

The maxclients setting governs the maximum number of connected clients allowed at the same time. If the value of the Connected clients measure is equal to the maxclients setting, then new connections are closed/rejected.

Cache misses

Indicates the number of failed key lookups in this cache.

Number

A healthy, optimal cache is one that is capable of servicing all requests to it. Ideally therefore, the value of this measure should be 0. A high value for this measure indicates that the cache has failed to service requests to it. In such a situation, check the value of the Server load measure to see if there is any abnormal increase in load, causing the cache server to timeout without completing requests. This can cause cache misses. You can also check the Memory used and Memory fragmentation ratio measures to see if the cache has sufficient memory for storing data. Memory contention on the cache is one of the common causes for poor cache performance.

Link down time between master and slave

Indicates the duration for which the link between the master and slave was down.

Seconds

A high value is a cause for concern. The longer the link is down, the longer replication will be delayed. Also, failover attempts will also fail during this period, thus rendering the cache unavailable for servicing requests.

Memory used resident set size

Indicates the amount of memory used in resident set size.

MB

 

Cache hits

Indicates the number of successful key lookups in this cache.

Number

A high value is desired for this measure.

Cache miss ratio

Indicates the percentage of key lookups in this cache that failed.

Percent

A value close to 100% is a cause for concern, as it implies that the cache has failed to service almost all of the requests to it. In such a situation, check the value of the Server load measure to see if there is any abnormal increase in load, causing the cache server to timeout without completing requests. This can cause cache misses. You can also check the Memory used and Memory fragmentation ratio measures to see if the cache has sufficient memory for storing data. Memory contention on the cache is one of the common causes for poor cache performance.

Maximum connections

Indicates the maximum number of simultaneous connections that this cache is allowed to entertain.

Number

If the value of the Connected clients measure is equal to that of this measure, Redis closes all the new connections, returning a 'max number of clients reached' error.

Total commands

Indicates the total number of commands processed by this cache.

Number

 

Server load

Indicates the current load on this cache server.

Percent

High server load means the Redis server is busy and unable to keep up with requests, leading to timeouts.

Following are some options to consider for high server load.

  • Scale out to add more shards, so that load is distributed across multiple Redis processes. Also, consider scaling up to a larger cache size with more CPU cores.

  • Avoid client connection spikes

  • Identify and eliminate long running commands;

  • If your Azure Cache for Redis underwent a failover, all client connections from the node that went down are transferred to the node that is still running. The server load could spike because of the increased connections. You can try rebooting your client applications so that all the client connections get recreated and redistributed among the two nodes.

Get operations from the cache

Indicates the number of get operations from this cache.

Number

 

Set operations to the cache

Indicates the number of set operations from this cache.

Number

 

Data reads from the cache

Indicates the rate at which data was read from this cache.

KB/Sec

 

These measures are good indicators of the bandwidth used by the cache. Whenever there is a bandwidth contention, you can compare the value of these measures to know where the maximum bandwidth is spent - when reading from the cache? or when writing to it?

Data writes from the cache

Indicates the rate at which data was written to this cache.

KB/Sec

CPU usage

Indicates the percentage of CPU resources utilized by this cache.

Percent

A value close to 100% indicates excessive CPU usage. This can adversely impact cache performance. You may want to determine the root-cause of this excess, so that it can be removed and normalcy restored to the cache.

Use the detailed diagnosis of the Status measure to know the host name, port, and version of the cache.

Figure 4 : The detailed diagnosis of the Status measure reported by the Azure Redis Cache test