Data Deduplication Jobs Test

Data deduplication works by finding portions of files that are identical and storing just a single copy of the duplicated data on the disk. The technology required to find and isolate duplicated portions of files on a large disk is pretty complicated. Microsoft uses an algorithm called chunking, which scans data on the disk and breaks it into chunks whose average size is 64KB. These chunks are stored on disk in a hidden folder called the chunk store. Then, the actual files on the disk contain pointers to individual chunks in the chunk store. If two or more files contain identical chunks, only a single copy of the chunk is placed in the chunk store and the files that share the chunk all point to the same chunk.

Microsoft has tuned the chunking algorithm sufficiently that in most cases, users will have no idea that their data has been deduplicated. Access to the data is as fast as if the data were not deduplicated. For performance reasons, data is not automatically deduplicated as it is written. Instead, regularly scheduled deduplication jobs scan the disk, applying the chunking algorithm to find chunks that can be deduplicated. Data deduplication works through the following jobs:

Job Name Description

Optimization

The Optimization job deduplicates by chunking data on a volume per the volume policy settings, (optionally) compressing those chunks, and storing chunks uniquely in the chunk store.

Garbage Collection

The Garbage Collection job reclaims disk space by removing unnecessary chunks that are no longer being referenced by files that have been recently modified or deleted.

Integrity Scrubbing

The Integrity Scrubbing job identifies corruption in the chunk store due to disk failures or bad sectors. When possible, Data Deduplication can automatically use volume features (such as mirror or parity on a Storage Spaces volume) to reconstruct the corrupted data. Additionally, Data Deduplication keeps backup copies of popular chunks when they are referenced more than 100 times in an area called the hotspot.

Unoptimization

The Unoptimization job, which is a special job that should only be run manually, undoes the optimization done by deduplication and disables Data Deduplication for that volume.

Data Deduplication uses a post-processing strategy to optimize and maintain a volume's space efficiency so it is important that Data Deduplication jobs are successfully completed without any delay. If, for any reason, the Data Deduplication jobs are not completed quickly, it will result in an overload condition due to long-winding job queues. This in turn will cause slowdown on the target host. In the event of such abnormalities, administrators will have to instantly figure out the count of jobs that are being queued. The Data Deduplication Jobs test helps administrators in this regard!

This test monitors the jobs on the target host, and reports the number of jobs that are currently running and the number of jobs that are in queue. Using these metrics, administrators can instantly know the current workload on the host as well as the overload condition (if any).

This test is disabled by default. To enable the test, go to the enable / disable tests page using the menu sequence : Agents -> Tests -> Enable/Disable, pick the desired Component type, set Performance as the Test type, choose the test from the DISABLED TESTS list, and click on the << button to move the test to the ENABLED TESTS list. Finally, click the Update button.

Target of the test : A Windows host

Agent deploying the test : An internal agent

Outputs of the test : One set of results for the target host being monitored.

Configurable parameters for the test
Parameter Description

Test Period

How often should the test be executed.

Host

The host for which the test is to be configured.

Port

Refers to the port used by the specified host. Here it is NULL.

Domain

Specify the name of the Windows domain to which the target host belongs.

Username

Here, enter the name of a valid domain user with login rights to the target host.

Password

Provide the password of the above-mentioned user in this text box.

Confirm password

Confirm the password by retyping it here.

DD Frequency

Refers to the frequency with which detailed diagnosis measures are to be generated for this test. The default is 1:1. This indicates that, by default, detailed measures will be generated every time this test runs, and also every time the test detects a problem. You can modify this frequency, if you so desire. Also, if you intend to disable the detailed diagnosis capability for this test, you can do so by specifying none against DD frequency.

Detailed Diagnosis

To make diagnosis more efficient and accurate, the eG Enterprise embeds an optional detailed diagnostic capability. With this capability, the eG agents can be configured to run detailed, more elaborate tests as and when specific problems are detected. To enable the detailed diagnosis capability of this test for a particular server, choose the On option. To disable the capability, click on the Off option.

The option to selectively enable/disable the detailed diagnosis capability will be available only if the following conditions are fulfilled:

  • The eG manager license should allow the detailed diagnosis capability
  • Both the normal and abnormal frequencies configured for the detailed diagnosis measures should not be 0.
Measurements made by the test
Measurement Description Measurement Unit Interpretation

Running jobs

Indicates the number of jobs that are currently running on the target host.

Number

This measure is a good indicator of the workload on the target host.

Queued jobs

Indicates the number of jobs that are currently in queue.

Number