Office Server Search Gatherer Test

The MOSS 2007 Search feature is implemented using two MOSS services:

  • Indexing: Responsible for crawling content sources and building index files.
  • Searching: Responsible for finding all information matching the search query by searching the index files.

All searching is performed against the index files; if these files do not contain what the user is looking for, there will not be a match. So, the index files are critical to the success of the search feature of MOSS. The search functionality can be described in its simplest form as a Web page where the user defines his or her search query.

The index role can be configured to run on its own MOSS server, or run together with all the other roles, such as the Web service, Excel Services and Forms Services. It performs its indexing tasks following this general e:

  1. SharePoint stores all configuration settings for the indexing in its database.
  2. When activated, the index will look in SharePoint's databases to see what content sources to index, and what type of indexing to perform, such as a full or incremental indexing.
  3. The index service will start a program called the Gatherer, which is a program that will try to open the content that should be indexed.
  4. For each information type, the Gatherer will need an Index Filter, or IFilter, that knows how to read text inside this particular type of information. For example, to read a MS Word file, an IFilter for .DOC is needed.
  5. The Gatherer will receive a stream of Unicode characters from the IFilter. It will now use a small program called a Word Breaker; its job is to convert the stream of Unicode characters into words.
  6. However, some words are not interesting to store in the index, such as "the", "a", and numbers; the Gatherer will now compare each word found against a list of Noise Words. This is a text file that contains all words that will be removed from the stream of words.
  7. The remaining words are stored in an index file, together with a link to the source. If that word already exists, only the source will be added, so one word can point to multiple sources.
  8. If the source was information stored in SharePoint, or a file in the file system, the index will also store the security settings for this source. This will prevent a user from getting search results that he or she is not allowed to open.
  9. Since the success of an indexing operation also depends upon how the Gatherer program functions, administrators need to keep their eyes open for irregularities in the functioning of the gatherer, so that such anomalies are detected instantly, and corrected before they can stall the indexing process.

This test monitors the gatherer, and reports issues in its performance (if any).

Target of the test : A Microsoft SharePoint Server

Agent deploying the test : An internal agent

Outputs of the test : One set of results each for the Microsoft SharePoint server that is being monitored

Configurable parameters for the test
Parameters Description

Test period

This indicates how often should the test be executed.

Host

The host for which the test is to be configured.

Port

The port at which the host server listens.

Measurements made by the test
Measurement Description Measurement Unit Interpretation

Documents filtered

Indicates the number of documents filtered per second.

Documents/Sec

If this rate is decreasing over time, you should perform some troubleshooting to find out why your server is not filtering documents.

Look for memory issues, processor issues, network issues, or site hit frequency rules that slow the gatherer process.

Filtering threads

Indicates the current number of filtering threads in the system.

Number

 

Threads accessing the network

Indicates the number of threads currently waiting for a response from the filter process.

Number

These threads have sent or are sending their request off to the remote data store and are either waiting for a response or consuming the response and filtering it.  You can distinguish the difference between actually waiting on the network versus filtering the document by looking at a combination of CPU usage and Network usage counters. 

If this number is consistently high then you are either network bound or you are bound by a "hungry" host.  If you are not meeting your crawl freshness goals, you can either change your crawl schedules to minimize overlapping crawls or look the remote repositories you are crawling to optimize them for more throughput.

Active queue length

Indicates the number of documents currently waiting for robot threads.

Number

If the value of this measure is not 0, then all threads should be filtered.

Admin clients

Indicates the number of currently connected administrative clients.

Number

 

Reason to back off

A code describing why the gatherer service went into back-off state.

Number

The values that this measure can take and the states they denote are available below:

0 - Up and Running.

1 - High system IO traffic.

2 - High notifications rate.

3 - Delayed recovery in progress.

4 - Due to user activity.

5 - Battery low.

6 - Memory low.

99 - Some internal reason.

During a back-off period, indexing is suspended. To manually back off the gatherer service, pause the search service. If the search service itself generates the back-off, an event will be recorded and the search service will be paused automatically. There is no automatic restart, so you must manually start the search service in order to end a back-off state. Note that there is little reason to start the search service until you have solved the problem that caused the back-off in the first place.

Threads waiting for plug-ins

Indicates the number of threads currently waiting for plug-ins to complete an operation

Number

These threads have the filtered documents and are processing it in one of several plug-ins.  This is when the index and property store are created. 

If you have a consistently high number for this counter, check the metrics reported by the Office Server Search Archival Plugin test for problem pointers.

Delayed documents

Indicates the number of documents that were currently delayed due to site hit frequency rules.

Number

If you have a plethora of rules and this number is steadily increasing over time, consider relaxing or simplifying your site hit frequency rules.

A very high number may indicate a conflict in the rules that the gatherer cannot resolve or follow with efficiency.

Idle threads

Indicates the number of threads that are currently waiting for documents.

Number

These threads are not currently doing any work and will eventually be terminated.  If you consistently have a more than Max Threads/Hosts idle threads you can schedule an additional crawl.  If this number is 0 then you are starved.  Do not schedule another crawl in this time period and analyze the durations of your crawls during this time to see if they are meeting your freshness goals.  If your goals are not being met you should reduce the number of crawls.

Heartbeats

Indicates the number of heartbeats per second.

Hearbeats/Sec

A heartbeat occurs once every 10 seconds while the service is running. If the service is not running there will be no heartbeat.