Office Server Search Gatherer Test
The MOSS 2007 Search feature is implemented using two MOSS services:
- Indexing: Responsible for crawling content sources and building index files.
- Searching: Responsible for finding all information matching the search query by searching the index files.
All searching is performed against the index files; if these files do not contain what the user is looking for, there will not be a match. So, the index files are critical to the success of the search feature of MOSS. The search functionality can be described in its simplest form as a Web page where the user defines his or her search query.
The index role can be configured to run on its own MOSS server, or run together with all the other roles, such as the Web service, Excel Services and Forms Services. It performs its indexing tasks following this general e:
- SharePoint stores all configuration settings for the indexing in its database.
- When activated, the index will look in SharePoint's databases to see what content sources to index, and what type of indexing to perform, such as a full or incremental indexing.
- The index service will start a program called the Gatherer, which is a program that will try to open the content that should be indexed.
- For each information type, the Gatherer will need an Index Filter, or IFilter, that knows how to read text inside this particular type of information. For example, to read a MS Word file, an IFilter for .DOC is needed.
- The Gatherer will receive a stream of Unicode characters from the IFilter. It will now use a small program called a Word Breaker; its job is to convert the stream of Unicode characters into words.
- However, some words are not interesting to store in the index, such as "the", "a", and numbers; the Gatherer will now compare each word found against a list of Noise Words. This is a text file that contains all words that will be removed from the stream of words.
- The remaining words are stored in an index file, together with a link to the source. If that word already exists, only the source will be added, so one word can point to multiple sources.
- If the source was information stored in SharePoint, or a file in the file system, the index will also store the security settings for this source. This will prevent a user from getting search results that he or she is not allowed to open.
- Since the success of an indexing operation also depends upon how the Gatherer program functions, administrators need to keep their eyes open for irregularities in the functioning of the gatherer, so that such anomalies are detected instantly, and corrected before they can stall the indexing process.
This test monitors the gatherer, and reports issues in its performance (if any).
Target of the test : A Microsoft SharePoint Server
Agent deploying the test : An internal agent
Outputs of the test : One set of results each for the Microsoft SharePoint server that is being monitored
Parameters | Description |
---|---|
Test period |
This indicates how often should the test be executed. |
Host |
The host for which the test is to be configured. |
Port |
The port at which the host server listens. |
Measurement | Description | Measurement Unit | Interpretation |
---|---|---|---|
Documents filtered |
Indicates the number of documents filtered per second. |
Documents/Sec |
If this rate is decreasing over time, you should perform some troubleshooting to find out why your server is not filtering documents. Look for memory issues, processor issues, network issues, or site hit frequency rules that slow the gatherer process. |
Filtering threads |
Indicates the current number of filtering threads in the system. |
Number |
|
Threads accessing the network |
Indicates the number of threads currently waiting for a response from the filter process. |
Number |
These threads have sent or are sending their request off to the remote data store and are either waiting for a response or consuming the response and filtering it. You can distinguish the difference between actually waiting on the network versus filtering the document by looking at a combination of CPU usage and Network usage counters. If this number is consistently high then you are either network bound or you are bound by a "hungry" host. If you are not meeting your crawl freshness goals, you can either change your crawl schedules to minimize overlapping crawls or look the remote repositories you are crawling to optimize them for more throughput. |
Active queue length |
Indicates the number of documents currently waiting for robot threads. |
Number |
If the value of this measure is not 0, then all threads should be filtered. |
Admin clients |
Indicates the number of currently connected administrative clients. |
Number |
|
Reason to back off |
A code describing why the gatherer service went into back-off state. |
Number |
The values that this measure can take and the states they denote are available below: 0 - Up and Running. 1 - High system IO traffic. 2 - High notifications rate. 3 - Delayed recovery in progress. 4 - Due to user activity. 5 - Battery low. 6 - Memory low. 99 - Some internal reason. During a back-off period, indexing is suspended. To manually back off the gatherer service, pause the search service. If the search service itself generates the back-off, an event will be recorded and the search service will be paused automatically. There is no automatic restart, so you must manually start the search service in order to end a back-off state. Note that there is little reason to start the search service until you have solved the problem that caused the back-off in the first place. |
Threads waiting for plug-ins |
Indicates the number of threads currently waiting for plug-ins to complete an operation |
Number |
These threads have the filtered documents and are processing it in one of several plug-ins. This is when the index and property store are created. If you have a consistently high number for this counter, check the metrics reported by the Office Server Search Archival Plugin test for problem pointers. |
Delayed documents |
Indicates the number of documents that were currently delayed due to site hit frequency rules. |
Number |
If you have a plethora of rules and this number is steadily increasing over time, consider relaxing or simplifying your site hit frequency rules. A very high number may indicate a conflict in the rules that the gatherer cannot resolve or follow with efficiency. |
Idle threads |
Indicates the number of threads that are currently waiting for documents. |
Number |
These threads are not currently doing any work and will eventually be terminated. If you consistently have a more than Max Threads/Hosts idle threads you can schedule an additional crawl. If this number is 0 then you are starved. Do not schedule another crawl in this time period and analyze the durations of your crawls during this time to see if they are meeting your freshness goals. If your goals are not being met you should reduce the number of crawls. |
Heartbeats |
Indicates the number of heartbeats per second. |
Hearbeats/Sec |
A heartbeat occurs once every 10 seconds while the service is running. If the service is not running there will be no heartbeat. |