Web Crawler Collection Test

The FAST Search Web crawler collects content from a set of defined Web sites, which can be internal or external.

The FAST Search Web crawler works, in many ways, like a Web browser downloading content from Web servers. But unlike a Web browser that responds only to user input via mouse clicks or keyboard, the FAST Search Web crawler works from a set of configurable rules it must follow when it requests Web items. This includes, for example, how long to wait between requests for items, and how long to wait before checking for new or updated items.

The main configuration concept in the FAST Search Web crawler is a “collection". Each crawl collection contains the configuration applicable to the particular collection, such as which start addresses and crawl rules to apply. A typical solution might have crawl collections such as Extranet or Blogs. The FAST Search Web crawler starts by comparing the start URL list against include and exclude rules specified in parameters in the XML file containing the configuration of a crawl collection. The start URL list is specified with either the start_uris or start_uri_files setting, and the rules via the include_domains and exclude_domains setting. Valid URLs are then requested from their Web servers at a rate determined by the request rate that is configured in the delay setting.

If fetched successfully, the Web item is parsed for hyperlinks and other meta-information, usually by a HTML parser built into the FAST Search Web crawler. The Web item’s meta-information is stored in the FAST Search Web crawler meta-database, and the Web item content (the HTML body) is stored in the FAST Search Web crawler store. The hyperlinks are filtered against the crawl rules, and used as the next set of URLs to be downloaded. This process continues until all reachable content has been gathered, until the refresh interval (refresh setting) is complete or until another configuration parameter limiting the scope of the crawl is reached.

To determine how efficiently the Web crawler functions, you need to understand the current load generated by each crawl collection in terms of the number and size of documents that are crawled per collection and the speed with which these documents are downloaded by the crawler. The Web Crawler Collection test provides you with these useful insights and helps assess the Web Crawler's efficiency.

Target of the test : A FAST Search Server 2010 for SharePoint

Agent deploying the test : An internal agent

Outputs of the test : One set of results for every crawl collection configured in the FAST Search Server 2010 for SharePoint farm.

Configurable parameters for the test
Parameter Description

Test period

How often should the test be executed

Host

The host for which the test is to be configured.

Port

Refers to the port used by the specified host. By default, this is 13280.

Measurements made by the test
Measurement Description Measurement Unit Interpretation

Active sites crawled

Indicates the number of websites or web links that are currently crawled with this crawl collection.

Number

The sum of the value of this measure across collections will serve as a good indicator of the current workload of the crawler.

If the number of Web sites or the total number of Web items to be crawled is large, the FAST Search Web crawler can be scaled by distributing it across multiple servers.

Compare the value of this measure across crawlers to know which collection is generating the highest load.

Current document download rate

Indicates the rate at which the documents are downloaded with this crawl collection.

Downloads/min

The crawler's overall download rate depends on the number of active sites that are busy.

Average document size

Indicates the average size of the documents downloaded with this crawl collection.

MB

This is another good measure of the current load on the crawler.

Documents in web crawler store

Indicates the number of documents downloaded with this crawl collection that are currently stored in the Web Crawler store.

Number

The FAST Search Web crawler stores crawled content locally on disk during crawling. The content is divided into two types; Web item content and meta data.

Documents deleted from document store

Indicates the number of documents downloaded with this crawl collection that are currently deleted from the Web Crawler store.

Number

 

Documents downloaded

Indicates the number of documents that are currently downloaded with this crawl collection.

Number

 

Documents stored that were modified

Indicates the number of stored documents that were currently modified with this crawl collection.

Number

The crawler periodically looks for changes in the Web Sites/Web pages configured for crawling and writes these changes to the crawled items that pre-exists in the store.

The Documents stored that were modified measure reports the number of items in the store that were currently updated with changes. The Documents writes to web crawler store measure on the other hand reveals how many such changes were written to the store.

Documents writes to web crawler store

Indicates the number of current document writes to the FAST Search Web Crawler store.

Number