To specify the language of content, what to do with rejected documents, and a content crawler tag:
Under Content Language, in the drop-down list, choose the language in which the majority of content that you want to import is written.
Under Rejected Documents, specify what to do with documents that do not successfully sort into a folder:
To import these documents anyway, choose Import into the Unclassified Documents folder.
Note: The Unclassified Documents folder is available to
users with access to unclassified
documents. To access unclassified documents, in the Directory menu,
click Edit Directory and open
the Unclassified Documents folder.
You can also click Administration
| Select Utilities
| Access Unclassified Documents.
To avoid importing these documents, choose Do not import.
If you are editing an existing content crawler, you see additional options under Rejected Documents that allow you to specify what to do when this content crawler finds a previously rejected document. The definition of "previously rejected" depends on the option you chose in step 4b:
If you chose "by this Content Crawler," previously rejected documents include all documents rejected by this content crawler.
If you chose "from this Content Source," previously rejected documents include all documents rejected from this content source.
Specify what to do with previously rejected documents:
To have this content crawler try to import previously rejected documents, select Re-Import.
To avoid importing these documents, choose Do not import.
If absolutely necessary, you can delete the history of previously rejected documents. Again, the definition of "previously rejected" depends on the option you chose in step 4b. If you chose "from this Content Source" in step 4b, you are deleting the rejection history for all content crawlers that import documents from this content source. If you are still sure that you must delete the history of previously rejected documents, click Clear Rejection History.
Note: If a document does not sort into any folder but is placed into the Unclassified Documents folder, this does not count as being rejected. Rejected documents are documents that were not placed in any folder.
If you are editing an existing content crawler, you see the section Importing Documents. Under Importing Documents, specify whether to import only new documents. By default, this content crawler attempts to import only new documents (those that have not been previously imported by this content crawler or other content crawlers that access this same content source). You can change the content crawler setting to import multiple copies of each document, which might be useful while testing your content crawlers.
To import only new documents, select Import only new links and new options display; otherwise, skip to step 5.
To specify what new links means:
To import only those documents that have not been previously imported by this content crawler, choose by this Content Crawler.
To import only those documents that have not been imported from the associated content source (either by this content crawler, another content crawler, or manually by a user), choose from this Content Source.
Note: The option you choose here affects your actions in step 3 and step 4f.
To refresh the previously imported documents
as specified on the Document
Settings page, select refresh them.
Generally, refreshing documents is the job of the Document Refresh Agent;
refreshing documents slows the content crawler down. However, if you changed
the document settings for this content crawler or changed the property
mappings in the associated content types, refreshing documents updates
these settings for the previously imported documents.
Note: If you are crawling
an RSS feed, the refresh them
option refreshes the properties (such as the title and description) with
the values from the target documents, not the RSS feed. If you want to
retain the properties from the RSS feed, do not select refresh
them.
If you created additional folders or applied
different filters to destination folders, select try
to sort them into additional folders to sort the previously imported
documents into new Knowledge Directory folders.
Another content crawler might have imported documents from the
same content source but into different folders than the destination folders
specified for this content crawler. Make sure you really want to re-sort
those documents into the destination folders specified for this content
crawler.
To re-import documents that were previously deleted (manually, due to expiration, or due to missing source documents), select regenerate deleted links. This might re-import documents that were at one time deemed inappropriate for your portal.
If absolutely necessary, you can delete the history of documents that have been deleted from the portal. "History" is defined by what you specified as new documents in step 3b:
If you chose "by this Content Crawler," the history includes all documents imported by this content crawler that have been deleted.
If you chose "from this Content Source," the history includes all documents imported from this content source that have been deleted. Therefore, you are deleting the history for all content crawlers that import documents from this content source.
If you are still sure that you must delete the record of documents deleted from the portal, click Clear Deletion History.
To mark imported documents with a content crawler tag, type a tag in the Mark imported documents with the following Content Crawler Tag box. This tag is used to differentiate documents imported by this content crawler from those imported by another content crawler.
Under Runtime Configuration, set the following:
Maximum document-fetching threads - determines the maximum number of concurrent threads used to fetch content from the content source.
Maximum card-indexing threads - determines maximum number of concurrent threads used in processing content once it has been crawled into the portal.
The allowable ranges for these fields are set in the portal configuration file. The values set here are also limited by the maximum threads allowable in the automation service used for the job associated with this content crawler.
To display the page associated with this help topic: