About Content Crawlers

Create a content crawler to import content into your portal from external content repositories. You must run a job associated with the content crawler to periodically search the external repository for content and import that content. For information about jobs, see About Jobs.

Note: Content crawlers depend on content sources. For information on content sources, see About Content Sources.

This topic discusses the following information:

Web Content Crawlers
Remote Content Crawlers
Content Web Services
Importing Document Security
Troubleshooting the Results of a Crawl

To learn how to create or edit administrative objects (including content crawlers), click here.

Web Content Crawlers

A Web content crawler allows users to import content from the Web into the portal.

To learn about the Web Content Crawler Editor, click one of the following editor pages:

Choose Content Source
Main Settings
Web Page Exclusions
Target Settings
Document Settings
Content Type
Advanced Settings
Set Job
Properties and Names
Security (only available when editing an object)
Migration History and Status (only available when editing an object)

Remote Content Crawlers

A remote content crawler allows users to import content from an external content repository into the portal.

Some crawl providers are installed with the portal and are readily available to portal users, but others require you to manually install them and set them up. For example, Oracle provides the following crawl providers:

Windows NT File (included with the portal software)
Documentum
Microsoft Exchange
Lotus Notes

Note: For information on obtaining crawl providers, refer to the Oracle Technology Network at http://www.oracle.com/technology/index.html. For information on installing crawl providers, refer to the Installation Guide for Oracle WebCenter Interaction (available on the Oracle Technology Network at http://www.oracle.com/technology/documentation/bea.html) or the documentation that comes with your crawl provider, or contact your portal administrator.

To create a remote content crawler:

Install the crawl provider on the portal computer or another computer.
Create a remote server.
Create a content Web service (discussed next).
Create a remote content source.
Create a remote content crawler.

To learn about the Remote Content Crawler Editor, click one of the following editor pages:

Choose Content Source
Main Settings
Document Settings
Content Type
Advanced Settings
Set Job
Properties and Names
Security (only available when editing an object)
Migration History and Status (only available when editing an object)

The following crawl providers, if installed, include at least one extra page to the Remote Content Crawler Editor:

Windows NT File (included with the portal software)
Documentum
Microsoft Exchange
Lotus Notes

Content Web Services

Content Web services allow you to specify general settings for your remote content repository, leaving the target and security settings to be set in the associated remote content source and remote content crawler. This allows you to crawl multiple locations of the same content repository without having to repeatedly specify all the settings.

Note: You create content Web services on which to base your remote content sources. For information on content sources, see About Content Sources.

To learn about the Content Web Service Editor, click one of the following editor pages:

Main Settings
HTTP Configuration
Preferences
Advanced URL Settings
Advanced Settings
Authentication Settings
Preferences
User Information
Debug Settings
Associated Objects (only available when editing an object)
Properties and Names
Security (only available when editing an object)
Migration History and Status (only available when editing an object)

Importing Document Security

Users can automatically be granted access to the content imported by some remote content crawlers. The Global ACL Sync Map shows these content crawlers how to import source document security.

For an example of how importing security works, see Importing Security Example.

Troubleshooting the Results of a Crawl

You should check the following if your content crawler does not import the expected content:

Make sure your folder filters are correctly filtering content. To learn about testing your filters, see the Testing Filters section on the Main Settings (Filter) page.
Make sure your content crawler did not place unwanted content into the target folder. If a document does not filter into any subfolders, your content crawler might place the document in the target folder. This is determined by a setting on the Main Settings page of the Folder Editor.
Make sure the content crawler did not place content into the Unclassified Documents folder. If a document cannot be placed in any target folders or subfolders, your content crawler might place the document in the Unclassified Documents folder. This is determined by a setting on the Advanced Settings page of the Content Crawler Editor. If you have the correct permissions, you can view the Unclassified Documents folder when you are editing the Directory or by clicking Administration | Select Utility | Access Unclassified Documents.
Make sure you have at least Edit access to the target folder.
For Web content crawlers, make sure the robot exclusion protocols or any exclusions or inclusions are not keeping your content crawler from importing the expected content. This is determined by a setting on the Web Page Exclusions page of the Content Crawler Editor.
Make sure the authentication information specified in the associated content source allows the portal to access content.
Review the job history for additional information.