To configure this content crawler to avoid importing unwanted Web pages into your portal:
By default, this content crawler follows the Web
server's recommendations about which pages might be of value to automated
crawlers. If you want to ignore these recommendations, clear the Obey the target site's robot exclusion protocols
check box.
In general, these recommendations help limit unwanted content from
being crawled into the portal. However, some sites offer very strict recommendations.
If your content crawler is not importing any content from a site, try
turning this option off.
By default, this content crawler saves the URLs to imported Web pages in the case used on the source Web site. To change the URLs to lower case, select Convert all URLs to lower case.
To avoid importing content from an area of a Web site or to avoid importing particular pages:
To specify an area to avoid, click Add exclusion
filter; then, in the text box, type the URL to the area of the
Web site that you want to avoid.
You can use wildcard notation (*) to make the exclusion more general.
For example, to avoid crawling sales information from a site, you might
type http://mycompany.com*sales.
As a result, this crawler would not import any pages from mycompany.com
that have "sales" anywhere in the URL.
Note: Wildcards are assumed on either side of your text.
For example, if you type sales, the crawler will not import any pages
from any site accessible from
the target URL that has "sales" anywhere in the URL.
Important: If you list exclusions and
inclusions (described in step 5), the exclusions apply only to the included pages. For example, if you
excluded sales and included http://mycompany.com, your crawler
would import all pages from http://mycompany.com except
for those pages that had "sales" anywhere in the URL.
To remove an exclusion filter, select it and
click .
To select or clear all exclusion filter check boxes, select or clear the box to the left of Exclusion Filters.
By default, this content crawler does not crawl or import any pages specified in the exclusions. If your content crawler will navigate from a link on an excluded page to a page that is not excluded and that should be imported, choose Crawl excluded pages, but do not import them.
To limit your crawl to an area of a Web site or a particular page:
To specify where this content crawler may
crawl, click Add inclusion filter; then, in the text box, type the
URL to the area of the Web site to which you want to restrict your crawl.
Because Web sites can contain links to other sites, you might want to
use inclusions to keep your content crawler on a particular site. To avoid
crawling other sites, add the base URL of the site you want to crawl to
the inclusion list; for example, http://mycompany.com.
You can use wildcard notation (*) to make the inclusion more general.
For example, if you want to crawl only information on single sign-on (SSO),
you might type http://mycompany.com*sso.
As a result, this content crawler would import only pages from mycompany.com
that have "sso" anywhere in the URL.
Note: Wildcards are assumed on either side of your text.
For example, if you type sso, the content crawler will import any
pages from any site accessible
from the target URL that has "sso" anywhere in the URL.
Important: If you list inclusions and
exclusions, the exclusions apply only to the included
pages. For example, if you included http://mycompany.com
and excluded sso, your content
crawler would import all pages from http://mycompany.com except
for those pages that had "sso" anywhere in the URL.
To remove an inclusion filter, select the
it and click .
To select or clear all inclusion filter check boxes, select or clear the box to the left of Inclusion Filters.
To display the page associated with this help topic: