Chapter 3. Indexing

Table of Contents
3.1. Indexing in general
3.2. Supported HTTP response codes
3.3. Content-Encoding support
3.4. Stopwords
3.5. Clones
3.6. Specifying WEB space to be indexed
3.7. Aliases
3.8. Servers Table
3.9. External parsers
3.10. Other commands are used in indexer.conf
3.11. Extended indexing features
3.12. Using syslog
3.13. Storing compressed document copies

3.1. Indexing in general

3.1.1. Configuration

First, you should configure DataparkSearch. Indexer configuration is covered mostly by indexer.conf-dist file. You can find it in etc directory of DataparkSearch distribution. You may take a look at other *.conf samples in doc/samples directory.

To set up indexer.conf file, change directory to DataparkSearch installation /etc directory, copy indexer.conf-dist to indexer.conf and edit it.

To configure search front-ends (search.cgi and/or search.php3, or other), you should copy search.htm-dist file in /etc directory of DataparkSearch installation to search.htm and edit it. See Section 8.3> for detailed description.

3.1.2. Running indexer

Just run indexer once a week (a day, an hour ...) to find the latest modifications in your web sites. You may also insert indexer into your crontab job.

By default, indexer being called without any command line arguments reindex only expired documents. You can change expiration period with Period indexer.conf command. If you want to reindex all documents irrelevant if those are expired or not, use -a option. indexer will mark all documents as expired at startup.

Retrieving documents, indexer sends If-Modified-Since HTTP header for documents that are already stored in database. When indexer gets next document it calculates document's checksum. If checksum is the same with old checksum stored in database, it will not parse document again. indexer -m command line option prevents indexer from sending If-Modified-Since headers and make it parse document even if checksum is the same. It is useful for example when you have changed your Allow/Disallow rules in indexer.conf and it is required to add new pages that was disallowed earlier.

If DataparkSearch retrieves URL with redirect HTTP 301,302,303 status it will index URL given in Location: field of HTTP-header instead.

3.1.3. How to create SQL table structure

To create SQL tables required for DataparkSearch functionality, use indexer -Ecreate. Executed with this argument, indexer looks up a file containing SQL statements necessary for creating all SQL tables for the database type and storage mode given in DBAddr indexer.conf command. Files are looking up at /share directory of DataparkSearch installation, which is usually /usr/local/dpsearch/share/.

3.1.4. How to drop SQL table structure

To drop all SQL tables created by DataparkSearch, use indexer -Edrop. A file with SQL statements required to drop tables are looking up at /share directory of DataparkSearch installation.

3.1.5. Subsection control

indexer has -t, -u, -s options to limit action to only a part of the database. -t corresponds 'Tag' limitation, -u is a URL substring limitation (SQL LIKE wildcards). -s limits URLs with given HTTP status. All limit options in the same group are ORed and in the different groups are ANDed.

3.1.6. How to clear database

To clear the whole database, use 'indexer -C'. You may also delete only the part of database by using -t,-u,-s subsection control options.

3.1.7. Database Statistics

If you run indexer -S, it will show database statistics, including count of total and expired documents of each status. -t, -u, -s filters are usable in this mode too.

The meaning of status is:

  • 0 - new (not indexed yet) URL

If status is not 0, then it is HTTP response code, some of the HTTP codes are:

  • 200 - "OK" (url is successfully indexed)

  • 206 - "Partial OK" (a part of url is successfully indexed)

  • 301 - "Moved Permanently" (redirect to another URL)

  • 302 - "Moved Temporarily" (redirect to another URL)

  • 303 - "See Other" (redirect to another URL)

  • 304 - "Not modified" (url has not been modified since last indexing)

  • 401 - "Authorization required" (use login/password for given URL)

  • 403 - "Forbidden" (you have no access to this URL(s))

  • 404 - "Not found" (there were references to URLs that do not exist)

  • 500 - "Internal Server Error" (error in cgi, etc)

  • 503 - "Service Unavailable" (host is down, connection timed out)

  • 504 - "Gateway Timeout" (read timeout when retrieving document)

HTTP 401 means that this URL is password protected. You can use AuthBasic command in indexer.conf to set login:password for this URL(s).

HTTP 404 means that you have incorrect reference in one of your document (reference to resource that does not exist).

Take a look on HTTP specific documentation for further explanation of different HTTP status codes.

Status codes 2xxx are not in HTTP specification and they correspond to the documents marked as clones, where xxx - one of status codes described above.

3.1.8. Link validation

Being started with -I command line argument, indexer displays URL and it's referrer pairs. It is very useful to find bad links on your site. Don't use HoldBadHrefs 0 command in indexer.conf for this mode. You may use subsection control options -t,-u,-s in this mode. For example, indexer -I -s 404 will display all 'Not found' URLs with referrers where links to those bad documents are found. Setting relevant indexer.conf commands and command line options you may use DataparkSearch special for site validation purposes.

3.1.9. Parallel indexing

It is possible to run several indexers simultaneously with the same indexer.conf file. We have successfully tested 30 simultaneous indexers with MySQL database. By default, indexer marks documents selected for indexing as expired in 4 hours in the future to avoid double indexing of the same URL by different indexer. However this is not gives 100% garantee of avoiding such duplication. You may use multi-threaded version of indexer with any SQL back-end though which does support several simultaneous connections. Multi-threaded indexer version uses own locking mechanism.

It is not recommended to use the same database with different indexer.conf files! First process could add something but second could delete it, and it may never stop.

On the other hand, you may run several indexer processes with different databases with ANY supported SQL back-end.