3.11. Extended indexing features

3.11.1. News extensions

To enable News extensions do these steps:

With News extensions enable, the indexer try to detect Parent-ID for each article indexed and also put the pairs (Parent-ID, ID) into links table.

3.11.2. Indexing SQL database tables (htdb: virtual URL scheme)

DataparkSearch can index SQL database text fields - the so called htdb: virtual URL scheme.

Using htdb:/ virtual scheme you can build full text index of your SQL tables as well as index your database driven WWW server.

Note: You must have PRIMARY key on the table you want to index.

3.11.2.1. HTDB indexer.conf commands

Five indexer.conf commands provide HTDB. They are HTDBAddr, HTDBList, HTDBLimit, HTDBDoc and HTDBText.

HTDBAddr is used to specify database connection. It's syntax identical to DBAddr command.

HTDBList is SQL query to generate list of all URLs which correspond to records in the table using PRIMARY key field. You may use either absolute or relative URLs in HTDBList command:

For example:

HTDBList "SELECT concat('htdb:/',id) FROM messages"
    or
HTDBList "SELECT id FROM messages"

HTDBLimit command may be used to specify maximal number of records in one SELECT operation. It allow reduce memory usage for big data tables indexing. For example:

HTDBLimit 512

HTDBDoc is a query to get only certain record from database using PRIMARY key value.

HTDBList SQL query is used for all URLs which end with '/' sign. For other URLs SQL query given in HTDBDoc is used.

Note: HTDBDoc query must return FULL HTTP response with headers. So, you can build very flexible indexing system giving different HTTP status in query. Take a look at HTTP response codes section of documentation to understand indexer behavior when it gets different HTTP status.

If there is no result of HTDBDoc or query does return several records, HTDB retrieval system generates "HTTP 404 Not Found". This may happen at reindex time if record was deleted from your table since last reindexing. You may use HoldBadHrefs 0 to delete such records from DataparkSearch tables as well.

You may use several HTDBDoc/List commands in one indexer.conf with corresponding Server commands.

HTDBText <section> is a query to get raw text data from database using PRIMARY key value collected via HTDBList command. The <section> parameter is specify the section name useing for storing this data. This query may return as many rows as required. You may specify several HTDBText commands per Server or Realm command.

DBAddr mysql://foo:bar@localhost/database/?dbmode=single

HTDBAddr mysql://foofoo:barbar@localhost/database/

HTDBList "SELECT DISTINCT topic_id FROM messages"

HTDBText body "SELECT raw_text\
FROM messages WHERE topic_id='$1'"

Server htdb:/

It' possible to specify both HTDBDoc and HTDBText commands per one Server or Realm command. HTDBText commands are processing first.

3.11.2.2. HTDB variables

You may use PATH parts of URL as parameters of both HTDBList and HTDBDoc SQL queries. All parts are to be used as $1, $2, ... $n, where number is the number of PATH part:

htdb:/part1/part2/part3/part4/part5
         $1    $2    $3    $4    $5

For example, you have this indexer.conf command:

HTDBList "SELECT id FROM catalog WHERE category='$1'"

When htdb:/cars/ URL is indexed, $1 will be replaced with 'cars':

SELECT id FROM catalog WHERE category='cars'

You may use long URLs to provide several parameters to both HTDBList and HTDBDoc queries. For example, htdb:/path1/path2/path3/path4/id with query:

HTDBList "SELECT id FROM table WHERE field1='$1' AND field2='$2' and field3='$3'"

This query will generate the following URLs:

htdb:/path1/path2/path3/path4/id1
...
htdb:/path1/path2/path3/path4/idN

for all values of the field "id" which are in HTDBList output.

It's possible to specify a regex-based pattern to match the URL into HTDB variables for HTDBDoc and HTDBtext commands:

HTDBText header "SELECT header FROM news WHERE section=$1 AND article=$2" "^/section/([0-9]+)/art/([0-9]+)\.html"

in this case the regex pattern specified is matched against the full path and filename of the URL.

For the HTDBText command it is possible to use search template meta-variables (as for example, $(DP_ID), $(URL), etc.) to form a sql-query. E.g.:

HTDBText hint "SELECT hint FROM hints WHERE url = '$(url)'"

3.11.2.3. Creating full text index

Using htdb:/ scheme you can create full text index and use it further in your application. Lets imagine you have a big SQL table which stores for example web board messages in plain text format. You also want to build an application with messages search facility. Lets say messages are stored in "messages" table with two fields "id" and "msg". "id" is an integer primary key and "msg" big text field contains messages themselves. Using usual SQL LIKE search may take long time to answer:

SELECT id, message FROM message WHERE message LIKE '%someword%'

Using DataparkSearch htdb: scheme you have a possibility to create full text index on "message" table. Install DataparkSearch in usual order. Then edit your indexer.conf:

DBAddr mysql://foo:bar@localhost/search/?dbmode=single

HTDBAddr mysql://foofoo:barbar@localhost/database/

HTDBList "SELECT id FROM messages"

HTDBDoc "SELECT concat(\
'HTTP/1.0 200 OK\\r\\n',\
'Content-type: text/plain\\r\\n',\
'\\r\\n',\
msg) \
FROM messages WHERE id='$1'"

Server htdb:/

After start indexer will insert 'htdb:/' URL into database and will run an SQL query given in HTDBList. It will produce 1,2,3, ..., N values in result. Those values will be considered as links relative to 'htdb:/' URL. A list of new URLs in the form htdb:/1, htdb:/2, ... , htdb:/N will be added into database. Then HTDBDoc SQL query will be executed for each new URL. HTDBDoc will produce HTTP document for each document in the form:

HTTP/1.0 200 OK
Content-Type: text/plain

<some text from 'message' field here>

This document will be used to create full text index using words from 'message' fields. Words will be stored in 'dict' table assuming that we are using 'single' storage mode.

After indexing you can use DataparkSearch tables to perform search:

SELECT url.url 
FROM url,dict 
WHERE dict.url_id=url.rec_id 
AND dict.word='someword';

As far as DataparkSearch 'dict' table has an index on 'word' field this query will be executed much faster than queries which use SQL LIKE search on 'messages' table.

You can also use several words in search:

SELECT url.url, count(*) as c 
FROM url,dict
WHERE dict.url_id=url.rec_id 
AND dict.word IN ('some','word')
GROUP BY url.url
ORDER BY c DESC;

Both queries will return 'htdb:/XXX' values in url.url field. Then your application has to cat leading 'htdb:/' from those values to get PRIMARY key values of your 'messages' table.

3.11.2.4. Indexing SQL database driven web server

You can also use htdb:/ scheme to index your database driven WWW server. It allows to create indexes without having to invoke your web server while indexing. So, it is much faster and requires less CPU resources when direct indexing from WWW server.

The main idea of indexing database driven web server is to build full text index in usual order. The only thing is that search must produce real URLs instead of URLs in 'htdb:/...' form. This can be achieved using DataparkSearch aliasing tools.

HTDBList command generates URLs in the form:

http://search.site.ext/board/message.php?id=XXX

where XXX is a "messages" table primary key values.

For each primary key value HTDBDoc command generates text/html document with HTTP headers and content like this:

<HTML>
<HEAD>
<TITLE> ... subject field here .... </TITLE>
<META NAME="Description" Content=" ... author here ...">
</HEAD>
<BODY> ... message text here ... </BODY>

At the end of doc/samples/htdb.conf we wrote three commands:

Server htdb:/
Realm  http://search.site.ext/board/message.php?id=*
Alias  http://search.site.ext/board/message.php?id=  htdb:/

First command says indexer to execute HTDBList query which will generate a list of messages in the form:

http://search.site.ext/board/message.php?id=XXX

Second command allow indexer to accept such message URLs using string match with '*' wildcard at the end.

Third command replaces "http://search.site.ext/board/message.php?id=" substring in URL with "htdb:/" when indexer retrieve documents with messages. It means that "http://mysearch.udm.net/board/message.php?id=xxx" URLs will be shown in search result, but "htdb:/xxx" URL will be indexed instead, where xxx is the PRIMARY key value, the ID of record in "messages" table.

3.11.3. Indexing binaries output (exec: and cgi: virtual URL schemes)

DataparkSearch supports exec: and cgi: virtual URL schemes. They allows running an external program. This program must return a result to it's stdout. Result must be in HTTP standard, i.e. HTTP response header followed by document's content.

For example, when indexing both cgi:/usr/local/bin/myprog and exec:/usr/local/bin/myprog, indexer will execute the /usr/local/bin/myprog program.

3.11.3.1. Passing parameters to cgi: virtual scheme

When executing a program given in cgi: virtual scheme, indexer emulates that program is running under HTTP server. It creates REQUEST_METHOD environment variable with "GET" value and QUERY_STRING variable according to HTTP standards. For example, if cgi:/usr/local/apache/cgi-bin/test-cgi?a=b&d=e is being indexed, indexer creates QUERY_STRING with a=b&d=e value. cgi: virtual URL scheme allows indexing your site without having to invoke web servers even if you want to index CGI scripts. For example, you have a web site with static documents under /usr/local/apache/htdocs/ and with CGI scripts under /usr/local/apache/cgi-bin/. Use the following configuration:

Server http://localhost/
Alias  http://localhost/cgi-bin/	cgi:/usr/local/apache/cgi-bin/
Alias  http://localhost/		file:/usr/local/apache/htdocs/

3.11.3.2. Passing parameters to exec: virtual scheme

indexer does not create QUERY_STRING variable like in cgi: scheme. It creates a command line with argument given in URL after ? sign. For example, when indexing exec:/usr/local/bin/myprog?a=b&d=e, this command will be executed:

/usr/local/bin/myprog "a=b&d=e" 

3.11.3.3. Using exec: virtual scheme as an external retrieval system

exec: virtual scheme allow using it as an external retrieval system. It allows using protocols which are not supported natively by DataparkSearch. For example, you can use curl program which is available from http://curl.haxx.se/ to index HTTPS sites.

Put this short script to /usr/local/dpsearch/bin/ under curl.sh name.

#!/bin/sh
/usr/local/bin/curl -i $1 2>/dev/null

This script takes an URL given in command line argument and executes curl program to download it. -i argument says curl to output result together with HTTP headers.

Now use these commands in your indexer.conf:

Server https://some.https.site/
Alias  https://  exec:/usr/local/dpsearch/etc/curl.sh?https://

When indexing https://some.https.site/path/to/page.html, indexer will translate this URL to

exec:/usr/local/dpsearch/etc/curl.sh?https://some.https.site/path/to/page.html

execute the curl.sh script:

/usr/local/dpsearch/etc/curl.sh "https://some.https.site/path/to/page.html"

and take it's output.

3.11.4. Mirroring

You may specify a path to root dir to enable sites mirroring

MirrorRoot /path/to/mirror

You may specify as well root directory of mirrored document's headers indexer will store HTTP headers to local disk too.

MirrorHeadersRoot /path/to/headers

You may specify period during which earlier mirrored files will be used while indexing instead of real downloading.

MirrorPeriod <time>

It is very useful when you do some experiments with DataparkSearch indexing the same hosts and do not want much traffic from/to Internet. If MirrorHeadersRoot is not specified and headers are not stored to local disk then default Content-Type's given in AddType commands will be used. Default value of the MirrorPeriod is -1, which means do not use mirrored files.

<time> is in the form xxxA[yyyB[zzzC]] (Spaces are allowed between xxx and A and yyy and so on) where xxx, yyy, zzz are numbers (can be negative!). A, B, C can be one of the following:

		s - second
		M - minute
		h - hour
		d - day
		m - month
		y - year

(these letters are the same as in strptime/strftime functions)

Examples:

15s - 15 seconds
4h30M - 4 hours and 30 minutes
1y6m-15d - 1 year and six month minus 15 days
1h-10M+1s - 1 hour minus 10 minutes plus 1 second

If you specify only number without any character, it is assumed that time is given in seconds (this behavior is for compatibility with versions prior to 3.1.7).

The following command will force using local copies for one day:

MirrorPeriod 1d

If your pages are already indexed, when you re-index with -a indexer will check the headers and only download files that have been modified since the last indexing. Thus, all pages that are not modified will not be downloaded and therefore not mirrored either. To create the mirror you need to either (a) start again with a clean database or (b) use the -m switch.

You can actually use the created files as a full featured mirror to you site. However be careful: indexer will not download a document that is larger than MaxDocSize. If a document is larger it will be only partially downloaded. If you site has no large documents, everything will be fine.

3.11.5. Data acquisition

With ActionSQL command you can execute SQL-queries with document related data while indexing. The syntax of ActionSQL command is as follow:

ActionSQL [add | update | delete] <section> <pattern> <sql-template> [<dbaddr>]
where <section> is the name of document section to check for regex pattern >pattern> match. If a match is found then the <sql-template> is filled with regex meta-variables $1-$9 as well with search template meta-variables (as for example, $(Title), $(Last-Modified), etc.) to form a sql-query, which is executed in the first DBAddr defined in indexer.conf file. If the optional <dbaddr> paramater of ActionSQL command is set, a new connection is set according this DBAddr and sql-query is executed in this connection.

One of options add, update or delete specify when this command is executed, on indexinf of a new document, on reindexing of a document or on delettion of a document. If none of such option specified, the add option is assumed by default.

Thus you can use ActionSQL commands to mind and collect the data on pages while indexing. For example, the following commands collect phone numbers (in Russian local notation) along with titles of pages where these phone numbers have been discovered:

ActionSQL add body "\(([0-9]{3})\)[ ]*([0-9]{3})[- \.]*([0-9]{2})[- \.]*([0-9]{2})" "INSERT INTO phonedata(phone,title,id)VALUES('+7$1$2$3$4','$(title)',$(dp_id))"
ActionSQL update body "\(([0-9]{3})\)[ ]*([0-9]{3})[- \.]*([0-9]{2})[- \.]*([0-9]{2})" "UPDATE phonedata SET phone='+7$1$2$3$4',title='$(title)' WHERE id=$(dp_id)"
ActionSQL delete url "." "DELETE FROM phonedata WHERE id=$(dp_id)"