|
|
DataparkSearch Engine 4.53Reference manualCopyright © 2003-2009 OOO DataPark Copyright © 2001-2003 Lavtech.com corp. This project is dedicated to Noémie.
Chapter 1. IntroductionDataparkSearch is a full-featured web search engine. DataparkSearch consists of two parts. The first part is an indexing mechanism (the indexer). The indexer walks over hypertext references and stores found words and new references into the database. The second part is a CGI front-end to provide the search service using the data collected by the indexer. DataparkSearch was cloned from the 3.2.16 CVS version of mnoGoSearch at 27 November 2003 as DataparkSearch 4.16. The mnoGoSearch's first release took place in November 1998. The search engine had the name of UDMSearch until October 2000 when the project was acquired by Lavtech.Com Corp. and changed its name to mnoGoSearch. The latest change log of DataparkSearch can be found on our website. 1.1. DataparkSearch FeaturesMain DataparkSearch features are as follows:
1.2. Where to get DataparkSearch.Check for the latest version of DataparkSearch at: http://www.dataparksearch.org/, as well at Google Code: http://code.google.com/p/dataparksearch/. DataparkSearch is also available in FreeBSD ports collection, see www.freshports.org/www/dpsearch and in the T2 Linux SDE. DataparkSearch's source is available via SVN at Google Code: svn checkout http://dataparksearch.googlecode.com/svn/trunk/ dataparksearch-read-only 1.3. DisclaimerThis program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. See COPYING file for details. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA 1.4. Authors Maxim Zakharov
1.4.1. Contributors Michael Kynast Jean-Gerard Pailloncy: Testing on OpenBSD. Amit Joshi: Testing on CentOS, packaging for Debian, some ideas to improve the scalability for several PC and using several DBAddr. mnoGoSearch developers and contributors
Chapter 2. Installation2.1. SQL database requirementsNote that if you want to compile DataparkSearch with one of supported SQL database you must have this database already installed before installing DataparkSearch. It is possible to use DataparkSearch with several SQL databases. You also should have enough permission to create new database or to write into already existing one. MySQL notes: If you want to build DataparkSearch with MySQL, 4.1 or later release required. libz library must be installed from zlib-devel RPM to successfully compile DataparkSearch with MySQL. PostgreSQL notes: If you want to build DataparkSearch with PostgreSQL, 7.3.x or later release required. PostgreSQL 8.1 is recommended for better performance. iODBC notes: iodbc-2.50.22a is known to work. unixODBC notes: unixODBC-1.7 is known to work. InterBase notes:
FreeTDS notes: 0.52 version is known to work with MS SQL 7.0. Oracle8 notes: 8.0.5.X is known to work. Oracle8i notes: 8.1.6 R2 EE is known to work. 2.2. Supported operating systemsWe use GNU Autoconf so it is possible to compile and use DataparkSearch on almost every modern UNIX system with a C compiler without any modifications. We develop the software on FreeBSD 5.x using PostgreSQL 8.1. Currently known systems where DataparkSearch has been successfully compiled and tested on are:
We hope DataparkSearch will work on other Unix platforms as well. Please report successful platforms to maxime@maxime.net.ru. NFS notes: There are some problems reported running DataparkSearch over NFS v4 on Linux 2.6.17. Although, everything is OK on this system when NFS v3 is used. 2.3. Tools required for installationYou need the following tools to build and install DataparkSearch from source:
2.4. Installing DataparkSearch
2.5. Possible installation problems
If above information doesn't help you, please feel free to contact DataparkSearch mailing list
2.6. Quick usage tourBefore running indexer first time, you need specify web space to index (see Section 3.6>). Basically, if you want index one site, you should put a Server command similar to the following into your indexer.conf file: Server http://www.server.ext/ Run the indexer to index your data and write URL data: sh$ /usr/local/dpsearch/sbin/indexer -W 2.7. Installation registrationIf you use DataparkSearch to build search on public accessible web site, you may register this site on our users page. Chapter 3. Indexing3.1. Indexing in general3.1.1. ConfigurationFirst, you should configure DataparkSearch. Indexer configuration is covered mostly by indexer.conf-dist file. You can find it in etc directory of DataparkSearch distribution. You may take a look at other *.conf samples in doc/samples directory. To set up indexer.conf file, change directory to DataparkSearch installation /etc directory, copy indexer.conf-dist to indexer.conf and edit it. To configure search front-ends (search.cgi and/or search.php3, or other), you should copy search.htm-dist file in /etc directory of DataparkSearch installation to search.htm and edit it. See Section 8.3> for detailed description. 3.1.2. Running indexerJust run indexer once a week (a day, an hour ...) to find the latest modifications in your web sites. You may also insert indexer into your crontab job. By default, indexer being called without any
command line arguments reindex only expired documents. You can change
expiration period with Period
indexer.conf command. If
you want to reindex all documents irrelevant if those are expired or
not, use Retrieving documents, indexer sends
If-Modified-Since HTTP header for documents that
are already stored in database. When indexer gets next document it
calculates document's checksum. If checksum is the same with old
checksum stored in database, it will not parse document again. indexer
If DataparkSearch retrieves URL with redirect HTTP 301,302,303 status it will index URL given in Location: field of HTTP-header instead. 3.1.3. How to create SQL table structureTo create SQL tables required for DataparkSearch functionality, use indexer -Ecreate. Executed with this argument, indexer looks up a file containing SQL statements necessary for creating all SQL tables for the database type and storage mode given in DBAddr indexer.conf command. Files are looking up at /share directory of DataparkSearch installation, which is usually /usr/local/dpsearch/share/. 3.1.4. How to drop SQL table structureTo drop all SQL tables created by DataparkSearch, use indexer -Edrop. A file with SQL statements required to drop tables are looking up at /share directory of DataparkSearch installation. 3.1.5. Subsection controlindexer has -t, -u, -s options to limit action to only a part of the database. -t corresponds 'Tag' limitation, -u is a URL substring limitation (SQL LIKE wildcards). -s limits URLs with given HTTP status. All limit options in the same group are ORed and in the different groups are ANDed. 3.1.6. How to clear databaseTo clear the whole database, use 'indexer -C'. You may also delete only the part of database by using -t,-u,-s subsection control options. 3.1.7. Database StatisticsIf you run indexer -S, it will show database statistics, including count of total and expired documents of each status. -t, -u, -s filters are usable in this mode too. The meaning of status is:
If status is not 0, then it is HTTP response code, some of the HTTP codes are:
HTTP 401 means that this URL is password protected. You can use AuthBasic command in indexer.conf to set login:password for this URL(s). HTTP 404 means that you have incorrect reference in one of your document (reference to resource that does not exist). Take a look on HTTP specific documentation for further explanation of different HTTP status codes. Status codes 2xxx are not in HTTP specification and they correspond to the documents marked as clones, where xxx - one of status codes described above. 3.1.8. Link validationBeing started with -I command line argument, indexer displays URL and it's referrer pairs. It is very useful to find bad links on your site. Don't use HoldBadHrefs 0 command in indexer.conf for this mode. You may use subsection control options -t,-u,-s in this mode. For example, indexer -I -s 404 will display all 'Not found' URLs with referrers where links to those bad documents are found. Setting relevant indexer.conf commands and command line options you may use DataparkSearch special for site validation purposes. 3.1.9. Parallel indexingMySQL and PostgreSQL users may run several indexers simultaneously with the same indexer.conf file. We have successfully tested 30 simultaneous indexers with MySQL database. Indexer uses MySQL and PostgreSQL locking mechanism to avoid double indexing of the same URL by different indexer. Parallel indexing in the same database is not implemented for other back-ends yet. You may use multi-threaded version of indexer with any SQL back-end though which does support several simultaneous connections. Multi-threaded indexer version uses own locking mechanism. It is not recommended to use the same database with different indexer.conf files! First process could add something but second could delete it, and it may never stop. On the other hand, you may run several indexer processes with different databases with ANY supported SQL back-end. 3.2. Supported HTTP response codesIt is described here the way DataparkSearch processes different HTTP codes. Pseudo-language is used here for explanation.
3.3. Content-Encoding supportDataparkSearch engine supports HTTP compression (Content encoding). Compression can have a major impact on the performance of HTTP transactions. The only way to obtain higher performance is to reduce the number of bytes transmitted. Using content encoding to receive a server's response you can reduce the traffic by twice or more. The HTTP 1.1 (RFC 2616) specification contains four content encoding methods: gzip, deflate, compress, and identity. When Content-encoding is enabled, DataparkSearch's indexer sends to a server Accept-Encoding: gzip,deflate,compress string in HTTP headers. If the server supports any of gzip, deflate or compress encoding, it sends gziped, deflated or compressed response. To compile DataparkSearch with HTTP Content encoding support, the zlib library is required. To enable HTTP Content encoding support, configure DataparkSearch with the following option: ./configure --with-zlib Use this option along with all the other necessary ones. 3.4. StopwordsStopwords - are the most frequently used words, i.e. words which appear in almost every document searched. Stopwords are filtered out prior to index construction, what is allow to reduce the total size of the index without any significant loss in quality of search. 3.4.1. StopwordFile commandLoad stop words from the given text file. You may specify either absolute file name or a name relative to DataparkSearch /etc directory. You may use several StopwordFile commands. StopwordFile stopwords/en.sl You must use the same set of StopwordFile commands in indexer.conf and search.htm (searchd.conf if searchd is used). 3.4.2. Format of stopword fileYou may create your own stopword lists. As an example you may take the English stopword file etc/stopwords/en.sl. In the beginning of the list please specify the following two commands: Language: en Charset: us-ascii
Then the list of stopwords is follow, one word per line. Each word is written in character set specified above by Charset: command. You may use optional Match: command to specify a pattern to treat any word match it as a stopword. E.g.: Match: regex ^\$## According to this command, any word begins with $## will be considered as a stopword. Options of Match: command are the same as for Allow (see Section 3.10.14>). Arguments are in character set specified by Charset: command. Regular expressions are limited at the moment (e.g. intervals aren't supported). 3.5. ClonesClones -- are documents having equal values of Hash32 on all document sections. Indentical copies of the same document always have equal values of Hash32. This allow to eliminate duplicate documents in a collection. However, if only title section is defined in sections.conf, all documents with different bodies but with identical titles will be considered as clones. 3.5.1. DetectClones commandDetectClones yes/no Allow/disallow clone detection and eliminating. If allowed, indexer will detect the same documents under different location, such as mirrors, and will index only one document from the group of such equal documents. "DetectClones yes" also allows to reduce space usage. Default value is "yes". DetectClones no 3.6. Specifying WEB space to be indexedWhen indexer tries to insert a new URL into database or is trying to index an existing one, it first of all checks whether this URL has corresponding Server, Realm or Subnet command given in indexer.conf. URLs without corresponding Server, Realm or Subnet command are not indexed. By default those URLs which are already in database and have no Server/Realm/Subnet commands will be deleted from database. It may happen for example after removing some Server/Realm/Subnet commands from indexer.conf. These commands have following format: <command> [method] [subsection] [CaseType] [MatchType] [CmpType] pattern [alias] Mandatory parameter Optional parameter
Use optional
Optional parameter CaseType is specify the case sensivity for string comparison, it can take one of follow value: case - case insensitive comparison, or nocase - case sensitive comparison. Optional parameter CmpType is specify the type of comparison and can take two value: Regex and String. String wildcards is default match type. You can use ? and * signs in URLMask parameters, they means "one character" and "any number of characters" respectively. For example, if you want to index all HTTP sites in .ru domain, use this command: Realm http://*.ru/* Regex comparison type takes a regular expression
as it's argument. Activate regex comparison type using Realm Regex ^http://.*\.ru/ Optional parameter MatchType means match type. There
are Match and NoMatch possible values with Match as
default. Realm NoMatch has reverse effect. It means
that URL that does not match given Realm NoMatch http://*.com/* Optional 3.6.1. Server commandThis is the main command of the indexer.conf file. It is used to add servers or their parts to be indexed. This command also says indexer to insert given URL into database at startup. E.g. command Server http://localhost/ allows to index whole http://localhost/ server. It also makes indexer insert given URL into database at startup. You can also specify some path to index server subsection: Server http://localhost/subsection/. It also says indexer to insert given URL at startup.
3.6.2. Realm commandRealm command is a more powerful means of describing web area to be indexed.
It works almost like Server command but takes
a regular expression or string wildcards as it's 3.6.3. Subnet commandSubnet command is another way to describe web area to be indexed.
It works almost like Server command but takes
a string wildcards or network specified in CIDR presentation format as it's Subnet 192.168.*.*In case of network specified in CIDR presentation format, you may specify subnet in forms: a.b.c.d/m, a.b.c, a.b, a Subnet 1291.168.10.0/24 You may use "NoMatch" optional argument. For example, if you want to index everything without 195.x.x.x subnet, use: Subnet NoMatch 195.*.*.* 3.6.4. Using different parameter for server and it's subsectionsIndexer seeks for "Server" and "Realm" commands in order of their appearance. Thus if you want to give different parameters to e.g. whole server and its subsection you should add subsection line before whole server's. Imagine that you have server subdirectory which contains news articles. Surely those articles are to be reindexed more often than the rest of the server. The following combination may be useful in such cases:
# Add subsection Period 200000 Server http://servername/news/ # Add server Period 600000 Server http://servername/ These commands give different reindexing period for /news/ subdirectory comparing with the period of server as a whole. indexer will choose the first "Server" record for the http://servername/news/page1.html as far as it matches and was given first. 3.6.5. Default indexer behaviorThe default behavior of indexer is to follow through links having correspondent Server/Realm command in the indexer.conf file. It also jumps between servers if both of them are present in indexer.conf either directly in Server command or indirectly in Realm command. For example, there are two Server commands:
Server http://www/ Server http://web/ When indexing http://www/page1.html indexer WILL follow the link http://web/page2.html if the last one has been found. Note that these pages are on different servers, but BOTH of them have correspondent Server record. If one of the Server command is deleted, indexer will remove all expired URLs from this server during next reindexing. 3.6.6. Using indexer -f <filename>The third scheme is very useful for indexer -i -f url.txt running. You may maintain required servers in the url.txt. When new URL is added into url.txt indexer will index the server of this URL during next startup. 3.6.7. ServerDB, RealmDB, SubnetDB and URLDB commandsURLDB pgsql://foo:bar@localhost/portal/links?field=url These commands are equal to Server, Realm, Subnet and
URL commands respectively, but takes arguments from field of SQL-table specified.
In example above, URLs are takes from database 3.6.8. URL commandURL http://localhost/path/to/page.html This command inserts given 3.7. AliasesDataparkSearch has an alias support making it possible to index sites taking information from another location. For example, if you index local web server, it is possible to take pages directly from disk without involving your web server in indexing process. Another example is building of search engine for primary site and using its mirror while indexing. There are several ways of using aliases. 3.7.1. Alias indexer.conf commandFormat of "Alias" indexer.conf command: Alias <masterURL> <mirrorURL> E.g. you wish to index http://search.mnogo.ru/ using nearest German mirror http://www.gstammw.de/mirrors/mnoGoSearch/. Add these lines in your indexer.conf: Server http://search.mnogo.ru/ Alias http://search.mnogo.ru/ http://www.gstammw.de/mirrors/mnoGoSearch/ search.cgi will display URLs from master site http://search.mnogo.ru/ but indexer will take corresponding page from mirror site http://www.gstammw.de/mirrors/mnoGoSearch/. Another example. If you want to index everything in udm.net domain and one of servers, for example http://home.udm.net/ is stored on local machine in /home/httpd/htdocs/ directory. These commands will be useful: Realm http://*.udm.net/ Alias http://home.udm.net/ file:/home/httpd/htdocs/ Indexer will take home.udm.net from local disk and index other sites using HTTP. 3.7.2. Different aliases for server partsAliases are searched in the order of their appearance in indexer.conf. So, you can create different aliases for server and its parts: # First, create alias for example for /stat/ directory which # is not under common location: Alias http://home.udm.net/stat/ file:/usr/local/stat/htdocs/ # Then create alias for the rest of the server: Alias http://home.udm.net/ file:/usr/local/apache/htdocs/
3.7.3. Using aliases in Server commandsYou may specify location used by indexer as an optional argument for Server command: Server http://home.udm.net/ file:/home/httpd/htdocs/ 3.7.4. Using aliases in Realm commandsAliases in Realm command is a very powerful
feature based on regular expressions. The idea of aliases in Realm
command implementation is similar to how PHP
Use this syntax for Realm aliases: Realm regex <URL_pattern> <alias_pattern> Indexer searches URL for matches to URL_pattern and builds an URL alias using alias_pattern. alias_pattern may contain references of the form $n. Where n is a number in the range of 0-9. Every such reference will be replaced by text captured by the n'th parenthesized pattern. $0 refers to text matched by the whole pattern. Opening parentheses are counted from left to right (starting from 1) to obtain the number of the capturing subpattern. Example: your company hosts several hundreds users with their domains in the form of www.username.yourname.com. Every user's site is stored on disk in "htdocs" under user's home directory: /home/username/htdocs/. You may write this command into indexer.conf (note that dot '.' character has a special meaning in regular expressions and must be escaped with '\' sign when dot is used in usual meaning): Realm regex (http://www\.)(.*)(\.yourname\.com/)(.*) file:/home/$2/htdocs/$4 Imagine indexer process http://www.john.yourname.com/news/index.html page. It will build patterns from $0 to $4: $0 = 'http://www.john.yourname.com/news/index.htm' (whole patter match) Then indexer will compose alias using $2 and $4 patterns: file:/home/john/htdocs/news/index.html and will use the result as document location to fetch it. 3.7.5. AliasProg commandYou may also specify AliasProg command for aliasing purposes. AliasProg is useful for major web hosting companies which want to index their web space taking documents directly from a disk without having to involve web server in indexing process. Documents layout may be very complex to describe it using alias in Realm command. AliasProg is an external program that can be called, that takes a URL and returns one string with the appropriate alias to stdout. Use $1 to pass URL to command line. For example this AliasProg command uses 'replace' command from MySQL distribution and replaces URL substring http://www.apache.org/ to file:/usr/local/apache/htdocs/: AliasProg "echo $1 | /usr/local/mysql/bin/mysql/replace http://www.apache.org/ file:/usr/local/apache/htdocs/" You may also write your own very complex program to process URLs. 3.7.6. ReverseAlias commandThe ReverseAlias indexer.conf command allows URL mapping before URL is inserted into database. Unlike Alias command, that triggers mapping right before a document is downloaded, ReverseAlias command triggers mapping after the link is found. ReverseAlias http://name2/ http://name2.yourname.com/ Server http://name2.yourname.com/ All links with short server name will be mapped to links with full server name before they are inserted into database. One of the possible use is cutting various unnecessary strings like PHPSESSION=XXXX E.g. cutting from URL like http://www/a.php?PHPSESSION=XXX, when PHPSESSION is the only parameter. The question sign is deleted as well: ReverseAlias regex (http://[^?]*)[?]PHPSESSION=[^&]*$ $1$2 Cutting from URL like w/a.php?PHPSESSION=xxx&.., i.e. when PHPSESSION is the first parameter, but there are other parameters following it. The '&' sign after PHPSESSION is deleted as well. Question mark is not deleted: ReverseAlias regex (http://[^?]*[?])PHPSESSION=[^&]*&(.*) $1$2 Cutting from URL like http://www/a.php?a=b&PHPSESSION=xxx or http://www/a.php?a=b&PHPSESSION=xxx&c=d, where PHPSESSION is not the first parameter. The '&' sign before PHPSESSION is deleted: ReverseAlias regex (http://.*)&PHPSESSION=[^&]*(.*) $1$2 3.7.7. ReverseAliasProg commandReverseAliasProg - is a command similar to both AliasProg command and ReverseAlias command. It takes agruments as AliasProg but maps URL before inserting it into database, as ReverseAlias command. 3.7.8. Alias command in search.htm search templateIt is also possible to define aliases in search template (search.htm). The Alias command in search.htm is identical to the one in indexer.conf, however it is active during searching, not indexing. The syntax of the search.htm Alias command is the same as in indexer.conf: Alias <find-prefix> <replace-prefix> For example, there is the following command in search.htm: Alias http://localhost/ http://www.mnogo.ru/ Search returned a page with the following URL: http://localhost/news/article10.html As a result, the $(DU) variable will be replace NOT with this URL: http://localhost/news/article10.html but with the following URL (that results in processing with Alias): http://www.mnogo.ru/news/article10.html 3.8. Servers TableDataparkSearch has ServerTable indexer.conf command. It allow load servers and filters configuration from SQL table. 3.8.1. Loading servers tableWhen ServerTable mysql://user:pass@host/dbname/tablename[?srvinfo=infotablename] is specified, indexer will load servers information from given tablename SQL table, and will load servers parameters from given infotablename SQL table. If srvinfo parameter is not specified, parameters will be loaded from srvinfo table. Check the structure for server and srvinfo tables in create/mysql/create.txt file. If there is no structure example for your database, take it as an example. You may use several ServerTable command to load servers information from different tables. 3.8.2. Servers table structureServers table consists of all necessary fields which describe servers parameters. Field names have correspondent indexer.conf commands. For example, "period" field corresponds "Period" indexer.conf command. Default field values are the same with default indexer.conf parameters. "gindex" field corresponds "Index" command. Name is slightly changed to avoid SQL reserved word usage. Description for several fields see in Section 9.3>.
3.8.3. Flushing Servers TableFlush server.enabled to inactive for all server table records. Use this command to deactivate all command in servertable before load new from indexer.conf or from other servertable. 3.9. External parsersDataparkSearch indexer can use external parsers to index various file types (MIME types). Parser is an executable program which converts one of the mime types to text/plain or text/html. For example, if you have postscript files, you can use ps2ascii parser (filter), which reads postscript file from stdin and produces ascii to stdout. 3.9.1. Supported parser typesIndexer supports four types of parsers that can:
3.9.2. Setting up parsers
3.9.3. Avoid indexer hang on parser executionTo avoid a indexer hang on parser execution, you may specify the amount of time in seconds for parser execution in your indexer.conf by ParserTimeOut command. For example: ParserTimeOut 600 Default value is 300 seconds, i.e. 5 minutes. 3.9.4. Pipes in parser's command lineYou can use pipes in parser's command line. For example, these lines will be useful to index gzipped man pages from local disk: AddType application/x-gzipped-man *.1.gz *.2.gz *.3.gz *.4.gz Mime application/x-gzipped-man text/plain "zcat | deroff" 3.9.5. Charsets and parsersSome parsers can produce output in other charset than given in LocalCharset command. Specify charset to make indexer convert parser's output to proper one. For example, if your catdoc is configured to produce output in windows-1251 charset but LocalCharset is koi8-r, use this command for parsing MS Word documents: Mime application/msword "text/plain; charset=windows-1251" "catdoc -a $1" 3.9.6. DPS_URL environment variableWhen executing a parser indexer creates DPS_URL environment variable with an URL being processed as a value. You can use this variable in parser scripts. 3.9.7. Some third-party parsers
3.9.8. libextractor libraryDataparkSearch can be build with libextractor library. Using this library, DataparkSearch can index keywords from files of the following formats: PDF, PS, OLE2 (DOC, XLS, PPT), OpenOffice (sxw), StarOffice (sdw), DVI, MAN, FLAC, MP3 (ID3v1 and ID3v2), NSF(E) (NES music), SID (C64 music), OGG, WAV, EXIV2, JPEG, GIF, PNG, TIFF, DEB, RPM, TAR(.GZ), ZIP, ELF, S3M (Scream Tracker 3), XM (eXtended Module), IT (Impulse Tracker), FLV, REAL, RIFF (AVI), MPEG, QT and ASF. To build DataparkSearch with libextractor library, install the library, and then configure and compile DataparkSearch. Bellow the relationship between keyword types of libextractor and DataparkSearch's section names is given: Table 3-1. Relationship between libextractor's keyword types and DataparkSearch section names
If a section name from the list above doesn't specified in sections.conf, the value of corresponding keyword is written as 3.10. Other commands are used in indexer.conf3.10.1. Include commandYou may include another configuration file in any place of the indexer.conf using Include <filename> command. Absolute path if <filename> starts with "/": Include /usr/local/dpsearch/etc/inc1.conf Relative path else: Include inc1.conf 3.10.2. DBAddr commandDBAddr command is URL-style database description. It specify options (type, host, database name, port, user and password) to connect to SQL database. Should be used before any other commands. You may specify several DBAddr commands. In this case DataparkSearch will merge result from every database specified. Command have global effect for whole config file. Format: DBAddr <Type>:[//[User[:Pass]@]Host[:Port]]/DBName/[?[dbmode=mode]{&<parameter name>=<parameter value>}]
You may use CGI-like encoding for Currently supported MySQL and PostgreSQLusers can specify path to Unix socket when connecting to localhost: mysql://foo:bar@localhost/dpsearch/?socket=/tmp/mysql.sock If you are using PostgreSQL and do not specify hostname, e.g. pgsql://user:password@/dbname/ then PostgreSQL will not work via TCP, but will use default Unix socket. dbmode parameter. You may also select database mode of words storage.
When " stored parameter. Format:stored=StoredHost[:StoredPort]. This parameter is used to specify host and port, if specified, where stored daemon is running, if you plan to use document excerpts and cached copies. cached parameter. Format:cached=CachedHost[:CachedPort]. Use cached at given host and port if specified. It is required for cache storage mode only (see Section 5.2>). Each indexer will connect to cached on given address at startup. charset parameter. Format:charset=DBCharacterSet. This parameter can be used to specity database connection charset. The charset specified by DBCharacterSet should be equal to charset specified by LocalCharset command. label parameter. Format: label=DBAlabel.
This parameter may be used to assign a label to DBAddr command. So, if you pass
Example: DBAddr mysql://foo:bar@localhost/dpsearch/?dbmode=single 3.10.3. VarDir commandYou may choose alternative working directory for cache mode: VarDir /usr/local/dpsearch/var 3.10.4. NewsExtensions commandWhether to enable news extensions. Default value is no. NewsExtensions yes 3.10.5. SyslogFacility commandThis is used if DataparkSearch was compiled with syslog support and if you don't like the default value. Argument is the same as used in syslog.conf file. For list of possible facilities see syslog.conf(5) SyslogFacility local7 3.10.6. Word length commandsWord lengths. You may change default length range of words stored in database. By default, words with the length in the range from 1 to 32 are stored. MinWordLength 1 MaxWordLength 32 3.10.7. MaxDocSize commandThis command is used for specify maximal document size. Default value 1048576 (1 Megabyte). Takes global effect for whole config file. MaxDocSize 1048576 3.10.8. MinDocSize commandThis command is used to checkonly urls with content size less than value specified. Default value 0. Takes global effect for whole config file. MinDocSize 1024 3.10.9. IndexDocSizeLimit commandUse this command to specify the maximal amount of data stored in index per document. Default value 0. This mean no limit. Takes effect till next IndexDocSizeLimit command. IndexDocSizeLimit 65536 3.10.10. URLSelectCacheSize commandSelect number of targets to index at once. Default value is 1024. URLSelectCacheSize 10240 3.10.11. URLDumpCacheSize commandSelect at once this number of urls to write cache mode indexes, to preload url data or to calculate the Popularity Rank. Default value is 100000. URLDumpCacheSize 10240 3.10.12. UseCRC32URLId commandSwitch on or off the ID generation for URL using HASH32. Default value is "no". UseCRC32URLId yes Switching it on allow speed up indexing a bit, but some small number of collisions is possible. 3.10.13. HTTPHeader commandYou may add desired headers to indexer's HTTP request. You should not use "If-Modified-Since", "Accept-Charset" headers, these headers are composed by indexer itself. "User-Agent: DataparkSearch/version" header is sent too, but you may override it. Command has global effect for all configuration file. HTTPHeader "User-Agent: My_Own_Agent" HTTPHeader "Accept-Language: ru, en" HTTPHeader "From: webmaster@mysite.com" 3.10.14. Allow commandAllow [Match|NoMatch] [NoCase|Case] [String|Regex] <arg> [<arg> ... ] Use this command to allow URLs that match (doesn't match) given argument.
First three optional parameters describe the type of comparison.
Default values are Match, NoCase, String.
Use Examples # Allow everything: Allow * # Allow everything but .php .cgi .pl extensions case insensitively using regex: Allow NoMatch Regex \.php$|\.cgi$|\.pl$ # Allow .HTM extension case sensitively: Allow NoCase *.HTM 3.10.15. Disallow commandDisallow [Match|NoMatch] [NoCase|Case] [String|Regex] <arg> [<arg> ... ] Use this command to disallow URLs that match (doesn't match) given argument. The meaning of first three optional parameters is exactly the same with Allow command. You can use several arguments for one Disallow command. Takes global effect for config file. Examples: # Disallow URLs that are not in udm.net domains using "string" match: Disallow NoMatch *.udm.net/* # Disallow any except known extensions and directory index using "regex" match: Disallow NoMatch Regex \/$|\.htm$|\.html$|\.shtml$|\.phtml$|\.php$|\.txt$ # Exclude cgi-bin and non-parsed-headers using "string" match: Disallow */cgi-bin/* *.cgi */nph-* # Exclude anything with '?' sign in URL. Note that '?' sign has a # special meaning in "string" match, so we have to use "regex" match here: Disallow Regex \? 3.10.16. CheckOnly commandCheckOnly [Match|NoMatch] [NoCase|Case] [String|Regex] <arg> [<arg> ... ] The meaning of first three optional parameters is exactly the same with Allow command. Indexer will use HEAD instead of GET HTTP method for URLs that match/do not match given regular expressions. It means that the file will be checked only for being existing and will not be downloaded. Useful for zip,exe,arj and other binary files. Note that you can disallow those files with commands given below. You may use several arguments for one CheckOnly commands. Useful for example for searching through the URL names rather than the contents (a la FTP-search). Takes global effect for config file. Examples: # Check some known non-text extensions using "string" match: CheckOnly *.b *.sh *.md5 # or check ANY except known text extensions using "regex" match: CheckOnly NoMatch Regex \/$|\.html$|\.shtml$|\.phtml$|\.php$|\.txt$ 3.10.17. HrefOnly commandHrefOnly [Match|NoMatch] [NoCase|Case] [String|Regex] <arg> [<arg> ... ] The meaning of first three optional parameters is exactly the same with Allow command. Use this to scan a HTML page for "href" attribute of tags but not to index the contents of the page with an URLs that match (doesn't match) given argument. Commands have global effect for all configuration file. When indexing large mail list archives for example, the index and thread index pages (like mail.10.html, thread.21.html, etc.) should be scanned for links but shouldn't be indexed: HrefOnly */mail*.html */thread*.html 3.10.18. CheckMp3 commandCheckMp3 [Match|NoMatch] [NoCase|Case] [String|Regex] <arg> [<arg> ...] The meaning of first three optional parameters is exactly the same with Allow command. If an URL matches given rules, indexer will download only a little part of the document and try to find MP3 tags in it. On success, indexer will parse MP3 tags, else it will download whole document then parse it as usual. Notes: This works only with those servers which support HTTP/1.1 protocol. It is used "Range: bytes" header to download mp3 tag. CheckMp3 *.bin *.mp3 3.10.19. CheckMp3Only commandCheckMP3Only [Match|NoMatch] [NoCase|Case] [String|Regex] <arg> [<arg> ...] The meaning of first three optional parameters is exactly the same with Allow command. If an URL matches given rules, indexer, like in the case CheckMP3 command, will download only a little part of the document and try to find MP3 tags. On success, indexer will parse MP3 tags, else it will NOT download whole document. CheckMP3Only *.bin *.mp3 3.10.20. IndexIf commandIndexIf [Match|NoMatch] [NoCase|Case] [String|Regex] <section> <arg> [<arg> ... ] Use this command to allow indexing, if the value of Example IndexIf regex Title Manual IndexIf body "*important detail*" 3.10.21. NoIndexIf commandNoIndexIf [Match|NoMatch] [NoCase|Case] [String|Regex] <section> <arg> [<arg> ... ] Use this command to disallow indexing, if the value of Example NoIndexIf regex Title Sex IndexIf body *xxx* 3.10.22. HoldBadHrefs commandHoldBadHrefs <time> How much time to hold URLs with erroneous status before deleting them from the database. For example, if host is down, indexer will not delete pages from this site immediately and search will use previous content of these pages. However if site doesn't respond for a month, probably it's time to remove these pages from the database. For <time> format see description of Period command in Section 3.10.26>. HoldBadHrefs 30d 3.10.23. DeleteOlder commandDeleteOlder <time> How much time to hold URLs before deleting them from the database. For example, for news sites indexing, you may delete automatically old news articles after specified period. For <time> format see description of Period command in Section 3.10.26>. Default value is 0. "0" value mean "do not check". You may specify several DeleteOlder commands, for example, by one for every Server command. DeleteOlder 7d 3.10.24. UseRemoteContentType commandUseRemoteContentType yes/no This command specifies if the indexer should get content type from http server headers (yes) or from it's AddType settings (no). If set to 'no' and the indexer could not determine content-type by using its AddType settings, then it will use http header. Default: yes UseRemoteContentType yes 3.10.25. AddType commandAddType [String|Regex] [Case|NoCase] <mime type> <arg> [<arg>...] This command associates filename extensions (for services that don't automatically include them) with their mime types. Currently "file:" protocol uses these commands. Use optional first two parameter to choose comparison type. Default type is "String" "Case" (case insensitive string match with '?' and '*' wildcards for one and several characters correspondently). AddType image/x-xpixmap *.xpm 3.10.26. Period commandPeriod <time> Set reindex period. <time> is in the form 'xxxA[yyyB[zzzC]]' (Spaces are allowed between xxx and A and yyy and so on) there xxx, yyy, zzz are numbers (can be negative!) A, B, C can be one of the following: s - second M - minute h - hour d - day m - month y - year (these letters are the same as in strptime/strftime functions). Examples: 15s - 15 seconds 4h30M - 4 hours and 30 minutes 1y6m-15d - 1 year and six month minus 15 days 1h-10M+1s - 1 hour minus 10 minutes plus 1 second If you specify only number without any character, it is assumed that time is given in seconds. Can be set many times before Server command and takes effect till the end of config file or till next Period command. Period 7d 3.10.27. PeriodByHops commandPeriodByHops <hops> [ <time> ] Set reindex period per <hops> basis. The format for <time> is the same as for Period. Can be set many times before Server command and takes effect till the end of config file or till next PeriodByHops command with same <hops> value. If <time> parameter is omitted, this undefine the previous defined value. If for given <hops> value the appropriate PeriodByHops command is not specified, in this case the value defined in Period command is used. 3.10.28. ExpireAt commandExpireAt [ A [ B [ C [ D [ E ]]]]] This command allow specify the exactly expiration time for documents. May be specified per Server/Realm basis and takes effect till the end of config file or till next ExpireAt command. ExpireAt specified without any arguments disable previously specified value. A - stand for minute, may be * or 0-59; B - stand for hour, may be * or 0-23; C - stand for day of month, may be * or 1-31; D - stand for month, may be * or 1-12; E - stand for day of week, may be * or 0-6, 0 - is Sunday. ExpireAt command have higher prioroty over Period or PeriodByHops command. 3.10.29. UseDateHeader commandUseDateHeader yes|no Use Date header if no Last-Modified header is sent by remote web-server. Default value: no. 3.10.30. MaxHops commandMaxHops <number> Maximum way in "mouse clicks" from start url. Default value is 256. Can be set multiple times before "Server" command and takes effect till the end of config file or till next MaxHops command. MaxHops 256 3.10.31. TrackHops commandTrackHops yes|no This command enable or disable hops tracking in reindexing. Default value is no. If enabled, the value of hops for url is recalculated when reindexing. Otherwise the value of hops is calculated only once at insertion of url into database. TrackHops yes 3.10.32. MaxDepth commandMaxDepth <number> Maximum directory depth of url. Default value is 16. Can be set multiple times before "Server" command and takes effect till the end of config file or till next MaxDepth command. MaxDepth 2 3.10.33. MaxDocsPerServer commandMaxDocsPerServer <number> Limits the number of hrefs accepted from a Server. Default value is -1, that means no limits. If set to positive value, no more than given number of pages will be indexed from one server during this run of index. Can be set multiple times before Server command and takes effect till the end of config file or till next MaxDocsPerServer command. MaxDocsPerServer 100 3.10.34. MaxHrefsPerServer commandMaxHrefsPerServer <number> Limits the number of documents retrieved from a Server. Default value is -1, that means no limits. If set to positive value, no more than given number of hrefs will be picked up from one server during this run of index. Can be set multiple times before Server command and takes effect till the end of config file or till next MaxHrefsPerServer command. MaxHrefsPerServer 100 3.10.35. MaxNetErrors commandMaxNetErrors <number> Maximum network errors for each server.
Default value is 16. Use 0 for unlimited errors number.
If there too many network errors on some server
(server is down, host unreachable, etc) indexer will try to do
not more then MaxNetErrors 16 3.10.36. ReadTimeOut commandReadTimeOut <time> Connect timeout and stalled connections timeout.
For ReadTimeOut 30s 3.10.37. DocTimeOut commandDocTimeOut <time> Maximum amount of time indexer spends for one document downloading.
For DocTimeOut 1m30s 3.10.38. NetErrorDelayTime commandNetErrorDelayTime <time> Specify document processing delay time if network error has occurred.
For NetErrorDelayTime 1d 3.10.39. Cookies commandCookies yes/no Enables/Disables the support for HTTP cookies. Command may be used several times before Server command and takes effect till the end of config file or till next Cookies command. Default value is "no". Cookies yes 3.10.40. Robots commandRobots yes/no Allows/disallows using robots.txt and <META NAME="robots" ...>
exclusions. Use Robots yes 3.10.41. RobotsPeriod commandBy defaults, robots.txt data holds in SQL-database for one week. You may change this period using RobotsPeriod command: RobotsPeriod <time>For <time> format see description of Period command in Section 3.10.26>. RobotsPeriod 30d 3.10.42. CrawlDelay commandUse this command to specify default pause in seconds between consecutive fetches from same server. This is similar to crawl-delay command in robots.txt file, but can specified in indexer.conf file on per server basis. If no crawl-delay value is specified in robots.txt, the value of CrawlDelay is used. If crawl-delay is specified in robots.txt, then the maximum of CrawlDelay and crawl-delay is used as interval between consecutive fetches. 3.10.43. Section commandSection <string> <number> <maxlen> [strict] [ <pattern> <replacement> ] where You can specify # Standard HTML sections: body, title Section body 1 256 Section title 2 128 # strict tokenization for URL Section url 3 0 strict # regex-pattern for a section Section GoodName 4 128 "<h1>([^<]*)</h1>" "<b>GoodName:</b> $1" 3.10.44. HrefSection commandHrefSection <string> [ <pattern> <replacement> ] where # Standard HTML sections: body, title HrefSection link HrefSection NewLink "<newlink>([^<]*)</newlink>" "$1" 3.10.45. FastHrefCheck commandThe "FastHrefCheck yes" command is useful to speed-up the indexing when you have a huge list of Server/Realm/Subnet commands as it disables the href checking against server list during parsing. 3.10.46. Index commandIndex yes/no Prevent indexer from storing words into database. Useful for example for link validation. Can be set multiple times before Server command and takes effect till the end of config file or till next Index command. Default value is "yes". Index no 3.10.47. ProxyAuthBasic commandProxyAuthBasic login:passwd Use http proxy basic authorization. Can be used before every Server command and takes effect only for next one Server command! It should be also before Proxy command. Examples: ProxyAuthBasic somebody:something 3.10.48. Proxy commandProxy your.proxy.host[:port] Use proxy rather then connect directly. One can index ftp servers when using proxy Default port value if not specified is 3128 (Squid) If proxy host is not specified direct connect will be used. Can be set before every Server command and takes effect till the end of config file or till next Proxy command. If no one Proxy command specified indexer will use direct connect. Examples: # Proxy on atoll.anywhere.com, port 3128: Proxy atoll.anywhere.com # Proxy on lota.anywhere.com, port 8090: Proxy lota.anywhere.com:8090 # Disable proxy (direct connect): Proxy 3.10.49. AuthBasic commandAuthBasic login:passwd Use basic http authorization. Can be set before every Server command and takes effect only for next one Server command! Examples: AuthBasic somebody:something # If you have password protected directory(-ies), but whole server is open,use: AuthBasic login1:passwd1 Server http://my.server.com/my/secure/directory1/ AuthBasic login2:passwd2 Server http://my.server.com/my/secure/directory2/ Server http://my.server.com/ 3.10.50. ServerWeight commandServerWeight <number> Server weight for Popularity Rank calculation (see Section 8.5.3>). Default value is 1. ServerWeight 1 3.10.51. OptimizeAtUpdate commandOptimizeAtUpdate yes Specify word index optimize strategy. Default value: no If enabled, this save disk space, but slow down indexing. May be placed in indexer.conf and cached.conf. 3.10.52. SkipUnreferred commandSkipUnreferred yes|no|del Default value: no. Use this command to skip reindexing or delete unreferred documents. An unreferred document is a document with no links to it. This command require the links collection to be enabled (see Section 8.5.3>). 3.10.53. Bind commandBind 127.0.0.1 You may use this command to specify local ip address, if your system have several network interfaces. 3.10.54. ProvideReferer commandProvideReferer yes Use this command to provide Referer: request header for HTTP and HTTPS connections. 3.10.55. LongestTextItems commandLongestTextItems 4 Use this command to specify the number of longest text items to index. 3.10.56. MakePrefixes commandWith MakePrefixes yes command you can instruct indexer to produce automatically all prefixes for words indexed. This is suitable, for example, for making search suggestions. 3.11. Extended indexing features3.11.1. Indexing SQL database tables (htdb: virtual URL scheme)DataparkSearch can index SQL database text fields - the so called htdb: virtual URL scheme. Using htdb:/ virtual scheme you can build full text index of your SQL tables as well as index your database driven WWW server.
3.11.1.1. HTDB indexer.conf commandsFive indexer.conf commands provide HTDB. They are HTDBAddr, HTDBList, HTDBLimit, HTDBDoc and HTDBText. HTDBAddr is used to specify database connection. It's syntax identical to DBAddr command. HTDBList is SQL query to generate list of all URLs which correspond to records in the table using PRIMARY key field. You may use either absolute or relative URLs in HTDBList command: For example: HTDBList "SELECT concat('htdb:/',id) FROM messages"
or
HTDBList "SELECT id FROM messages"HTDBLimit command may be used to specify maximal number of records in one SELECT operation. It allow reduce memory usage for big data tables indexing. For example: HTDBLimit 512 HTDBDoc is a query to get only certain record from database using PRIMARY key value. HTDBList SQL query is used for all URLs which end with '/' sign. For other URLs SQL query given in HTDBDoc is used.
If there is no result of HTDBDoc or query does return several records, HTDB retrieval system generates "HTTP 404 Not Found". This may happen at reindex time if record was deleted from your table since last reindexing. You may use HoldBadHrefs 0 to delete such records from DataparkSearch tables as well. You may use several HTDBDoc/List commands in one indexer.conf with corresponding Server commands. HTDBText <section> is a query to get raw text data from database using PRIMARY key value collected via HTDBList command. The <section> parameter is specify the section name useing for storing this data. This query may return as many rows as required. You may specify several HTDBText commands per Server or Realm command. DBAddr mysql://foo:bar@localhost/database/?dbmode=single HTDBAddr mysql://foofoo:barbar@localhost/database/ HTDBList "SELECT DISTINCT topic_id FROM messages" HTDBText body "SELECT raw_text\ FROM messages WHERE topic_id='$1'" Server htdb:/ It' possible to specify both HTDBDoc and HTDBText commands per one Server or Realm command. HTDBText commands are processing first. 3.11.1.2. HTDB variablesYou may use PATH parts of URL as parameters of both HTDBList and HTDBDoc SQL queries. All parts are to be used as $1, $2, ... $n, where number is the number of PATH part: htdb:/part1/part2/part3/part4/part5
$1 $2 $3 $4 $5For example, you have this indexer.conf command: HTDBList "SELECT id FROM catalog WHERE category='$1'" When htdb:/cars/ URL is indexed, $1 will be replaced with 'cars': SELECT id FROM catalog WHERE category='cars' You may use long URLs to provide several parameters to both HTDBList and HTDBDoc queries. For example, htdb:/path1/path2/path3/path4/id with query: HTDBList "SELECT id FROM table WHERE field1='$1' AND field2='$2' and field3='$3'" This query will generate the following URLs: htdb:/path1/path2/path3/path4/id1 ... htdb:/path1/path2/path3/path4/idN for all values of the field "id" which are in HTDBList output. 3.11.1.3. Creating full text indexUsing htdb:/ scheme you can create full text index and use it further in your application. Lets imagine you have a big SQL table which stores for example web board messages in plain text format. You also want to build an application with messages search facility. Lets say messages are stored in "messages" table with two fields "id" and "msg". "id" is an integer primary key and "msg" big text field contains messages themselves. Using usual SQL LIKE search may take long time to answer: SELECT id, message FROM message WHERE message LIKE '%someword%' Using DataparkSearch htdb: scheme you have a possibility to create full text index on "message" table. Install DataparkSearch in usual order. Then edit your indexer.conf: DBAddr mysql://foo:bar@localhost/search/?dbmode=single HTDBAddr mysql://foofoo:barbar@localhost/database/ HTDBList "SELECT id FROM messages" HTDBDoc "SELECT concat(\ 'HTTP/1.0 200 OK\\r\\n',\ 'Content-type: text/plain\\r\\n',\ '\\r\\n',\ msg) \ FROM messages WHERE id='$1'" Server htdb:/ After start indexer will insert 'htdb:/' URL into database and will run an SQL query given in HTDBList. It will produce 1,2,3, ..., N values in result. Those values will be considered as links relative to 'htdb:/' URL. A list of new URLs in the form htdb:/1, htdb:/2, ... , htdb:/N will be added into database. Then HTDBDoc SQL query will be executed for each new URL. HTDBDoc will produce HTTP document for each document in the form: HTTP/1.0 200 OK Content-Type: text/plain <some text from 'message' field here> This document will be used to create full text index using words from 'message' fields. Words will be stored in 'dict' table assuming that we are using 'single' storage mode. After indexing you can use DataparkSearch tables to perform search: SELECT url.url FROM url,dict WHERE dict.url_id=url.rec_id AND dict.word='someword'; As far as DataparkSearch 'dict' table has an index on 'word' field this query will be executed much faster than queries which use SQL LIKE search on 'messages' table. You can also use several words in search: SELECT url.url, count(*) as c
FROM url,dict
WHERE dict.url_id=url.rec_id
AND dict.word IN ('some','word')
GROUP BY url.url
ORDER BY c DESC;
Both queries will return 'htdb:/XXX' values in url.url field. Then your application has to cat leading 'htdb:/' from those values to get PRIMARY key values of your 'messages' table. 3.11.1.4. Indexing SQL database driven web serverYou can also use htdb:/ scheme to index your database driven WWW server. It allows to create indexes without having to invoke your web server while indexing. So, it is much faster and requires less CPU resources when direct indexing from WWW server. The main idea of indexing database driven web server is to build full text index in usual order. The only thing is that search must produce real URLs instead of URLs in 'htdb:/...' form. This can be achieved using DataparkSearch aliasing tools. HTDBList command generates URLs in the form: http://search.mnogo.ru/board/message.php?id=XXX where XXX is a "messages" table primary key values. For each primary key value HTDBDoc command generates text/html document with HTTP headers and content like this: <HTML> <HEAD> <TITLE> ... subject field here .... </TITLE> <META NAME="Description" Content=" ... author here ..."> </HEAD> <BODY> ... message text here ... </BODY> At the end of doc/samples/htdb.conf we wrote three commands: Server htdb:/ Realm http://search.mnogo.ru/board/message.php?id=* Alias http://search.mnogo.ru/board/message.php?id= htdb:/ First command says indexer to execute HTDBList query which will generate a list of messages in the form: http://search.mnogo.ru/board/message.php?id=XXX Second command allow indexer to accept such message URLs using string match with '*' wildcard at the end. Third command replaces "http://search.mnogo.ru/board/message.php?id=" substring in URL with "htdb:/" when indexer retrieve documents with messages. It means that "http://mysearch.udm.net/board/message.php?id=xxx" URLs will be shown in search result, but "htdb:/xxx" URL will be indexed instead, where xxx is the PRIMARY key value, the ID of record in "messages" table. 3.11.2. Indexing binaries output (exec: and cgi: virtual URL schemes)DataparkSearch supports exec: and cgi: virtual URL schemes. They allows running an external program. This program must return a result to it's stdout. Result must be in HTTP standard, i.e. HTTP response header followed by document's content. For example, when indexing both cgi:/usr/local/bin/myprog and exec:/usr/local/bin/myprog, indexer will execute the /usr/local/bin/myprog program. 3.11.2.1. Passing parameters to cgi: virtual schemeWhen executing a program given in cgi: virtual scheme, indexer emulates that program is running under HTTP server. It creates REQUEST_METHOD environment variable with "GET" value and QUERY_STRING variable according to HTTP standards. For example, if cgi:/usr/local/apache/cgi-bin/test-cgi?a=b&d=e is being indexed, indexer creates QUERY_STRING with a=b&d=e value. cgi: virtual URL scheme allows indexing your site without having to invoke web servers even if you want to index CGI scripts. For example, you have a web site with static documents under /usr/local/apache/htdocs/ and with CGI scripts under /usr/local/apache/cgi-bin/. Use the following configuration: Server http://localhost/ Alias http://localhost/cgi-bin/ cgi:/usr/local/apache/cgi-bin/ Alias http://localhost/ file:/usr/local/apache/htdocs/ 3.11.2.2. Passing parameters to exec: virtual schemeindexer does not create QUERY_STRING variable like in cgi: scheme. It creates a command line with argument given in URL after ? sign. For example, when indexing exec:/usr/local/bin/myprog?a=b&d=e, this command will be executed: /usr/local/bin/myprog "a=b&d=e" 3.11.2.3. Using exec: virtual scheme as an external retrieval systemexec: virtual scheme allow using it as an external retrieval system. It allows using protocols which are not supported natively by DataparkSearch. For example, you can use curl program which is available from http://curl.haxx.se/ to index HTTPS sites. Put this short script to /usr/local/dpsearch/bin/ under curl.sh name. #!/bin/sh /usr/local/bin/curl -i $1 2>/dev/null This script takes an URL given in command line argument and executes curl program to download it. -i argument says curl to output result together with HTTP headers. Now use these commands in your indexer.conf: Server https://some.https.site/ Alias https:// exec:/usr/local/dpsearch/etc/curl.sh?https:// When indexing https://some.https.site/path/to/page.html, indexer will translate this URL to exec:/usr/local/dpsearch/etc/curl.sh?https://some.https.site/path/to/page.html execute the curl.sh script: /usr/local/dpsearch/etc/curl.sh "https://some.https.site/path/to/page.html" and take it's output. 3.11.3. MirroringYou may specify a path to root dir to enable sites mirroring MirrorRoot /path/to/mirror You may specify as well root directory of mirrored document's headers indexer will store HTTP headers to local disk too. MirrorHeadersRoot /path/to/headers You may specify period during which earlier mirrored files will be used while indexing instead of real downloading. MirrorPeriod <time> It is very useful when you do some experiments with DataparkSearch indexing the same hosts and do not want much traffic from/to Internet. If MirrorHeadersRoot is not specified and headers are not stored to local disk then default Content-Type's given in AddType commands will be used. Default value of the MirrorPeriod is -1, which means do not use mirrored files. <time> is in the form xxxA[yyyB[zzzC]] (Spaces are allowed between xxx and A and yyy and so on) where xxx, yyy, zzz are numbers (can be negative!). A, B, C can be one of the following: s - second M - minute h - hour d - day m - month y - year (these letters are the same as in strptime/strftime functions) Examples: 15s - 15 seconds 4h30M - 4 hours and 30 minutes 1y6m-15d - 1 year and six month minus 15 days 1h-10M+1s - 1 hour minus 10 minutes plus 1 second If you specify only number without any character, it is assumed that time is given in seconds (this behavior is for compatibility with versions prior to 3.1.7). The following command will force using local copies for one day: MirrorPeriod 1d If your pages are already indexed, when you re-index with -a indexer will check the headers and only download files that have been modified since the last indexing. Thus, all pages that are not modified will not be downloaded and therefore not mirrored either. To create the mirror you need to either (a) start again with a clean database or (b) use the -m switch. You can actually use the created files as a full featured mirror to you site. However be careful: indexer will not download a document that is larger than MaxDocSize. If a document is larger it will be only partially downloaded. If you site has no large documents, everything will be fine. 3.11.4. Data acquisitionWith ActionSQL command you can execute SQL-queries with document related data while indexing. The syntax of ActionSQL command is as follow: ActionSQL <section> <pattern> <sql-template> [<dbaddr>]where <section> is the name of document section to check for regex pattern >pattern> match. If a match is found then the <sql-template> is filled with regex meta-variables $1-$9 as well with search template meta-variables (as for example, $(Title), $(Last-Modified), etc.) to form a sql-query, which is executed in the first DBAddr defined in indexer.conf file. If the optional <dbaddr> paramater of ActionSQL command is set, a new connection is set according this DBAddr and sql-query is executed in this connection. Thus you can use ActionSQL commands to mind and collect the data on pages while indexing. For example, the following command collect phone numbers (in Russian local notation) along with titles of pages where these phone numbers have been discovered: ActionSQL body "\(([0-9]{3})\)[ ]*([0-9]{3})[- \.]*([0-9]{2})[- \.]*([0-9]{2})" "INSERT INTO phonedata(phone,title)VALUES('+7$1$2$3$4','$(title)')"3.12. Using syslogDataparkSearch indexer uses syslog to log its messages. Different verbose levels could be specified with -v option or by LogLevel command in config files: Table 3-2. Verbose levels
You may use -l option to suppress log to stdout/stderr when running indexer via crontab. Without -l option log is sent both to stdout/stderr and to log files. If you do not like such behavior, run configure with --disable-syslog flag and recompile indexer. Compiled without syslog support, indexer uses only stdout/stderr. Syslog uses different facilities to separate log messages. The indexer's default facility is LOCAL7. Facility could be changed during configure with --enable-syslog=LOG_FACILITY option. LOG_FACILITY should be one of the standard facilities, usually listed in /usr/include/sys/syslog.h header file. Facility helps to separate DataparkSearch messages from others. You can modify /etc/syslog.conf to tell syslog how to treat DataparkSearch messages. For example: # Log all messages from DataparkSearch to separate file local7.* -/var/log/DataparkSearch.log Other example: # Send all DataparkSearch messages to host named central # Syslog on central should be configured to allow this local7.* @central By default all messages are logged to /var/log/messages as well. DataparkSearch could populate this file with a number of messages. To avoid this, add local7.none or local7.!* (ignore any messages from local7 facility) to your 'catch-all' log files. For example: #
# Some `catch-all' logfiles.
#
*.=info;*.=notice;*.=warn;\
auth,authpriv.none;\
cron,daemon.none;\
mail,news.none;\
local7.!* -/var/log/messages
Please take a look at syslogd(8) and syslog.conf(5) man pages for more detailed information about syslog and its configuration notes. 3.13. Storing compressed document copiesIn DataparkSearch it is possible to store compressed copies of indexed documents. Copies are stored and retrieved by the new daemon - stored, that is installed into sbin directory of DataparkSearch installation (default: /usr/local/dpsearch/sbin). To enable documents copies archiving without stored usage, place DoStore yes command in your indexer.conf file instead of stored daemon configuration. Stored document copies are retrieved by means of storedoc.cgi CGI script. It requests a saved copy of a documents from stored, then a copy is displayed with user's web browser with search keywords highlighted. To enable stored support, compile DataparkSearch with zlib support: ./configure --with-zlib <other arguments> You may use the Store and NoStore commands to allow or disallow storing several files by pattern. For arguments of those commands are exactly the same as for the Allow command (see Section 3.10.14>). All documents are stores by defaults, if support for stored is enabled. 3.13.1. Configure storedTo start using stored, please do the following:
3.13.2. How stored worksAfter you have successfully configured stored, the indexer pass downloaded documents to stored daemon. After that, stored compress the received documents and save them. 3.13.3. Using stored during searchTo enable displaying stored documents during search, do the following:
This is how stored works during search, if everything configured correctly:
3.13.4. Document excerptsstored is also used to make documents excerpts for search results. You can use ExcerptSize command in search.htm template to specify average excerpt size in characters; value by default: 256. With ExcerptPadding command you can specify average number of characters is taken before and after a search word in excerpts; value by default: 40. With ExcerptMark command you can alter the marking character sequence which delimits excerpt chunks; value by default: " ... " (a space, a dots, a space). You may switch off document excerpts (but retain ability to show stored copies) with DoExcerpt no command in your search template. Chapter 4. DataparkSearch HTML parser4.1. Tag parserTag parser understands the following tag notation:
4.2. Special charactersindexer understands the following special HTML characters:
4.3. META tagsIndexer's HTML parser currently understands the following META tags. Note that "HTTP-EQUIV" may be used instead of "NAME" in all entries.
4.4. LinksHTML parser understand the following links:
However, you can specify the list of HTML which would be omitted in new href lookup with SkipHrefIn command. HrefSkipIn "img, link, script" 4.5. Comments
4.6. Body patternsIf you need index not whole page, for example, to exclude navigation, ads, etc., you may use BodyPattern command to specify a pattern to extract content of a page for indexing. For example: BodyPattern "<!--content-->(.*)<!--/content-->" "$1"this pattern will extract content between special comments and only that content will be indexed for this page. You may specify several BodyPattern commands, but only the first match will be applied to a page. These patterns are trying to apply to all pages indexed. Beware, huge number of such body patterns may hurt indexing speed. 4.7. Sub-documentsThe sub-documents are: frames, iframes and embedded objects (flash tubes in general); temporary redirects (often used to place cookies or redirect to a page, depending on the language preferences of the user); versions of the same page in different languages obtained with Content negotiation. The indexing of sub-documents is controlled by two commands: the SubDocLevel command sets the maximum nesting level of a sub-document to be indexed. The default value is 0, which prohibits the sub-document indexing. The SubDocCnt command sets the maximum number of sub-documents to be indexed at all nesting levels (this command is mainly to prevent endless cycles of pages nested into each others). The default value is 5. Chapter 5. Storing data5.1. SQL storage types5.1.1. General storage informationDataparkSearch stores every words found in any defined section of document. The count of word appearance in the document does not affect it's weight. But the fact whether the word appears in more important parts of the document (title, description, etc.) is taken in account however. 5.1.2. Various modes of words storageThere are different modes of word storage which
are currently supported by DataparkSearch:
"single", "multi", "crc", "crc-multi", "cache". Default mode is "cache". Mode is
to be selected by Examples: DBAddr mysql://localhost/search/?dbmode=single DBAddr mysql://localhost/search/?dbmode=multi DBAddr mysql://localhost/search/?dbmode=crc DBAddr mysql://localhost/search/?dbmode=crc-multi 5.1.3. Storage mode - singleWhen "single" is specified, all words are stored in one table with structure (url_id,word,weight), where url_id is the ID of the document which is referenced by rec_id field in "url" table. Word has variable char(32) SQL type. 5.1.4. Storage mode - multiIf "multi" is selected, words will be located in different 13 tables depending of their lengths. Structures of these tables are the same with "single" mode, but fixed length char type is used, which is usually faster in most databases. This fact makes "multi" mode usually faster comparing with "single" mode. 5.1.5. Storage mode - crcIf "crc" mode is selected, DataparkSearch will store 32 bit integer word IDs calculated by HASH32 algorithm instead of words. This mode requires less disc space and is faster than "single" and "multi" modes. DataparkSearch uses the fact that HASH32 calculates quite unique check sums for different words. According to our tests there are only 250 pairs of words have the same HASH32 value in the list of about 1.600.000 unique words. Most of these pairs (>90%) have at least one misspelled word. Words information is stored in the structure (url_id,word_id,weight), where word_id is 32 bit integer ID calculated by HASH32 algorithm. This mode is recommended for big search engines. 5.1.6. Storage mode - crc-multiWhen "crc-multi" mode is selected, DataparkSearch stores HASH32 word IDs in several tables with the same to "crc" structures depending on word lengths like in "multi" mode. This mode usually is the most fast and recommended for big search engines. 5.1.7. SQL structure notesPlease note that we develop DataparkSearch with PostgreSQL as back-end and often have no possibility to test each version with all of other supported databases. So, if there is no table definition in create/you_database directory, you may found PostgreSQL definition for the same table and just adopt it for your back-end. PostgreSQL table definitions are always up-to-date. 5.1.8. Additional features of non-CRC storage modes"single" and "multi" modes support substring search. As far as "crc" and "crc-multi" do not store words themselves and use integer values generated by HASH32 algorithm instead, there is no possibility of substring search in these modes. 5.2. Cache mode storage5.2.1. Introductioncache words storage mode is able to index and search quickly through several millions of documents. 5.2.2. Cache mode word indexes structureThe main idea of cache storage mode is that word index and URLs sorting information is stored on disk rather than in SQL database. Full URL information however is kept in SQL database (tables url and urlinfo). Word index is divided into number of files specified by WrdFiles command (default value is 0x300). URLs sorting information is divided into number of files specified by URLDataFiles command (default value is 0x300).
Word index is located in files under /var/tree directory of DataparkSearch installation. URLs sorting information is located in files under /var/url directory of DataparkSearch installation. 5.2.3. Cache mode toolsThere are two additional programs cached and splitter used in cache mode indexing. cached is a TCP daemon which collects word information from indexers and stores it on your hard disk. It can operate in two modes, as old cachelogd daemon to logs data only, and in new mode, when cachelogd and splitter functionality are combined. splitter is a program to create fast word indexes using data collected by cached. Those indexes are used later in search process. 5.2.4. Starting cache modeTo start "cache mode" follow these steps:
5.2.5. Optional usage of several splitterssplitter has two command line arguments: -f [first file] -t [second file] which allows limiting used files range. If no parameters are specified splitter distributes all prepared files. You can limit files range using -f and -t keys specifying parameters in HEX notation. For example, splitter -f 000 -t A00 will create word indexes using files in the range from 000 to A00. These keys allow using several splitters at the same time. It usually gives more quick indexes building. For example, this shell script starts four splitters in background: #!/bin/sh splitter -f 000 -t 3f0 & splitter -f 400 -t 7f0 & splitter -f 800 -t bf0 & splitter -f c00 -t ff0 & 5.2.6. Using run-splitter scriptThere is a run-splitter script in /sbin directory of DataparkSearch installation. It helps to execute subsequently all three indexes building steps. "run-splitter" has these two command line parameters: run-splitter --hup --split or a short version: run-splitter -k -s Each parameter activates corresponding indexes building step. run-splitter executes all three steps of index building in proper order:
In most cases just run run-splitter script with all -k -s arguments. Separate usage of those three flags which correspond to three steps of indexes building is rarely required. run-splitter have optional parameters: -p=n and -v=m to specify pause in seconds after each log buffer update and verbose level respectively. n is seconds number (default value: 0), m is verbosity level (default value: 4). 5.2.7. Doing searchTo start using search.cgi in the "cache mode", edit as usually your search.htm template and add the "cache" as value of dbmode parameter of DBAddr command. 5.2.8. Using search limitsTo use search limits in cache mode, you should add appropriate Limit command(s) to your indexer.conf (or cached.conf, if cached is used) and to search.htm or searchd.conf (if searchd is used). Limit prm:type [SQL-Request [DBAddr]] To use, for example, search limit by tag, by category and by site, add follow lines to search.htm or to indexer.conf (searchd.conf, if searchd is used). Limit t:tag Limit c:category Limit site:siteid where t - name of CGI parameter (&t=) for this constraint, tag - type of constraint. Instead of tag/category/siteid in example above you can use any of values from table below: Table 5-1. Cache mode predefined limit types
If the second, optional, parameter Limit prm:strcrc32 "SELECT label, rec_id FROM labels" pgsql://u:p@localhost/sitedb/where prm - is the name of limit and the name of CGI-parameter is used for this limit; strcrc32 - is the type of limit, particularly for this limit is a string. Instead of strcrc32 it's possible to use any of the following limit types: Table 5-2. SQL-based cache mode limit types
With third, optional, parameter It's possible to omit optional parameters Limit prm:strcrc32 5.3. DataparkSearch performance issuesThe cache mode is the fastest DataparkSearch's storage mode. Use it if you need maximal search speed. If your /var directory isn't changed since the indexing has been finished, you may disable file locking using "ColdVar yes" command placed in search.htm (or in searchd.conf, if searchd is used). This allow you to save some time on file locking. Using UseCRC32URLId yes command (see Section 3.10.12>) allow to speed up indexing, but small number of collisions is possible, especially on large database. 5.3.1. searchd usage recommendationIf you plan use ispell data, synonym or stopword lists, it's recommended setup the searchd daemon for speed-up searches (See Section 5.4>). searchd daemon preload all these data and lists and holds them in memory. This reduce average search query execution time. Also, searchd can preload url info data (20 bytes per URL indexed) and cache mode limits (4 or 8 bytes per URL depend on limit type). This allow reduce average search time. 5.3.2. Memory based filesystem (mfs) usage recommendationIf you use cache storage mode and you have enough RAM on your PC, you may place /usr/local/dpsearch/var directory on memory based filesystem (mfs). This allow speedup both indexing and searching. If you haven't enough RAM to fit /usr/local/dpsearch/var, you may place on memory filesystem any of /usr/local/dpsearch/var/tree, /usr/local/dpsearch/var/url or /usr/local/dpsearch/var/store directories as well. 5.3.3. URLInfoSQL commandFor dbmode cache, you may use URLInfoSQL no command to disable storing URL Info into SQL database. But using this command, you'll be unable to use limits by language and by Content-Type. 5.3.4. MarkForIndex commandBy default, DataparkSearch are marking all URLs selected for indexing as indexed for 4 hours. This prevent possible simultaneous indexing of the same URL by different indexer instance running. But for huge installation this feature can take some time for processing. You may switch off this markage using "MarkForIndex no" in your indexer.conf file. 5.3.5. CheckInsertSQL commandBy default, DataparkSearch trying to insert data into SQL database regardless it's already present there. On some systems this raise some error loggings. To avoid such errors, you may enable additional checks, is the inserting data new, by specifying CheckInsertSQL yes command in your indexer.conf. 5.3.6. MySQL performanceMySQL users may declare DataparkSearch tables with
With it indexes are processed only in memory and written onto disk as last resort, command FLUSH TABLES or mysqld shutdown. This can take even minutes and impatient user can kill -9 mysql server and break index files with this. Another downside is that you should run myisamchk on these tables before you start mysqld to ensure that they are okay if something killed mysqld in the middle. Because of it we didn't include this table
option into default tables structure. However as the key information
can always be generated from the data, you should not lose anything by
using 5.3.7. Post-indexing optimizationThis article was supplied by Randy Winch I have some performance numbers that some of you might find interesting. I'm using RH 6.2 with the 2.2.14-6.1.1 kernel update (allows files larger than 2 gig) and mysql 2.23.18-alpha. I have just indexed most of our site using mnoGoSearch 3.0.18:
mnoGoSearch statistics
Status Expired Total
-----------------------------
200 821178 2052579 OK
301 797 29891 Moved Permanently
302 3 3 Moved Temporarily
304 0 7 Not Modified
400 0 99 Bad Request
403 0 7 Forbidden
404 30690 100115 Not found
500 0 1 Internal Server Error
503 0 1 Service Unavailable
-----------------------------
Total 852668 2182703
I optimize the data by dumping it into a file
using The performance is wonderful. My favorite test is searching for "John Smith". The optimized database version takes about 13 seconds. The raw version takes about 73 seconds. Search results: john : 620241 smith : 177096 Displaying documents 1-20 of total 128656 found 5.3.8. Asynchronous resolver libraryUsing c-ares, an asynchronous resolver library (dns/c-ares in FreeBSD ports collection), allow to perform DNS queries without blocking for every indexing thread. Please note, this also increase the number of concurrent queries to your DNS server. 5.4. SearchD support5.4.1. Why using searchd
5.4.2. Starting searchdTo start using searchd:
To suppress output to stderr, use -l option. The output will go through syslog only (in case syslog support was not disabled during installation with --disable-syslog). In case syslog is disabled, it is possible to direct stderr to a file: /usr/local/dpsearch/sbin/searchd 2>/var/log/searchd.log & searchd just like indexer can be used with an option of a configuration file, e.g. relative path to /etc directory of DataparkSearch installation: searchd searchd1.conf or with absolute path: searchd /usr/local/dpsearch/etc/searchd1.conf 5.5. Oracle notes5.5.1.5.5.1.1. Why Oracle?Oracle is a powerful, tunable, scalable and reliable industrial RDBMS. It provides some functionalities which are absent in simple freeware RDBMS like MySQL and PostgresSQL, such as: transactions support, concurrency and consistency, data integrity, partitioning, replication, cost-based and rule-based optimizers, parallel execution, redo logs, RAW devices and many other features. Although Oracle is a very functional database, the additional qualities like reliability impose some overhead. In fact, providing many advantages Oracle has some disadvantages. For example great tenability requires more experienced DBA, redo logs support provide great reliability against instance and media failures but requires more efficient disk system. I think you should select Oracle as a database for DataparkSearch if you want to search through hundreds of megabytes or several gigabytes of information, reliability is one of the primary concerns, need high availability of the database, and you are ready to pay higher sums for hardware and Oracle DBA to achieve better quality of service. 5.5.1.2. DataparkSearch+Oracle8 Installation RequirementsIn order to install DataparkSearch with Oracle RDBMS support you must ensure the following requirements:
5.5.1.3. Currently supported/tested platformsOracle versions:
Operation systems:
Oracle Server may be ran on any platform supporting tcp/ip connections. I see no difficulties to port DataparkSearch Oracle driver to any commercial and freeware unix systems, any contribution is appreciated. 5.5.2. Compilation, Installation and Configuration5.5.2.1. CompilationOracle 8.0.5.X and Linux RedHat 6.1
./Configure --with-oracle8=oracle_home_dir make make install If you have any troubles, try to put CC = i386-glibc20-linux-gcc in the src/Makefile, this is old version of gcc compiler for glibc 2.0. 5.5.2.2. Installation and ConfigurationCheck whether Oracle Server and Oracle Client work properly. First, try DataparkSearch service is accessible [oracle@ant oracle]$ tnsping DataparkSearch 3 TNS Ping Utility for Linux: Version 8.0.5.0.0 - Production on 29-FEB-00 09:46:12 (c) Copyright 1997 Oracle Corporation. All rights reserved. Attempting to contact (ADDRESS=(PROTOCOL=TCP)(Host=ant.gpovz.ru)(Port=1521)) OK (10 msec) OK (0 msec) OK (10 msec) Second, try to connect to Oracle Server with svrmgrl and check whether DataparkSearch tables were created [oracle@ant oracle]$ svrmgrl command='connect scott/tiger@DataparkSearch' Oracle Server Manager Release 3.0.5.0.0 - Production (c) Copyright 1997, Oracle Corporation. All Rights Reserved. Oracle8 Release 8.0.5.1.0 - Production PL/SQL Release 8.0.5.1.0 - Production Connected. SVRMGR> SELECT table_name FROM user_tables; TABLE_NAME ------------------------------ DICT DICT10 DICT11 DICT12 DICT16 DICT2 DICT3 DICT32 DICT4 DICT5 DICT6 DICT7 DICT8 DICT9 PERFTEST ROBOTS STOPWORD TAB1 URL 19 rows selected. Check the library paths in /etc/ld.so.conf
[oracle@ant oracle]$ cat /etc/ld.so.conf /usr/X11R6/lib /usr/lib /usr/i486-linux-libc5/lib /usr/lib/qt-2.0.1/lib /usr/lib/qt-1.44/lib /oracle8/app/oracle/product/8.0.5/lib This file should contain line oracle_home_path/lib to ensure DataparkSearch will be able to open libclntsh.so, the shared Oracle Client library Make symbolic link: ln -s /oracle8/app/oracle/product/8.0.5/network/admin/tnsnames.ora /etc Correct the indexer.conf file You should specify
Setting up search.cgi Copy the file /usr/local/dpsearch/bin/search.cgi to apache_root/cgi-bin/search.cgi. Then add two lines to apache's http.conf file: SetEnv ORACLE_HOME /oracle8/app/oracle/product/8.0.5 Correct the search.htm to provide DBName, DBUser, DBPass information. search.cgi should work now. Chapter 6. Subsections6.1. TagsTag is a special parameter which can be given for a set of documents. The main purpose of tags is to join a number of documents into one group and then while doing search to select a group of documents to search through. You can use Tag command of
indexer.conf to give some tag value for a server
or server subset. While doing search you can specify tag value to
search through documents which tag matches given parameter with
6.1.1. Tag commandTag <string> Use this field for your own purposes. For example for grouping some servers into one group, etc... During search you'll be able to limit URLs to be searched through by their tags. Can be set multiple times before Server command and takes effect till the end of config file or till next Tag command. Default values is an empty string. 6.1.2. TagIf commandTagIf <tag> [Match|NoMatch] [NoCase|Case] [String|Regex] <section> <arg> [<arg> ... ] Mark document by <tag> tag, if the value of Example TagIf Docs regex Title Manual 6.1.3. Tags in SQL versionTag type is CHAR. CHAR type allows to use some nice features. You can use '_' and '%' LIKE wildcards in tag parameter when doing search. It makes possible that tag, like a category, does support an idea of nesting. For example, documents with tag value "AB" can be found with both "A%" and "AB" tag limits. Tags also give a way to make an URL a member of multiple tag selections. Playing with LIKE wildcards you can easily create two or more groups. For example, tag "ABCDE" is the member of at least these selections: _BCDE A_CDE AB_DE ABC_E ABCD_
6.2. CategoriesThere is a categories editor written in Perl. You can get it in perl/cat_ed/ subdirectory of DataparkSearch installation. Categories are similar to tag feature, but nested. So you can have one category inside another and so on. Basic points:
You can also set up symlinks, e.g. categories that are actually links to other categories. link database field is used for that. In the symlink last two characters should be @@. In example above Moto->BMW is a link to Auto->BMW. First notice that category in the server table is set to be 11 characters long. This means you can use a valid character to keep track of categories. If you are going to keep a category tree of any size, then I would suggest using the category editor. But anyways, here's how it works. You can use either the tag column or the category column in the server for the same thing. Or you can categorize a site in two different ways. For example you could keep track of sites that are owned by a certain company and then categorize them as well. You could use the tag option to keep of ownership and use the category option for categories. When I explain the category option, it goes the same for the tag option. A category can be broken down any way you choose. But for it to work with the category editor, I believe for now, you have to use two characters for each level. If you use the category editor you have the choice to use a hex number going from 0-F or a 36 base number going from 0-Z. Therefore a top-level category like 'Auto' would be 01. If it has a subcategory like 'Ford', then it would be 01 (the parent category) and then 'Ford' which we will give 01. Put those together and you get 0101. If 'Auto' had another subcategory named 'VW', then its id would be 01 because it belongs to the 'Ford' category and then 02 because it's the next category. So its id would be 0102. If VW had a sub category called 'Engine' then it's id would start at 01 again and it would get the 'VW' id 02 and 'Auto' id of 01, making it 010201. If you want to search for sites under that category then you pass it cat=010201 in the url...so create a select box and give like that: <OPTION value="01">AUTO <OPTION value="0101">Ford and so on...
6.2.1. Category commandCategory <string> You may distribute documents between nested categories. Category is a string in hex number notation. You may have up to 6 levels with 256 members per level. Empty category means the root of category tree. Take a look into Section 6.2> for more information. # This command means a category on first level: Category AA # This command means a category on 5th level: Category FFAABBCCDD 6.2.2. CategoryIf commandCategoryIf <category> [Match|NoMatch] [NoCase|Case] [String|Regex] <section> <arg> [<arg> ... ] Mark document by <category> category, if the value of Example CategoryIf 010F regex Title "JOB ID" Chapter 7. Languages support7.1. Character sets7.1.1. Supported character setsDataparkSearch supports almost all known 8 bit character sets as well as some multi-byte charsets including Korean euc-kr, Chinese big5 and gb2312, Japanese shift-jis, euc-jp and iso-2022-jp, as well as UTF-8. Some multi-byte character sets are not supported by default, because the conversion tables for them are rather large that leads to increase of the executable files size. See configure parameters to enable support for these charsets. DataparkSearch also supports the following Macintosh character sets: MacCE, MacCroatian, MacGreek, MacRoman, MacTurkish, MacIceland, MacRomania, MacThai, MacArabic, MacHebrew, MacCyrillic, MacGujarati. Table 7-1. Language groups
7.1.2. Character sets aliasesEach charset is recognized by a number of its aliases. Web servers can return the same charset in different notation. For example, iso-8859-2, iso8859-2, latin2 are the same charsets. There is support for charsets names aliases which search engine can understand: Table 7-2. Charsets aliases
7.1.3. Recodingindexer recodes all documents to the character set specified in the LocalCharset command in your indexer.conf file. Internally recoding is implemented using Unicode. Please note that if some recoding can't convert a character directly from one charset to another, DataparkSearch will use HTML numeric character references to escape this character (i.e. in form &#NNN; where NNN - a character code in Unicode). Thus, for any LocalCharset you do not lost any information about indexed documents, but on LocalCharset selection depend the database volume you will get after indexing. 7.1.4. Recoding at search timeYou may display search results in any charset supported by DataparkSearch. Use BrowserCharset command in search.htm to select charset for search results. This charset may be different from LocalCharset specified. All recodings will done automatically. 7.1.6. Automatic charset guesserDataparkSearch has an automatic charset and language guesser. It currently recognizes more than 100 various charsets and languages. Charset and language detection is implemented using "N-Gram-Based Text Categorization" technique. There is a number of so called "language map" files, one for each language-charset pair. They are installed under /usr/local/dpsearch/etc/langmap/ directory by default. Take a look there to check the list of currently provided charset-language pairs. Guesser works fine for texts bigger than 500 characters. Shorter texts may not be guessed well. 7.1.6.1. LangMapFile commandLoad language map for charset and language guesser from the given file. You may specify either absolute file name or a name relative to DataparkSearch /etc directory. You may use several LangMapFile commands. LangMapFile langmap/en.ascii.lm 7.1.6.2. Build your own language mapsTo build your own language map use dpguesser utility. In addition, your need to collect file with language samples in charset desired. For new language map creation, use the following command: dpguesser -p -c charset -l language < FILENAME > language.charset.lm You can also use dpguesser utility for guessing document's language and charset by existing language maps. To do this, use following command: dpguesser [-n maxhits] < FILENAME For some languages, it may be used few different charset. To convert from one charset supported by DataparkSearch to another, use dpconv utility. dpconv [OPTIONS] -f charset_from -t charset_to [configfile] < infile > outfileYou may also specyfy -e switch for dpconv to use HTML escape entities for input, and -E switch - for output. By default, both dpguesser and dpconv utilities is installed into /usr/local/dpsearch/sbin/ directory. DataparkSearch can update language and charset maps automatically while indexing, if remote server is supply exactly specified language and charset with pages. To enable this function, specify the following command in your indexer.conf file: LangMapUpdate yes By default, DataparkSearch uses only first 8192 bytes of each file indexed to detect language and charset. You may change this value using GuesserBytes command. Use value of 0 to use all text from document indexed. GuesserBytes 16384 7.1.7. Default charsetUse RemoteCharset command in indexer.conf to choose the default charset of indexed servers. 7.1.8. Default LanguageYou can set default language for Servers by using DefaultLang <string> Default language for server. Can be used if you need language restriction while doing search. DefaultLang en 7.1.9. LocalCharset commandDefines the charset which will be used to store data in database. All other character sets will be recoded into given charset. Take a look into Section 7.1> for detailed explanation how to choose a LocalCharset depending on languages used on your site(s). This command should be used once and takes global effect for the config file. Take a look into documentation to check whole list of supported charsets. Default LocalCharset is iso-8859-1 (latin1). LocalCharset koi8-r 7.1.10. ForceIISCharset1251 commandThis option is useful for users which deals with Cyrillic content and broken (or misconfigured ?) Microsoft IIS web servers, which tends to not report charset correctly. This is really dirty hack, but if this option is turned on it is assumed that all servers which reports as 'Microsoft' or 'IIS' have content in Windows-1251 charset. This command should be used only once in configuration file and takes global effect. Default: no ForceIISCharset1251 yes 7.1.11. RemoteCharset commandRemoteCharset <charset>
RemoteCharset iso-8859-5 7.1.12. URLCharset commandURLCharset <charset>
URLCharset KOI8-R 7.1.13. CharsToEscape commandCharsToEscape "\"&<>![]" Use this command in your search template to specify the list of characters to escape for $&(x) search template meta-variables. 7.2. Making multi-language search pagesOriginal idea instructions by Craig Small
It is often required to allow for different languages which means different search.htm files depending on what language users have set in their browser. Further installation should be done in three steps.
7.2.1. How does it work?
So what happens if the user wants, say, German? Well there is no search.de.cgi (search.de.php) so the first bit of DirectoryIndex fails, so it tries the second one, search.php OK, they get the page in English, but it's better than a 404. This does work, you may need some more apache fiddling to get negotiation to work because I am testing this on a server that already has it setup, so I may have missed something. 7.2.2. Possible troublesYou may get some language negotiation problems caused by:
The apache team is working on some workarounds for most of these, if possible. For a reasonably heavily used web site you can expect an email about it once a week or so. 7.3. Segmenters for Chinese, Japanese, Korean and Thai languagesChinese, Japanese, Korean and Thai writings have no spaces between words in phrase as in western languages. Thus, while indexing documents in these languages, it's need additionally to segment phrases into words. Sometimes, a text in Chinese, Japanese, Korean or Thai can be typed with a space between every hieroglyph for better view. In this case, you may use "ResegmentChinese yes", "ResegmentJapanese yes", "ResegmentKorean yes" or "ResegmentThai yes" commands to index a text typed in such way. With resegmenting enabled, all spaces between characters are removing and then all the text is segmenting again using DataparkSearch's segmenters (see below). 7.3.1. Japanese language phrase segmenterFor Japanese language phrase segmenting the one of ChaSen, a morphological system for Japanese language, or MeCab, a Japanese morphological analyser, is used. Thus, you need one of these systems to be installed before DataparkSearch's configuring and building. To enable Japanese language phrase segmenting use 7.3.2. Chinese language phrase segmenterFor Chinese language phrase segmenting the frequency dictionary of Chinese words is used. And segmenting itself is done by dynamic programming method to maximize the cumulative frequency of produced words. To enable Chinese language phrase segmenting it's need to enable the support for Chinese charsets while DataparkSearch configuring, and specify the frequency dictionary of Chinese words by LoadChineseList command in indexer.conf file. LoadChineseList [charset dictionaryfilename] By default, the GB2312 charset and mandarin.freq dictionary is used.
7.3.3. Thai language phrase segmenterFor Thai language phrase segmenting the frequency dictionary of Thai words is used. And segmenting itself is done as for Chinese language. To enable Thai language phrase segmenting it's need to specify the frequency dictionary of Thai words by LoadThaiList command in indexer.conf file. LoadThaiList [charset dictionaryfilename] By default, the tis-620 charset and thai.freq dictionary is used.
7.3.4. Korean language phrase segmenterFor Korean language phrase segmenting the frequency dictionary of Korean words is used. And segmenting itself is done as for Chinese language. To enable Korean language phrase segmenting it's need to specify the frequency dictionary of Korean words by LoadKoreanList command in indexer.conf file. LoadKoreanList [charset dictionaryfilename] By default, the euc-kr charset and korean.freq dictionary is used.
7.4. Multilingual servers supportSome web-servers can handle language negotiation for documents language. In this case, for one URL exist several copies in different languages. For indexing all pages of such servers, VaryLang command is used. It specify list of languages separated by spaces. These languages will used for indexing URL with multi-language versions. Usage example: VaryLang "ru en fr" index will fetch all document copies in Russian, English and French languages. Chapter 8. Searching documents8.1. Using search front-ends8.1.1. Performing searchOpen your preferred front-end in Web browser: http://your.web.server/path/to/search.cgi To find something just type words you want to
find and press SUBMIT button. For example: mysql
odbc. DataparkSearch will find all documents that
contain word To find a phrase, simple enclose it in quotas. For example: "uncontrollable sphere". 8.1.2. Search parametersDataparkSearch front-ends support the following parameters given in CGI query string. You may use them in HTML form on search page. Table 8-1. Available search parameters
8.1.3. Changing different document parts weights at search timeIt is possible to pass "wf" HTML form variable to search.cgi. "wf" variable represents weight factors for specific document parts. Currently body,title,keywords,description,url parts, crosswords as well as user defined META and HTTP headers are supported. Take a look into "Section" part of indexer.conf-dist. To be able use this feature it is recommended to set different sections IDs for different document parts in "Section" indexer.conf command. Currently up to 256 different sections are supported. Imagine that we have these default sections in indexer.conf: Section body 1 256 Section title 2 128 Section keywords 3 128 Section description 4 128 "wf" value is a string of hex digits ABCD. Each digit is a factor for corresponding section weight. The most right digit corresponds to section 1. For the given above sections configuration: D is a factor for section 1 (body) Examples: wf=0001 will search through body only. By default, if "wf" variable is omitted in the query, all sections factors are 1, it means all sections have the same weight. By default, DataparkSearch uses fast relevance calculation.
In this case, only zero and non-zero values for "wf" variable takes an effect (this allows only include/exclude
specified sections in search results).
To enable full support for dynamic section weight, you need specify 8.1.4. Using front-end with an shtml pageWhen using a dynamic shtml page containing SSI that calls search.cgi, i.e. search.cgi is not called directly as a CGI program, it is necessary to override Apache's SCRIPT_NAME environment attribute so that all the links on search pages lead to the dynamic page and not to search.cgi. For example, when a shtml page contains a line <--#include virtual="search.cgi">, SCRIPT_NAME variable will still point to search.cgi, but not to the shtml page. To override SCRIPT_NAME variable we implemented a DPSEARCH_SELF variable that you may add to Apache's httpd.conf file. Thus search.cgi will check DPSEARCH_SELF variable first and then SCRIPT_NAME. Here is an example of using DPSEARCH_SELF environment variable with SetEnv/PassEnv Apache's httpd.conf command:
SetEnv DPSEARCH_SELF /path/to/search.cgi PassEnv DPSEARCH_SELF 8.1.5. Using several templatesIt is often required to use several templates with the same search.cgi. There are actually several ways to do it. They are given here in the order how search.cgi detects template name.
8.1.6. Search operatorsThe operator allin<section>:, where <section> is the name of a section, defined in sections.conf file (or in any dpsearch's configuration file by Section command) with non-zero section number (see Section 3.10.43>), that operator allows to limit the search domain for a query word by the section specified. This operator differ from limiting search domain using &wf= CGI-variable in a way, that such limit is imposing only on query words specified after this operator. For example, if you have the following commands in sections.conf file Section body 1 256 Section title 2 128 Section url 3 0 strictthen you can use the following operators in search query: allinbody:, allintitle: and allinurl:. For the query computer allintitle: science it will be found the documents that contain the word "science" in the title and the word "computer" in any document section. 8.1.7. Advanced boolean searchIf you want more advanced results you may use query language. You should select "bool" search mode in the search from. DataparkSearch understands the following boolean operators: AND or & - logical AND. For example, "mysql & odbc" or "mysql AND odbc" - DataparkSearch will find any URLs that contain both "mysql" and "odbc". NEAR - NEAR operator, identical to AND operator, but come true if both words are within 16 words of each other. For example, "mysql NEAR odbc" - DataparkSearch will find any URLs that contain both "mysql" and "odbc" within 16 words of each other. ANYWORD or * - ANYWORD operator, identical to AND operator, but come true if both words have any one word between and left operand have lesser position than right operand. For example, "mysql * odbc" - DataparkSearch will find any URLs that contain both "mysql" and "odbc" within any word between, for example, any document with "mysql via odbc" phrase. OR or | - logical OR. For example, "mysql | odbc" or "mysql OR odbc" - DataparkSearch will find any URLs that contain word "mysql" or word "odbc". NOT or ~ - logical NOT. For example, "mysql & ~ odbc" or "mysql AND NOT odbc" - DataparkSearch will find URLs that contain word "mysql" and do not contain word "odbc" at the same time. Note that ~ just excludes given word from results. Query "~ odbc" will find nothing! () - group command to compose more complex queries. For example "(mysql | msql) & ~ postgres". Query language is simple and powerful at the same time. Just consider query as usual boolean expression. 8.1.8. The Verity Query Language, VQLOnly the prefix variant of the Verity Query Language is supported by DataparkSearch. Also, only the following subset of VQL operators is supported by DataparkSearch: Table 8-2. VQL operators supported by DataparkSearch
8.1.9. How search handles expired documentsExpired documents are still searchable with their old content. 8.2. mod_dpsearch module for Apache httpdSince version 4.19 DataparkSearch also provide the mod_dpsearch.so module for Apache web server. 8.2.1. Why using mod_dpsearch
8.2.2. Configuring mod_dpsearchTo enable this extension, add LoadModule dpsearch_module libexec/mod_dpsearch.so
AddModule mod_dpsearch.c
<Ifmodule mod_dpsearch.c>
DataparkSearchdConf /usr/local/dpsearch/etc/modsearchd.conf
<Location /search>
SetHandler dpsearch
DataparkSearchTemplate /usr/local/dpsearch/etc/modsearch.htm
</Location>
<Location /storedoc>
SetHandler dpstoredoc
DataparkStoredocTemplate /usr/local/dpsearch/etc/modstoredoc.htm
</Location>
</IfModule>There are three configuration directives supported by this module: 8.3. How to write search result templatesDataparkSearch users have an ability to customize search results (output of search.cgi ). You may do it by providing template file search.htm, which should be located in /etc/ directory of DataparkSearch installation. Template file is usual HTML file, which is divided into sections. Keep in mind that you can just open template file in your favorite browser and get the idea of how the search results will look like.
Each section begins with <!--sectionname--> and ends with <!--/sectionname--> delimiters, which should reside on a separate line. Each section consists of HTML formatted text with special meta symbols. Every meta symbol is replaced by it's corresponding string. You can think of meta symbols as of variables, which will have their appropriate values while displaying search results. Format of variables is the following: $(x) - plain value $&(x) - HTML-escaped value and search words highlighted. $*(x) - HTML-escaped value. $%(x) - value escaped to be used in URLs $^(x) - search words highlighted. $(x:128) - value truncated to the first 128 bytes, if longer. $(x:UTF-8) - value written in UTF-8 charset. You may specify any charset supported. $(x:128:right) - value truncated to the last 128 bytes, if longer. 8.3.1. Template sectionsThe following section names are defined: 8.3.1.1. TOP sectionThis section is included first on every page. You should begin this section with <HTML><HEAD> and so on. Also, this is a definitive place to provide a search form. There are two special meta symbols you may use in this section: $(self) - argument for FORM ACTION tag $(q) - a search query $(cat) - current category value $(tag) - current tag value $(rN) - random number (here N is a number) If you want to include some random banners on your pages, please use $(rN). You should also place string like "RN xxxx" in 'variables' section (see below), which will give you a range 0..xxxx for $(rN). You can use as many up random numbers as you want. Example: $(r0), $(r1), $(r45) etc. Simple top section should be like this: <!--top--> <HTML> <HEAD> <TITLE>Search Query: $(q)</TITLE> </HEAD> <BODY> <FORM METHOD=GET ACTION="$(self)"> <INPUT TYPE="hidden" NAME="ul" VALUE=""> <INPUT TYPE="hidden" NAME="ps" VALUE="20"> Search for: <INPUT TYPE="text" NAME="q" SIZE=30 VALUE="$&(q)"> <INPUT TYPE="submit" VALUE="Search!"><BR> </FORM> <!--/top--> There are some variables defined in FORM. lang limit results by language. Value is a two-letter language code. <SELECT NAME="lang">
<OPTION VALUE="en" SELECTED="$(lang)">English
.....
</SELECT>
ul is the filter for URL. It allows you to limit results to particular site or section etc. For example, you can put the following in the form Search through: <SELECT NAME="ul"> <OPTION VALUE="" SELECTED="$(ul)">Entire site <OPTION VALUE="/manual/" SELECTED="$(ul)">Manual <OPTION VALUE="/products/" SELECTED="$(ul)">Products <OPTION VALUE="/support/" SELECTED="$(ul)">Support </SELECT> to limit your search to particular section. The expression SELECTED="$(ul)" in example above (and all the examples below) allows the selected option to be reproduced on next pages. If search front-end finds that expression it prints the string SELECTED only in the case OPTION VALUE given is equal to that variable. ps is default page size (e.g. how many documents to display per page). q is the query itself. pn is ps*np. This variable is not used by DataparkSearch, but may be useful for example in <!INCLUDE CONTENT="..."> directive if you want to include result produced by another search engine. Following variables are concerning advanced search capabilities:
8.3.1.2. BOTTOM sectionThis section is always included last in every page. So you should provide all closing tags which have their counterparts in top section. Although it is not obligatory to place this section at the end of template file, but doing so will help you to view your template as an ordinary html file in a browser to get the idea how it's look like. Below is an example of bottom section: <!--bottom--> <P> <HR> <DIV ALIGN=right> <A HREF="http://www.maxime.net.ru/"> <IMG SRC="dpsearch.gif" BORDER=0 ALT="[Powered by DataparkSearch search engine software]"> </A> </BODY> </HTML> <!--/bottom--> 8.3.1.3. RESTOP sectionThis section is included just before the search results. It's a good idea to provide some common search results. You can do so by using the next meta symbols:
Below is an example of 'restop' section: <!--restop--> <TABLE BORDER=0 WIDTH=100%> <TR> <TD>Search<BR>results:</TD> <TD><small>$(WE)</small></TD> <TD><small>$(W)</small></TD> </TR> </TABLE> <HR> <CENTER> Displaying documents $(first)-$(last) of total <B>$(total)</B> found. </CENTER> <!--/restop--> 8.3.1.4. RES sectionThis section is used for displaying various information about every found document. The following meta symbols are used:
Here is an example of res section: <!--res--> <DL><DT> <b>$(Order).</b><a href="$(URL)" TARGET="_blank"> <b>$(Title)</b></a> [<b>$(Score)</b>]<DD> $(Body)...<BR> <b>URL: </b> <A HREF="$(URL)" TARGET="_blank">$(URL)</A>($(Content-Type))<BR> $(Last-Modified), $(Content-Length) bytes<BR> <b>Description: </b>$(meta.description)<br> <b>Keywords: </b>$(meta.keywords)<br> </DL> <UL> $(CL) </UL> <!--/res--> 8.3.1.5. CLONE sectionThe contents of this section is included in result just instead of $(CL) meta symbol for every document clone found. This is used to provide all URLs with the same contents (like mirrors etc.). You can use the same $(D*) meta symbols here as in 'res' section. Of course, some information about clone, like $(DS), $(DR), $(DX) will be the same so it is of little use to place it here. Below is an example of 'clone' section. <!--clone--> <li><A HREF="$(DU)" TARGET="_blank">$(DU)</A> ($(DC)) $(DM) <!--/clone--> 8.3.1.6. RESBOT sectionThis is included just after last 'res' section. You usually give a navigation bar here to allow user go to next/previous results page. This is an example of 'resbot' section: <!--resbot--> <HR> <CENTER> Result pages: $(NL)$(NB)$(NR) </CENTER> <!--/resbot--> Navigator is a complex thing and therefore is constructed from the following template sections: 8.3.1.7. navleft, navleft_nop sectionThese are used for printing the link to the previous page. If that page exists, <!--navleft--> is used, and on the first page there is no previous page, so <!--navleft_nop--> is used. <!--navleft--> <TD><A HREF="$(NH)"><IMG...></A><BR> <A HREF="$(NH)">Prev</A></TD> <!--/navleft--> <!--navleft_nop--> <TD><IMG...><BR> <FONT COLOR=gray>Prev</FONT></TD> <!--/navleft_nop--> 8.3.1.8. navbar0 sectionThis is used for printing the current page in the page list. <!--navbar0--> <TD><IMG...><BR>$(NP)</TD> <!--navbar0--> 8.3.1.9. navright, navright_nop sectionThese are used for printing the link to the next page. If that page exists, <!--navright--> is used, and on the last page <!--navright_nop--> is used instead. <!--navright--> <TD> <A HREF="$(NH)"><IMG...></A> <BR> <A HREF="$(NH)">Next</A></TD> <!--/navright--> <!--navright_nop--> <TD> <IMG...> <BR> <FONT COLOR=gray>Next</FONT></TD> <!--/navright_nop--> 8.3.1.10. navbar1 sectionThis is used for printing the links to the other pages in the page list. <!--navbar1--> <TD> <A HREF="$(HR)"> <IMG...></A><BR> <A HREF="$(NH)">$(NP)</A> </TD> <!--/navbar1--> 8.3.1.11. notfound sectionAs its name implies, this section is displayed in case when no documents are found. You usually give a little message saying that and maybe some hints how to make search less restrictive. Below is an example of notfound section: <!--notfound--> <CENTER> Sorry, but search hasn't returned results.<P> <I>Try to compose less restrictive search query or check spelling.</I> </CENTER> <HR> <!--/notfound--> 8.3.1.12. noquery sectionThis section is displayed in case when user gives an empty query. Below is an example of noquery section: <!--noquery--> <CENTER> You haven't typed any word(s) to search for. </CENTER> <HR> <!--/noquery--> 8.3.1.13. error sectionThis section is displayed in case some internal error occurred while searching. For example, database server is not running or so. You may provide the following meta symbol: $(E) - error text. Example of error section: <!--error--> <CENTER> <FONT COLOR="#FF0000">An error occured!</FONT> <P> <B>$(E)</B> </CENTER> <!--/error--> 8.3.2. Variables sectionThere is also a special variables section, in which you can set up some values for search. Special variables section usually looks like this: <!--variables DBAddr mysql://foo:bar@localhost/search/?dbmode=single VarDir /usr/local/dpsearch/var/ LocalCharset iso-8859-1 BrowserCharset iso-8859-1 TrackQuery no Cache no DetectClones yes HlBeg <font color="blue"><b><i> HlEnd </i></b> R1 100 R2 256 Synonym synonym/english.syn ResultContentType text/xml Locale fr_FR.ISO_8859-1 -->
VarDir command specifies a custom path to directory that indexer stores data to when use with cache mode. By default /var directory of DataparkSearch installation is used. LocalCharset specifies a charset of database. It must be the same with indexer.conf LocalCharset. BrowserCharset specifies which charset will be used to display results. It may differ from LocalCharset. All template variables which correspond data from search result (such as document title, description, text) will be converted from LocalCharset to BrowserCharset. Contents of template itself is not converted, it must be in BrowserCharset. Use "Cache yes/no" to enable/disable search results cache. Use "DetectClones yes/no" to enable/disable clones detection. Use "GroupBySite yes/no" to enable/disable grouping results by url.site_id.
HlBeg and HlEnd commands are used to configure search results highlighting. Found words will be surrounded in those tags. There is an Alias command in search.htm, that is similar to the one in indexer.conf, but it affects only search results while having no effect on indexing. See Section 3.7> for details. R1 and R2 specify ranges for random variables $(R1) and $(R2). Synonym command is used to load specified synonyms list. Synonyms file name is either absolute or relative to /etc directory of DataparkSearch installation. DateFormat command is used to change Last-Modified date format output. Use strftime function meta-variables for your own format string.
"Log2stderr yes/no" command is used to enable error logging to stderr. ResultsLimit command is uses to limit maximum number of results shown. If searchd is used, this command may be specified in searchd.conf. ResultContentType command is uses to specify Content-Type header for results page. Default value: text/html. Locale command is uses to specify LC_ALL locale settings for search results output. Default value: unspecified (uses the value specified before in system settings). With MakePrexixes yes command you can instruct to extend a search query automatically by producing all prefixes of query words. This is suitable, for example, for making search suggestions.(See also Section 3.10.56>) 8.3.3. Includes in templatesYou may use <!INCLUDE Content="http://hostname/path"> to include external URLs into search results. WARNING: You can use <!INCLUDE> ONLY in the following template sections: <!--top--> This is an example of includes usage: 8.3.4. Conditional template operatorsDataparkSearch supports conditional operators in search templates: <!IF, <!ELSE, <!ENDIF, <!ELIF, <!ELSEIF, <!SET, <!COPY, <!IFLIKE, <!IFREGEX, <!ELIKE, <!EREGEX, <!ELSELIKE, <!ELSEREGEX. <!IF NAME="Content-Type" Content="application/pdf"> <img src="pdf.png"> <!ELIF NAME="Content-Type" Content="text/plain"> <img src="text.png"> <!ENDIF> It's possible to use nested conditional operators. This gives more power for search template construction. See samples in etc/search.htm-dist file. 8.3.5. Security issuesWARNING: Since the template file contains such info as password, it is highly recommended to give the file proper permissions to protect it from reading by anyone but you and search program. Otherwise your passwords may leak. 8.4. Designing search.htmlThis section is assuming that you are using the CGI front end. 8.4.1. How the results page is createdThe file etc/search.htm consists of a number of blocks delimited by HTML comments that start with <!--comment--> and end with <!--/comment-->. The <!--variables--> block is only used by search.cgi. The other blocks form part of the results output depending on the situation. The blocks <--top--> and <!--bottom--> are always returned to the user as the top and bottom part of the output respectively. There are three series of <!--restop-->, <!--res--> and <!--resbot--> blocks. The first series is returned to users that have requested long results (default), the second one to those that have requested short results and the third one to those that have requested results as URL only. All three blocks must be present in search.htm. Furthermore there is a series of navigation blocks and the blocks <!--notfound-->, <!--noquery--> and <!--error-->. The latter are returned occasionally instead of results. Any HTML that is outside the pre-defined blocks in search.htm is completely ignored. Thus, the output of search.cgi will always be something like this: top restop top top top res or notfound or error or noquery resbot bottom bottom bottom (navigation) bottom The navigation part is built in the same way, with the elements that pertain to each results page. For example, <!--navleft--> and <!--navright--> are used to link to the previous and next results pages, while <!--navXXX_nop--> is used when there are no more pages in one or either direction. 8.4.2. Your HTMLThe simplest HTML is provided ready for use in etc/search.htm-dist. It is advisable that you use this until your back-end works fine. Once you decide to add bells and whistles to your search, you have two options. One is to keep the simple design of search.htm, but make it part of a frame set. This way you can add elements such as menus etc in a frame and keep the output of search.htm in another. The other option is to incorporate your entire design in search.htm. If you fully understand the "blocks" system described above, this should not be too difficult. The one most important factor is to keep track of elements that need to be opened in one block and closed in another. For example, you might want a page in tables that looks like this: ----------------------------------
| top table |
|..................................|
| . |
|left . |
| . |
| . main table |
|table . |
| . |
| . |
----------------------------------If you are planning to put your results in the main table, you can put all the HTML code in the <!--top--> block of search.htm, up to and including the opening of the main table (<table><tr><td>). If you then put the closing of the main table and the closing tags of the page in the <!--bottom--> block (</table></tr></td></body></html>) and leave all other blocks unformatted, you will have the design of your choice and all your results in the right place. In a more complicated design, where you want to format results individually, you can apply the same method as long as you keep track of the opening and closing of HTML elements. You must either open and close them in the same block, or make sure that any possible combination of blocks will result in properly opened and closed HTML tags. What you cannot do without editing the source code, is change the order in which the blocks are parsed. Taking the above example, let's assume that you want your page to look like this: ----------------------------------
| logo banner ads |
|..................................|
| . |
|choices . |
| . |
| . results |
|search . |
|button . |
| . |
----------------------------------To get this, you need to have everything except the results and navigation in the <!--top--> block, since that is the only block that can draw the page even if there are no results at all. In this case your search.htm would look like this: <!--variables-->
[your configuration]
<!--/variables-->
<!--top-->
<html>
<body>
<table>
<tr colspan="2">
<td>[logo, banner ads]</td>
</tr>
<tr>
<td>[search form]</td>
<td>
<!--/top-->
[all other blocks in search.htm except "bottom"]
<!--bottom-->
[closing elements like the DataparkSearch link
and a link to the webmaster]
</td>
</tr>
</table>
</body>
</html>
<!--/bottom-->
The individual blocks can be formatted individually as long as that formatting is closed within each block. Thus, nothing stops you from doing things like <!--error-->
<table>
<tr><td bgcolor"red">
<font color="#ffffff">
[error variables]
</font>
</tr><td>
</table>
<!--error-->
as long as such formatting is opened and closed properly within the same block. 8.4.3. Forms considerationsMost modern browsers can handle forms that stretch over different tables, but writing such forms is against all standards and is bad HTML. Unless you really can't avoid it, don't do it. For example, <table>
<tr><td>
<form>
<input type="text" name="something">
<input type="radio" name"button1">
<input type="radio" name"button2">
</form>
</tr></td>
</table>
is fine, but <table>
<tr><td>
<form>
<input type="text" name="something">
</tr></td>
</table>
<table>
<tr><td>
<input type="radio" name"button1">
<input type="radio" name"button2">
</form>
</tr></td>
</table>
is not. Note that the input forms in search.htm can be changed at will. The default is drop-down menus, but nothing stops you from using radio buttons or hidden input or even text boxes. For instance, where search.htm says Results per page: <SELECT NAME="ps"> <OPTION VALUE="10" SELECTED="$(ps)">10 <OPTION VALUE="20" SELECTED="$(ps)">20 <OPTION VALUE="50" SELECTED="$(ps)">50 </SELECT> you can very well substitute <input type="radio" name="ps" value="10" checked="$(ps)"> <input type="radio" name="ps" value="20" checked="$(ps)"> <input type="radio" name="ps" value="50" checked="$(ps)"> which will result in three radio buttons instead of a drop-down menu, with "20" as the default and the exact same functionality. What you obviously cannot do is provide multiple-choice menus like <type="checkbox"> or <select multiple>. Note that you can also use the format if you want to set other defaults than the pre-defined and not allow the user to change them. 8.4.4. Relative links in search.htmIt might be worth mentioning that search.htm is parsed from your cgi-bin directory. The position of this directory in relation to your document root is determined by the web server, independently of its actual position in the file system. Almost invariably is http://your_document_root/cgi-bin/ . Since search.cgi lives in cgi-bin, any links to images etc in search.htm will assume cgi-bin as the base directory. Therefore, if you have a file system structure like the correct relative link from search.cgi to images in img/ would still be <img src="../img/image.gif"> despite the fact that it doesn't match the file system structure. 8.4.5. Adding Search form to other pagesTo place a search form to any of your pages, please place the following code where you would like the form to be displayed: 8.5. Relevance8.5.1. Ordering documentsDataparkSearch by default sorts results first by relevency and second by popularity rank. 8.5.2. Relevance calculationIn indexing, DataparkSearch divides every document onto sections. A section is any part of the document, for example, for HTML documents this may be TITLE or META Description tag. In addition to sections, some document factors are also take in account for relevance calculation: the average distance between query words, the number of query word occurrences, the position of first occurrence of a query word, the difference between the distribution of query word counts and the uniform distribution. In searching, DataparkSearch compares every document found against an "ideal" document. The "ideal" document should have query words in every section defined and should have also the predefined values of additional factors. Since sections definition located only in indexer.conf file, use NumSections command in searchd.conf or in search.htm to specify the number od section used. By default, this value is 256. But note, NumSections do not affect document ordering, only the relevance value. Table 8-3. Configure-time parameters to tune relevance calculation (switches for configure)
8.5.2.1. A full method of relevance calculation.Let x is the weighted sum of all sections. The weights for these sections are define by
8.5.2.2. A fast method of relevance calculation.Let x is the number of bits used in weighted values of all sections defined. Let y is the weighted sum of differences between additional factors of document found and corresponding values of the "ideal" document. And let xy is the number of bits where weighted values of sections of the "ideal" document are different to weighted values of sections of document found. Then value of document relevance is calculates as: ( x - xy ) / ( x + y ). 8.5.3. Popularity rankDataparkSearch support two methods for popularity rank calculation. A method used in previous versions called "Goo", and new method is called "Neo". By default, the Goo method is used. To select desired PopRank calculation method use PopRankMethod command: PopRankMethod Neo You need enable links collection by CollectLinks yes command in your indexer.conf file for Neo method and for full functionality of Goo method. But this slow down a bit indexing speed. By default, links collection is not enabled. By default, only intersite links (i.e. links from a page on one site to a page on an another site) are taken in account for the popularity rank calculation. If you place PopRankSkipSameSite no command in indexer.conf file, indexer take all links for this purpose. You may assign initial value for page popularity rank using DP.PopRank META tag (see Section 4.3>). 8.5.3.1. "Goo" popularity rank calculation methodThe popularity rank calculation is made in two stages. At first stage, the value of
By default, the value of If you place
If you place
If you place
For this method is supposed all pages are neurons and links between pages are links between neurons. So it's possible use an error back-propagation algorithm to train this neural network. Popularity rank for a page is the activity level for corresponding neuron. See short description of The Neo Popularity Rank for web pages. You may use
By default, the Neo Popularity Rank is caclulated along with indexing. To speed up indexing, you may postpone Popularity Rank execution using PopRankPostpone command: PopRankPostpone yes Then you may calculate the Neo Popularity Rank after indexing in same way as for method Goo, i.e.: indexer -TR 8.5.4. Boolean searchPlease note that in case of boolean searching for two or more words, you have to enter operators (&, |, ~, AND, OR, NOT, NEAR, ALL, etc.). I.e. it is necessary to enter a & book instead of a book. See also Section 8.1.7>. 8.5.5. CrosswordsThis feature allows to assign words between <a href="xxx"> and </a> also to a document this link leads to. To enable Crosswords, please use CrossWords yes command in indexer.conf and search.htm, and define crosswords section in sections.conf file. 8.5.6. The Summary Extraction Algorithm (SEA)The Summary Exctraction Algorith (SEA) builds the summary of three or more the most relevant sentences of the each document indexed, if this document consists of six or more sentences. To enable this feature, add this command to your seaction.conf file: Section sea x ywhere x - the number of section and y - the maximum length of this section value,
leave 0, if you do not want show this in result pages.
If you specify y non-zero, you may use $(sea) meta-variable in your search
template to show the summary in result pages.Related configuration directives: The SEASentenceMinLength command specify the minimal length of sentence to be used in summary construction using the SEA. Default value: 64. The SEASentences command uses to specify the maximal number of sentences with length greater or equal to the value specified by the SEASentenceMinLength command, which are used for summary construction in the SEA. Default value: 32. Since the summary construction using SEA is nonlinear expensive (affects only indexing), you may adjust this value according to desired indexing performance. With SEASections command you can specify the list of document sections which are used to construct SEA summary. By default, only the "body" section is used for SEA summary construction. SEASections "body, title" This algorithm of automatic summary construction is based on ideas of Rada Mihalcea described in the paper Rada Mihalcea and Paul Tarau, An Algorithm for Language Independent Single and Multiple Document Summarization, in Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP), Korea, October 2005. Differences in DataparkSearch's SEA:
After indexing of document collection with this section defined, you may use $(sea) meta-variable in your template to show summary for a search result. 8.6. Search queries trackingDataparkSearch supports query tracking. When doing a search, front-end uses table qtrack to store query words, client IP address, a number of found documents and current UNIX timestamp in seconds and table qinfo to store all search parameters. To enable tracking, add the trackquery parameter to DBAddr command (see Section 3.10.2>) in your search template. For example: DBAddr pgsql://user:pass@localhost/search/?dbmode=cache&trackquery
Query tracking is useful to have a statistics of your search engine usage. To make a search queries summary, you may execute, for example, this SQL expression: SELECT qwords,count(*),sum(found),avg(found) FROM qtrack GROUP BY qwords; 8.7. Search results cacheSearch results cache allows search.cgi to make very fast response on recently used queries as well as user's navigation though the pages of the same result. Search results cache is disabled by default. You may use Cache yes command in search.htm to enable results caching. If you use searchd, add "Cache yes" command to searchd.conf file. Search cache is located in $PREFIX/var/cache/ subdirectory, where $PREFIX is DataparkSearch installation base directory. Each result is stored in a separate file. By defaults, search results cache is not deleted automatically. You have to delete it every time after indexer's work to avoid displaying non-recent cached results. Or you may specify a refresh period for search results cache using HoldCache command: HoldCache <time>For <time> format see description of Period command in Section 3.10.26>. HoldCache 3h 8.8. Fuzzy search8.8.1. IspellWhen DataparkSearch is used with ispell support enabled, it automatically extend search query by all grammatical forms of the query words. E.g. search front-end will try to find the word "test" if "testing" or "tests" is given in search query. 8.8.1.1. Two types of ispell filesDataparkSearch understands two types of ispell files: affixes and dictionaries. Ispell affixes file contains rules for words and has approximately the following format: Flag V:
E > -E, IVE # As in create> creative
[^E] > IVE # As in prevent > preventive
Flag *N:
E > -E, ION # As in create > creation
Y > -Y, ICATION # As in multiply > multiplication
[^EY] > EN # As in fall > fallenIspell dictionary file contains words themselves and has the following format: wop/S word/DGJMS wordage/S wordbook wordily wordless/P 8.8.1.2. Using IspellTo make DataparkSearch support ispell you must specify Affix and Spell commands in search.htm file. The format of commands: Affix [lang] [charset] [ispell affix file name] Spell [lang] [charset] [ispell dictionary filename] The first parameter of both commands is two letters language abbreviation. The second is ispell files charset. The third one is filename. File names are relative to DataparkSearch /etc directory. Absolute paths can be also specified.
If you use searchd, add the same commands to searchd.conf. When DataparkSearch is used with ispell support it is recommended to use searchd, especially for several languages support. Otherwise the starting time of search.cgi increases. 8.8.1.3. Customizing dictionaryIt is possible that several rare words are found in your site which are not in ispell dictionaries. In such case, an entry with longest match suffix is taking to produce word forms. But you can also create the list of such words in plain text file with the following format (one word per line): rare.dict: ---------- webmaster intranet ....... www http --------- You may also use ispell flags in this file (for ispell flags refer to ISpell documentation). This will allow not writing the same word with different endings to the rare words file, for example "webmaster" and "webmasters". You may choose the word which has the same changing rules from existing ispell dictionary and just to copy flags from it. For example, English dictionary has this line: postmaster/MS So, webmaster with MS flags will be probably OK: webmaster/MS Then copy this file to /etc directory of DataparkSearch and add this file by Spell command in DataparkSearch configuration: During next reindexing using of all documents new words will be considered as words with correct spelling. The only really incorrect words will remain. 8.8.1.4. Where to get Ispell filesYou may find ispell files for many of languages at this page. For Japanese language there exist quasi-ispell files suitable for use with DataparkSearch only. You may get this data from our web site or from one of our mirrors. See Section 1.2>. 8.8.1.5. Query words modificationQuffix [lang] [charset] [ispell-like suffix file name]The Quffix command is similar to Affix command described above, except that these rules apply to the query words, bot not to the normal word forms as it is done for Affix command. The file loaded with this command must contain only suffix rules (in terms of ispell affix files). This command is suitable, for example, to specify the rules to switch from one part of speech to an another for the Russian language when it is appropriate. 8.8.2. AspellWith Aspell support compiled, it's possible automatically extend search query by spelling suggestions for query words. To enable this feature, you need to install Aspell at your system before DataparkSearch build. Then you need to place AspellExtensions yes command into your indexer.conf and search.htm (or into searchd.conf, if searchd is used) files to activate this feature. Automatically spelling suggestion for search query words is going only if 8.8.3. SynonymsDataparkSearch also support a synonyms-based fuzzy search. Synonyms files are installed into etc/synonym subdirectory of DataparkSearch installation. Large synonyms files you need to download separately from our web site, or from one of our mirrors, see Section 1.2>. To enable synonyms, add to search.htm search template commands like Synonym <filename>, e.g.: Synonym synonym/english.syn Synonym synonym/russian.syn Filenames are relative to etc directory of DataparkSearch installation or absolute if begin with / If you use searchd, add the same commands to searchd.conf. You may create your own synonyms lists. As an example you may take the English synonyms file. In the beginning of the list please specify the following two commands: Language: en Charset: us-ascii
Optionaly you may specify following command in the list: Thesaurus: yes This command enable thesaurus mode for synonyms list. For this mode, only words at one line treats as synonyms. 8.8.4. Accent insensitive searchSince version 4.17 DataparkSearch also support an accent insensitive search. To enable this extension, use AccentExtensions command in your search.htm (or in searchd.conf, if searchd is used) to make automatically accent-free copies for query words, and in your indexer.conf config file to produce accent-free word's copies to store in database. AccentExtensions yes If AccentExtensions command is placed before Spell and Affix commands, accent-free copies for those data also will be loaded automaticaly. 8.8.5. Acronyms and abbreviationsSince version 4.30 DataparkSearch also support search fuzzying based on acronyms and abbreviation. Acronyms files are installed into etc/acronym subdirectory of DataparkSearch installation. To enable acronyms, add to search.htm search template commands like Acronym <filename>, e.g.: Acronym acronym/en.fido.acr Acronym acronym/en.acr Filenames are relative to etc directory of DataparkSearch installation or absolute if begin with / If you use searchd, add the same commands to searchd.conf. You may create your own acronyms lists. As an example you may take the English acronyms file. In the beginning of the list please specify the following two commands: Language: en Charset: us-ascii
Also, you can extend queries by special comments specifying regular expression modifications. E.g.: #* regex last "([0-9]{2})[- \.]?([0-9]{2})[- \.]?([0-9]{2})" "+78622$1$2$3"This specify a transformation from widely used format of local phone numbers, 99-99-99, into canonical format, +78622XXXXXX. So the phone numbers become searchable regardless the format they were written. The last option here means that the process of regex application stops after applying this rule. Please send your own acronym files to Chapter 9. Miscellaneous9.1. Reporting bugsWhen reporting bugs, please specify DataparkSearch version and provide us as much information about your problem as possible. Such information as your platform and OS details, database version, database statistics like number of URLs in database or probably count of records in different tables would be very helpful to find and fix possible bugs. Please, submit bug reports using our Bug Reporting System at DataparkSearch web site. Please do not send reports to mailing list or to personal authors addresses! 9.1.1. Currently known bugsUse DataparSearch Bug Reporting System to view active and fixed bug statistics. As well, you can use this system to submit new bug-reports and proposals for new feature or improvements 9.1.2. Core dump reportsIf indexer or search.cgi
die during their work and produce core, it would be very helpful to
send us gdb (The GNU Debugger) output. To do this, please make the
following steps. Make sure you have DataparkSearch built with Run GNU Debugger with executable as the first argument and with core file as the second: gdb indexer indexer.core Some information about the crash location will appear: Core was generated by `indexer'.
Program terminated with signal 8, Floating point exception.
Reading symbols from /usr/lib/libc.so.3...done.
Reading symbols from /usr/libexec/ld-elf.so.1...done.
#0 0x80483f3 in main () at indexer.c:4
4 printf("%d",0/0);
Then type thread apply all backtrace command: (gdb) thread apply all bt #0 0x80483f3 in main () at indexer.c:4 #1 0x804837d in _start () Send us either first and second outputs or just a screenshot of gdb session. 9.2. Using libdpsearch libraryThe libdpsearch is available for using it in third party applications. You can easily add search into your own application using library and include files installed in /lib and /include DataparkSearch directories. Each application which uses libdpsearch must have dpsearch.h header file included. 9.2.1. dps-config scriptWhen compiled with one of supported SQL back-end, libdpsearch requires some dependent libraries, for example libmysqlclient. You can find dps-config script in /bin directory of DataparkSearch installation. This script helps to take in account required dependencies. dps-config script can take several options in it's command line. By default dps-config outputs all available options:
Usage: ./dps-config [OPTIONS]
Options:
[--version]
[--libs]
[--cflags]
When executed with # ./dps-config --libs -lm -L/usr/local/mysql/lib/mysql -lmysqlclient \ -L/usr/local/dpsearch/lib -ldpsearch So you may insert dps-config cc myprog.c -o myprog `dps-config --libs` 9.2.2. DataparkSearch APIThere is no detailed description of DataparkSearch API yet. This is because API is currently under rapid development and may have major changes from version to version. You may use search.c as an example of application which uses libdpsearch library. 9.3. Database schemaFull database schema used by DataparkSearch is defined in appropriate sql-scipts for database creation located under create subdirectory. Table 9-1.
Other server's parameters store in Table 9-2. Several server's parameters values in
Appendix A. DonationsIf you like the DataparkSearch Engine and want to encourage further development, feel free to make a donation (at Kagi) to support this project. Any donation is greatfully appreciated. The following individuals have made donations to support DataparkSearch development, and deserve credit for it:
Index
|