|
|
3.6. Specifying WEB space to be indexedWhen indexer tries to insert a new URL into database or is trying to index an existing one, it first of all checks whether this URL has corresponding Server, Realm or Subnet command given in indexer.conf. URLs without corresponding Server, Realm or Subnet command are not indexed. By default those URLs which are already in database and have no Server/Realm/Subnet commands will be deleted from database. It may happen for example after removing some Server/Realm/Subnet commands from indexer.conf. These commands have following format: <command> [method] [subsection] [CaseType] [MatchType] [CmpType] pattern [alias] Mandatory parameter Optional parameter
Use optional
Optional parameter CaseType is specify the case sensivity for string comparison, it can take one of follow value: case - case insensitive comparison, or nocase - case sensitive comparison. Optional parameter CmpType is specify the type of comparison and can take two value: Regex and String. String wildcards is default match type. You can use ? and * signs in URLMask parameters, they means "one character" and "any number of characters" respectively. For example, if you want to index all HTTP sites in .ru domain, use this command: Realm http://*.ru/* Regex comparison type takes a regular expression
as it's argument. Activate regex comparison type using Realm Regex ^http://.*\.ru/ Optional parameter MatchType means match type. There
are Match and NoMatch possible values with Match as
default. Realm NoMatch has reverse effect. It means
that URL that does not match given Realm NoMatch http://*.com/* Optional 3.6.1. Server commandThis is the main command of the indexer.conf file. It is used to add servers or their parts to be indexed. This command also says indexer to insert given URL into database at startup. E.g. command Server http://localhost/ allows to index whole http://localhost/ server. It also makes indexer insert given URL into database at startup. You can also specify some path to index server subsection: Server http://localhost/subsection/. It also says indexer to insert given URL at startup.
3.6.2. Realm commandRealm command is a more powerful means of describing web area to be indexed.
It works almost like Server command but takes
a regular expression or string wildcards as it's 3.6.3. Subnet commandSubnet command is another way to describe web area to be indexed.
It works almost like Server command but takes
a string wildcards or network specified in CIDR presentation format as it's Subnet 192.168.*.*In case of network specified in CIDR presentation format, you may specify subnet in forms: a.b.c.d/m, a.b.c, a.b, a Subnet 1291.168.10.0/24 You may use "NoMatch" optional argument. For example, if you want to index everything without 195.x.x.x subnet, use: Subnet NoMatch 195.*.*.* 3.6.4. Using different parameter for server and it's subsectionsIndexer seeks for "Server" and "Realm" commands in order of their appearance. Thus if you want to give different parameters to e.g. whole server and its subsection you should add subsection line before whole server's. Imagine that you have server subdirectory which contains news articles. Surely those articles are to be reindexed more often than the rest of the server. The following combination may be useful in such cases:
# Add subsection Period 200000 Server http://servername/news/ # Add server Period 600000 Server http://servername/ These commands give different reindexing period for /news/ subdirectory comparing with the period of server as a whole. indexer will choose the first "Server" record for the http://servername/news/page1.html as far as it matches and was given first. 3.6.5. Default indexer behaviorThe default behavior of indexer is to follow through links having correspondent Server/Realm command in the indexer.conf file. It also jumps between servers if both of them are present in indexer.conf either directly in Server command or indirectly in Realm command. For example, there are two Server commands:
Server http://www/ Server http://web/ When indexing http://www/page1.html indexer WILL follow the link http://web/page2.html if the last one has been found. Note that these pages are on different servers, but BOTH of them have correspondent Server record. If one of the Server command is deleted, indexer will remove all expired URLs from this server during next reindexing. 3.6.6. Using indexer -f <filename>The third scheme is very useful for indexer -i -f url.txt running. You may maintain required servers in the url.txt. When new URL is added into url.txt indexer will index the server of this URL during next startup. 3.6.7. ServerDB, RealmDB, SubnetDB and URLDB commandsURLDB pgsql://foo:bar@localhost/portal/links?field=url These commands are equal to Server, Realm, Subnet and
URL commands respectively, but takes arguments from field of SQL-table specified.
In example above, URLs are takes from database 3.6.8. URL commandURL http://localhost/path/to/page.html This command inserts given |