|
|
3.6. Specifying WEB space to be indexedWhen indexer tries to insert a new URL into database or is trying to index an existing one, it first of all checks whether this URL has corresponding Server, Realm or Subnet command given in indexer.conf. URLs without corresponding Server, Realm or Subnet command are not indexed. By default those URLs which are already in database and have no Server/Realm/Subnet commands will be deleted from database. It may happen for example after removing some Server/Realm/Subnet commands from indexer.conf. These commands have following format: [Server | Realm | Subnet] [method] [subsection] [CaseType] [MatchType] [CmpType] pattern [alias] Mandatory parameter pattern specify an URL, or it part, or pattern to compare. Optional parameter method specify an document action for this command. May take values: Allow, Disallow, HrefOnly, CheckOnly, Skip, CheckMP3, CheckMP3Only. By default, the value Allow is used.
Use optional subsection parameter to specify server's checking behavior. Subsection value must be one of the following: nofollow, page, path, site, world and has "path" value by default.
Optional parameter CaseType is specify the case sensivity for string comparison, it can take one of follow value: case - case insensitive comparison, or nocase - case sensitive comparison. Optional parameter CmpType is specify the type of comparison and can take two value: Regex and String. String wildcards is default match type. You can use ? and * signs in URLMask parameters, they means "one character" and "any number of characters" respectively. For example, if you want to index all HTTP sites in .ru domain, use this command: Realm http://*.ru/* Regex comparison type takes a regular expression as it's argument. Activate regex comparison type using Regex keyword. For example, you can describe everything in .ru domain using regex comparison type: Realm Regex ^http://.*\.ru/ Optional parameter MatchType means match type. There are Match and NoMatch possible values with Match as default. Realm NoMatch has reverse effect. It means that URL that does not match given pattern will correspond to this Realm command. For example, use this command to index everything without .com domain: Realm NoMatch http://*.com/* Optional alias argument allows providing very complicated URL rewrite more powerful than other aliasing mechanism. Take a look Section 3.7 for alias argument usage explanation. Alias works only with Regex comparison type and has no effect with String type. 3.6.1. Server commandThis is the main command of the indexer.conf file. It is used to add servers or their parts to be indexed. This command also says indexer to insert given URL into database at startup. E.g. command Server http://localhost/ allows to index whole http://localhost/ server. It also makes indexer insert given URL into database at startup. You can also specify some path to index server subsection: Server http://localhost/subsection/. It also says indexer to insert given URL at startup.
3.6.2. Realm commandRealm command is a more powerful means of describing web area to be indexed. It works almost like Server command but takes a regular expression or string wildcards as it's pattern parameter and do not insert any URL into database for indexing. 3.6.3. Subnet commandSubnet command is another way to describe web area to be indexed. It works almost like Server command but takes a string wildcards or network specified in CIDR presentation format as it's pattern argument which is compared against IP address instead of URL. In case of string wilcards formant, argument may have * and ? signs, they means "one character" and "any number of characters" respectively. For example, if you want to index all HTTP sites in your local subnet, use this command: Subnet 192.168.*.*In case of network specified in CIDR presentation format, you may specify subnet in forms: a.b.c.d/m, a.b.c, a.b, a Subnet 1291.168.10.0/24 You may use "NoMatch" optional argument. For example, if you want to index everything without 195.x.x.x subnet, use: Subnet NoMatch 195.*.*.* 3.6.4. Using different parameter for server and it's subsectionsIndexer seeks for "Server" and "Realm" commands in order of their appearance. Thus if you want to give different parameters to e.g. whole server and its subsection you should add subsection line before whole server's. Imagine that you have server subdirectory which contains news articles. Surely those articles are to be reindexed more often than the rest of the server. The following combination may be useful in such cases:
# Add subsection Period 200000 Server http://servername/news/ # Add server Period 600000 Server http://servername/ These commands give different reindexing period for /news/ subdirectory comparing with the period of server as a whole. indexer will choose the first "Server" record for the http://servername/news/page1.html as far as it matches and was given first. 3.6.5. Default indexer behaviorThe default behavior of indexer is to follow through links having correspondent Server/Realm command in the indexer.conf file. It also jumps between servers if both of them are present in indexer.conf either directly in Server command or indirectly in Realm command. For example, there are two Server commands:
Server http://www/ Server http://web/ When indexing http://www/page1.html indexer WILL follow the link http://web/page2.html if the last one has been found. Note that these pages are on different servers, but BOTH of them have correspondent Server record. If one of the Server command is deleted, indexer will remove all expired URLs from this server during next reindexing. 3.6.6. Using indexer -f <filename>The third scheme is very useful for indexer -i -f url.txt running. You may maintain required servers in the url.txt. When new URL is added into url.txt indexer will index the server of this URL during next startup. 3.6.7. ServerDB, RealmDB, SubnetDB and URLDB commandsURLDB pgsql://foo:bar@localhost/portal/links?field=url These commands are equal to Server, Realm, Subnet and URL commands respectively, but takes arguments from field of SQL-table specified. In example above, URLs are takes from database portal, SQL-table links and filed url. 3.6.8. URL commandURL http://localhost/path/to/page.html This command inserts given URL into database. This is usefull to add several entry points to one server. Has no effect if an URL is already in the database. |