DataparkSearch Engine 4.54: Reference manual
Prev		Next

Chapter 7. Languages support

Table of Contents
7.1. Character sets
7.2. Making multi-language search pages
7.3. Segmenters for Chinese, Japanese, Korean and Thai languages
7.4. Multilingual servers support

7.1. Character sets

7.1.1. Supported character sets

DataparkSearch supports almost all known 8 bit character sets as well as some multi-byte charsets including Korean euc-kr, Chinese big5 and gb2312, Japanese shift-jis, euc-jp and iso-2022-jp, as well as UTF-8. Some multi-byte character sets are not supported by default, because the conversion tables for them are rather large that leads to increase of the executable files size. See configure parameters to enable support for these charsets.

DataparkSearch also supports the following Macintosh character sets: MacCE, MacCroatian, MacGreek, MacRoman, MacTurkish, MacIceland, MacRomania, MacThai, MacArabic, MacHebrew, MacCyrillic, MacGujarati.

Table 7-1. Language groups

Language group	Character sets
Arabic	cp864, ISO-8859-6, MacArabic, windows-1256
Armenian	armscii-8
Baltic	cp775, ISO-8859-13, ISO-8859-4, windows-1257
Celtic	ISO-8859-14
Central European	cp852, ISO-8859-16, ISO-8859-2, MacCE, MacCroatian, MacRomania, windows-1250
Chinese Simplified	GB2312, GBK
Chinese Traditional	Big5, Big5-HKSCS, cp950, GB-18030
Cyrillic	cp855, cp866, cp866u, ISO-8859-5, KOI-7, KOI8-C, KOI8-R, KOI8-U, MacCyrillic, windows-1251
Georgian	georgian-academy, georgian-ps, geostd8
Greek	cp869, cp875, ISO-8859-7, MacGreek, windows-1253
Hebrew	cp862, ISO-8859-8, MacHebrew, windows-1255
Icelandic	cp861, MacIceland
Indian	MacGujarati, tscii
Iranian	ISIRI3342
Japanese	EUC-JP, ISO-2022-JP, Shift_JIS
Korean	EUC-KR
Lao	cp1133
Nordic	cp865, ISO-8859-10
South Eur	ISO-8859-3
Tajik	KOI8-T
Thai	cp874, ISO-8859-11, MacThai
Turkish	cp1026, cp857, ISO-8859-9, MacTurkish, windows-1254
Unicode	sys-int, UTF-16BE, UTF-16LE, UTF-8
Vietnamese	VISCII, windows-1258
Western	cp437, cp500, cp850, cp860, cp863, IBM037, ISO-8859-1, ISO-8859-15, MacRoman, US-ASCII, windows-1252

7.1.2. Character sets aliases

Each charset is recognized by a number of its aliases. Web servers can return the same charset in different notation. For example, iso-8859-2, iso8859-2, latin2 are the same charsets. There is support for charsets names aliases which search engine can understand:

Table 7-2. Charsets aliases

armscii-8	armscii-8, armscii8
Big5	big-5, big-five, big5, bigfive, cn-big5, csbig5
Big5-HKSCS	big5-hkscs, big5_hkscs, big5hk, hkscs
cp1026	1026, cp-1026, cp1026, ibm1026
cp1133	1133, cp-1133, cp1133, ibm1133
cp437	437, cp437, ibm437
cp500	500, cp500, ibm500
cp775	775, cp775, ibm775
cp850	850, cp850, cspc850multilingual, ibm850
cp852	852, cp852, ibm852
cp855	855, cp855, ibm855
cp857	857, cp857, ibm857
cp860	860, cp860, ibm860
cp861	861, cp861, ibm861
cp862	862, cp862, ibm862
cp863	863, cp863, ibm863
cp864	864, cp864, ibm864
cp865	865, cp865, ibm865
cp866	866, cp866, csibm866, ibm866
cp866u	866u, cp866u
cp869	869, cp869, csibm869, ibm869
cp874	874, cp874, cs874, ibm874, windows-874
cp875	875, cp875, ibm875, windows-875
cp950	950, cp950, windows-950
EUC-JP	cseucjp, euc-jp, euc_jp, eucjp, ujis, x-euc-jp
EUC-KR	cseuckr, euc-kr, euc_kr, euckr
GB-18030	gb-18030, gb18030
GB2312	chinese, cn-gb, csgb2312, csiso58gb231280, euc-cn, euc_cn, euccn, gb2312, gb_2312-80, iso-ir-58
GBK	cp936, gbk, windows-936
georgian-academy	georgian-academy
georgian-ps	georgian-ps
geostd8	geo8-gov, geostd8
IBM037	037, cp037, csibm037, ibm037
ISIRI3342	isiri-3342, isiri3342
ISO-2022-JP	csiso2022jp, iso 2022-jp, iso-2022-jp
ISO-8859-1	cp819, csisolatin1, ibm819, iso 8859-1, iso-8859-1, iso-ir-100, iso8859-1, iso_8859-1, iso_8859-1:1987, l1, latin-1, latin1
ISO-8859-10	csisolatin6, iso 8859-10, iso-8859-10, iso-ir-157, iso8859-10, iso_8859-10, iso_8859-10:1992, l6, latin-6, latin6
ISO-8859-11	iso 8859-11, iso-8859-11, iso8859-11, iso_8859-11, iso_8859-11:1992, tactis, thai, tis-620, tis620
ISO-8859-13	iso 8859-13, iso-8859-13, iso-ir-179, iso8859-13, iso_8859-13, l7, latin-7, latin7
ISO-8859-14	iso 8859-14, iso-8859-14, iso-ir-199, iso8859-14, iso_8859-14, iso_8859-14:1998, l8, latin-8, latin8
ISO-8859-15	iso 8859-15, iso-8859-15, iso-ir-203, iso8859-15, iso_8859-15, iso_8859-15:1998, l9, latin-0, latin-9, latin0, latin9
ISO-8859-16	iso 8859-16, iso-8859-16, iso-ir-226, iso8859-16, iso_8859-16, iso_8859-16:2000
ISO-8859-2	csisolatin2, iso 8859-2, iso-8859-2, iso-ir-101, iso8859-2, iso_8859-2, iso_8859-2:1987, l2, latin-2, latin2
ISO-8859-3	csisolatin3, iso 8859-3, iso-8859-3, iso-ir-109, iso8859-3, iso_8859-3, iso_8859-3:1988, l3, latin-3, latin3
ISO-8859-4	csisolatin4, iso 8859-4, iso-8859-4, iso-ir-110, iso8859-4, iso_8859-4, iso_8859-4:1988, l4, latin-4, latin4
ISO-8859-5	csisolatincyrillic, cyrillic, iso 8859-5, iso-8859-5, iso-ir-144, iso8859-5, iso_8859-5, iso_8859-5:1988
ISO-8859-6	arabic, asmo-708, csisolatinarabic, ecma-114, iso 8859-6, iso-8859-6, iso-ir-127, iso8859-6, iso_8859-6, iso_8859-6:1987
ISO-8859-7	csisolatingreek, ecma-118, elot_928, greek, greek8, iso 8859-7, iso-8859-7, iso-ir-126, iso8859-7, iso_8859-7, iso_8859-7:1987
ISO-8859-8	csisolatinhebrew, hebrew, iso 8859-8, iso-8859-8, iso-ir-138, iso8859-8, iso_8859-8, iso_8859-8:1988
ISO-8859-9	csisolatin5, iso 8859-9, iso-8859-9, iso-ir-148, iso8859-9, iso_8859-9, iso_8859-9:1989, l5, latin-5, latin5
KOI-7	iso-ir-37, koi-7, koi7
KOI8-C	cskoi8c, koi8-c, koi8c
KOI8-R	cskoi8r, koi8-r, koi8r
KOI8-T	cskoi8t, koi8-t, koi8t
KOI8-U	cskoi8u, koi8-u, koi8u
MacArabic	macarabic
MacCE	cmac, macce, maccentraleurope, x-mac-ce
MacCroatian	maccroation
MacCyrillic	maccyrillic, x-mac-cyrillic
MacGreek	macgreek
MacGujarati	macgujarati
MacHebrew	machebrew
MacIceland	macisland
MacRoman	csmacintosh, mac, macintosh, macroman
MacRomania	macromania
MacThai	macthai
MacTurkish	macturkish
Shift_JIS	csshiftjis, ms_kanji, s-jis, shift-jis, shift_jis, sjis, x-sjis
sys-int	sys-int
tscii	tscii
US-ASCII	ansi_x3.4-1968, ascii, cp367, csascii, ibm367, iso-ir-6, iso646-us, iso_646.irv:1991, us, us-ascii
UTF-16BE	utf-16, utf-16be, utf16, utf16be
UTF-16LE	utf-16le, utf16le
UTF-8	utf-8, utf8
VISCII	csviscii, viscii, viscii1.1-1
windows-1250	cp-1250, cp1250, ms-ee, windows-1250
windows-1251	cp-1251, cp1251, ms-cyr, ms-cyrl, win-1251, win1251, windows-1251
windows-1252	cp-1252, cp1252, ms-ansi, windows-1252
windows-1253	cp-1253, cp1253, ms-greek, windows-1253
windows-1254	cp-1254, cp1254, ms-turk, windows-1254
windows-1255	cp-1255, cp1255, ms-hebr, windows-1255
windows-1256	cp-1256, cp1256, ms-arab, windows-1256
windows-1257	cp-1257, cp1257, winbaltrim, windows-1257
windows-1258	cp-1258, cp1258, windows-1258

7.1.3. Recoding

indexer recodes all documents to the character set specified in the LocalCharset command in your indexer.conf file. Internally recoding is implemented using Unicode. Please note that if some recoding can't convert a character directly from one charset to another, DataparkSearch will use HTML numeric character references to escape this character (i.e. in form &#NNN; where NNN - a character code in Unicode). Thus, for any LocalCharset you do not lost any information about indexed documents, but on LocalCharset selection depend the database volume you will get after indexing.

7.1.4. Recoding at search time

You may display search results in any charset supported by DataparkSearch. Use BrowserCharset command in search.htm to select charset for search results. This charset may be different from LocalCharset specified. All recodings will done automatically.

7.1.5. Document charset detection

indexer detects document character set in this order:

"Content-type: text/html; charset=xxx"
<META NAME="Content-Type" CONTENT="text/html; charset=xxx">
Selection of this variant may be switch off by command: GuesserUseMeta no in your indexer.conf.
Defaults from "Charset" field in Common Parameters

7.1.6. Automatic charset guesser

DataparkSearch has an automatic charset and language guesser. It currently recognizes more than 100 various charsets and languages. Charset and language detection is implemented using "N-Gram-Based Text Categorization" technique. There is a number of so called "language map" files, one for each language-charset pair. They are installed under /usr/local/dpsearch/etc/langmap/ directory by default. Take a look there to check the list of currently provided charset-language pairs. Guesser works fine for texts bigger than 500 characters. Shorter texts may not be guessed well.

7.1.6.1. LangMapFile command

Load language map for charset and language guesser from the given file. You may specify either absolute file name or a name relative to DataparkSearch /etc directory. You may use several LangMapFile commands.

LangMapFile langmap/en.ascii.lm

7.1.6.2. Build your own language maps

To build your own language map use dpguesser utility. In addition, your need to collect file with language samples in charset desired. For new language map creation, use the following command:

        dpguesser -p -c charset -l language < FILENAME > language.charset.lm

You can also use dpguesser utility for guessing document's language and charset by existing language maps. To do this, use following command:

        dpguesser [-n maxhits] < FILENAME

For some languages, it may be used few different charset. To convert from one charset supported by DataparkSearch to another, use dpconv utility.

        dpconv [OPTIONS] -f charset_from -t charset_to [configfile] < infile > outfile

You may also specyfy -e switch for dpconv to use HTML escape entities for input, and -E switch - for output.

By default, both dpguesser and dpconv utilities is installed into /usr/local/dpsearch/sbin/ directory.

DataparkSearch can update language and charset maps automatically while indexing, if remote server is supply exactly specified language and charset with pages. To enable this function, specify the following command in your indexer.conf file:

LangMapUpdate yes

By default, DataparkSearch uses only first 512 bytes of each file indexed to detect language and charset. You may change this value using GuesserBytes command. Use value of 0 to use all text from document indexed.

GuesserBytes 16384

7.1.7. Default charset

Use RemoteCharset command in indexer.conf to choose the default charset of indexed servers.

7.1.8. Default Language

You can set default language for Servers by using DefaultLang indexer.conf variable. This is useful while restricting search by URL language.

DefaultLang <string>

Default language for server. Can be used if you need language restriction while doing search.

DefaultLang en

7.1.9. LocalCharset command

Defines the charset which will be used to store data in database. All other character sets will be recoded into given charset. Take a look into Section 7.1> for detailed explanation how to choose a LocalCharset depending on languages used on your site(s). This command should be used once and takes global effect for the config file. Take a look into documentation to check whole list of supported charsets. Default LocalCharset is iso-8859-1 (latin1).

LocalCharset koi8-r

7.1.10. ForceIISCharset1251 command

This option is useful for users which deals with Cyrillic content and broken (or misconfigured ?) Microsoft IIS web servers, which tends to not report charset correctly. This is really dirty hack, but if this option is turned on it is assumed that all servers which reports as 'Microsoft' or 'IIS' have content in Windows-1251 charset. This command should be used only once in configuration file and takes global effect. Default: no

ForceIISCharset1251 yes

7.1.11. RemoteCharset command

RemoteCharset <charset>

<charset> is default character set for the server in next Server, Realm or Subnet command(s). This is required only for "bad" servers that do not send information about charset in header: "Content-type: text/html; charset=some_charset" and do not have <META NAME="Content" Content="text/html; charset="some_charset"> Can be set before every Server, Realm or Subnet command and takes effect till the end of config file or till next RemoteCharset command. Default value is iso-8859-1 (latin1).

RemoteCharset iso-8859-5

7.1.12. URLCharset command

URLCharset <charset>

<charset> is character set for the URL argument in next Server, Realm or URL command(s). This command specify character set only for arguments in commands follow and havn't effect on charset detection for indexing pages. Have less priority than RemoteCharset. Can be set before every Server, Realm or URL command and takes effect till the end of config file or till next URLCharset command. Default value is ISO-8859-1 (latin1).

URLCharset KOI8-R

7.1.13. CharsToEscape command

CharsToEscape "\"&<>![]"

Use this command in your search template to specify the list of characters to escape for $&(x) search template meta-variables.

Prev	Home	Next
Categories		Making multi-language search pages