|
DataparkSearch Engine is a full-featured open sources web-based search
engine released under the GNU General Public License and designed to organize search
within a website, group of websites, intranet or local system.
Key features
- Support for http, https, ftp, nntp and news URL schemes.
- htdb virtual URL scheme for indexing SQL databases.
- Indexes text/html, text/xml, text/plain, audio/mpeg (mp3) and image/gif mime types natively.
- External parsers support for other document types, including Microsoft Word, Excel, RTF, PowerPoint, Adobe Acrobat PDF and Flash.
- Can index multilingual sites using content negotiation.
- Can search all of the word forms using ispell affixes and dictionaries.
- Synonym, acronym and abbreviation query expansion based on editable dictionaries, specified by language and charset.
- Stop-words, synonyms and acronyms lists.
- Options to query with all words, all words near to each others, any words, or Boolean queries. A subset of VQL
(Verity Query Language) is supported.
- Popularity Rank based on a neural network model.
- Results can be sorted by relevancy (using vector calculation), popularity rank as "Goo" (adding weight
for incoming links), and "Neo" (neural network model), last modified time, and by
"importance" (a combination of relevancy and popularity rank).
- Supports wide range of character sets support with automated character set and language detection.
- Offers an accent insensitive search option.
- Provides phrase segmenting (tokenizing) for Chinese, Japanese, Korean and Thai.
- Includes an indexer and a web CGI front-end, as well as a search module for Apache web server (mod_dpsearch).
- Handles Internationalized Domain Names (IDN).
- Summary Extraction Algorithm automatically sums up each document in several sentences.
- Uses If-Modified-Since for efficient transfer of only changed files.
- Can tweak URLs with session IDs and other weird formats, including some JavaScript link decoding.
- Can perform parallel and multi-threaded indexing for faster updating.
- Flexible update scheduling, including options for checking some sections of a site more frequently.
- Handles basic authentication (user name and password) and cookies.
- Stores a compressed text version of the documents for extracting and viewing.
- Can specify a default character set and language for a server or subdirectory, or a list of possible languages.
- Noindex tags: <!--UdmComment-->, <NOINDEX>, <!--noindex-->, Google's special comments
<!-- google_ad_section_start -->, <!-- google_ad_section_start(weight=ignore) --> and <!-- google_ad_section_end -->
consider as tags to include/exclude.
- Can specify a content body tag.
- Spellchecking for query words with aspell.
- Flexible options and commands to customize search result pages.
- Effective caching gives significant time reduction in search times.
- Query logging stores the query, query parameters and the number of results found.
Documentation
DataparkSearch documentation is enclosed in release or
snapshot distribution in doc subdirectory. And it's also available on-line in
English (PDF, 1,471,090 bytes)
and in Russian.
You can use our
forum to ask about DataparkSearch. Or you may subscribe to DataparkSearch group at Google Groups:
groups.google.com/group/dataparksearch/.
As well, you can share your experiance using DataparkSearch in DataparkSearch's
collaborative documentation (wiki).
DataparkSearch's: ChangeLog (As a RSS feed);
PAD file.
Download
Latest DataparkSearch version released: dpsearch-4.52.tar.bz2,
2,175,210 bytes, 25.04.2009, 16:06 MSK
You may try latest snapshot: dpsearch-4.53-04112009.tar.bz2,
2,132,267 bytes, 04.11.2009, 02:22 MSK
dpsearch-spell-ja.tgz,
68,705 bytes, 09.11.2004, 01:06 MSK -
Quasi-ispell data for Japanese. THIS IS NOT VALID ISPELL DATA.
Can be used only with DataparkSearch 4.27 or later version. All data are in EUC-JP charset.
Additional Data
- Frequency dictionaries
- Traditional Chinese,
730,641 bytes, 04.09.2008, 12:10 MSK
- Mandarin,
394,634 bytes, 04.09.2008, 12:11 MSK
- Korean,
30,624 bytes, 04.09.2008, 12:11 MSK
- Korean, EUC-KR charset, 246,038 bytes
- Thai,
126,572 bytes, 04.09.2008, 12:11 MSK
- Synonym lists
- English,
774,663 bytes, 04.09.2008, 12:11 MSK
- German,
131,880 bytes, 04.09.2008, 12:11 MSK
- Italian,
166,684 bytes, 04.09.2008, 12:11 MSK
- Polish,
92,158 bytes, 04.09.2008, 12:11 MSK
- Russian,
73,968 bytes, 04.09.2008, 12:11 MSK
- Acronym and abbreviation lists
- English biomedical acronyms and abbreviations,
7,084 bytes, 04.09.2008, 12:10 MSK
- Other code
Mirrors
Bugs
You may see all open or new bug reports or post your bug
reports at Google Code's home.
Sample sites
- 43°N 39°E (PgSQL, cache mode, searchd is used.
SQL-server: PIII 670MHz, 512M RAM, IDE SATA 120Gb. Searching PC: Celeron 2.25GHz, 1G RAM, IDE UDMA100.
1'004'109 pages, 1'268'964 sites, 27.06 Gbytes indexed.
).
Test search in Chinese,
Test search in Japanese,
Test search in Korean,
Test search in Thai.
- All Sochi's Internet.
- News Lookup Service (MySQL, cache mode, searchd is not used).
- DataparkSearch Engine usage location map
Donate
If you use DataparkSearch and found it useful,
or want to encourage further development, feel free to make a donation
(at Kagi) to support this project. Any amount is greatfully appreciated.
|
DataparkSearch's Awards
To leave a donation via MasterCard, VISA, American Express, JCB, check, money order, or wire transfer please click the button below:
Donate with Kagi
DataparkSearch's blog feed
|