Scaling

From Dpsearch

Jump to: navigation, search

Contents

Overview

Dataparksearch can easily scale up to handle millions of documents and handle an incoming feed of documents. This is possible by making use of the ability to separate the search part from the indexing, and to have multiple indexes.

Architecture

The key to scaling the search engine installation is to note the different requirements and possibilities:

  • It is possible to increase the number of searches handled if the search engine part can be separated from the indexer and we can deploy multiple search engines.
  • It is possible to have a incoming feed of documents while still having a large collection (in the millions) by having multiple indexes
    • one or more very large but static or near static index that holds the bulk of the documents and
    • and a small index into which the incoming feeds are dropped.

The rest of this section describes how to implement this configuration. The configuration has been tested to with one machine acting as the indexer with 3 indexes and a cluster of machines to perform the searches.

All this requires very careful and co-ordinated configuration of the dpsearch on the various machines. This section describes the general concept. However, to accomplish the actual implementation we have created a set of scripts that ensure that the configuration is consistent and works.

Separation of Indexer and Search Engine

In reality (provided that the configuration files are co-ordinated carefully) the search engine portion and indexer of dpsearch are isolated. The key was to create a shared drive holding all the index files. We accomplished this by adding sufficient disk space to the Indexer machine and shared it with the search engines using NFS. We did run into an issue with locking (the machines need to lock the files while updating or reading) and the NFS locks seemed to be ureliable.

Installation

I will describe the installation on a Debian system (Etch or later), however it should not be that difficult to translate the instructions to another system. The document does not cover installation and configuration of Apache etc. Please refer to the installation instructions for those packages.

Shared directory (/idxdata)

In order to separate the indexer from the search engine you need to make the index areas available to the indexer and the search engine. The following describes a working layout. The layout makes certain assumptions. The configuration scripts provided enforce these conditions. You will have to examine the layout and the scripts carefully if you change the conditions. The key assumptions are:

  1. The search engines will only access the index R/O (Dataparsearch configuration: ColdVar Yes)
  2. Each collection has two copies of the index - one updated by the indexer and the other idle fixed copy used by the search engine.
  3. The indexing scripts use rsync to sync up the two copies as needed and HUP the search engines to force them to update internally cached information.
  • The indexer needs to have very fast access to it (we made it a local drive and shared it over NFS).
  • The search machines need access to the disk. We mounted the NFS export from the indexer. Read only access is sufficient from the search machines.
  • The shared directory contains only one subdirectory called dpsearch. The rest of the directory structures are created by the configuration scripts automatically.
  • /idxdata/dpsearch has subdirectories one per collection. Our configuration has 3 collections (full, daily, tiny) and thus we have three sub-directories:
    • /idxdata/dpsearch/full
    • /idxdata/dpsearch/daily
    • /idxdata/dpsearch/tiny
  1. <collection> directory (/idxdata/dpsearch/<collection>)
    • contains folders and data used by the collection
    • folders:
    1. /idxdata/dpsearch/<collection>/add
      • contains the *.LST files to be inserted into the collection.
      NOTE: *.lst file is a file containing a list of URLS
    2. /idxdata/dpsearch/<collection>/delete
      • contains the *.lst files to be deleted from the collection.
    3. /idxdata/dpsearch/<collection>/var.<collection>1
      • one of the VarDirs used by the collection
    4. /idxdata/dpsearch/<collection>/var.<collection>2
      • another of the VarDirs used by the collection
    • symbolic links
    1. /idxdata/dpsearch/<collection>/var.<collection>
      • a link to one of the VarDirs of the collection. This link is used by the indexer (/app/dpsearch/etc/common-full.conf)
    2. /data/dpsearch/<collection>/varsrch.<collection>
      • a link to one of the VarDirs of the collection. This link is used by the search machine (/data/dpsearch/searchdb.conf)


Debian Specific Notes

Dataparksearch Package

The dpsearch application is not distributed as a debian package and so it will need to be rebuilt as such. To rebuild it. The instructions assume you have built Debian packages from sources before.

  • Install all the prerequisites as needed.
  • Untar the source code into a work directory (say ~/dpsearch/dpsearch-4.50)
  • Now go to the source directory (cd ~/dpsearch/dpsearch-4.50)
  • Edit the debian rules file (vi debian/rules) and make sure that the database enabled matches your choice. In particular look for the use
    • --with-pgsql for Postgres or
    • --with-mysql for MySQL
  • Build a debian package
fakeroot dpkg-buildpackage -uc -us -sa
  • The ready to install package should be available in the ~/dpsearch/dpsearch/dpsearch_4.50i386.deb
  • Copy the package to the indexer and search machines as needed and install using the command
dpkg -i dpsearch_4.50i386.deb

OS Configuration

Dataparksearch is quite resource intensive and particularly needs a large number of open files and other resources. The file limits are dependent on the settings for the WrdFiles etc. The following needs to be added or changed:

  • /etc/security/limits.conf
*          soft    memlock         320
*          hard    memlock         320
*          soft    nofile         20240
*          hard    nofile         20240
who's online
Personal tools