Appendix A. mnoGoSearch change history

Table of Contents
Changes in 3.3
Changes in 3.2

Changes in 3.3

Changes in 3.3.0 (06 March 2007)

  • Cluster support was added. A typical cluster consists of several database machines and a single front-end machine. The front-end machine receives HTTP requests from a user's browser, forwards search queries to the database machines using HTTP protocol, receives back a limited number of top best search results (using a simple XML format, based on OpenSearch specifications) from every database machine, then parses and merges the results, and displays them according to score and applying HTML template. This approach distributes operations with high CPU and hard disk consumption between the database machines in parallel, leaving simple merge and HTML template processing functions to the the front-end machine. As of version 3.3.0, mnoGoSearch allows to join up to 256 database machines into a single cluster.

  • node.xml-dist is now installed into /etc directory - an XML template for a cluster database machine.

  • "DBAddr http://hostname/search.cgi/node.xml" search.htm command was added, to specify an URL of a cluster database machine interface with XML format.

  • "DBAddr file:///path/to/node.xml" search.htm command was added, to specify a static XML search response. This is mostly for test purposes.

  • Two cluster types were implemented - a merge cluster to join results from several independent databases, each created by its own indexer.conf, as well as a distributed cluster - created by a single indexer.conf when indexer automatically distributes search index between database machines.

  • Changing default distribution type from "reminder" to "quotient". Thus, for indexer.conf having three DBAddr command, distribution is done as follows:

    • URLs with seed 0..85 go to the first DBAddr

    • URLs with seed 85..170 go to the second DBAddr

    • URLs with seed 171..255 go to the third DBAddr

    This distribution style simplifies manual redistribution of an existing clustered database when adding a new DBAddr (i.e. a new database machine). Future releases will provide an automatic tool for redistribution when adding and deleting machines in an existing cluster, as well as more configuration commands to control distribution.

  • Maximum amount of words collected from a document was changed from 64K words per document to 64K words per section - positions are now enumerated per section, starting from the beginning of each section separately.

  • "SaveSectionSize yes/no" indexer.conf and search.htm command was added. When SaveSectionSize is set to yes, indexer stores additional information about section sizes, making it possible to generate better score values, as well as to do "exact section match" searches. Default value is "yes".

  • Relevancy improvement: "WordDensityFactor num" search.htm command was added. Num is a number in the range 0..255 to specify impact of word frequency on the result score. This feature works with "SaveSectionSize yes". The default value is 25.

  • Exact section match syntax was added:

                
        title="Apache web server"
    
    This feature works with "SaveSectionSize yes".

  • "WordFormFactor num" search.htm command was added to give more weight to the word forms originally written in the search query and less weight to generated word forms using ispell dictionaries and synonyms. Use with a number 0..255. Default value is 255. 255 means to give the same weight to the original and generated forms. 0 means maximum effect, i.e. weight for a generated word form is much smaller than weight for the original word form.

  • Excerpt generating code performance improvements were done. Excerpt generation from CachedCopy is now about 6-12% faster.

  • Using URL and Tag limits is now possible with "indexer -Eblob", e.g.:

                
    ./indexer -Eblob -u "%subdir%"
    ./indexer -Eblob -t tag
    
    This is to generate a search index over a subset of all documents collected during crawling.

  • Using "Limit" command is also possible with "indexer -Eblob", e.g.:

    indexer.conf command:

                
    Limit subdir "SELECT rec_id FROM url WHERE url LIKE '%/subdir/%'"
    

    command line:

                
    ./indexer -Eblob --fl=subdir
    

  • "ResultContentType type" search.htm command was added to specify Content-Type header generated by search.cgi. The default value is "text/html".

  • "Dehyphenate yes/no" search.htm command was added. When "Dehyphenate yes" is specified, searching for "peace-making" also will return documents having "peacemaking".

  • Clone template variables were changed: clones are now returned in the same row with the document itself, using CloneN prefix, e.g.: $(Clone0.URL). The "<!--clone-->" search.htm section and the $(CL) variable are not supported anymore.

  • DetectClones is now "no" by default, for performance purposes.

  • "CollectLinks yes/no" indexer.conf command was added. The default value is "no" which improves indexing performance by not pupulating the "links" table. As a side effect PopRank calculation is not possible in the default configuration. If PopRank is important for your installation, specify "CollectLinks yes" in indexer.conf.

  • Default sort order was changed from "RP" (score, then popularity) to "R" (score). This change improves search performance for the installations where PopRank is not important.

  • Indexer now honors <a rel="nofollow"> tags. Thanks to Jeff Veit for contribution.

  • A simplified format of HTDBDoc command was added:

                
    HTDBDoc "SELECT title, body FROM docs WHERE id=$2"
    
    SQL column names are associated with "Section" names.

  • It's now possible to specify wf as a parameter for DBAddr search.htm command, which is useful when merging two or more databases - to give more score to results coming from a desired database.

                
    DBAddr mysql://root@localhost/db1/?wf=FFFF
    DBAddr mysql://root@localhost/db2/?wf=1111
    DBAddr mysql://root@localhost/db3/?wf=1111
    

  • MaxResults parameter was added for DBAddr, which is useful to add a limited number of sponsored links in the top of search results:

    
DBAddr mysql://root@localhost/avd/?wf=FFFF&MaxResults=1
    DBAddr mysql://root@localhost/db1/?wf=1111
    DBAddr mysql://root@localhost/db2/?wf=1111
    

  • $(DBOrder) template variable was added to display the original order of a document in its database result, before multiple DBAddr search results were merged into the final result. It is equal to $(Order) when using only a single DBAddr command in search.htm.

  • FOR template operator was added. Loop limits can be both constants:

    
  <!FOR NAME="a" FROM="10" TO="20">a=$(a)<!ENDFOR>
    
    and variables that were previously set, for example by the SET operator:
    
  <!SET NAME="from" CONTENT="80">
      <!SET NAME="to" CONTENT="90">
      <!FOR NAME="a" FROM="$(from)" TO="$(to)">a=$(a)<!ENDFOR>
    

  • "[no title]" is not added automatically anymore: an empty string is printed instead. One can use IF template operator to reproduce 3.2.x behaviour:

    
<!IF NAME="title" CONTENT="">[no title]<!ELSE>$&(title)<!ENDIF>
    

  • Various indexing and search performance improvements were made.

  • Fixed that indexer didn't work with MySQL-5.1.15-GPL.

  • "indexer -?" now prints its help page to stdout instead of stderr.

  • A "#version" record is now put into the table "bdict" when running "indexer -Eblob". mnoGoSearch version ID is put as its value. For example, mnoGoSearch 3.3.0 will put "30300" string.

  • Preliminary implementation for DBMode=rawblob in search.htm was added. This mode is designed for direct search from the table "bdicti" without having to run "indexer -Eblob" and is intended for use with small search databases as a replacement for DBMode=single. In the future releases it will also be reused for real-time index updates - to avoid running "indexer -Eblob" when only a small number of documents were changed.