Deutsch English
Home
About tdbengine
Newsletter
Download
Helpware
Forum
Chat
Documentation
Installation
Configuration
First Steps
Runtime Switches
Text files
System functions
Database functions
Index functions
Full-text Indexing
Memos and BLOBs
Semaphore concept
Error codes
Hints for PHP developers
Basic Course
Function reference
HOWTO...?
Projects
Links
Benchmarks
Bug Reporting
Support request
 
Home    Overview    Search    Impressum    Contact    Members
Full-text indexing
Other than the usual indexing, with which the contents of separate fields as a whole are retrieved from the database and put in a special, ordered structure, the field contents during the full-text indexing is divided further into separate words. These words are put in a table of its own, and a relation from this table (the glossary) and the source table build the full-text index finally.

Is there any need for this at all?


Now you can object that a full-text index is superfluous, since the data array can also be traversed sequentially after all. The serial search has even the advantage that you can search for word elements.

It is true! The serial search, by which each data block is read and decimated according to relevant information, replaces each form of the index search (or another with rich trickery). And when the reading of the data and the processing of the information would not require time, you could really spare the development of other search functions. But unfortunately this is not reality.

It is even a lot worse. So even if the computer speed is increased by some factor, the index search remains always limited to small information arrays only, when the search result must be presented within a given time.

I see!
The full-text indexing replaces, regarding the search for words or word components, the serial search in databases. It opens even the door to the latter in larger data arrays, because it returns its results in very short time independent from the size of the database. For example, all search engines on the internet were unthinkable at all without full-text indexing.

I'd like to make one more step forward and to claim that you, after you have used the full-text indexing for the first time, can't imagine how you could have gotten along earlier at all without this function.

Full-text index with the tdbengine
The VDP offers some of the full-text functions of the tdbengine in the program layer. However the whole functional range can be discovered when programming EASY only.

The pre-conditon for the full-text indexing of a table in the presentation of two tables:

Index word table, which contains a character chain in the first field and is automatically numbered
Relation table between the output table and the index word table.

Since these tables are needed not infrequently, the tdbengine offers two special functions in addition to the general functionMakeDB() for its creation:

GenList() creates a new index word table
GenRel() creates a new relation table

GenList(FileName,CharacterNumber,Mode)

Doing so, a new and empty table FileName is created. If no mode or mode=0 is given, the structure of this table is as follows:

[STRUCTURE]
field_1=Word,STRING,CharacterNumber
field_2=Current_Number,AUTO,1
[INDEX]
id=Word,Current_Number

There is a little difference for the mode=1:

[STRUCTURE]
field_1=Word,STRING.CharacterNumber
field_2=HowOften,NUMBER,2
field_3=Current_Number,AUTO,1
[INDEX]
id=Word,Current_Number,HowOften

GenRel(Table1,Table2,FileName)

It creates an empty relation table with the following structure

[STRUCTURE]
field_1=R-Table1,Link,Table1
field_2=R-Table2,Link,Table2

Expert's advise:The output table must neither be numbered automatically, nor contain a relation field for index word table!

Normally both tables are neither read nor written by usual table functions. Because thats what the special full-text functions of the tdbengine are responsible for:


Recreating full-text index

ScanRecs(D,I,R,Fields(Fields),ExtABC,Cut,ContraIndex,Mode,Step,MasksField,DynContraIndex) : REAL

Don't be afraid of this function monster: All parameters are optional with the exception of the first four!

D Table number of the output table
I Table number of the index word table
R Table number of the relation table

Fields() is a special function, by which the information specifying which fields are taken in the full-text index is transferred to the function. The field combination can be both statically and dynamically carried out. In the case of the CGI-programming the dynamical version is possible only, because no tables are opened during translation time after all here.

Statical version: Fields(Field1,Field2,Field3...)

Dynamical version: Fields("Field1,Field2,Field3...")

There are all fields of the table (with exception of BLOBs) possible. Date, time and number fields are converted into strings. Having selection fields the corresponding text constant is taken into the full-text.

There is an important exception with string fields: If a string field begins with the "#"-character, the rest of the string is interpreted as a path to an external text document and this is taken into the full-text index. This version of the tdbengine supports external ASCII, ANSI and HTML-documents.

In addition the contents of ADL-linked records can be written via the L-field notation and the R-field notation in the full-text index.

Example: A simple full-text indexing for an address table addr.dat

VAR D,I,R,fc,N : REAL
.EC 1
?fc:=GenList("i-adr.dat",40,1)<>0/Halt .. Error by creating the index word table
?fc:=GenRel("addr","i-addr","r-addr.rel")<>0/Halt .. Error by creating the RELATION table
IF D:=OpenDB("addr.dat")
  IF I:=OpenDB("i-addr.dat","",0,15)
    IF R:=OpenDB("r-addr.rel","",0,15)
      PrimFile(D)
      N:=ScanRecs(D,I,R,Fields("FirstName,LastName,Street,Place,Note"))
    ELSE
      .. Error by opening the RELATION table
    END
    CloseDB(I)
  ELSE
    .. Error by opening the index word table
    CloseDB(D)
  END
ELSE
  .. Error by opening the output table
END

This example contains all essential components including drafted error processing.

You must by aware of following:

  • in the sequence by the opening the tables.
  • Primary table must be changed before ScanRecs(), so as the field identificators of the output table are recognized.

And it takes place by the full-text indexing:

At the beginning both index word table and relation table are emptied. Then all data blocks of the output table are read. The fields, which are given under function Fields(), are scanned by each data block, i.e. these are decompose into separate character chains, which consist of letters only. Each of these words is entered in the index word table now, if it is not available already there. Finally, an entry is made in the relation table with a corresponding reference to the data block of the output table and with the related block in the index word table.

The optional Parameters:
ExtABC

is a string, which must be additionally related as word components to the letters. For example, if the zip-code must be put in the full-text index, so "0123456789" must be here.

Advise: The hyphen must not be taken as component of a word (with the exception of special cases), so as the separate components of a (with hyphen) compound word can not be found by the full-text search further (i.e. easy and fast).

By the next both parameters the restriction of the full-text index is concerned. In many cases it will not to take insufficiently informative words as  "and", "the", etc. in the full-text index.


Cut

If a value more than 0 is given here, then words, which occur often than Cut, are not taken in the full-text index.


ContraIndex

It is the name of an external text file containing such words, which are not taken in the full-text index. Thereby each word is placed in an own line and therefore a sorting is not required.


The next parameter defines the basic behavior of tdbengine on creation of the full-text index. The separate modes are simply added:

Mode

0        The index word table is recreated
1        An existing index word table is used and new words are entered only
2        An existing index word table is used, but none of new words are entered
4 *)    The unconditionally required file are created only
8        HTML-tags are skipped and HTML special characters are translated according to ASCII-format
16      is reserved
32      external texts are presented in the ASCII-format (otherwise ANSI-format)

*) This mode is conditionally compatible with relational system of the tdbengine. But it can be always used with full advantage, when the access is made through the full-text functions to the index word table and the relation table. Furthermore, the required memory space is reduced by far less than 50% and the search speed is significantly increased.


Step

This parameter must be set to a value other than 0 only, when really very large bulk of data is processed. The case in point is the memory space, which is needed by the tdbengine during the indexing of the operating system. In this case a lot of memory is given by an increased processing speed. Indeed a too high memory requirement can lead to uploading a part of it from the operating system, what leads further to a massive performance fall! But let us suppose that you have a 256 Mbytes system, where a few of memory-hungry processes are running, then, for example, you can set the value of 100000000 (=100 Millions) here.

Advise: Change this value only, when the indexing time really goes beyond.


MaskField

Thereby the number of a field of the output file is concerned, which stores a 16-bit integral number (NUMBER,2). The contents of this field is taken in the full-text index, or rather in the relation table (more rather in the IN2 of the relation table), and can be regarded by the full-text search. There is a special case, if it is set to -1 here, since the mask values can be directly (separated by colon) given in the field combination by function Fields. Then the analysis is performed by the full-text index in such manner that a number is given here and it is linked with a binary AND of the mask, which is stored in the full-text index. Further only such linkages will be found, in which the binary AND gives a value unequaled 0.

It sounds complicated and is complicated indeed, but it opens great possibilities. There is an example. Let us suppose that you have an address data base, in which different types of the addresses are stored: personal addresses, company addresses, authorities, etc. You store the type of the addresses as identificator in an integer field:

1    personal
2    company
4    authorities
8    scientific institutions
...

For addresses, which fit into more than only one categorie, you simply add these values. Now, if you give the field number of this field by the full-text indexing, you can already make any restriction concerning the categories by the search and therefore you need not read a single data block!

Alternatively you can also determine mask numbers for the analysis of a mask field directly by function Fields. In this case the mask field number has the value -1, and the mask numbers are directly given after the field names. There is again an example:

ScanRecs(D,I,R,Fields("FirstName:1, LastName:2, Street:4, Country:8, ZIP:8, Place:8, Note:16"),"",0,"",0,0,-1)

A full-text index is also built here. However in the relation table, if the first name is regarded by the just investigated field, the mask number 1 is taken to each entry, for the field name - the number 2, and so on. Then by the full-text search you can purposefully search in separate fields or fields combinations (or, as matter of course, also in all fields simultaneously)!

DynContraIndex

The case in point is a dynamical, field-oriented contra index, which is transported in its turn through the function Fields. This means that the fields, which are presented here, are read by each data block firstly and then a contra index is built from this. Indeed this contra index is referred to this data block only, and it begins at the beginning by next one.

Error processing

The parameters enable a very flexible full-text indexing of any data arrays. Indeed a table function is regarded by ScanRecs(), with which just two tables are simultaneously changed and the system is correspondingly scrupulous by the observance of the regular syntax. The function determines, if none of error occurs, the number of the entries in the relation table. Otherwise a runtime error is generated, which you can intercept with .EC 1.

The most often occurring error:

50: (unknown identificator) There is at least one false data field in function Fields
58: (none of table) At least one of three required tables is not opened
64: (none of ADL-table) The index word table is not automatically numbered
77: (none of relation table) The case in point is an invalid relation table by the third parameter

If the function result is negative, none of error is generated though, but the function is interrupted nevertheless.

-115   Too little rights for the function (The relation file could not be emptied)
-56     (, waits) The fields in function Fields must be separated by comma

Indexing separate data blocks
The function ScanRecs() indexes a complete table. But it is also possible to put a single record into an existing full-text index. The function ScanRec() indexes the current record of a table.

ScanRec(D,I,R,Fields(Fields),ExtABC,Cut,ContraIndex,Mode,Step,MaskField,DynContraIndex) : REAL

The parameters are exactly the same as by the function ScanRecs(). Indeed following modes are supported only:

2    None of new words are put in the index word table
4    The unconditionally required information is stored only

Important: The mode 4 can be used only, if the full-text index was created by the function ScanRecs() in this mode.

The function result determines the number of the new added linkages or generates one of the following error codes:

-56    In Fields() one (at least) comma is absent
-6      the ID index of the index word table could not be opened

Deleting full-text index for single records
It is also possible to remove single records from an existing full-text index. Thereby a complete dynamical management of the full-text index is achievable.

UnScanRec(D,I,R,Fields(Fields),ExtABC,Cut,ContraIndex,Mode,Step,MaskField,DynContraIndex) : REAL

The parameters are exactly the same as for the function ScanRec(). The function result determines the number of the deleted linkages (in positive case) or generates the error code.

Searching in the full-text index
There are two functions for the analysis of a full-text index: MarkTable() and MarkBits(). There are nearly the same parameters:

MarkTable(D,I,SearchString,ExtABC,Ops,MaskNumber,R1,R2,...) : REAL
MarkBits(D,I,SearchString,ExtABC,Ops,MaskNumber,BitField,R1,R2,...) : REAL

D                    number of the output table
I                      number of the index word table
Expression      search string
ExtABC           as for ScanRecs()
Ops                 logical operators
MaskNumber    integral number
BitField            variable of the type TBits[xxx]
R1, R2...          handle of the relation table

The function result determines the number of the found records (in positive case) or generates one of the following error codes:

-64      invalid index word table
-77      invalid relation table
-56      illegal bracketing in the search string
-45      invalid character in the search string

The last two errors can be defined in detail with TDB_ErrorMsg() and TDB_ErrorOfs().

For the SearchString the case in point is a string with following structure:

SearchString ::= Expression
Expression ::=  Term { Or-operator term}
Term        ::=  Factor { And-operator factor}
Factor     ::=   SearchWord | "(" Expression ")"

A search word is a string of letters, the characters from ExtABC as well as the special characters "?" and "*". Thereby "?" stands as substitute for exactly one of any character, "*" - for arbitrarily many characters.

Following operators are predetermined (when an empty string is transferred in Ops):

Or-operator          +  (concatenation set, logical OR)
And-operators       ,  (cut set, logical AND)
                             -  (cut with complementary set, logical NAND )

Examples:

Meier  finds all data blocks, which contain the word Meier in any upper or lower case spelling.
Mei*er finds the data blocks containing the words, which begin with Mei and end with er.
Hans, Meier in the data block must contain both the word Hans and the word Meier.
Hans+Meier finds the data blocks, in which occur Hans or Meier or both.
Hans-Meier finds the data blocks, in which occur Hans though, but not Meier.


Ops

In the parameter Ops the parameters can be anew determined. Therefore a string must be transferred with exactly three characters here. The first character is the operator for logical AND, then the second corresponds to the logical AND and finally the third - logical NAND.

MaskNumber

The setting of a mask number is advisable only, when a mask field or mask constant were already used by the full-text indexing. The number given here and the number stored in the linkage are linked to a binary AND. There is found such records only, for which the binary AND returns a value other than 0.

Special case: The mask number 0 determines all linkages.

Example:

The full-text indexing is performed with following index description: Fields(LastName:1,FirstName:2,Street:4)

Then the search for Hans gives mask number as follows

0 In (at least) one of the fields the word is contained
1 The word is contained in the field LastName
2 The word is contained in the field FirstName
3 The word is contained in the field LastName or in the field FirstName (or in both)
4 The word is contained in the field Street
5 The word is contained in the field LastName or in the field Street (or in both)
...
7 The word is contained (at least) in one of the fields LastName, FirstName or Street.


BitField

In this parameter the both search functions are different. MarkTable() stores all hits directly in the internal marking list of the table. MarkBits() instead stores the hits in a bit field. Further it is advantageous, when the search is repeatedly made and the separate search results are added up to a general result.

Example:

VAR Hit, Temp : TBITS[1000000]

MarkBits(...,Hit,...)
MarkBits(...,Temp,...)
BitAnd(Hit,MARKS) (determines the cut set FROM both searches)
BitOr(Hit,MARKS) (determines the concatenation set FROM both searches)

Advise: Since EASY (at present time) does not permit dynamical dimensioning of fields, the bit field must be dimensioned plenty large enough. On the other hand the fields must not be also overdimensioned to excess, so as a fast processing is carried out only, if the fields are fitted into the working memory (and are not uploaded by the operating system). The memory capacity in bytes can be easily calculated by dividing the field size by 8. A bit field occupies approximately 120 Kbytes for a table with 1000000 (=1 million) data blocks.

Expert's advise: A bit field, as we use it here, can be also regarded as characteristic set function, i.e. as a set mapping of all data blocks of a table in the set {0,1}. The case in point is a very efficient implementation of (partial-)sets, since then the set functions are directly carried out (as fundamental machine operations). MarkTable() also operates internally with a bit field and maps the found set in the marking list at the end of the function.

There are several possibilities to access the hit of a search:


Mapping to the marking list

After PutMarks(D,Hit) the hits are available in the marking list and can be processed as usual further.


Direct access to the bit field

i:=0
WHILE i<FileSize(D) DO
  IF Hit[i] THEN
     ..Processing of the data block with the block number i
  END
  i:=i+1
END


The relation tables

Both search functions can (simultaneously) search in several relations (between the output table and the index word table). Although one relation table is normally given only.

A script for two relation tables can look close as follows: A table for a text archive contains a number of data in usual data fields and a reference to a text document, which is represented in the HTML-format. In this case it is advisable to create an own relation table for the full-text index through the external text references.


Example: Document archive

We proceed from following structure of a table


[STRUCTURE]
field_1=Author,STRING,60
field_2=Title,STRING,120
field_3=Synopsis,MEMO
field_4=Document,STRING,60

In the field Document references are put as #/home/kern/doc/example_001.html.

Through this table a full-text index was created with

ScanRecs(...,Fields(Author:1,Title:2,Synopsis:4,Document:8),"",0,"",4+8)

Mode 4 (only required files) + 8 (supporting of external HTML-files)

Our search form has following structure:

<form action=... method="post">
Search word: <Input type="text" name="search word" size="40"><br>
Search IN: <Input type="checkbox" name="cb_author" value="1"> Author
<Input type="checkbox" name="cb_title" value="2"> Title
<Input type="checkbox" name="cb_synopsis" value="4"> Synopsis
<Input type="checkbox" name="cb_document" value="8"> Document
</form>

In doing so we allow an entry of a search word to the user and give him the choice of areas, where the word must be searched.

PROCEDURE Document-Search
VAR D,I,R,Mode : REAL
...
..all three tables are opened here
..
Mode:=Val(CGIGetParam(cb_author))+VAL(CGIGetParam(cb_title))
Mode:=Mode+VAL(CGIGetParam(cb_synopsis))+VAL(CGIGetParam(cb_dokument))
MarkTable(D,I,CGIGetParam("search word"),"","",Mode,R)
...
..giving hit
..
ENDPROC





tdbengine Anwendungen im Web:

Open-Source Web CMS


Open-Source Bug-Tracking


Free wiki hosting

Open-Source Wiki-System

Kostenloses Foren-Hosting

Dišt mit tdbengine 8-)

tdbengine chat
irc.tdbengine.org
#tdbengine

   Copyright © 2003-2004 tdb Software Service GmbH
   Alle rechte vorbehalten. / All rights reserved
   Last changed: 06.05.2004


ranking-charts.de

Programmers Heaven - Where programmers go!