Welcome to MSDN Blogs Sign in | Join | Help

What is an index anyway?

Foxpro’s legendary speed is due in part to its index technology.

 

About 14 years ago, the speed of Foxpro was demonstrated to live audiences around the country. A table (places.dbf) of almost 3 million streets in the US was used as the demo data (about 200 megabytes: quite large for a state of the art 386 with a 1 Gig hard disk.)  An audience member was asked on what street she lived, and Fox would query the street name in something like .01 seconds. In 1991, PC Magazine awarded FoxPro Applications Software of the Year.

 

Q: How do you search a 200 megabyte file for a single record?

A: Use an efficient compressed index.

 

Accessing the disk is orders of magnitude more expensive in time than accessing memory, so the goal is to keep as much in memory as possible, thus reading the disk as little as possible. Data compression helps to keep more data in memory, even if there is computation cost in compressing/uncompressing.

 

The index of a book is a table with 2 columns: the 1st column lists keywords from the text in alphabetical order and the 2nd has the page numbers on which the keywords occur.

 

A FoxPro index is very similar: An index on the key “streetname” can be thought of as just a table with 2 columns: the streetname values and the record # where the streetname occurs.  Using the index to find data would then just be a search of this 2 column index table. This table could be searched using a binary search and would be fairly fast: however, it could be made faster.

 

Imagine the entries for this table. Because they’re in order, there will typically be a lot of repetition:

 

Name                Rec #

Smith                22

Smithson          44

Smithson          11

Smithsonian      14

 

The first 5 letters of these are the same, so these can be shortened to:

 

0          “smith”

5          “son”

8          “”

8          “ian”

 

Each entry contains an integer indicating how many letters of the prior word to use as the first letters of the current entry. (This encoding is how I was able to compress 2 English spelling dictionaries into a tiny DLL.)

 

At about 15 megabytes, the index used (places.cdx) is a tree structure of about 30,000 nodes that is only 7 levels deep. (each node is 512 bytes: 512 * 30,000 = 15 megs). For 3 million streets, that’s an average of almost 100 records per node, or about 5 bytes per record. The entire index is so compressed that it can fit in memory.

 

Use a binary editor (like VFP\tools\hexedit\Hexedit.APP or Visual Studio (File->Open->Open With->Binary Editor) to examine a CDX file. You can see that the data is structured in 512 (0x200) byte segments or nodes. Not coincidentally, this is also exactly the size of a disk sector.

 

A VFP CDX file is a data structure that

  • is compact to minimize disk access.
  • Is Balanced: All records can be reached with roughly the same cost: any leaf can be reached by a path through roughly the same number of interior nodes.
  • Minimizes Insertion/deletion costs
  • Has nodes that contain left/right pointers to sibling nodes, so traversing data in indexed order (BROWSE, SCAN) does not require going back to the root node.

 

The sample code creates a sample table with 3 character fields: name, address and zip. Then an index is created on each field. The Names are generated as 10000N00001, 20000N00002, etc. and the Addresses are 10000A00001, etc. This data doesn’t compress well, but is uniform for easier analysis..

 

The sample then decodes the CDX and displays the data found in the CDX file. It does not read the original DBF table at all.

 

The output shows that each index node is identified by its address within the file. Using hexadecimal makes it easier to read because each file offset is a multiple of 0x200. It also shows that each node has a left and right node pointer. These make it efficient to traverse records without going back to the root.

 

The interior nodes are not compressed. You can read the uncompressed keys right in the raw hex dump of the file.

 

Each line is indented by the node depth, and contains several columns. The ixnode columns are:

  1. the node depth (distance from the root)
  2. the Node ID: a multiple of 0x200
  3. the node type: Tagnode or Ixnode
  4. the node attribute
  5. node left pointer, or 0xFFFFFFFF if none
  6. node right pointer, or 0xFFFFFFFF if none
  7. The # of key/value pairs in this particular node (nKeys)

·         For interior nodes:. (not compressed). nKeys pairs of (key and Node ID of child node)

·         For leaf nodes: nKeys pairs of (key and Record #  )

 

This code creates a test index with 3 index tags.

 

CREATE TABLE test (name c(20), address c(30), zip c(11))

INDEX on UPPER(name) FOR 7 < 8 AND SECONDS()>0 TAG name     && have a FOR expression that is always true for demo

INDEX on UPPER(address) TAG address

INDEX on UPPER(zip) TAG zip

INSERT INTO test (name,address,zip) VALUES ("fred  ","52 anywhere","z11111")

INSERT INTO test (name,address,zip) VALUES ("barney","23 anywhere","z22222")

INSERT INTO test (name,address,zip) VALUES ("wilma ","33 anywhere","z33333")

INSERT INTO test (name,address,zip) VALUES ("betty ","43 anywhere","z44444")

FOR i = 5 TO 10   && try 100, 1000, 10000

      INSERT INTO test (name,address,zip) VALUES (padr(i,5,"0")+"n"+PADL(i,5,"0"),padr(i,5,"0")+"a"+PADL(i,5,"0"),padr(i,5,"0")+"z"+PADL(i,5,"0"))

ENDFOR

 

The output from running the Index decoder program

 

0 0x00000000 Tagnode 0x000000E0 0x00000400 

  1 0x00000400 Ixnode 3 0xFFFFFFFF 0xFFFFFFFF 3  ADDRESS 0x00000C00  NAME 0x00000600  ZIP 0x00001200 

    2 0x00000C00 Tagnode 0x00000060 0x00001000  UPPER(address)

      3 0x00001000 Ixnode 7 0xFFFFFFFF 0xFFFFFFFF 10 10000A00010 10     23 ANYWHERE 2      33 ANYWHERE 3      43 ANYWHERE 4      50000A00005 5      52 ANYWHERE 1      60000A00006 6      70000A00007 7      80000A00008 8      90000A00009 9     

    2 0x00000600 Tagnode 0x00000068 0x00000A00  UPPER(name) FOR  7<8.AND.SECONDS()>0

      3 0x00000A00 Ixnode 7 0xFFFFFFFF 0xFFFFFFFF 10 10000N00010 10     50000N00005 5      60000N00006 6      70000N00007 7      80000N00008 8      90000N00009 9      BARNEY 2      BETTY 4      FRED 1      WILMA 3     

    2 0x00001200 Tagnode 0x00000060 0x00001600  UPPER(zip)

      3 0x00001600 Ixnode 3 0xFFFFFFFF 0xFFFFFFFF 10 10000Z00010 10     50000Z00005 5      60000Z00006 6      70000Z00007 7      80000Z00008 8      90000Z00009 9      Z11111 1      Z22222 2      Z33333 3      Z44444 4     

 

You can see that the first NodeID is 0x0000. It points to NodeID = 0x400, which contains the tag names and NodeIDs of each tag root.

Each tag root (0xC00, 0x600, 0x1200), in turn, contains the NodeID of any child nodes, as well as the index expression and any FOR condition. In this small sample there are no interior nodes:, each of these child nodes are leaf nodes, which contain the actual key/record # pairs.

 

For complete source code of the index decoder (about 250 lines): click here. (The sample will not work in VFP8: it requires the new VFP9 CTOBIN features for decoding binary numbers.)

 

Experiment with changing the data, index expressions, field sizes and record count. Examine your own indexes.

This code only works with character string index expressions.

 

 

47315 

 

 

Published Wednesday, January 05, 2005 6:40 PM by Calvin_Hsia
Filed under:

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

# re: What is an index anyway?

Friday, January 07, 2005 2:20 PM by Garrett Fitzgerald
Calvin, where does the "2rs" parameter for CTOBIN() come from? The help isn't clear....

# re: What is an index anyway?

Friday, January 07, 2005 3:36 PM by CalvinH [MS]
the "2" means 2 bytes. "r" means don't reverse the bytes, and "s" means don't toggle the sign bit.
For more details on CTOBIN(), see the included sample. Start VFP9, Tools->Task Pane. Choose Solution Samples, New in VFP9, “BINTOC binary conversion”

# re: What is an index anyway?

Friday, January 07, 2005 11:55 PM by Garrett Fitzgerald
Ah, wrong version of the help file. Got it now. Thanks!

# re: What is an index anyway?

Wednesday, February 02, 2005 4:25 PM by Dave Sonsalla
Great explanation, but I'm confused as to the use of the masks, specifically, from bytes 14 through 23 of the node record. I have created a sample index and experimented, but I am at a loss to put the index keys back in their original format before compression.

I am accessing some Foxpro information for a report via Delphi. I've written the native Delphi code to read multiple database files without incorporating the BDE, but ran into a roadblock once I hit the compressed indexes (uncompressed indexes are simple to understand via looking at them with a hex editor.)

Could you clear up the mystery for me?

Dave

# Index creation performance issues

Thursday, September 27, 2007 8:48 PM by Calvin Hsia's WebLog

In this post: Index creation performance question I asked: Is it faster to fill a table with values,

# Good site

Wednesday, January 23, 2008 3:01 PM by Vilyamxu

<a href= http://index1.tolant.com >best wsternhotels</a>

# Nachtrag zu Wortfragmentsuche | hilpers

Tuesday, January 20, 2009 10:33 AM by Nachtrag zu Wortfragmentsuche | hilpers

# A Discounter Introduces Reductions: Multiple Anagrams

Sunday, February 15, 2009 1:04 AM by Calvin Hsia's WebLog

Many moons ago, I was playing with spelling dictionaries (see What is an index anyway? ) and anagrams.

# Calvin Hsia s WebLog What is an index anyway | Outdoor Ceiling Fans

# Calvin Hsia s WebLog What is an index anyway | Outdoor Ceiling Fans

Leave a Comment

(required) 
required 
(required) 

  
Enter Code Here: Required
 
Page view tracker