Merging Indexes - Doing my head in...
I wish I could see the code behing this beacuse I feel it may not be optimised efficiently
I imagine that the table structures of master and supplemental indexes are the same so the same indexer/code can be used for both
But the merging seems to take forever, is it reindexing those books again from scratch?
If you have the data in the supplemental index, nothing should have changed... surely you could merge the indexes much faster and more effficiently using SQL to copy the exising data into the masterindex
INSERT INTO MasterIndex (F1,F2,...)
SELECT F1 as F1s,F2 as F2s
FROM SuplementIndex
WHERE condition
We would then be talking minutes rather than hours IMO
Never Deprive Anyone of Hope.. It Might Be ALL They Have
Comments
-
But the merging seems to take forever, is it reindexing those books again from scratch?
Yes. It is not merging the indexes it is reindexing and the program should say this...
0 -
If you have the data in the supplemental index, nothing should have changed... surely you could merge the indexes much faster and more effficiently using SQL to copy the exising data into the masterindex
INSERT INTO MasterIndex (F1,F2,...)
SELECT F1 as F1s,F2 as F2s
FROM SuplementIndex
WHERE conditionWe would then be talking minutes rather than hours IMO
For whatever reason it is not that simple. From what I remember in Beta discussions they are working on improving the merging process to do what you are describing, a true merge of data rather then destroying data and starting over. Not sure of a time table though.
0 -
We agree that merging indexes is a good solution, and we've got someone working on it. (Our recent hire Dr. Peter Venable: http://www.logos.com/press/releases/oracle-senior-application-developer-moves-to-logos-bible-software)
So far it looks like we'll be able to merge a supplemental index into an existing index in half the time it takes to reindex. Peter hopes to have this shippable within a month or so. (We didn't do this in the first release because we didn't want to invest the coding time, and in the early days we were constantly changing the logic of the indexing system, which always requires a full rebuild anyway.)
Unfortunately half is still a long time for large indexes. But it's a hard problem -- you're building a indexes in the gigabytes.
While we do use SQL for a few things, we don't use it to store the "postings" file of the inverted index for the text. (Nobody does.) SQL is designed for databases that are constantly updated, and full-text indices are more often static. Moreover, they're huge and read in a linear fashion; using custom data stores reduces the page overhead of SQL, allows you to do run-compression for lists of hits, etc. (When the word "the" appears every 40-50 characters, you can do cool things like using less than a whole byte to store just the offset from this occurrence to the next. In many cases you can use 1-2 bytes, or less, to store a hit, where in SQL and a full database you'd be storing a large record of many fields, with 32-bit numbers, etc. Define a SQL record to represent a single occurrence of a word, and you'll see it uses many more bytes than the word itself.)
Database optimization is important, and we'v got someone doing that now on our intranet and extranet, to speed customer record lookups, upgrade pricing calculations, and even our data synchronization back-end.
If you're curious about this area, "Managing Gigabytes" by Witten, Moffat, and Bell is a great technical overview. I liked it a lot, and have found many papers by the authors to be very useful. I even visited Moffat in Australia and spoke to one of Witten's graduate student seminars in Waikato, New Zealand.
(Yes, I'm over-answering the question and showing-off. :-) But I want to make the point that we are "all over this problem" and working hard and smart to address it as best possible. And I really like any excuse to get into geeky detail, so feel free to keep the questions or suggestions coming. We're not offended in the least, and we know we've still got lots to learn -- part of the reason we are hiring people like Dr. Venable.)
0 -
We agree that merging indexes is a good solution, and we've got someone working on it. (Our recent hire Dr. Peter Venable: http://www.logos.com/press/releases/oracle-senior-application-developer-moves-to-logos-bible-software)
When I recently read about his hire I figured he was going to be given some of this.(Yes, I'm over-answering the question and showing-off. :-) But I want to make the point that we are "all over this problem" and working hard and smart to address it as best possible. And I really like any excuse to get into geeky detail, so feel free to keep the questions or suggestions coming.
That's great, really. I like the geeky detail - brag away Bob!Now I have to think up a really hard question that only Bob can answer.....
Sarcasm is my love language. Obviously I love you.
0 -
Bob, never doubted you wern't on top of it, and the explaiation makes sense, i would settle for 50% saving right here right now.. (10 hours is a bit painful) I am think ing of going platinu, but am dreading the extra indexing time..
Never Deprive Anyone of Hope.. It Might Be ALL They Have
0