Question for Bob Pritchett ..... How?

Page 1 of 1 (13 items)
This post has 12 Replies | 8 Followers

Posts 342
Stephen Miller | Forum Activity | Posted: Thu, Sep 22 2016 5:24 AM

Bob,

As a Logos user from the days of Logos 1 when 50 books was a complete library, I have a question.

Years ago I imagined you slaving away in a garage somewhere, manually scanning books into the latest $100 OCR program, and then doing your computer magic to "Logosise" the resource.

I was wondering if you would like to give us a little history of how you digitised resources in the past, what OCR programs you used, and how you do it now.

I don't expect any trade secrets to be revealed. I am not a spy for a competitor, just curious.

Thx

Stephen Miller

Sydney, Australia

Posts 992
LogosEmployee
Kyle G. Anderson | Forum Activity | Replied: Thu, Sep 22 2016 7:04 AM

Bob isn't involved in the current conversion process these days but I can speak to it briefly.

  1. I could speak to OCR programs in the "old days" but I would have to do some digging.
  2. Ideally we convert from exportable files. This might be a PDF, INDD, RTF, or even an audio file that that gets transcribed. At the end of the resource creation process we run a diff between the original files and the final files to make sure something didn't get altered during our work.
  3. In cases where we do not have exportable we've gone completely away from using OCR programs to convert files. I don't recall the exact dates off the top of my head but I'd estimate it has been four to five years. Now we double key books. In this process two different people independently type a book. When they're finished, we run a diff on the two files and reconcile any differences by comparing against the print in question. While more expensive, the output is better. Industry standard is 99% accuracy for this process. (Of course its dependent on the source material. I recall a recent case where the output for a Latin text of Jerome was quite bad. Unfortunately this was due to a print edition that was lacking. A user alerted us to this and pointed us to a better quality PDF and we were able to fix it.)

I hope that helps. 

Posts 353
Virgil Buttram | Forum Activity | Replied: Thu, Sep 22 2016 10:18 AM

I recall reading on these boards that, given the error rate with even the best OCR, once the manual effort to correct those errors is factored in, the double key method is actually better than OCR-and-correct; less time and/or less money, I forget which.

Posts 342
Stephen Miller | Forum Activity | Replied: Thu, Sep 22 2016 10:28 PM

Kyle,

Thanks for the modern info. I am amazed that the resources are copied by hand ..... have we goner back to the medieval monasteries?????

Love others to fill in more info.

Stephen

Posts 273
Greg F | Forum Activity | Replied: Thu, Sep 22 2016 11:20 PM

Kyle G. Anderson:
Of course its dependent on the source material. I recall a recent case where the output for a Latin text of Jerome was quite bad. Unfortunately this was due to a print edition that was lacking. A user alerted us to this and pointed us to a better quality PDF and we were able to fix it.

I believe you're referring to the Augustine Loeb volume, Kyle, not the Jerome, which is still waiting to be cleaned up. ;)

Are you sure that the Latin texts were double-keyed as you describe? I find it hard to believe that two different typists would both type in "ct" instead of "et" twenty-five times in the one Jerome volume alone. (You can see what I mean by doing an inline search for " ct ", or "dc", which should be " et " and "de" respectively). Given, the Augustine text was worse. :)

Posts 6402
DAL | Forum Activity | Replied: Fri, Sep 23 2016 1:24 AM

Hmm ...

Posts 353
Virgil Buttram | Forum Activity | Replied: Fri, Sep 23 2016 5:09 AM

Stephen Miller:

Kyle,

Thanks for the modern info. I am amazed that the resources are copied by hand ..... have we goner back to the medieval monasteries?????

Love others to fill in more info.

Stephen

Given that the copyists are typing on a computer keyboard, nothing medieval about it. The human brain remains a better processor for many cognitive tasks, and optical character recognition (aka "reading") is one of those.

Posts 992
LogosEmployee
Kyle G. Anderson | Forum Activity | Replied: Fri, Sep 23 2016 6:03 AM

Greg F:

Kyle G. Anderson:
Of course its dependent on the source material. I recall a recent case where the output for a Latin text of Jerome was quite bad. Unfortunately this was due to a print edition that was lacking. A user alerted us to this and pointed us to a better quality PDF and we were able to fix it.

I believe you're referring to the Augustine Loeb volume, Kyle, not the Jerome, which is still waiting to be cleaned up. ;)

Are you sure that the Latin texts were double-keyed as you describe? I find it hard to believe that two different typists would both type in "ct" instead of "et" twenty-five times in the one Jerome volume alone. (You can see what I mean by doing an inline search for " ct ", or "dc", which should be " et " and "de" respectively). Given, the Augustine text was worse. :)

Yes. Places that do projects like this key exclusively off of character recognition. While they may have an excellent grasp of the nuances of English that may not be true for languages like Latin or Greek which are read by a much, much, much smaller section of the population. I distinctly recall the book you are mentioning. I know enough Latin to be able to say "that's not right" but was astounded to discover that in the print that we used the "e's" looked exactly like "c's". If anything I put the fault on us for not catching that in the print to begin with. Believe me, there's some truly awful print out there and we've had to reject a great deal of it for use.

Posts 13359
Forum MVP
Mark Barnes | Forum Activity | Replied: Fri, Sep 23 2016 6:50 AM

Kyle G. Anderson:
I know enough Latin to be able to say "that's not right" but was astounded to discover that in the print that we used the "e's" looked exactly like "c's".

I wonder if you were looking at a book that had been OCRd and typeset by another person before you got it. I once bought a print book on Amazon. It was modern reprinting of an old book,long out of print. But it had been OCRd and re-typeset, presumably by an automatic process and it was almost unreadable in places, particularly in non-English passages.

Here's what the frontispiece says (which wasn't made clear before purchase, of course):

Posts 341
Abram K-J | Forum Activity | Replied: Fri, Sep 23 2016 9:03 AM

Not sure I'd feel confident paying someone who offered to "proof read"....

Abram K-J: Pastor, Writer, Freelance Editor, Youth Ministry Consultant
Blog: Words on the Word

Posts 26492
Forum MVP
MJ. Smith | Forum Activity | Replied: Fri, Sep 23 2016 12:12 PM

Abram K-J:
paying someone who offered to "proof read".

Hey, my sister did that for an educational software firm for many years ...

Orthodox Bishop Hilarion Alfeyev: "To be a theologian means to have experience of a personal encounter with God through prayer and worship."

Posts 3938
abondservant | Forum Activity | Replied: Fri, Sep 23 2016 12:30 PM

MJ. Smith:

Abram K-J:
paying someone who offered to "proof read".

Hey, my sister did that for an educational software firm for many years ...

proof read vs proofread ;)

I was one for a few years in an advertising firm.

You can't tell it by my typing in general however...

L2 lvl4, L3 Scholars, L4 Scholars, L5 Platinum,  L6 Collectors. L7 Baptist Portfolio. L8 Baptist Platinum.

Posts 341
Abram K-J | Forum Activity | Replied: Fri, Sep 23 2016 1:34 PM

Yes

Abram K-J: Pastor, Writer, Freelance Editor, Youth Ministry Consultant
Blog: Words on the Word

Page 1 of 1 (13 items) | RSS