Industrial Grade OCR

Page 1 of 1 (16 items)
This post has 15 Replies | 2 Followers

Posts 259
John C Connell Jr. | Forum Activity | Posted: Thu, May 12 2016 5:38 AM

We have all run across OCR mistakes and reported them as typos.  It gets me wondering about the OCR capabilities of FaithLife or Kindle or other electronic book publishers.

Are industrial OCR capabilities significantly better than the OCR programs available to us?

Be strong and courageous. . . for the LORD your God is with you wherever you go.

Posts 932
Justin Gatlin | Forum Activity | Replied: Thu, May 12 2016 7:17 AM

So OCR is actually a very difficult problem. There is not a huge distinction at this point between software levels (although machine learning is exciting on this front). I believe it is a time/accuracy balance. A computer can be configured to make a closer analysis for higher accuracy, but there is a point of diminishing returns. Should Logos spend twice as long to get a text which is 1/2 % more accurate? 

Posts 259
scooter | Forum Activity | Replied: Thu, May 12 2016 7:26 AM

Hi, John:  The world is full of acronyms.  I did not know what OCR meant till I looked up the article Justin had included in his post.

Posts 13399
Mark Barnes | Forum Activity | Replied: Thu, May 12 2016 7:54 AM

John C Connell Jr.:
Are industrial OCR capabilities significantly better than the OCR programs available to us?

The software isn't much different, but the hardware can be: https://www.logos.com/features/bookscanner. Software that's for a specific hardware setup is usually more accurate, because the software can compensate more exactly for page curvature etc. if it knows exactly how things were scanned.

Many of the older resources with OCR errors will have been done on the set up linked above.

Posts 259
John C Connell Jr. | Forum Activity | Replied: Thu, May 12 2016 8:43 AM

Thank you Justin and Mark.  Exactly the information I was seeking.

Be strong and courageous. . . for the LORD your God is with you wherever you go.

Posts 1272
LogosEmployee
Kyle G. Anderson | Forum Activity | Replied: Thu, May 12 2016 10:38 AM

FWIW. I should probably look at getting that page pulled since it's no longer relevant.

For approximately the last three to four years we, generally speaking, convert text one of two ways:

  1. For resources that need do not exist as an exportable file, we don't use OCR. We double key the book and run a diff and then consult with the print.
  2. For resources that exist as an exportable file we run a diff on the final files against the original exported files.

In my experience resources with a higher proportion of typos tend to be resources made in 2011 or earlier.

Posts 259
John C Connell Jr. | Forum Activity | Replied: Thu, May 12 2016 2:55 PM

Kyle,

Back confused again.  What does it mean to "double key the book?"  

Be strong and courageous. . . for the LORD your God is with you wherever you go.

Posts 29123
Forum MVP
MJ. Smith | Forum Activity | Replied: Thu, May 12 2016 3:05 PM

Two people enter the text via keyboard; there results are compared.

Orthodox Bishop Hilarion Alfeyev: "To be a theologian means to have experience of a personal encounter with God through prayer and worship."

Posts 244
Colin | Forum Activity | Replied: Thu, May 12 2016 3:09 PM

MJ. Smith:

Two people enter the text via keyboard; there results are compared.

Two people enter the text via keyboard; their results are compared.  

Ooh MJ, I guessed you were using a deliberate typo to illustrate the process Wink

Posts 259
John C Connell Jr. | Forum Activity | Replied: Thu, May 12 2016 3:11 PM

Wow, I had no idea people were retyping these resources rather than using OCR.  This is totally surprising to me.

Be strong and courageous. . . for the LORD your God is with you wherever you go.

Posts 13399
Mark Barnes | Forum Activity | Replied: Thu, May 12 2016 3:18 PM

John C Connell Jr.:
Wow, I had no idea people were retyping these resources rather than using OCR.  This is totally surprising to me.

It used to be OCR, so some older resources (the ones with lots of typos) will have been done with OCR.

Posts 1272
LogosEmployee
Kyle G. Anderson | Forum Activity | Replied: Thu, May 12 2016 3:26 PM

Mark Barnes:

John C Connell Jr.:
Wow, I had no idea people were retyping these resources rather than using OCR.  This is totally surprising to me.

It used to be OCR, so some older resources (the ones with lots of typos) will have been done with OCR.

Correct. As advanced as OCR is, it still doesn't do well on really old PDFs (which often what we have). For example it can easily mistake "e" as "c" or a "cl" as a "d". This isn't surprising. Sometimes it's even hard for a human to tell the difference and this will be with the best print available.

We found we got better results at a better price double keying the resource and running a diff than doing an OCR and proofreading.

Posts 621
Dave Thawley | Forum Activity | Replied: Fri, May 13 2016 5:29 PM

Mark Barnes:

John C Connell Jr.:
Wow, I had no idea people were retyping these resources rather than using OCR.  This is totally surprising to me.

It used to be OCR, so some older resources (the ones with lots of typos) will have been done with OCR.

I was part of a team once contracted to OCR a couple hundred old documents. After a while we found there were so many errors we ended up typing a lot of it in again. It took hundreds of hours when we were expecting in to take a week

Posts 451
Paul | Forum Activity | Replied: Fri, May 13 2016 10:50 PM

Thanks for the explanation Kylie - it makes me more understanding of the difficulties involved in producing some of the books. It also illustrates that its hard work!  

Posts 1
Jerry Thompson | Forum Activity | Replied: Thu, May 26 2016 7:41 AM

Can anyone tell me how to produce an iota subscript with the Logos Greek Keyboard? I'm didn't want to interrupt this forum thread, but I can't find anywhere to ask the question? I felt sure someone among you would know the answer. Thanks, JT

Posts 22855
Forum MVP
Graham Criddle | Forum Activity | Replied: Thu, May 26 2016 7:50 AM

Hi Jerry - and welcome to the forums

Jerry Thompson:

Can anyone tell me how to produce an iota subscript with the Logos Greek Keyboard? I'm didn't want to interrupt this forum thread, but I can't find anywhere to ask the question? I felt sure someone among you would know the answer. Thanks, JT

I would cross-post this in the Logos 6 forum - https://community.logos.com/forums/124.aspx 

Graham

Page 1 of 1 (16 items) | RSS