Original Reddit post

I’m working on a project to digitise some old books for my church. I thought this would be a simple task for AI, but I’m having a lot of difficulties. I was wondering if anyone had any expertise with this and could advise please. Situation: I have a lot of old books on church history, theology, clerical memoirs, etc. They’re all out of print and out of copyright, but otherwise good quality scholarship that I’d like to make more easily available. They currently only exist as hard copies or pdf image scans. The layouts aren’t always straightforward – there is single-column and sometimes double-column text, footnotes, headings, quotes in Latin, and other anomalies. Here is an example page. https://preview.redd.it/50uoc1yfgwjg1.png?width=434&format=png&auto=webp&s=d391c4dec2c90d6561b4642fdbea22a00a418ee6 I want to extract the text and create good quality, clean, modern, searchable, pdf test documents. What I’ve tried: Before trying AI, I OCR scanned the pdfs and exported the text to MS Word. This didn’t work – the formatting was a huge mess and involved a huge amount of manual work to correct. I tried uploading the books as a whole to both ChatGPT and Gemini and asking them to extract the text. This didn’t work as the books were too large to do in one go. Then I tried extracting smaller sections – 5-10 pages at a time. That did work better, but is quite time consuming. The current book I’m working on is 900 pages, so this is a lot of fiddle work. The problems: When I have got the AIs to successfully extract text at all it’s a constant battle with them to extract it verbatim, and not summarise. Their default approach is to give me a commentary on the issues described in the book rather than the verbatim text. Even when I use a prompt that explicitly says not to summarise or comment, it still happens. Sometimes it’s quite difficult to spot – 90% of a section will be extracted verbatim, but a couple of paragraphs here and there will be paraphrased instead. I’ve also had problems with footnotes. The AI is extremely good (surprisingly so) at recognising what text is a footnote and excluding it from the main body of the text. But it generally just doesn’t extract the foot notes at all. This requires extra steps to correct. ChatGPT and Gemini have both had similar issues with this. Does anyone have any advice, or found a working solution for similar tasks? Thanks submitted by /u/Dr_Bumfluff_Esq

Originally posted by u/Dr_Bumfluff_Esq on r/ArtificialInteligence