Although millions of books are scanned and put online every year, making old documents and texts available on the web is a difficult and painstaking process.
Project IMPACT – which stands for Improving Access to Text – is focused on the making the process easier.
Project IMPACT director Hildelies Balk explained: “The problem with turning an historic document into a machine readable text is that it is so very old, everything is different from a modern document, it has old fonts, old words and a very difficult layout.“
Once scanned they are left full of errors, because computers struggle to read old texts with strange layouts, fonts and spellings.
Clemens Neudecker, technical manager for European projects at Koninklijke Bibliotheek, showed us one example: “This is the Principia Mathematica by Isaac Newton. You see actually what we call shine through, that is ink from the opposite page which is just shining through the paper, you see that the paper is warped, and you can also see here there is this long ‘s’ also in use, which can very easily be confused with an ‘f’.”
Researchers at the National Library of the Netherlands have spent four years in a European project to improve software tools to read old books.
Researcher Hildelies Balk said: “We improved software for image enhancement, optical character recognition, post-correction of the document and language technology to make it more accessible.“
“Text that is not fully digital is virtually invisible. Everyone is used to going into a search engine, and looking for a word, and if they don’t find this it basically isn’t there for them.”