Archive

Author Archives: lorenzogerbi

Although millions of books are scanned and put online every year, making old documents and texts available on the web is a difficult and painstaking process.

Project IMPACT – which stands for Improving Access to Text – is focused on the making the process easier.

Project IMPACT director Hildelies Balk explained: “The problem with turning an historic document into a machine readable text is that it is so very old, everything is different from a modern document, it has old fonts, old words and a very difficult layout.“

Once scanned they are left full of errors, because computers struggle to read old texts with strange layouts, fonts and spellings.

Clemens Neudecker, technical manager for European projects at Koninklijke Bibliotheek, showed us one example: “This is the Principia Mathematica by Isaac Newton. You see actually what we call shine through, that is ink from the opposite page which is just shining through the paper, you see that the paper is warped, and you can also see here there is this long ‘s’ also in use, which can very easily be confused with an ‘f’.”

Researchers at the National Library of the Netherlands have spent four years in a European project to improve software tools to read old books.

Researcher Hildelies Balk said: “We improved software for image enhancement, optical character recognition, post-correction of the document and language technology to make it more accessible.“

“Text that is not fully digital is virtually invisible. Everyone is used to going into a search engine, and looking for a word, and if they don’t find this it basically isn’t there for them.”

Read the entire article and watch the video at this link: Euronews

 

Advertisements

Text digitization in the cultural heritage sector started in earnest in1971, when the first Project Gutenberg text — the United States Declaration of Independence — was keyed into a file on a mainframe at the University of Illinois. The Thesaurus Linguae Graecae began in 1972. The Oxford Text Archive was founded in 1976. The ARTFL Project was founded at the University of Chicago in 1982. The Perseus Digital Librarystarted its development in 1985. The Text Encoding Initiativestarted in 1987. The Women Writers Project started at Brown University in 1988. The University of Michigan’s UMLibText project was started in 1989. The Center for Electronic Texts in the Humanities was established jointly by Princeton University and Rutgers University in 1991. Sweden’s Project Runeberg went online in 1992. The University of Virginia EText Center was also founded in 1992. These projects focused on keyed-in text structured with markup, ASCII or SGML at the time, transitioning to HTML and later, to XML.

Read more herehttp://blogs.loc.gov/digitalpreservation/2012/12/before-you-were-born-we-were-digitizing-texts/

The story with the growth in e-book readers was somewhat different from the story with tablet computers. Ownership of e-readers among women grew more than among men. Those with more education and higher incomes also lead the pack when it comes to e-book ownership, but the gap between them and others isn’t as dramatic. For instance, 19% of those in households earning $30,000- $50,000 have e-book readers. They are 12 percentage points behind those in households earning $75,000 or more in e-book reader ownership. The gap between those income levels on tablet ownership is 20 percentage points.

Source: The Dec. 2011 and Jan. 2012 results shown here are from three new surveys by the Pew Research Center’s Internet & American Life Project .The Dec. 2011 results are from a survey of 2,986 people age 16 and older conducted November 16-December 21, 2011. The survey was conducted in English and Spanish and on landline and call phones. The margin of error is +/- 2 percentage points. The Jan. 2012 results are from a combination of two surveys, one conducted January 5-8, 2012 of 1,000 adults age 18 and older and the other conducted January 12-15, 2012 among a sample of 1,008 adults. The overall margin of error in the combined Jan. 2012 dataset is +/- 2.4 percentage points. The January surveys were conducted on landline and cell phones. They were conducted only in English.

Download the full survey here: Pew_Tablets-and-e-readers-double-1.23.2012

Extract:

Schermata 2013-01-14 a 23.16.13

12% of e-book readers have borrowed an e-book from a library. Those who use libraries are pretty heavy readers, but most are not aware they can borrow e-books.

Download the full survey here: PIP_Libraries_and_Ebook_Patrons 6.22.12

Source: Dec. 2011 results are from a survey of 2,986 people ages 16 and older conducted November 16- December 21, 2011. N for print book readers in the past 12 months= 2,295. N for e-reader owners in the past 12 months=793. N for audiobook listeners in the past 12 months=415. The survey was conducted in English and Spanish and on landline and cell phones.

Some extracts:

Schermata 2013-01-14 a 22.57.29 Schermata 2013-01-14 a 23.01.56 Schermata 2013-01-14 a 23.05.29Schermata 2013-01-14 a 23.07.12

The population of e-book readers is growing. In the past year, the number of those who read e-books increased from 16% of all Americans ages 16 and older to 23%. At the same time, the number of those who read printed books in the previous 12 months fell from 72% of the population ages 16 and older to 67%.

Overall, the number of book readers in late 2012 was 75% of the population ages 16 and older, a small and statistically insignificant decline from 78% in late 2011.

The move toward e-book reading coincides with an increase in ownership of electronic book reading devices. In all, the number of owners of either a tablet computer or e-book reading device such as a Kindle or Nook grew from 18% in late 2011 to 33% in late 2012. As of November 2012, some 25% of Americans ages 16 and older own tablet computers such as iPads or Kindle Fires, up from 10% who owned tablets in late 2011. And in late 2012 19% of Americans ages 16 and older own e-book reading devices such as Kindles and Nooks, compared with 10% who owned such devices at the same time last year.

Source: Most recent data from Pew Research Center Internet & American Life Project Library Services survey. October 15-November 10, 2012. N=2,252 Americans ages 16 and older. Interviews were conducted in English and Spanish and on landline and cell phones. Margin of error is +/- 2.3 percentage points for the total sample.

Download the full survey here: PIP_Reading and ebooks_12.27

Some extracts:

Schermata 2013-01-14 a 23.33.23 Schermata 2013-01-14 a 23.36.01 Schermata 2013-01-14 a 23.37.54

The Google Ngram Viewer is a phrase-usage graphing tool which charts the yearly count of selected n-grams (letter combinations), words, or phrases, as found in over 5.2 million books digitized by Google Inc (up to 2008). The words or phrases (or ngrams) are matched by case-sensitive spelling, comparing exact uppercase letters, and plotted on the graph if found in 40 or more books. The Ngram tool was released in mid-December 2010.

The word-search database was created by Google Labs, based originally on 5.2 million books, published between 1500 and 2008, containing 500 billion words in American English, British English, French, German, Spanish, Russian, or Chinese. Italian words are counted by their use in other languages. A user of the Ngram tool has the option to select among the source languages for the word-search operations.

Info: http://books.google.com/ngrams/info

Search engine: http://books.google.com/ngrams

TED Talk about Google Ngram Viewer: http://www.ted.com/talks/what_we_learned_from_5_million_books.html