Jim Trageser's Hot on the Web - Bringing history into the future

Bringing history into the future

This column originally ran in ComputorEdge on October 20, 2006
(Issue 2442, A Joke a Day)

In mid-September, the New York Times – which has long had its electronic archives dating back to 1981 online – announced that it had completed the monumental task of making every article ever published in the paper available on its Web site.

That goes back to 1851, when the paper first began publishing.

Because of the fact that before 1980, the Times was not yet saving its articles in electronic format, the new archive section is delivered in PDF format – a photograph of the article (more on that below). Not as good as a searchable text document, but assuredly better than flying to New York and going through microfilm at the public library.

Interestingly, when you do a search of the archives from 1851-1980, you get a headline and the first paragraph of each story, to help you decide whether that's the article you want or not.

Now, it's not free. But Times subscribers get access, and those who don't want the daily paper but want access to the archives (and other for-pay items on the Web site, like the opinion page) can subscribe to the TimesSelect package for just under $50 a year.

And those doing research who don't normally visit the Times can purchase any single article for $4.95.

Best of all, there's a free two-week trial period of TimesSelect. Just remember to cancel in those two weeks or your credit card will get dinged.

The scope of the project

When you look at how the Times presents its new archival materials, the reality of how much labor was involved is pretty overwhelming.

As mentioned, every article since 1981 has been available in the archives (again, for a price)

They took 130 years' worth of daily newspapers and, first, photographed each article (from microfilm, microfiche or original isn't clear – but the PDFs I looked at were pretty darn clean). But a simple photo of an article is useless for a searchable archive. So they added the above-mentioned headline and first paragraph for each article – all of which had to be typed in by hand. Then they added key words for the searches, and tied all three (PDF, first paragraph/headline, key search words) into a database entry.

It simply had to have cost the Times several million dollars to do this.

Obviously, they're convinced that the above pricing structure will allow the archive to pay for itself over time.

But it's still a pretty nifty example of using technology to make historical documents – in this case, newspaper articles – available long into the future.

Google Books

When you consider how large and complex a project that Times archive was, Google Books' project is that much bigger.

Already under way while not yet complete &3150; heck, it will never be complete &3150; Google Books is no less than an attempt to have every book ever written available forever.

Similar to but yet also different from Project Gutenberg, Google Books involves scanning in every page of every book they can access and presenting each book as a series of PDFs. (Project Gutenberg has each book it presents – all older books whose copyright has expired – as a text file, meaning each one has to be typed in by volunteers.)

But you can also search Google Books to find specific passages in a work – meaning these are searchable PDFs.

While Google has focused its Books project initially on older, out of print books whose copyright has expired, it has also signed deals with large libraries (including the University of California and Harvard) to scan in their entire collections and make them available to the public.

This has had book publishers in an uproar. While scanning in books from the 1700s and 1800s that are long public domain and yet may be out of print (not every book was a classic, of course) is unquestionably good (helping more people access that large pool of knowledge they contain), publishers are concerned that libraries may be allowing Google to scan in expensive text books or other works still protected under copyright.

Google's response has been that the Google Books site does not allow someone to simply sit and read a copyrighted book in a sitting, but only to search these more recent books for relevant passages needed for a quote or to confirm a footnote.

With lawsuits filed and more on the way, Google Books will at least serve the useful purpose of forcing the courts to clarify copyright law in the age of the Internet.

In the meantime, untold students and volunteers are scanning in untold pages of untold books at dozens of major libraries around the globe.

Return to the Hot on the Web index | Return to the Computers Page
Return to my home page