I have often used the National Library of Australia's Australian Newspapers website for my research.
It is a marvelous resource.
"As of 13 October 2009 there are 816,461 pages consisting of 8,522,700 articles available to search"
It mainly works through some type of automated scanning (usually of microfilm of hard copy newspapers) and then through use of Optical Character Recognition to provide searchable text.
The Australian Newspapers website and systems behind it are still under development. Some of the known problems are that the readability of the microfilms from which scans are taken are sometimes less than good, which in turn effects the quality of the OCR text, which in turn degrades the searchability of the text and hence newspaper articles.
For my needs, it has also been problematic that the Sydney Morning Herald and Melbourne's The Age newspapers have not thus far been included in the searchable newspapers online.
Although we now know that the Sydney Morning Herald is on the way:
"The Sydney Morning Herald 1831-1954 has now been digitised and is awaiting OCR processing. There are 600,000 pages containing approximately 6 million articles to be processed in 2009 - 2010, which will be made available progressively."
I gather that it is a very expensive process.
Today, I found outthat Google, through Google News has an archive service which makes many historical (old) newspapers searchable. The quality of the on screen scanned image of the newspapers is far superior to newspapers of a similar age shown on the NLA's Australian Newspapers site.
Compare this image from a 1941 The Canberra Times on Australian Newspapers
with
this image from a 1941 Sydney Morning Herald on Google News Archive.
See what I mean?
To do with another piece of research about The Siege of Ladysmith, a few months ago I went to the State Library of New South Wales to view their microfilm copies of the Sydney Morning Herald for November 1899. They were very hard to read and interpret.
But Google News Archive now has some Sydney Morning Heralds from November 1899 available, for example this.
Let me tell you, there is no comparison. It is much easier to read Google online version than the microfilms at the State Library of New South Wales. Hard copies of such newspapers are no longer available to general researchers due to their fragile condition.
Google has this problem licked.
I suspect that Google is not scanning from microfilm, and is scanning from original old newspapers. Somehow. From somewhere.
And the results are outstanding.
I wonder if the National Library of Australia should reconsider their Australian Newspapers project and outsource it to Google?
The NLA has already permitted Google News Archive to index results from the Australian Newspapers website.
I think it's time to take it a step further. Let Google take it over. Cheaper. Better result.
You're right, it's a technically superior solution. But Google have poor form when it comes to digitisation:
http://weblogs.swarthmore.edu/burke/2009/10/13/digital-search-i-google-poisons-the-well/
Given that this stuff is out of copyright, I'd probably rather have a second-best solution that we can ensure will be freely available to all, than let Google suck everything into its gaping maw and possibly start charging Australians for something they used to get for free.
Posted by: Brett | Wednesday, 14 October 2009 at 11:15
Brett, good points. And the Swarthmore piece has some too.
A recent edition of the New Yorker has a pretty good deconstruction of the Google organization by Ken Auletta. Abstract here:
http://www.newyorker.com/reporting/2009/10/12/091012fa_fact_auletta
where the points are made that it is essentially a commercial entity.
However my point is that Google is making the "stuff" available. And more easily readable. And more easily searchable. In this case the "stuff" is old newspapers being put online.
But the hard copies of the newspapers are still there in the off site storage of the State Library of New South Wales, still owned by the people of New South Wales. And still available on microfilm for free at many places (for but two examples, the State Library of New South Wales; closer to home for you, The Baillieu.)
So, Australians will still have it available for free just as it is today. But maybe not for free online forever. But we don't have it free online at the moment anyway.
Our taxes are paying for the digitisation of the newspapers at the NLA, along with very generous support from Vincent Fairfax. The NLA is being sqeezed for their efficiency dividends required by the present government. Some of that saving could be made by getting Google on board.
Anyone who does not want to pay a fee which Google may impose in the future can return to the status quo as we know it today. Provided our libraries don't start throwing away newspapers to save space.
Posted by: Bob Meade | Wednesday, 14 October 2009 at 11:49
Absolutely, somebody has to pay for it, somewhere. I don't dispute that. But the danger is that, as Burke says, we're at a critical point where we we can go either open content or monopoly paywall. If we focus on penny-pinching arguments now we'll end up with the latter, and the same arguments will prevent us from reinventing the wheel at that point. (And yes, I do use microfilm newspapers and sometimes hardcopies in libraries. But libraries *will* ditch physical resources for digital ones eventually, because they are all being squeezed.) Seeing as we are already inventing a perfectly serviceable wheel in the shape of the NLA project, I'd like us to continue with that for the time being.
Posted by: Brett | Thursday, 15 October 2009 at 07:42
OK, Brett, understood. I'm not convinced though that the NLA wheel is perfectly serviceable. I think it is managing to roll down the hill, but needs many human hands (the volunteers) to correct the OCR text.
Google seems to be scanning from clearer copies and possibly using superior OCR to text technology - thus increasing searchability.
When the NLA get their Sydney Morning Herald content online then we'll be able to do a like for like comparison - that is if Google continue to use their own content.
Posted by: Bob Meade | Sunday, 18 October 2009 at 16:25
It looks like the Google pages have been scanned in greyscale but the NLA scanning was bitonal.As far as I know the NLA shipped the OCR work to India and it seems to give average results only.
A local council here in Qld has put their local historic paper up on the web:
www.nambour-chronicle.com
You can text search on line but have to download page to see where your hits occur. The OCR seems to give better results than the NLA. Wonder what package they used and maybe the NLA Indian subcontactors need to update their software. Does anyone know anything about this paper and what was done?
Posted by: Greg Smith | Tuesday, 03 November 2009 at 17:21