Web data needs to be sorted by people we trust. Step forward the librarians says Stephen Pinfield.
For years we have relied on librarians to catalogue books. On ledgers and cards, then on microfiche and computers, library catalogues have enabled readers to get at the material they need. The activity of putting these catalogues together went, until recently, under the self-confident name of "bibliographic control".
Since the rise of the Internet, talk of control seems to have waned. But librarians are still cataloguing. They have just added the Internet to the list of things to catalogue. Library catalogues now often cite web sites as well as books. As many catalogues are themselves delivered on the Web, they provide a direct clickable link to other Internet information sources.
At the same time, catalogue records have been given a new name: metadata. Metadata is data about data. It is data which describes an information source and provides access to it. Schemes are being developed to produce metadata for Web pages and other networked resources. Perhaps the best known of these is Dublin Core, which provides a set of simple headings (title, creator, subject . . .) for a description of the resource. On Web pages the author normally puts this information in a . LESS THAN META
/ tag at the top of the page. While this makes it invisible to the user, the idea is that it will be picked up by automated harvester programs which search the Web to find new pages. When you use a search engine such as AltaVista or Lycos you are not searching the Web itself but a database gathered by one of these harvesters. If these programs can understand metadata and store it in their databases in a structured way under specific headings, then it is hoped that searching for Web pages will ultimately become easier.
So, if metadata is created by Web page authors themselves and incorporated into search engine databases automatically, do we still need librarians to catalogue Web pages? There is a practice known as "Web page stuffing" which already causes problems for people looking for information on the Web. It involves adding lots of commonly searched-for words to a Web page, whether or not they are relevant to the page itself, in order to increase the number of people who come to the page. Sometimes these words are added in such a way as to make them invisible to the human surfer but not to the automated harvester. They may, for example, be written in the same colour as the background of the page. Often the same word is added many times in order to affect the relevance-ranking calculations of search engines. The more times a word occurs in a page, the more likely it is to appear high on a list of hits when that word is searched for, and the more likely it is to be accessed by users.
If this is what happens with Web page stuffing, just think what it will be like when we have metadata stuffing. This warning was sounded recently by Clifford Lynch of the Coalition for Networked Information in the United States. The problem with self-created metadata is that it can be cynically fixed. Rather than providing an accurate description of a Web resource, it might include terms which would simply make it more likely for people to want to look at it.
The only way to avoid biased metadata is to involve a trusted third party in its production. This could be done either by getting an agency to approve the metadata on the site, or by having a third party produce and store the metadata elsewhere. Both of these are possible and already are being carried out to a greater or lesser extent.
The trusted third party must be someone who can create accurate metadata which will help information users to get at the resources they need. The librarian fits the bill. Apart from the library catalogues of individual institutions, Web subject gateways such as SOSIG (for social sciences, http://sosig.ac.uk/) and OMNI (for medicine, http://omni.ac.uk/) are good examples of librarians cataloguing the Web.
Librarians have a proven track record in this field. They have established sophisticated techniques for the precise description of material and are experienced in managing metadata of all kinds. They are used to acting as intermediaries between users and information, and are trusted to do so. In the networked environment, librarians can continue this role working with a wide range of traditional and virtual library resources. The librarian will remain a valued metadata creator.
Stephen Pinfield is project leader of the BUILDER hybrid library project at the University of Birmingham.
CATALOGUES OF TRUST
They can never catalogue the Web.
No. But good stuff eventually gets catalogued and some sites are keen to catalogue their own material to make it more accessible.
Too many catalogues could be as bad as too few.
The idea is that they should work the same way. Catalogue information - metadata - will be expressed in RDF, a language that Web browsers and other software will understand.
So I get to a Web site and the metadata tells me it has all kinds of wonderful things. Why should I believe it?
You might believe it if you knew the site was catalogued by skilled, independent information professionals. Digital signatures will show that metadata comes from a trusted source.
Digital signatures. How do those work?
A digital signature is encrypted data which proves the authorship of a message. Digital signatures are created with a secret key known only to the author, but anyone can read them using a public key. Hard to fake, easy to check.
But that will not work. I could copy a signature and stick it on the end of a bogus message.
They have thought of that. The signature is specific to the message and becomes invalid if the message is altered in any way.
So digitally-signed metadata is something to look forward to.
It is here already. PICS (Platform for Internet Content Selection) is a labelling system for Web pages, originally designed to protect children from unsuitable material. A PICS label can carry a signature proving that the labelling was carried out by a trusted third party. The specification, including signatures, was finalised in May. The same principles will be extended to more complex metadata formats such as Dublin Core, designed for finding information rather than filtering it.
An Irish standard?
International. But a lot of the work was done in Dublin, Ohio in the United States.
What about the search engines? Will treasures they retrieve always be mingled with dross?
It depends on the quality of the metadata and how the search engines handle it. Trustworthy metadata should help to sharpen up the automatic indexing done by the search engines. Lies and hype posing as metadata could fill your hit-lists with irrelevant rubbish. But search engine operators will be able to protect themselves against misinformation or "stuffing" by telling their automatic harvester programs to ignore metadata that does not carry a trusted signature.
Where can I find out about this?
The World Wide Web Consortium, www.w3.org, is the international body for standards and technologies on the Web.