Kicking up the dust in a chaotic storehouse

October 11, 1996

Researchers are collaborating to bring order to the chaos of the Web.

People find things on the network in different ways. The popular search engines - Alta Vista, Lycos, and so on - are heavily used and are a preferred way into resources for many. They use a vacuum-cleaner approach: they suck up text from Web sites world-wide and index it in large databases.

They may be comprehensive, but pose well-known retrieval problems. The user is often overwhelmed by results: the most relevant to his or her needs may be hidden far down a long list. There is often not enough detail to know whether any particular resource is relevant or usable. The user may have to sift through much dust to come across one jewel.

These services are the subject of significant development effort. However, other approaches are also gaining currency, based on fuller description of network resources. This article is about some approaches to the creation and use of "metadata".

A growing number of sites rely on manual effort to create resource descriptions. A good example is the subject information gateways now being developed in the United Kingdom as a part of the Electronic Libraries (eLib) programme of the Joint Information Systems Committee. (For a list of these projects and more details see http://ukoln.ac.uk/elib/lists/anr.html). These are organised in various ways but share certain characteristics, including a fondness for striking acronyms:

Adam for art, design and related areas; Sosig for social sciences; Omni for medical sciences; Eevl for engineering; IHR-Info for history; Biz/ed for business. In some ways, they are like early abstracting and indexing services. Subject specialists create records which include keywords, brief descriptions, classification, and data about the technical characteristics of a resource. Resources are screened according to selection and quality criteria before being added to the database.

These fuller descriptions should make searches more precise (improve the jewels to dust ratio) and allow users to judge better whether a resource is of interest. They also allow fielded searching: users may search on, or qualify by, particular attributes such as author or title. While they may rely on automatic robots to assist in the identification of potentially relevant resources they rely on human intervention for selection and further description. A service with a different scale and focus, but which also creates a database of resource descriptions, is OCLC's Netfirst, hosted on behalf of the UK academic community by Niss (http://www.niss.ac.uk/).

These services create metadata, data which describes attributes of other data, here a network resource. The metadata created by these projects supports several functions: location, discovery, and selection.

A familiar example of metadata is the library catalogue record, which describes the attributes of a book in the collection. However, these general discovery services typically do not use formats as complex as Marc, the format used for interchange of library bibliographic data. Some use home-grown formats. Some of the eLib services use IAFA/Whois++ templates, simple attribute-value based structures. Templates exist for various objects: user, document, service, organisation, and others.

There has been much interest in another simple format for Internet resource description, the Dublin Core, which comprises 13 elements (title, author, format, and so on). It was developed at a multidisciplinary workshop organised by OCLC and the National Center for Supercomputing Applications (which developed the Mosaic Web browser) in Dublin, Ohio last year. This format was refined at a workshop organised by OCLC and Ukoln (the UK Office for Library and Information Networking) in Warwick last April. Also of interest is the decision of Netscape to use SOIF (Summary Object Identifier Format) in its Catalog Server product. SOIF is the internal format of Harvest, a set of tools for extraction and indexing of data from network resources.

A third set of services is emerging . Domain-specific initiatives are exploring frameworks for the encoding of documents, data sets and other resources. Particular domains include: archives, electronic texts in the humanities, museums, social science data sets, and geospatial data among others. Typically, these are developing SGML document type definitions and pay special attention to metadata requirements. In some cases they look at metadata separately; in some it is part of a larger framework, considered alongside the encoding of "information content" itself. Their needs are fuller than, but include, the basic descriptive data mentioned. They are also interested in providing data about the best way to extract the information content of the resource, now and in the future; in its provenance, archival history, and various forms of intellectual responsibility; in representing levels of relationship between "collections" and the objects in them; in particular domain-specific attributes; and so on. An example of one relevant approach is the Text Encoding Initiative which is making a growing number of literary texts available in electronic form. It includes metadata in a "header" prefaced to each text. Another example is Encoding Archival Descriptions in the archival community. There are other initiatives in all the domains mentioned. These approaches to metadata can be distinguished from basic descriptive formats like the Dublin Core by their fullness, structure, specialism and context of use.

It is recognised that in an indefinitely large resource space, effective management of networked information will increasingly rely on effective management of metadata. The need for metadata services is already clear in the Internet environment, where search engines work on unstructured data of uncontrolled origin. As the Net develops into a mixed information economy deploying multiple application protocols and formats this need will be all the greater.

Metadata is not only key to finding relevant material; by establishing the technical or business frameworks in which resources can be used, it will enable people to make effective use of the resources they discover. It will become increasingly important with the spread of "middleware" services which shield the user from the burden of having to interact individually with multiple diverse information repositories. A user may be interested in resources on a particular topic published after a certain date, costing less than a certain amount, in a certain language, and prefer certain geographical locations or physical formats.

Without appropriate metadata, such searches will be very expensive or impossible to carry out, even with significant manual intervention. Effective agent software, for example, will benefit from structured metadata. However, we are not yet looking at achieved solutions. Much remains to be done. Suppose that, as a first step, authors or site administrators were to create metadata for their own resources and embed it in their documents in a consistent way. This would be of benefit to the Web indexes which could suck in data which had more predictable content and structure. It might be of benefit to the selective gateway-type services, which could use it as a headstart for their records. A convention for embedding metadata in HTML was proposed in a break-out group at a World Wide Web Consortium (W3C) workshop earlier this year. The group included the Dublin Core/Warwick Framework metadata representatives, major Web search vendors, the W3C and others. This work aims to provide a solution which does not depart from HTML practice and which does not compromise the behaviour of browsers or the existing Web "robots". This has been taken up by researchers and developers brought together at the Warwick metadata workshop and it is hoped that agreement will be reached shortly. Ukoln will be promoting strategies for embedded metadata based on the outcome of this work.

The variety discussed suggests that there will not be a monolithic metadata format that meets all needs. There will be specialised metadata formats to meet the needs of particular communities, and formats for particular types of metadata (description, terms and conditions, and so on). For example, the Dublin Core is a simple format focusing on basic descriptive data. Attempts to enhance it to cover a variety of putative requirements would conflict with its intended purpose.

It is undesirable either that there be a single format for resource description or that a single format be indefinitely expanded to accommodate all future requirements. In this context, the Warwick workshop proposed the Warwick Framework: this is a container architecture for the aggregation of metadata objects and their interchange. It would allow applications to bundle different packages of metadata: a Dublin Core package alongside a richer Marc record, or a package describing the terms and conditions under which a resource is available. Work has begun, testing the concept using Mime and SGML containers. The July/August issue of D-Lib magazine at http://www.ukoln.ac.uk/dlib/ has information about the Dublin Core and the Warwick Framework.

Finally, how does it all fit together? A user with an interest in the image of the city in modernist fiction will find resources of interest in several collections: image banks, library catalogues, network archives, and so on. To individually identify all relevant resources and independently interact with each will be tedious. A number of projects should help to provide the invisible glue that will hide this diversity.

Ukoln is working with Bristol and Loughborough universities on the Roads project to provide the eLib subject based services with a distributed framework based on Whois++, an Internet search and retrieve protocol. This will allow the autonomously managed services to be accessed as a navigable unit.

At a higher level, federating solutions will be required which also work across other search and retrieve protocols (Z39.50, LDAP and others) to access heterogeneous metadata stores. The Roads partners are also cooperating in a major European Union Telematics for Research project, Desire, one of whose aims is to investigate appropriate links between subject gateways and robot-based Web indexes.

An important component is expected to be the use of generic metadata, perhaps Dublin Core, to provide an intermediate layer between diverse richer metadata models. For example, the JISC-funded Arts and Humanities Data Service wishes to provide unified access to the catalogues of its partners, each of which is operating in a particular subject area and may be taking different approaches to its own metadata. Each partner might create Dublin Core descriptions which could be searched together, and which point back to richer domain-specific descriptions. The Warwick Framework may have a role in exchanging metadata between applications. Ukoln and the Arts and Humanities Data Service are to look more closely at these issues in a series of workshops over the next year. Metadata referring to network information resources will need to be integrated at some level with the large, highly fragmented, metadata resource for the print literature. There may also be metadata about people, about courses, about research departments and about other objects.

Lorcan Dempsey is director of Ukoln which is funded by the JISC and the British Library Research and Innovation Centre. Further details about Ukoln can be found at http://www.ukoln.ac.uk/

Register to continue

Why register?

  • Registration is free and only takes a moment
  • Once registered, you can read 3 articles a month
  • Sign up for our newsletter
Register
Please Login or Register to read this article.

Sponsored