Parasites at the root of records

An Introduction to Database Systems

九月 8, 1995

A Kuhnian crisis is a point eventually reached by all theories that attempt to describe the world. It is where the supporters of a theory are forced to extend their originally simple idea to encompass more and more situations, while providing less and less additional explanation. The question is, has relational database theory reached this crisis point?

The motivation behind the idea of the database in the late 1960s and early 1970s was that any large business would have many departments that kept the same (or similar) information on file. The probability of the information being inconsistent or out of date across departments was high, and managers would therefore be making decisions based on misinformation. Further, because of the cost and delicacy of mass storage at the time, it seemed sensible to have a central data "base" that was looked after by computer experts in air-conditioned rooms where all the information was stored centrally. The data could then be kept up to date and accessed by all departments.

The notion of files and records in a computer was drawn from the physical concept of the filing cabinet. The limitations of the computer model reflected the physical limitations of paper stored in cardboard folders in drawers. This hierarchical structure was retained in the computer model partly because of the physical limitations of the original mass storage media - paper tape. cards and magnetic tape - and partly because of the limitations imposed by the commercial programming language COBOL.

The database and the limited computer model raised several issues. One issue was the problem of implementing file structures that reflected the constraints of the information being stored, and to do so for very large quantities of data. Another issue was the problem of accessing information from file structures that were not necessarily designed for storing that particular information explicitly. Each department would organise its filing system along the lines of access in which it was interested. For example, the purchasing department would organise according to suppliers and the stores department according to parts. But if the purchasing staff wished to find from their own files all the suppliers selling a particular part, then they would have to exhaustively examine every one of their records. This is because they have to know a supplier in order to access a record.

The response to these problems (and others) was to set up a committee whose members represented a world-wide set of interests ranging from large computer manufactures to major users. This was CODASYL (Conference on Data Systems Language) which made a proposal based upon random access mass storage (that is, the disk) and an extension of COBOL. At about this time E. F. Codd, then working for IBM in the United States, proposed the idea of using the mathematical concept of a relation to represent files. His greatest insight was to link this representation with the process of normalisation; a process aimed specifically at decomposing existing hierarchical file structures that are biased for a particular access strategy into a set of simpler non-hierarchical file structures that showed no access bias. Ideal for databases. Codd's British colleague, Chris Date, in conjunction with the British Computer Society (Peter King and his Advanced Computing Specialist Group) set up a working party called Understanding Codd Papers. The purpose of this group was to inform the community of Codd's ideas and to translate the mathematical concepts into a language that a computer user could understand.

For purposes of explanation, at this time, the relation was likened to a table of data or, for the computer literate, was treated as equivalent to a computer file. Many of the arguments for normalisation were justified in these terms with particular emphasis on how it overcomes the problems of uniformity of accessing, the maintenance of consistency and the reduction of processing. So users then perceived the idea of a Relational Database as an alternative to the CODASYL's proposals, and this perception was encouraged by E. F. Codd and his colleagues, since this was an easy route for the acceptance of their ideas. Such was the origin of the first edition of An Introduction to Database Systems.

Having played a very minor part in these early beginnings (I was a member of the BCS UCP working party, representing ICL) it was interesting to see how these ideas might have changed over the past 25 years. What follows is a comment on these developments as reflected through the sixth edition.

The structure of the book ranges from an overview of database management systems aimed at (say) first-year students on an information technology course, through a detailed section on the relational model and database design, to the introduction of data protection and object-oriented systems - the last being de rigueur in current IT books.

The initial definition of a database system is given as: "Essentially nothing more than a computerised record-keeping system. The database itself can be regarded as a kind of electronic filing cabinet."

This seems a very restricted, somewhat old fashioned view and one that is misleading considering the history of the database. The relation, on the other hand, is defined as a table where the rows are "logical" records and the columns "fields". Note that the nomenclature chosen is that of the computer file.

This definition of a relation leads onto a sequence of its properties, some of which are later abandoned or possibly suspect (for example, attributes are atomic, attributes are unordered) with a re-emphasis that a relation is equivalent to a file, tuples to records and so on. It is true that these statements are carefully hedged with phrases such as "logical, not physical", "disciplined" and "occurrence, not type" but the damage to the reader's understanding has been done (and often irretrievably) in that the distinction between a relation and a computer file is lost.

The root of the problem is that the original notion of a relation was distorted by E. F. Codd and his followers at the inception of its use as a computational data model; a distortion that encouraged its passage into the community. Nevertheless, the model was a great insight and is of considerable importance for the analysis of a working environment.

To recap, the "meaning" of a mathematical relation can be interpreted in two quite distinct ways. The first is to consider it as equivalent to a sentence about the world with blanks; it is a sentence that is understood through the normal process of language. The interpretation of these sentences is referred to as the "intension" of the relation.

The second way of interpreting relations is through the reference of the symbols (the predicate statement) to an object in the world (the set of True combinations). This interpretation is referred to as the "extension" of the relation and it is the extension that governs the logic and proofs behind relational theory. It is that extension that declares that the three relational operations (select, project and join) combined with set operations are complete; that is, they provide all the manipulations ever needed.

However, the extensional interpretation means that a relation can never be changed (since this changes the meaning), that the position of a value set (column) is significant and that the level of manipulation through these operations is restricted to the relation itself - the individual values cannot be manipulated through these operations alone. Why should this be bad news for databases? Is it even relevant?

It depends what you expect from a theory. If we take the pragmatic view of C. S. Peirce (1839-1914) then a theory or hypothesis is that set of statements, either formal or informal, which increases the probability of predicting an event or outcome from a given situation; it makes life less surprising. A mathematical theory should provide us with the confidence that certain absolutes hold. In particular, it should provide a set of strong inference procedures that allow us to reason about a particular mathematical object; in this case the object is the relation.

The procedures given, in the first instance, are the set and relational operations and we should with confidence assume that these are complete. However, new procedures have since been created. These are procedures that are not just shorthand for the initial set; for example, the procedure "divide" that allows sets of values in columns to be used for selection as in: "find all suppliers that sell both part A and part B". There are also "insert", "update" and "delete" operations that have no place within the original extensional view.

Date took the intensional definition of a relation in editions 1 to 5. Unfortunately, the mathematics of intensional objects (intuitionistic reasoning) is not fully understood and I know of no work that has fully explored the concept of the intensional definition of a relation. We cannot have any confidence that theoperations are complete nor be able to predict the behaviour of intensional relations. However, in edition 6 the relation has been converted into a "relation variable". That is, the meaning of a relation is now what? - the set of all sets for which a given predicate might at some time be True? This seems to be becoming a little awkward and perhaps not useful. The relational theory is being constructed, in response to empirical demands from database users, from semi-formal insights; insights that are influenced strongly by the database language SQL.

Perhaps we need to consider the database from a different standpoint. Perhaps we should consider the relation as a function. The concept of functional dependency that describes a constraint between two sets of objects is the single most important idea to emerge from relational database theory. It is important because it is the elementary unit of a theory, hypothesis or method. The recognition of a functional dependency between two user made distinctions in the world (for example, two attributes. entities, objects, processes or concepts) provides us with an atomic theory (or hypothesis) between them. The set of functional dependencies represents a theory of the users' work domain. The notion of Truth, which seems to cause Date some anxiety, does not enter into this discussion. What matters is the usefulness of the theory or hypothesis for prediction. The better the prediction the better the hypothesis.

The concept of normalisation depends upon the recognition of functional dependencies between user made distinctions (for example, attributes and entities). The real value of normalisation is that it provides a framework within which an analyst can reduce the users' work domain into a collection of primitive sets of entities (or objects and so on). This does not mean that the analyst starts with one huge unnormalised relation (where would it come from anyway?) that requires step by step decomposition as described by the normalisation process. It means that the analyst looks for simple entities/objects/concepts that conform to a normalised form; a process that is far more natural than decomposition. Here is the real usefulness of relational database theory. So why does Date insist on pursuing the outdated decomposition approach to normalisation and worse still, use the Chen Entity/Relationship diagrams that lock out useful description of the higher order constraints? The "relationship" is a damaging parasite that lives off the backs of relations. Every relationship can always be replaced by a relation. The relation can then have properties and take part in the full design process. Relationships should be dumped, except perhaps as some primitive initial tool for communicating with the user.

The initial distortion of E. F. Codd as presented by C. J. Date has not been corrected and has remained the underlying flaw that has created an industry of amelioration. This industry has generated vast amounts of auxiliary hypotheses to shore up the foundations of the relational database concept by the addition of exceptions, special cases and new provisions. It is an industry that has been the source of considerable cost to business because bad analysis has been encouraged that has led to the failure of information technology to respond to users' expectations. It has only been through the art and experience of good systems analysts that these disasters have not been worse. Without these distortions the theory was both unacceptable and unusable by the community of users; but as a basis for a formal theory it is unworkable. The initial mistaken stance of the early relational database community, the collection of ad hoc tools that have been adopted and the long-term intellectual commitment to these ideas have created an unwieldy structure. It is a structure that is creating untold harm and needs to be revised.

Tom Addis is professor of computer science, University of Portsmouth.

An Introduction to Database Systems

Author - C.J. Date
ISBN - 0 201 82458 2
Publisher - Addison Wesley
Price - £24.95
Pages - 839

请先注册再继续

为何要注册?

  • 注册是免费的,而且十分便捷
  • 注册成功后,您每月可免费阅读3篇文章
  • 订阅我们的邮件
注册
Please 登录 or 注册 to read this article.