Issues raised by polymorphism in relation land

I continue to ponder how to merge the Relational Model (RM) and Object Oriented (OO) worlds in order produce a best-of language. The task is a challenge due to the deep and fundamental differences between OO and the RM. One approach I am exploring is to utilize the notion of multiple logical representations to allow both to co-exist without compromise. Doing this, though, means pinning down canonical isomorphisms between the two models.

One important concept from the OO world is polymorphism. In OO land, we identify a useful abstract concept like "Contact" and build a class around it. We then build classes for more specific types of contacts such as Organization and Person. A challenge for me has been to identify a clearly obvious equivalent to this design in relation land. This has proven quite elusive because there are several relational designs that seem to qualify. The design that seems most commonly used involves a "type" table of some sort followed by tables corresponding to each class. For the running example, a ContactType table, a Contact table, and of course the tables Organization and Person. Continuing the OO design, and for the purpose of introducing polymorphism into the discussion, let's give the Contact an abstract Name property. The Organization class then might just implement the name property as "pass-through" to some field, while the Person class might combine first, middle, and last name fields.

class Contact
{
abstract string GetName();
}

class Person : Contact
{
string FirstName;
string MiddleInitial;
string LastName;
override string GetName()
{
return FirstName + MiddleInitial + LastName;
}
}

class Organization : Contact
{
string OrganizationName;
override string GetName()
{
return OrganizationName;
}
}
In the working relational design, arriving at polymorphic behavior is not so natural. If we could assume that the relational system for which we are designing has "code" as a data type, a GetName column could be placed in the ContactType table. To view contacts with their respective names, the Contact and ContactType tables could be joined and the GetName routine invoked in an extension operation.

table ContactType
{
ID : ContactTypeID;
Description : Description;
GetNameOperator : operator(AID : ContactID) : String;
}

table Contact
{
ID : ContactID;
Type_ID : ContactTypeID;
key { ID };
reference Contact_ContactType ...
}

table Person
{
Contact_ID : ContactID;
FirstName : String;
MiddleInitial : String;
LastName : String;
key { Contact_ID };
reference Person_Contact ...
}

table Organization
{
Contact_ID : ContactID;
OrganizationName : String;
key { Contact_ID };
reference Organization_Contact ...
}

constraint MutualIncl ...

constraint MutualExcl ...

view ContactView
Contact
join (ContactType { ID Type_ID, GetNameOperator })
add { GetNameOperator(ID) Name }
{ ID, Name };
This might provide similar behavior, but doesn't seem as natural. Why not? I am exploring this question; here are some somewhat random observations:

  • ContactType is constant like the other propositions formed by the application's "code". There will always be two rows (Organization and Person) in the ContactType table, whereas the other tables contain variable information. It may not seem to harm anything to use a variable to represent a constant, but this is effectively placing code with data, perhaps forming a part of the reason for the apparent strangeness. As an aside, a table constant could be defined in the Relational Model (RM) as is through a table constrained to equal a table literal, but it seems that a true table constant would be a natural feature. I have often wanted enumerations and/or constants in D4.
  • A Contact is an abstract entity, therefore it doesn't make sense to have just a Contact; a specific type of contact is required. With several somewhat awkward constraints, a relational design can enforce the notion of mutual inclusion and mutual exclusion [shown by D4 tutorials], but the system still lacks the "is a" information that is know to an OO system. Without this information, the system cannot automate this commonly occurring pattern. For instance, if a user attempts to add a contact, rather than furnishing a complicated error after the user has filled out the contact form, a more useful behavior would be to allow the user to choose the type of contact to add.
  • Strictly speaking, the Contact table is redundant unless it introduces it's own non-key attributes. The Contact table could exist as a view based on the union of Organization and Person. Given a sophisticated relational system, this could logically behave identically to the shown relational design. Given a really sophisticated relational system it could even physically behave identically, though Physical Data Independence (PDI) is a topic for another day. Given these seemingly equivalent designs, what is one to conclude about abstract types?
  • Key values and object references have different semantics. Consider these points where a relational system differs from an OO one: Having a contact ID value doesn't guarantee having a contact row. Operators can be defined against a contact ID, but operators must access global state to actually refer to the referenced row(s). If a contact's key is compound, operators against a contact must take multiple arguments. In short, the relational design uses explicit, surrogate values to reference global logical entities, whereas an OO design uses local implicit references.
  • The OO design encapsulates the notions of Contact, Organization, and Person into a scalar form. The relational design leaves these entities as the nebulous product of various components (tables, references, constraints, operators, etc.). The types have not truly been described to the relational system as logical entities. For better and worse, the relational design yields complete transparency. This transparency lends itself to perspective independence and the analytical abilities that follow. On the other hand, encapsulation has also proven itself the great simplifier.
  • What does it mean to have operators as a type in a relational system? What higher order issues does this raise?
  • Code is obviously data, but not all data exists at the same level of discourse. As mentioned previously, the ContactType table seems at the meta-data level with say, the Person table. This is a similar relationship to the one that exists between the objects in the database and the reflective "system catalog" table variables that describe them. So why does the ContactType table in the design exist at the same level as the rest of the "data" while other application operators and such go with the "catalog"? Is there some general mechanism needed in a relational system to describe meta-data?

More recently, rather than trying to tie the two while leaving the RM as is, I've been considering how the RM might be different with a more "liberal" type system. For instance, imagine something like this:

type Contact
{
ID : ContactID;
abstract GetName() : String;
}

type Person : Contact
{
FirstName : ProperName;
MiddleInitial : NameInitial;
LastName : ProperName;
override GetName()
{
return FirstName + MiddleInitial + LastName;
}
}

type Organization : Contact...

type Entity //..for whom we are tracking contacts
{
Contacts : table { Contact : Contact; key Contact.ID };
...
}

The "database", would then simply be a shared variable ("instance") of type Entity (call it Database rather than Entity if you wish). I will not pretend that such a proposal doesn't deeply affect many of the assumptions made by the RM; regardless I see no harm in exploring them so long as it is not done in ignorance of the implications. So what are some of the implications of the above example?

  • A key is defined on an expression, rather than an attribute list. This might seem like heresy, but under some constraints (functional, local, deterministic) I don't see this as a problem. Note that an expression as a key might make functional dependency inference tricky under certain circumstances, but the cases that are currently facilitated by an attribute list would be trivial. This change might be a useful generalization regardless of the other ideas shown.
  • The single attribute in the Contacts table is declared as the abstract type Contact. In the style of OO this means that the value, or "runtime" type, of each row may differ. At a glance the Person and Organization entities may seem to contain more information than the declared type (Contact) allows for, but as I discussed in a prior post, an alternative view is that there are fewer in the set of more specialized types, but the representations are more specific.
  • How about relational operators? How are we to do useful operations on a table with a single attribute? One option is to have the system provide alternate representations of the same type. In the case of the example, the system could automatically provide a representation of a table that looks like a table containing the unencapsulated contact. Such alternative representations might be more useful in light of operations such as projection. At the same time, generalizing some relational operators to support expressions rather than just columns may not be a bad idea either. For example:
    select Contact join MailingList
    by Contact.GetName() = MailingList.Name;
    (is this still an equi-join?)

Representations may or may not provide a way to avoid all of this and keep the RM and OO perspectives independent. Still pondering...

Comments

Unknown said…
What about the following proposal.

There are three tables, contact, organization and person, and we loose the contactType table, which is redundant, because the fact that there is a person record associate with a contact means the contact is a person, the same goes for organization.

table Contact
{
ID : ContactID;
Type_ID : ContactTypeID;
key { ID };
reference Contact_ContactType ...
}

table Person
{
Contact_ID : ContactID;
FirstName : String;
MiddleInitial : String;
LastName : String;
key { Contact_ID };
reference Person_Contact ...
}

table Organization
{
Contact_ID : ContactID;
OrganizationName : String;
key { Contact_ID };
reference Organization_Contact ...
}

I wil use an relation language whe use to express a relational expression. The relevant operators are
relation.union(relation): combines the tuples from two table in one table, only possible if both table headings are equal
relation.extend(fielname, expression): extend the table with a calculated field
projectIn(fieldlist): get only the selected fields in the result table

The following expression does the trick:
Person.projectIn(contact_ID).extend('name', FirstName + MiddleInitial + LastName).union.organization.projectIn(contact_ID).extend('name', OrganizationName)

The dot notation works as follows:
person: get the person table
.projectIn(contact_ID): tale the result so far and get rid of all the fields except
.extend(...): take the result so far and extend with a new field
etc.

The result is a table with contactID's and relevant names.
Nathan Allan said…
Freek,

I'm not sure how what you are suggesting solves the problem. The problem is how to a) enforce the constraints involved; and b) how to reason about and abstract conceptual cases of inheritance in a relational schema.

Certainly we can compute a common description attribute as you suggest using today's relational and pseudo-relational languages. A query in D4 for instance which does what you suggest might look like this:
Person { ID, First + " " + Last Description }
union (Organization { ID, Name Description })
Nathan Allan said…
Oops, forgot my third bullet: c) to provide polymorphism so as to enable a method for layering abstraction and extensibility.

Just as above a union may be used to hard-code the known contact sub-tables, every virtual method invocation in OO land could in fact be replaced with a switch/case statement given a concrete case, but the point that the target can change at run-time and that the target constitutes and extensibility contract. This form of abstraction allows code to be layered into smaller autonomous units. This form of modularity is, in my opinion, necessary in order to scale software complexity beyond a certain point.

Popular posts from this blog

Don't Repeat Yourself... Really!

Camtasia Studio Tips