Actively managing missing information through Condition Propagation

Few dead horses have been beaten as severely as the "missing information" one, yet I shall now drag it out and resume the thrashing. When arguing matters of null, discussion usually centers on "optional" columns in tables, but I wish to start on constructs that seem to me to be related, yet are rarely if ever discussed as such. Consider the following:

Points[10];

As I see it, there are four primary ways in which a system may deal with the range issues surrounding the indexer:

Always give a point value even if this means interpreting memory beyond the actual items allocated for the array. This is the approach taken in some cases by many pointer based languages.
Implement the indexer as a condition check followed by exception propagation logic if the condition fails; in other words, at runtime stop processing at whatever level and raise an error. This is the approach taken by most modern languages.
Implement the indexer with a runtime range check and return the null non-value (extending the core logic system to include null).
Propagate the range condition as compiler meta-data and provide compiler level mechanisms for dealing with these conditions.

Option 1 is clearly problematic as it requires no less than perfect diligence to ensure correct results. In other words, the compiler and runtime are no help in ensuring that all range conditions are satisfied.

Option 2 has the benefit of ensuring range checks, but has disadvantages including:

Condition checks are often duplicated; the developer must check them to avoid exceptions, yet the runtime checks them again.
Exceptions only surface at run-time; no help is given at compile-time regarding a context's conditions.
Exception propagation can only be managed imperatively; there are no means, for example, to interpret exceptions within an expression.
What really is an exception anyway? In other words, if any exceptions are actually anticipated, are they really exceptions? So what is "planned" behavior and what is an exception. This sounds merely philosophical, but this plagues today's pervasive languages with practical problems.

Option 3 is basically the stance taken by most "database" languages such as SQL and current D4. The approach does have the advantage of allowing a more optimistic design approach by propagating the missing information through the expression, but:

All missing information is lumped into a single "null" non-value; there is no visibility regarding where the null originated or why.
Nulls are forcefully "converted" at certain language boundaries and cause violations at others.
No compile-time information is provided as to where nulls might originate and whether those points have been addressed.

Option 4 has been alluded to in various discussions, but to my knowledge no concrete proposals or implementations have been produced. It is this option that I wish to explore. The heart of the concept is to treat missing information as a primary concern in the development process; something recognized by the compiler and considered by the developer; like types, variables, and the rest of the language elements.

Under Condition Propagation missing information is explicitly noted and antecedently avoided. Rather than extend logic systems or stop execution at run time, the compiler in-effect ensures that expressions are never evaluated that do not satisfy the conditions. The logical implications of this are significant. Unlike null-like solutions, operators, and thus truth-tables, remain logically unaffected by the missing information. For example, the AND truth table remains in it's traditional, four entry, form. Each primitive operator may publish compiler meta-data regarding the conditions under which the operator is able to operate. With this information the compiler ensures that any conditions are resolved prior to evaluation. The compiler also propagates the condition meta-data to allow the developer to choose at what point prior to evaluation to resolve the condition. Logic, arithmetic, relational and other operations can all be defined in their true logical form, unaware of missing information because they will never be requested to deal with it.

For example's sake, assume a compiler supporting Condition Propagation and consider the case of Points[10]. The compiler "knows" that indexing into points introduces a condition--namely that there must be at least 10 elements in the array--and that the expression can only produce a value if that condition is satisfied. We'll call this condition PointsInRange. This information is noted by the compiler and propagated to the next operation in the expression. In the example, the next operation will be a call to Translate:

Translate(Points[10], 5, 5)
(translate returns a point shifted by the given x and y deltas)

The Translate call inherits the condition: PointsInRange. Stepping back, this makes sense because Translate in this context could not provide a value unless the points indexer provides a value. Let's now resolve this condition so that an unconditional result value is produced.

Translate(Points[10], 5, 5)resolve PointInRange as Point(0, 0)
// result when indexer is out of range: Point(0, 0)
(hypothetical syntax for illustration)

Allowing the condition to propagate to the Translate before being resolved has significant logical implications versus addressing the condition at the indexer, as in:

Translate(Points[10] resolve as Point(0, 0), 5, 5)
// result when indexer is out of range: Point(5, 5)

This type of propagation can be useful for many reasons and is often cited as an advantage of null based systems. Primarily, the propagation allows for an optimistic coding style while still allowing conditions to be resolved. Consider this example:

Parse(String : String) : Integer

In a language with exceptions, an unparseable string would typically be handled by throwing a runtime exception. In a null supporting language, the function might return null. Due to the rather restrictive nature of exceptions, and the realization that an unparseable string isn't really an exception but one anticipated outcome, another pattern is often advocated:

TryParse(out ID : UserID) : Boolean

This pattern does make the parseability of the string explicit as the return value, but suffers from being hopelessly imperative. Clearly the user ID cannot be used until a subsequent statement in a block. Note that this is the pattern recently utilized and advocated by Microsoft in the .NET Framework 2.0.

What happens if conditions are never resolved? There are several possible options, including:

The compiler could address unresolved conditions at expression boundaries. It could, for instance:
Generate code to throw an exception. Once out of an expressive realm, exceptions are more useful than they are within an expression.
Raise compiler warnings and/or errors.
The compiler could pass the condition meta-data on to imperative operations, for instance:
Assigning a conditional expression to an optional attribute could have the effect of clearing the optional attribute if the condition is not met.
A failed condition at run-time may turn the imperative operation into a no-op; or perhaps more generally:
Branch or switch based on conditions.

It seems possible that language support for Condition Propagation would significantly automate many processes developer's manually handle without such support. A simple example might be:

var i := Parse(SomeString)
  when not StringParsable then
  begin
    ShowMessage("Invalid integer value, try again");
    return;
  end;

In this example, the "when.." clause is part of the assignment, allowing branching for unresolved conditions in the expression. In the case of a string not being convertible to an integer, the assignment would not be performed and the conditional block would execute. The effect is that not just functional operators can be kept from having to encounter non-values, but imperative ones as well.

The language should also have the ability to operate on conditions directly, such as:

if Points[10] satisfies PointsInRange then ...
(Returns true if there are at least 10 points in the Points array)

The optimization possibilities seem immense for Condition Propagation. In this example the actual evaluation of the expression can be pruned from the execution plan. In this case, only the PointsInRange condition needs to be evaluated.

Note that regardless of the expression the compiler must perform to ensure a condition, the condition is always dealt with as a simple Boolean value. The compiler generates the actual code to evaluate the condition.

It should be possible to preserve both the value and the conditions of an expression in an imperative context:

var p := Points[10];
select p satisfies PointsInRange;

In the latest example, the compiler determines the type for p based on the expression Points[10], but what type is it that can preserve the PointsInRange condition? This is a trick question, as it isn't the type that is maintaining the condition, it is the variable. The "manual" variable declaration might look like this:

var PointsInRange : Boolean;
var p : Point when PointsInRange;

If p receives an unconditional value (p := Point(0, 0);) it is clear to the compiler that PointsInRange must be satisfied or it could not receive a value. The compiler can thus maintain the conditional nature of p, even in an imperative context. If single variables can be conditional, how about attributes within a tuple (AKA row, record, struct)? Let's call such a structure a sparse row:

row
{
  Name : String;
  HasHair : Boolean;
  when HasHair
  {
    HairColor : Color;
    HairLength : Length;
  }
  when not HasHair
  {
    IsScalpWaxed : Boolean;
  }
}

The Name and HasHair attributes are required, whereas the other attributes are conditional on the HasHair attribute. When leaving the expressive realm, it becomes important to make conditions explicit data rather than just compiler meta-data. Otherwise, the conditional information would be "hidden" and the resulting solution could not be deemed complete from a data management perspective. Note that conditional attributes like HairColor and HairLength in the above example are typically represented in systems without making the conditional relationship known to the system. Relational systems, do provide a means to describe such conditional relationships through cardinality relationships and constraints. It could be argued that because a relational representation of the design is possible, there is no need for the extra complexity of a sparse row. This might be true if the only useful representation of the information were relational, but this simply is not the case. The relational model itself builds upon tuples and scalar values yet in order for the Information Principal to hold, there must be relational representations of information for which there are also tuple and scalar representations. A Sparse Row can therefore be seen as another logical representation of relations having certain cardinality relationships. It is worth noting that a sparse row is also a much more concise short-hand for what would actually be a rather complex relational schema. I will save more discussion of this point for another day.

var c := MyPerson.HairColor;
MyPerson.HairColor := Color.Green;

"Reading" the value of a conditional attribute, such as HairColor in the example, introduces a condition to the compiler, but what happens when a conditional attribute is written to? The answer of course depends on how assignment is defined. A sophisticated language might support the notion of updating the condition to make it true. Updating through scalar expressions is trivial for cases such as an identifier reference and simple unary operations (such as NOT in the example). So the effect is that setting HairColor would ensure that HasHair was true. Other semantics are also possible so long as the language maintains the integrity of the structures.

The examples thus far have explored a few cases; Let's explore other missing information plagued areas where Condition Propagation might be applied:

Row extractors (the extraction of a single tuple from a relation value having only a single tuple). It is useful to define the extraction condition as two conditions: 1. The table must not contain more than one row; 2. The table must not contain zero rows. A relational compiler can detect that more than one row is possible via a check for the "empty" key (a key with no columns). Thus, row extraction would only need to originate a ">1" cardinality check condition if the source table was not this way, singleton. The other condition, that the source table not be empty, could also be suppressed if a corresponding constraint could be identified.
Divide by zero. In computer math, it is most accurate to describe the results of a divide by zero as "not a value". Condition propagation seems perfect for representing this because expressions that do not satisfy the conditions are never evaluated and hence will always have a value.
Outer joins. The row type of a table resulting from an outer join can simply be a sparse row with the outer columns conditioned on an "existence" attribute. Note that D4 already has the ability to include such an existence attribute as part of an outer join operation.
Aggregate Operators. Aggregate operators such as Sum, Min, and Count face two missing information problems. One problem traditionally has been the possibility of missing information in the source information. For instance, should Count count a null "value" or not. This problem is clearly addressed by Condition Propagation because like all other operators the aggregate can always assume "real" values to operate on. The other missing information problem faced by aggregate operators is an empty set as a source. What, for instance, is the average of an empty set? Condition propagation again seems to address this issues because aggregate operators that require non-empty sets can originate this requirement as a condition.

I know that those who closely monitor relational theory are likely squeamish at what might seem like "extensions" to the relational model in order to have tables based on such things as sparse rows. I can understand this reaction, but consider that, again, the "sparseness" is removed before the operators even operate. Thus, all relational operators will only be asked to operate on "regular" rows. Another way to look at Condition Propagation is that the ability for things to be conditional exists at a higher level of abstraction, another logic system if you will, super-imposed over a more "pure and simple", realm of logic in which values are never missing when they are expected.

On the topic of usability, a compiler supporting Condition propagation might "know" what conditions exist at each point in the code, but how is the developer to know these while coding? The compiler may throw errors and warning, but this information would be much more helpful while the developer is actually formulating statements. I see this problem being addressed in essentially the same manner that modern development environments aid developers in terms of data types. Conditions could be visualized by a mechanism similar to "intellisense", whereby the development editor automatically displays annotations indicating conditions.

Optimization was briefly mentioned earlier, but it might be worth discussing in a little more detail. Consider alone how many times applications perform duplicate bounds checking. Strongly typed runtime exception supporting languages, for instance, are performing range checks each time the developer accesses items in an array. This is in addition to the checks already being done by the developer. The only way to avoid this in such languages is to invoke the operation, catch any exception, and perform condition logic that way. In reality, this is not practical because a) exceptions are purely imperative; b) the said approach is very cumbersome; and c) exception propagation is complex by nature and thus the resulting solution is going to be dramatically slower than a condition check.

Though there are clearly code generation optimizations to be had, there seem to be even greater possibilities for logical optimizations. Nulls and exceptions introduce almost insurmountable obstacles to even basic logical optimizations. A tautology or contradiction check, for instance, fundamentally assumes 2 value logic (with emphasis on the word value).

Further research is obviously needed. There are likely to be many issues that arise during design and implementation of a concrete language based on Condition Propagation. Questions, for instance, might include: what happens to conditions in the presence of non-functional operations (side effects)? Must conditions be resolved before invoking such operators or can they safely propagate through them? Can all conditions be deterministic? How should the compiler derive names for conditions and keep them from colliding? Questions aside, the approach seems promising, and I look forward to exploring it further.

Search This Blog

Ten Ways to Sunday

Actively managing missing information through Condition Propagation

Comments

Popular posts from this blog

Don't Repeat Yourself... Really!

Virtual Appliances... poodoo

Enough software patent madness!