Saturday, July 24, 2010

Solving a 30 Year Puzzle

I have been interested in programming languages for over 30 years. One of the first programming languages I was interested in was Algol 68. Indeed I have always been annoyed that all subsequent languages get some things wrong that Algol 68 got right. More on that later.

One thing Algol 68 gets wrong is that it conflates variables and pointers. For clarity I’ll use more modern notation, only consider variables, and restrict arrays to 1 dimension: Var<X> for the type of X variables and Vect<X> for the type of 0-based 1-dimensional arrays of X. I’ll also show explicitly where parameterless functions are called.

What I have always liked about Algol 68 (that all other languages get wrong) is that identifiers stand for some sort of constant. The core declaration syntax is just “type identifier = value”.
  • Int three = 3;
  • Var<Int> i = loc<Int>();
  • Vect<Int> one2three = (1,2,3);
loc<Int> returns a Var<Int> entity (a box you can put an Int into) from the stack. The distinction between Int and Var<Int> is crucial for thinking clearly about programming. Note that Vect<Int> is a constant array of Int. We also note that the type denotation on the left is redundant, since it is just the type of the expression on the right, and a modern language would leave that out.

Constant arrays are undoubtedly useful, but mostly people will want traditional arrays that can be updated in place. This is where Algol 68 makes a delicate decision. The obvious decision is that an updateable array is a Vect<Var<X>>. However that seemed to imply passing around lists of addresses of variables. This seemed undesirable. So instead (actually: as well) they made special rules for Var<Vect<X>>. So if we have:
  • Var<Vect<Int>> vvi = loc<Vect<Int>>(3) := (1,2,3);
The rule is that normal Algol 68 dereferencing doesn’t happen when you write, for example, vvi[2]. Instead some magic happens and vvi[2] becomes a Var<Int> that allows you to update vvi in place.

This is really sad. Without this then we can easily regard an array as a procedure, with subscripting and slicing being normal procedure calls. Then a lot of stuff could be moved out of the core language into the standard library. It is always good if stuff can be done in the standard library, because it raises the chance that authors of other libraries will be able to achieve a good result.

Indeed it would be nice to get Var out of the core language. Var itself interacts horribly with type inclusion/coercion, being neither covariant nor contravariant in the underlying type. So if every X is a Y, and every Y is a Z, and a procedure takes Var<Y> parameter then you can’t consistently pass either a Var<X> or a Var<Z> to the procedure. Also standard variable semantics are unsuitable for the modern world where concurrency is exploited at all scales. It seems that we need multiple Var-like types with different concurrency guarantees.

Getting back to the problem. First we need to understand a couple of things.

Representing values at run time

How many bits do you need to represent the number 3 in a computer? If you didn't get 0 then try this one. How many bits does it take to represent a number that might be 17 or 18? Now it is more obvious that the answer is 1 bit: 0 for 17 and 1 for 18 (or vice versa). Of course it might result in a shorter and/or faster program if the number is represented by 17 for 17 and 18 for 18, but that might not be the case.

Maybe you think that this only applies when the values are known at compile time. But JIT compilers are becoming the norm, so that distinction is disappearing. In an Algol 68 style language where every identifier stands for a constant, it may not be necessary to actually materialize the value at run time at all. Indeed that is the norm for Var values, since they are normally just a fixed offset from the frame pointer, and machines typically support that conveniently in the hardware.

Assembly Language had it right

Most of my early programming was in various assembly languages. Assembly language is just a convenience veneer on the machines underlying instruction set. In assembler identifiers are created in two ways:
  • A define statement gives an identifier a fixed meaning as a bit string.
  • A label statement gives an identifier a fixed meaning as a memory address of some code or of some reserved memory.

This was right. The first languages (Fortran, Algol) used identifiers for changing values. Algol 68 tried to get back on track, but slightly got it wrong. After that it was downhill all the way in the mainstream. On the side we had functional programming, but that doesn’t solve the problem of handling Var types correctly, it just tries to eliminate them.

The right answer

Putting it together we see that we can implement a normal updateable array in the natural way as a Vect<Var<X>>. It can normally be a sequence of consecutive Var<X> boxes. The compiler can remember this and doesn’t need to materialize the addresses of the Var<X> boxes. We still want to write:
  • Vect<Var<Int>> vvi = loc<Vect<Var<Int>>>(3) := (1,2,3);
but now we need to enhance the := assignment operator to do parallel assignment.

But what happens when we want to pass a Vect<Var<X>> as a parameter to a procedure? Mostly we want to send that as a start memory address and a length, and perhaps a stride. But we don’t want to abandon the option to send a fully general list of Var<X>, i.e. of pointers to Xs in memory. Both these representations are valid and don’t change the semantics of Vect<Var<X>>.

We need to embrace this. Each type can have multiple representations defined in terms of other more low level types. At the bottom things have to be defined in terms of the capabilities of the raw machine, or some abstract machine. So for a given procedure there are a finite number of implementations covering the possible input representations, and indeed the possible output representations. There is no reason for this to get out of control in the era of JIT compilation. It is hard to believe that duplicated implementations of procedures can work out worse than the technique commonly used for traits/interfaces of passing a pointer to the object and a pointer to a list of implemented methods.

Well anyway it will be interesting to investigate. My concept of a nice programming language seems within reach for the first time. The name will be COPL, for Closure Oriented Programming Language.