At today's CRG meeting, we talked about what we want to do with KRoC -- and occam in general -- in the near future. The consensus was that we want to make occam a good tool for use on machines with multiple CPUs on a single chip. This means that we need to get KRoC (or, more likely, its successor) compiling code that works efficiently on SMP with some degree of shared cache, but also that we need to make KRoC perform well on sequential code. phw suggested that we should aim to produce better code than a sequential programmer can hand-craft using the vendor compiler.
The vendor compiler in the case of the Intel IA32 series is icc, which is well known for producing decent object code. (It's a non-cheap proprietary product; there's also GCC, which comes pretty close and is free software.) My conjecture is that Intel's compiler works very well because its developers have access to both the excellent Intel processor manuals and the internal design documents of the chip -- they should be able to tell precisely what effect code will have on the chip's internals. We've got access to the public manuals, but we don't have the internal knowledge. We're also a fairly small team of people; Intel and the FSF have both had large teams of extremely skilled people working on their compilers for many years. We could write an optimiser as good as GCC's -- and probably one as good as icc's, with some guesswork -- but it'd take a long time and a lot of work, be extremely complex, and need doing all over again when we want to target a different processor.
So let's not bother. Instead, let's let the computer do the work.
A modern processor can be viewed as a black box containing a complex system. We provide some input -- object code -- and can measure some output -- how fast it executes (averaged over a number of iterations). This is the sort of problem that evolutionary algorithms are designed to solve.
Specifically, I propose that we chop our program up into chunks that perform sets of operations, with input and output data, and then optimise each chunk using an evolutionary algorithm. The evolutionary population must consist of only programs that are known to be correct, so our mutation operators must preserve correctness -- easier than it sounds, given that the optimisation best suited to this approach will be instruction reordering, where we can easily reorder instructions without changing the semantics once we've built a dependency graph. Since we want to avoid our optimiser getting stuck in a local maximum, we'd need a mix of small and large sets of mutations; some children should have just one mutation, and some should have a large number. (This is all stuff that an expert on evolutionary computation would be able to tell us more about.)
Measuring how fast code executes is an interesting problem, because to do it in real-world conditions requires sensible input data. One approach would be to compile a "dumb" version of the program and run it until the point at which we want to try exchanging code. This requires input data that exercises every code path (or, at least, every code path we want to optimise).
An extra operator that might be worth having for some processors would be to randomly insert NOPs or other placeholders that could be replaced with instructions from the next block later; I'm guessing that there are cases where inserting an extra instruction could prevent a pipeline stall?
One problem is that you need to be able to run code on the processor you're targeting, so cross-compilation is awkward. It also generates code for the specific processor you're testing on; I'd expect different processors in the same range to have different behaviour depending on amounts of cache, pipeline design, and so on. (I remember reading an article a few years ago about evolutionary FPGA design, where the resulting design worked beautifully, but only on the particular set of FPGAs it had been tested on, and only in a particular temperature range! This was based on pure random mutation, though, so the resulting "design" was depending on the analogue rather than digital properties of the FPGA gates...)
Another problem is that this could produce a seriously slow compiler. However, once we've optimised a block of code once, there's no need to do it again -- we can cache the result for future compilations of similar code. Even better, we could do this over the Internet, with a global repository of optimisations used by all the compiler's users -- true distributed compilation! This should be perfectly safe to do, since the compiler can verify that the optimised code has the same dataflow semantics as the original code as it knows the set of safe mutations that can be performed on it -- a check that we probably want to perform anyway -- so even if someone submits a "maliciously-optimised" bit of code to the repository, the compiler will spot that it doesn't do the same as the original code. This idea of a global cache could also be used to average out differences between processors; the fitness calculation that the evolutionary optimiser performs could take into account fitness measurements from other systems.
One of the nicest things about this approach is that it should be possible to make it work for different kinds of processors with a reasonably small amount of effort -- if the compiler can be made to produce unoptimised code, and is given a description of the semantics of each instruction sufficient to construct mutation operators, then the above optimisation approach can be used. Even if the performance isn't as good as the hand-crafted compilers, being able to use it on pretty any new processor design -- and have it adapt automatically to new processors in the same family without needing to change the code at all -- would be extremely useful.
Of course, I haven't actually tried this, and it's hard to predict how effective it'd be in practice. It certainly seems like an interesting idea, though, and I'd be very interested to see if it manages to discover optimisations that a hand-crafted compiler author wouldn't think of, or processor behaviour that the CPU designer wouldn't be aware of...