Superoptimization was introduced by H. Massalin in his paper Superoptimizer: a look at the smallest program2 back in 1987. The idea is that for every sequence of instructions for a given computer architecture and a metric, there is an infinite number of sequences which perform the same task (are semantically equivalent to the original sequence), but one of them is the best one with respect to the metric.
Take for example the instruction sequence in RV32I:
1 2 3 4
addi x1, zero, 1 mv x2, zero add x2, x2, x1 add x2, x2, x1
This sets the register
x2 to 2. If we choose a metric that orders programs by increasing order of size, then the best possible program is:
addi x2, zero, 2
Now take another example in RV32IM (which includes the multiplication extension):
mul x1, x1, 2
This doubles the value in register
x1. Given the same metric, it is not possible to find a better program. However, if we change our metric to the number of clock cycles taken by the whole sequence of instructions and our micro-architecture takes 8 cycles for multiplications and 2 cycles for shifting then there’s a better program:
slli x1, x1, 1
Which shifts register
x1 by 1 bit to the left, in essence multiplying its value by two and taking only 2 cycles whereas the original example took 8 (under our micro-architectural assumptions).
For long sequence of instructions the state space that requires searching for the best sequence is enormous, which makes superoptimization a CPU intensive program. Research on superoptimization slowed down after the initial paper but has seen a resurgence in the last decade.
Will superoptimizers replace your compiler? No!
Production compilers are highly complex pieces of software that generate very good code, under very tight time constraints. On the other hand, superoptimizers are very slow since they need to search very large state spaces and therefore can take minutes, hours or even days to find the best possible instruction sequence.
The placement of the superoptimizer in the development toolchain is up to the user but, in general, it will be placed after the compilation step picking up where the compiler left off and trying to optimize the code it generated.
S10 was created out of the resurgence in superoptimization research mentioned earlier. Mangpo Phothilimthana from UC Berkeley published in 2016 a paper3 and technical report on Greenthumb. The initial prototype of S10 was built upon Greenthumb and once we verified the potential we decided to turn it into a commercial product — where at this point about 15% of code comes from Mangpo’s work.
Since most of the code was rewritten, including the interface between the backend architectures and the main modules, the architectures supported by Greenthumb are not supported by S10, namely arm, llvm ir, and greenarrays. We have therefore chosen to implement support initially for RISC-V, with other architectures on the roadmap.
S10 is written completely in Racket and implements three distinct algorithms that cooperate in a distributed setting. A symbolic3, stochastic1 and lens4 algorithms are distributed over several cores (or different machines) and communicate with each other through message passing finding incrementally better instruction sequences until it’s proved that the best sequence has already been found.
S10 includes a statistical profiler and debugging aids to ensure that any error is found quickly and any hotspots are first to be optimized. Due to the statistical nature of some methods it is also possible to automatically tune these parameters on an architecture independent way to ensure the best possible performance from the framework.
Another area we focused on, given the long times taken to find solutions to the optimization problem was partial caching. S10 is supported by a database which uses previous found solutions to the superoptimization problem to simplify future searches.
Superoptimization in practice
In practice waiting for the perfect solution is not always possible or realistic, that is why S10 will incrementally provide incrementally better instruction sequences until it proves there is none better.
An interaction with S10 for the first example I mentioned in this post might look like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
$ riscv-s10 -j 8 --timeout 60 --verbose --progress add2.s S10 Superoptimization Framework, by Linki Tools UG https://linki.tools/s10 Booting up 8 cores (stochastic 3, lens 4, symbolic 1) Input score: 16 * stochastic found optimization (score: 12) mv x2, zero addi x2, x2, 1 addi x2, x2, 1 * lens found optimization (score: 4) addi x2, zero, 2 * symbolic found best addi x2, zero, 2 Shutting down cores Caching solution: addi x2, zero, 2
The superoptimizer will incrementally give information (due to the
--progress flag) about optimizations found until we can prove there’s no better solution, or a timeout is reached (which in this case was set to 60 seconds with
The metric for the previous example was instruction sequence size, but had our metric been performance the situation would have been similar with one exception. The metric has to be fast to execute because it is executed potentially millions of times. If a cycle accurate simulation takes a long time, then it is better to implement a fast heuristic and request the best few solutions found and run an accurate metric on those.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
$ riscv-s10 -j 8 --timeout 20 --verbose --progress \ --metric speed --accurate-metric-script sim.sh --run-metric-on 2 add2.s S10 Superoptimization Framework, by Linki Tools UG https://linki.tools/s10 Booting up 8 cores (stochastic 3, lens 4, symbolic 1) Input score: 16 * stochastic found optimization (score: 12) mv x2, zero addi x2, x2, 1 addi x2, x2, 1 * lens found optimization (score: 4) addi x2, zero, 2 Timeout reached. Shutting down cores Best two solutions:  mv x2, zero addi x2, x2, 1 addi x2, x2, 1  addi x2, zero, 2 Running sim.sh on ... Done Running sim.sh on ... Done Best solution (sim.sh):  addi x2, zero, 2 Caching solution: addi x2, zero, 2
S10 in this case ran against a tight timeout and at that point since it was given a script to calculate an accurate metric (with
--accurate-metric-script) it called it on the best two optimizations found at that point (due to
--run-metric-on). The metric scored solution
 as better and therefore that was the one which was returned and cached.
This is a simple example of what S10 can do. I have not mentioned how to extend the algorithms and internal structures, how to extend S10 to a new architecture or autotune it for better performance. I hope to touch those points on future posts.
Linki Tools is the first company to offer a commercial product for Superoptimization. Join us in superoptimizing your software! Contact us for more information.