The introduction of KMS and GEM into the i915 driver broke the i830/i845 chipsets, and a lots of hearts. But fear not! A decade after its introduction, we finally have a driver that is not only stable, but capable of accelerating firefox.
The problem?
The problem was, simply, we could not find a way to enable dynamic video memory on the ancient i830/i845 chipsets without it eventually eating garbage. Since dynamic memory management was the raison d’etre of GEM and critical for acceleration, it is a requirement of the current driver stack. The first cunning solution was simply never to reuse batch buffers, and keep a small amount of memory reserved for our usage. This stopped the command streamer from seeing the garbage, and my system has remained stable for many hours of thrashing. Daniel Vetter extended my solution to implement a kernel workaround whereby every batch would be copied into a reserved area before execution. In the end, we compromised so that I could avoid that extra copy and assume responsibility in the driver for ensuring the batch was coherent, but the kernel would intervene for any non-cooperative driver.
With these workarounds in place, we are finally able to run through the test suites. Which brought us to the next problem:

The sad fact is that UXA is inadequate for the challenge of accelerating the Render protocol.
If we compare with an architecture that was designed to accelerate cairo, SNA:

We find a much happier result. In all cases the performance is at least as good as using a software rasteriser in the X server, and often much better than if we avoided the Render protocol entirely and did the rasterisation in the client. With a little more tuning, we may be able to achieve parity even in the worst case – if we can win on an old GPU with an ancient CPU (single core, virtually no cache and even less memory bandwidth) we should be able to excel on more recent GPUs and CPUs, and be more efficient in the process.
Yet, the Render protocol is not the be-all-and-end-all of acceleration. We need to keep an eye on the basics as well, the copies, the fills and the uploads, to know if we are achieving our goals. The basic premise is that using the driver (and thus the GPU) is faster than just using the CPU for everything. (In reality, the choice is more complicated because we have to consider the efficacy of GPU offload for enabling the CPU to get on with other tasks and overall power efficiency.)
1: Baseline performance of Xvfb
2: SNA with acceleration disabled (shadow)
3: UXA
4: SNA
1 2 3 4 Operation
-------- ------ ------ ------ ---------
277000.0 2.04 1.06 4.58 Char in 80-char aa line (Charter 10)
265000.0 2.15 1.11 4.83 Char in 80-char rgb line (Charter 10)
312000.0 0.66 0.15 1.38 Copy 10x10 from window to window
6740.0 0.90 1.56 1.75 Copy 100x100 from window to window
382.0 0.92 1.30 1.36 Copy 500x500 from window to window
268000.0 0.74 0.17 1.50 Copy 10x10 from window to pixmap
7260.0 0.87 1.43 1.85 Copy 100x100 from window to pixmap
376.0 0.94 1.28 1.37 Copy 500x500 from window to pixmap
154000.0 0.74 0.69 0.86 PutImage 10x10 square
1880.0 1.04 1.05 1.04 PutImage 100x100 square
87.1 1.03 1.02 1.01 PutImage 500x500 square
308000.0 0.58 0.46 0.66 ShmPutImage 10x10 square
6500.0 1.02 1.16 1.24 ShmPutImage 100x100 square
380.0 1.00 1.24 1.28 ShmPutImage 500x500 square
So it appears that using the GPU for basic operations such as moving the windows about is only at most a marginal win over using a shadow buffer (and often times UXA fails at even that). Overall then it seems that enabling UXA should bring nothing but misery.