Skip navigation

Monthly Archives: December 2012

Lots of numbers from running cairo-traces on a i5-2500, which has a Sandybridge GT1 GPU (or more confusingly called HD2000), and also a Radeon HD5770 discrete GPU.

Performance results

The key results being the geometric mean of all the traces as compared to using a software rasteriser:

UXA (snb): 1.6x slower
glamor (snb): 2.7x slower
SNA (snb): 1.6x faster
GL (snb): 2.9x slower

EXA (r600g): 3.8x slower
glamor (r600g): 3.0x slower
GL (r600g): 2.7x slower

fglrx (xlib): 4.7x slower
fglrx (GL): 32.0x slower [not shown as it makes the graphs even more difficult to read]

All bar one of the acceleration methods is worse (performance, power, latency, by any metric) than simply using the CPU and rendering directly within the client. Note also that software rasterisation is currently more performant than trying to use the GPU through the OpenGL driver stack.

(All software, except for the xserver which was held back to keep glamor working, from git as of 20121228.)

Advertisement

The introduction of KMS and GEM into the i915 driver broke the i830/i845 chipsets, and a lots of hearts. But fear not! A decade after its introduction, we finally have a driver that is not only stable, but capable of accelerating firefox.

The problem?

The problem was, simply, we could not find a way to enable dynamic video memory on the ancient i830/i845 chipsets without it eventually eating garbage. Since dynamic memory management was the raison d’etre of GEM and critical for acceleration, it is a requirement of the current driver stack. The first cunning solution was simply never to reuse batch buffers, and keep a small amount of memory reserved for our usage. This stopped the command streamer from seeing the garbage, and my system has remained stable for many hours of thrashing. Daniel Vetter extended my solution to implement a kernel workaround whereby every batch would be copied into a reserved area before execution. In the end, we compromised so that I could avoid that extra copy and assume responsibility in the driver for ensuring the batch was coherent, but the kernel would intervene for any non-cooperative driver.

With these workarounds in place, we are finally able to run through the test suites. Which brought us to the next problem:

UXA vs software rasterisation on 845g

The sad fact is that UXA is inadequate for the challenge of accelerating the Render protocol.

If we compare with an architecture that was designed to accelerate cairo, SNA:

SNA vs software rasterisation on 845g

We find a much happier result. In all cases the performance is at least as good as using a software rasteriser in the X server, and often much better than if we avoided the Render protocol entirely and did the rasterisation in the client. With a little more tuning, we may be able to achieve parity even in the worst case – if we can win on an old GPU with an ancient CPU (single core, virtually no cache and even less memory bandwidth) we should be able to excel on more recent GPUs and CPUs, and be more efficient in the process.

Yet, the Render protocol is not the be-all-and-end-all of acceleration. We need to keep an eye on the basics as well, the copies, the fills and the uploads, to know if we are achieving our goals. The basic premise is that using the driver (and thus the GPU) is faster than just using the CPU for everything. (In reality, the choice is more complicated because we have to consider the efficacy of GPU offload for enabling the CPU to get on with other tasks and overall power efficiency.)

1: Baseline performance of Xvfb
2: SNA with acceleration disabled (shadow)
3: UXA
4: SNA

     1        2       3       4    Operation
--------  ------  ------  ------   ---------
277000.0    2.04    1.06    4.58   Char in 80-char aa line (Charter 10) 
265000.0    2.15    1.11    4.83   Char in 80-char rgb line (Charter 10) 
312000.0    0.66    0.15    1.38   Copy 10x10 from window to window 
  6740.0    0.90    1.56    1.75   Copy 100x100 from window to window 
   382.0    0.92    1.30    1.36   Copy 500x500 from window to window 
268000.0    0.74    0.17    1.50   Copy 10x10 from window to pixmap 
  7260.0    0.87    1.43    1.85   Copy 100x100 from window to pixmap 
   376.0    0.94    1.28    1.37   Copy 500x500 from window to pixmap 
154000.0    0.74    0.69    0.86   PutImage 10x10 square 
  1880.0    1.04    1.05    1.04   PutImage 100x100 square 
    87.1    1.03    1.02    1.01   PutImage 500x500 square 
308000.0    0.58    0.46    0.66   ShmPutImage 10x10 square 
  6500.0    1.02    1.16    1.24   ShmPutImage 100x100 square 
   380.0    1.00    1.24    1.28   ShmPutImage 500x500 square 

So it appears that using the GPU for basic operations such as moving the windows about is only at most a marginal win over using a shadow buffer (and often times UXA fails at even that). Overall then it seems that enabling UXA should bring nothing but misery.