Skip navigation

Category Archives: 2D

So I have a new toy, an i7-4950hq processsor. This little beast is one of the special Intel chips sporting an Iris Pro 5200, better known as Haswell GT3e. That GPU has 40 execution units and 128MiB of eDRAM to serve as a fourth-level cache for both the CPU and GPU.

Enough spiel, just how fast is it?

For context, here are some results comparing it with my old Sandybridge laptop (with an i5-2520m).

Comparing the processor using the single-threaded cairo-image:

Comparison of i7-4950hq to i5-2520m

and again comparing the GPUs, using SNA and cairo-xlib:

Comparison of i7-4950hq to i5-2520m

On the whole, we see a two-fold increase of both single-threaded CPU performance and GPU performance (for 2D graphics using cairo) from the jump from a Sandybridge i5-2520m to a Haswell i7-4950hq. In most cases SNA is being limited by how fast the application can feed it commands and so the performance increase is mostly due to same improvement in CPU speed. (This increase is above and beyond the expected improvements due to IPC, so it is more likely the ability of the Haswell chip to turbo higher and longer thanks to improved thermals and cooling.)

And we can compare the relative merits of using OpenGL and a specialised 2D driver by comparing the various rendering backends available for the DDX. The results are normalized to the cairo-image results, and we have

  • none – a multithreaded CPU renderer inside the DDX
  • blt – disable the render acceleration, but allow the DDX to use the BLT engine to move data about i.e. copies and fills
  • sna – SNA render acceleration, default in xf86-video-intel-3.0
  • uxa – UXA render acceleration, current default
  • glamor – Glamor render acceleration, uses OpenGL to offload rendering operations onto the GPU

Comparison of DDX backends on an i5-2520m

Comparison of DDX backends on an i7-4950q

The summary here is that Glamor offers a meagre improvement over UXA. However, both are still much slower on average than cairo-image, i.e. the performance attainable by using a single CPU core. It takes multiple threads inside the DDX to match the performance of cairo-image – this is due to the inherent inefficiencies of the current Render protocol. However, if we then utilize the render acceleration on the GPU (using SNA) we can indeed outperform cairo-image, on average about 2x faster and about 4x faster than UXA and Glamor. Thus SNA does deliver hardware acceleration that succeeds in offloading work onto the GPU (letting the CPU get on with other tasks) and performs faster than rendering everything with the CPU.


Søren Sandmann Pedersen, maintainer of the pixman library that is at the heart of the software rasteriser in Cairo and X, argues that with all things being equal an integrated graphics processor should never be able to outperform the main processor for rendering Cairo. Ideally, when using either the GPU or CPU, we want the memory being the rate-limiting component. On the Sandybridge processor, the memory is attached to the system agent, along with the large L3 caches, and is shared between the CPU and GPU. So he argues that any performance advantage cairo-xlib might show over cairo-image whilst using Intel graphics is really an opportunity for improvement of the rasteriser and pixman kernels. (But also bear in mind that the GPU should be considerably more efficient and be able to saturate the memory bus at a fraction of the power cost of the CPU). Using SSE2, we are indeed able to fully saturate the bus to L1, L2 and main memory for a variety of common operations in pixman. However, the bilinear scaling kernels take longer to compute than it does to write the results to memory, and so fail to completely utilize the available bandwidth. Furthermore, for a slightly more complex transformation than a simple scale and translation, we do not yet have a fast SSE2 kernel that operates directly on the destination and so must incur some extra work. If we ignore, for the time being, making further optimisations to those kernels to improve performance, we can instead ask the question: if using one SSE2 unit is not enough to saturate the memory bus, what happens if we throw more cores at the problem?

To gain the most improvement from adding threads to cairo, you need to design a rasteriser and usage model with threading in mind in. One such design is the vector renderer O by Øyvind Kolås. Despite being an experiment, it does show quite a bit of promise, but in its raw form just throwing threads at the problem does not beat using the SIMD compositing routines provided by pixman. However, it did raise the question whether we can make improvements to the existing image backend without impacting upon its immediate mode nature and so could be used by existing applications without alteration. To preserve the existing semantics, we can break up the individual composite and scan conversion operations into small pieces and feed those to a pool of threads, and then wait for the threads to complete before returning back to the application. As such we then never run the threads for very long, and risk that the overhead in thread management outweighs any benefit from splitting the operation over multiple cores.

To gain the greatest advantage from adding threads, I used a desktop Sandybridge processor (an i5-2500) for benchmarking this prototype cairo-image backend:

Comparison of multiple threads

About half the test cases regress by about 10%, but around one third are improved by 2-3x. Not bad – the regressions are unacceptable, but it does offer a tantalising suggestion that we can make some significant improvements with only a minor change. (Remember not all processors are equal, and thanks to Sandybridge’s turbo some cores are more equal than others. I tried running the test cases on a Sandybridge ultrabook, and as soon as more one core was active, the entire package was throttled to keep it within its thermal constraints. The performance regression was dire.)

Having tuned the software backend to make better use of the average resources, what can we now say about the relative merits of the integrated graphics processor?

Comparison of multiple threads and SNA

Indeed, for the cases that are almost entirely GPU bound (for example the firefox-fishbowl, -fishtank, -paintball, -particles), we have virtually eliminated all the previous advantage that the GPU held. In a notable couple of cases, we have improved the image backend to outperform SNA, and for all cases now the threaded image backend beats UXA. However, as can be seen there is still plenty of room for improvement of the image backend, and we can’t let the hardware acceleration be merely equal to a software rasteriser…

Lots of numbers from running cairo-traces on a i5-2500, which has a Sandybridge GT1 GPU (or more confusingly called HD2000), and also a Radeon HD5770 discrete GPU.

Performance results

The key results being the geometric mean of all the traces as compared to using a software rasteriser:

UXA (snb): 1.6x slower
glamor (snb): 2.7x slower
SNA (snb): 1.6x faster
GL (snb): 2.9x slower

EXA (r600g): 3.8x slower
glamor (r600g): 3.0x slower
GL (r600g): 2.7x slower

fglrx (xlib): 4.7x slower
fglrx (GL): 32.0x slower [not shown as it makes the graphs even more difficult to read]

All bar one of the acceleration methods is worse (performance, power, latency, by any metric) than simply using the CPU and rendering directly within the client. Note also that software rasterisation is currently more performant than trying to use the GPU through the OpenGL driver stack.

(All software, except for the xserver which was held back to keep glamor working, from git as of 20121228.)

The introduction of KMS and GEM into the i915 driver broke the i830/i845 chipsets, and a lots of hearts. But fear not! A decade after its introduction, we finally have a driver that is not only stable, but capable of accelerating firefox.

The problem?

The problem was, simply, we could not find a way to enable dynamic video memory on the ancient i830/i845 chipsets without it eventually eating garbage. Since dynamic memory management was the raison d’etre of GEM and critical for acceleration, it is a requirement of the current driver stack. The first cunning solution was simply never to reuse batch buffers, and keep a small amount of memory reserved for our usage. This stopped the command streamer from seeing the garbage, and my system has remained stable for many hours of thrashing. Daniel Vetter extended my solution to implement a kernel workaround whereby every batch would be copied into a reserved area before execution. In the end, we compromised so that I could avoid that extra copy and assume responsibility in the driver for ensuring the batch was coherent, but the kernel would intervene for any non-cooperative driver.

With these workarounds in place, we are finally able to run through the test suites. Which brought us to the next problem:

UXA vs software rasterisation on 845g

The sad fact is that UXA is inadequate for the challenge of accelerating the Render protocol.

If we compare with an architecture that was designed to accelerate cairo, SNA:

SNA vs software rasterisation on 845g

We find a much happier result. In all cases the performance is at least as good as using a software rasteriser in the X server, and often much better than if we avoided the Render protocol entirely and did the rasterisation in the client. With a little more tuning, we may be able to achieve parity even in the worst case – if we can win on an old GPU with an ancient CPU (single core, virtually no cache and even less memory bandwidth) we should be able to excel on more recent GPUs and CPUs, and be more efficient in the process.

Yet, the Render protocol is not the be-all-and-end-all of acceleration. We need to keep an eye on the basics as well, the copies, the fills and the uploads, to know if we are achieving our goals. The basic premise is that using the driver (and thus the GPU) is faster than just using the CPU for everything. (In reality, the choice is more complicated because we have to consider the efficacy of GPU offload for enabling the CPU to get on with other tasks and overall power efficiency.)

1: Baseline performance of Xvfb
2: SNA with acceleration disabled (shadow)
3: UXA
4: SNA

     1        2       3       4    Operation
--------  ------  ------  ------   ---------
277000.0    2.04    1.06    4.58   Char in 80-char aa line (Charter 10) 
265000.0    2.15    1.11    4.83   Char in 80-char rgb line (Charter 10) 
312000.0    0.66    0.15    1.38   Copy 10x10 from window to window 
  6740.0    0.90    1.56    1.75   Copy 100x100 from window to window 
   382.0    0.92    1.30    1.36   Copy 500x500 from window to window 
268000.0    0.74    0.17    1.50   Copy 10x10 from window to pixmap 
  7260.0    0.87    1.43    1.85   Copy 100x100 from window to pixmap 
   376.0    0.94    1.28    1.37   Copy 500x500 from window to pixmap 
154000.0    0.74    0.69    0.86   PutImage 10x10 square 
  1880.0    1.04    1.05    1.04   PutImage 100x100 square 
    87.1    1.03    1.02    1.01   PutImage 500x500 square 
308000.0    0.58    0.46    0.66   ShmPutImage 10x10 square 
  6500.0    1.02    1.16    1.24   ShmPutImage 100x100 square 
   380.0    1.00    1.24    1.28   ShmPutImage 500x500 square 

So it appears that using the GPU for basic operations such as moving the windows about is only at most a marginal win over using a shadow buffer (and often times UXA fails at even that). Overall then it seems that enabling UXA should bring nothing but misery.

Last week saw a new release of glamor, a library for use by the 2D display drivers translating Render drawing commands into OpenGL (and then feeding them into a 3D driver). This release packs in a couple of standout features for cairo: trapezoid shaders and the ability to handle textures larger than supported by hardware. The development emphasis has been on performance, and indeed glamor-0.5 is much improved over glamor–0.4 as measured on Intel’s latest and greatest IvyBridge architecture.

Performance improvement of glamor 0.5 over 0.4

This is a graph of absolute time for each trace, shorter bars are quicker, and as can be seen the improvement is pretty good.

However, to keep this in perspective, lets compare against the standard 2D driver:
Performance improvement of glamor 0.5 over 0.4

Still plenty of room for improvement.

Having looked at the impact of the move from XAA to UMS/EXA and then to KMS/UXA on performance of the core drawing operations, we can turn to look at the impact upon RENDER acceleration. One of the arguments for the reason behind dropping XAA support and writing a new architecture was to address the needs of accelerating the new advanced rendering protocol, RENDER. Did the claim really live up to reality, did the switch to EXA actually help?

Comparison of acceleration architectures on 965gm

In short, the switch to EXA was catastrophic in the beginning. The new acceleration architecture regressed performance in both the core rendering operations and failed to deliver the claim of improving RENDER acceleration. Fast forward a few years and through improvements to the RENDER protocol and many improvements to the driver and we reach UXA, where we start to actually see some benefit from enabling GPU acceleration. But it is not until we look at SNA that we reach a level of maturity and consistency in the driver to have all round performance that is finally at least as good as XAA again (effectively software rendering in these benchmarks).

An outstanding question that is regularly asked is what are design differences between EXA, UXA and SNA?

UXA was originally EXA with the pixmap migration removed. It was argued that given a unified memory architecture such as found on IGP, there was no need to migrate pixmaps between the various GPU memory domains. Instead all pixmap allocations were to be made in the single GPU domain and if necessary the pixmap would be mapped and read by the CPU through the GTT (effectively an uncached readback, very very slow).

In hindsight, that decision was flawed. As it turns out, not only we have both mappable and unmappable memory domains within the IGP, and so we cannot simply map any GPU pixmap without cost, but we can also have snoopable GPU memory (memory that can exist in the CPU cache). The single GPU memory domain argument was a fallacy from the start, and even more so with the advent of the shared last-level-cache between the CPU and GPU. Also it tends to be much much faster to copy from the GPU pixmap into system memory, perform the operation on the copy in system memory and then copy it back, than it is to try and perform the operation through a GTT mapping of a GPU pixmap (more so if you take care to amoritize the cost of the migration). Not to mention the asymmetry between upload and download speeds, and that we can exploit snoopable GPU memory to accelerate those transfers. So it turns out that having a very efficient pixmap migration strategy is the core of an acceleration architecture. In this regard SNA is very much like EXA.

From the design point of view, the only real difference between EXA and SNA is that EXA is a mere midlayer whose existence is to hinder the driver from doing the right thing, and SNA is a complete implementation. Since the devil is in the details (fallbacks are slow, don’t!), that difference turns out to be quite huge.

A few years ago, memory management for the graphics card was performed as single static allocation in the X server which was then carved up into surfaces and used for the scanout and important pixmaps such as renderbuffers for its DRI clients. The upside of this simple scheme meant that all locations were known, allocation was very fast and we could always tell the GPU where its surfaces were. The downside was that the amount of video memory was therefore predetermined and could not be resized, not even if you added a second monitor and needed a new framebuffer, or if you were running a game and it wanted lots of textures. So X was relieved of its role in memory management and the task given to the kernel under the guise Graphics Execution Memory. Now userspace has no idea where its surfaces are and so needs to ask the kernel to patch up its command buffers to insert the correct addresses. This relocation of command buffers is a bottleneck in the new design, and from the outset people were complaining about the performance loss going from XAA to GEM/UXA.

A few years have passed and we’ve been gradually tuning all the parties, but the question remains have we managed to recover that speed which we threw away so long ago?

I compiled Xorg-1.5 and xf86-video-intel-2.6 for my 965gm (a ThinkPad t61) by discarding anything that no longer compiled and was left with a very light shell for investigating XAA in its heyday. I then ran x11perf under the ancient XAA and EXA, and the modern UXA and SNA to see how things had changed.

By looking at the geometric mean of the all the x11perf tests we can get a rough feel for the overall performance change against XAA:

EXA 1.117 (12% faster than XAA)

UXA -1.108 (11% slower than XAA)

SNA 2.284 (128% faster than XAA)

As with all averages of micro-benchmarks take this with an extremely large pinch of salt.

So Michal Danzer just pushed some patches to enable using glamor from within the xf86-video-ati driver. As a quick recap, glamor is a generic Xorg driver library that translates the 2D requests used to render X into OpenGL commands. The argument being then that the driver teams need only concentrate on bringing up the OpenGL stack and gain a functioning display server in the process. The counter argument is that this compromise in saving engineering time penalises the performance of the display server.

To highlight that last point, we can look at the performance of the intel driver with and without glamor, and rendering directly with OpenGL:

glamor on SandyBridge

The centre baseline is the performance of simply using the CPU and pixman to render, above that we are faster and below slower. The first bar is the performance of using OpenGL directly, in theory this should provide the best performance of all, only being limited by hardware. Sadly, the graph shows the stark reality that undermines using glamor – one needs an OpenGL driver that has been optimized for 2D usage in order to maximise GPU performance with the Xorg workload. Note the areas where glamor does better than the direct usage in cairo-gl? This is where glamor itself attempts to mitigate against poor buffer managment in the driver.

Enter a different GPU and a different driver. The whole balance of CPU to GPU power shifts along with the engineering focus. Everything changes.

Taking a look at the same workloads on the same computer, but using the discrete Radeon HD5770 rather than the integrated processor graphics:

glamor on Radeion HD5770

Perhaps the first thing that we notice is the raw power of the discrete graphics as exposed by using OpenGL directly from within cairo. Secondly, we notice the lack luster performance of the existing EXA driver for the Radeon chipset – remember everything below the lines implies that the GPU driver in Xorg is behaving worse than could be achieved just through client-side sowftware rendering, that using RENDER acceleration is nothing of the sort. And then our attention turns to the newcomer, glamor on radeon. It is still notably much slower than both the CPU and using OpenGL directly. However, it is of very similar performance to the existing EXA driver, sometimes slower, sometimes faster (if you look at the relative x11perf, then it reveals some areas where the EXA driver could do major improvements).

glamor on Radeion HD5770

Not bad for the first patch with an immature library, and demonstrates that glamor can be used to reduce the development cost of bringing up a new chipset – yet does not reach the full potential of the system. Judging by the last graph, one does wonder whether glamor is even preferable to using xf86-video-modesetting in such cases, on a high performance multicore system, for the time being, at least. 😉

The idea of glamor is to leverage an existing OpenGL driver in order to implement the 2D acceleration required for an X server, and then in the future one only need to create an OpenGL driver. But how well does the existing mesa/i965 driver handle the task of accelerating glamor for cairo – a task best suited for the 3D pipeline and pixel shaders of the GPU? With the first patches to implement accelerated trapezoids for glamor, I decided to find out.

I ran the usual cairo trace benchmarks, including a couple of new ones to highlight areas of poor performance that have come to our attention, so measuring the throughput of various applications which use cairo against the glamor (with and without the trapezoid shader enabled), SNA and UXA ddx backends on a small selection of Core processors, from a lowly i3 Arrandale to the thoroughbred i7 IvyBridge. The results were then normalized to the performance of UXA on each system, so that we can directly compare the performance of each proposed backend to the current driver. A result above the centre-line means that the driver was faster than UXA, below slower.

glamor vs sna vs uxa on an i3-330m

glamor vs sna vs uxa on an i5-2520m

glamor vs sna vs uxa on an i7-3720qm

Between the impedence mismatch between the Render protocol and OpenGL, that mesa has not been optimised for this workload and that glamor itself is still very immature, therein belies the simplicity and appeal of glamor.

With a couple of embarassing bug fixes and the incremental evolution of the MSAA compositor for the cairo-gl backend, it is almost time for a new bugfix release – just one more bug to squash first.

In the meantime, I looked the performance of an older machine, a Pentium-III mobile processor with an 855gm (a gen2 Intel GPU). It is not a platform known for its high performance…

UXA vs SNA performance on 855gm

Outside of the singularly disappointing performance of the swap happy ocitysmap run, SNA performs adequately delivering performance at least as good as client-side rendering and often better. So even on these old machines we can run current applications fluidly, if not exactly with blistering performance. In contrast, with UXA reasonable performance is the exception rather than rule, with many scenarios where the driver is a CPU hog killing interactivity.