Skip navigation

So Michal Danzer just pushed some patches to enable using glamor from within the xf86-video-ati driver. As a quick recap, glamor is a generic Xorg driver library that translates the 2D requests used to render X into OpenGL commands. The argument being then that the driver teams need only concentrate on bringing up the OpenGL stack and gain a functioning display server in the process. The counter argument is that this compromise in saving engineering time penalises the performance of the display server.

To highlight that last point, we can look at the performance of the intel driver with and without glamor, and rendering directly with OpenGL:

glamor on SandyBridge

The centre baseline is the performance of simply using the CPU and pixman to render, above that we are faster and below slower. The first bar is the performance of using OpenGL directly, in theory this should provide the best performance of all, only being limited by hardware. Sadly, the graph shows the stark reality that undermines using glamor – one needs an OpenGL driver that has been optimized for 2D usage in order to maximise GPU performance with the Xorg workload. Note the areas where glamor does better than the direct usage in cairo-gl? This is where glamor itself attempts to mitigate against poor buffer managment in the driver.

Enter a different GPU and a different driver. The whole balance of CPU to GPU power shifts along with the engineering focus. Everything changes.

Taking a look at the same workloads on the same computer, but using the discrete Radeon HD5770 rather than the integrated processor graphics:

glamor on Radeion HD5770

Perhaps the first thing that we notice is the raw power of the discrete graphics as exposed by using OpenGL directly from within cairo. Secondly, we notice the lack luster performance of the existing EXA driver for the Radeon chipset – remember everything below the lines implies that the GPU driver in Xorg is behaving worse than could be achieved just through client-side sowftware rendering, that using RENDER acceleration is nothing of the sort. And then our attention turns to the newcomer, glamor on radeon. It is still notably much slower than both the CPU and using OpenGL directly. However, it is of very similar performance to the existing EXA driver, sometimes slower, sometimes faster (if you look at the relative x11perf, then it reveals some areas where the EXA driver could do major improvements).

glamor on Radeion HD5770

Not bad for the first patch with an immature library, and demonstrates that glamor can be used to reduce the development cost of bringing up a new chipset – yet does not reach the full potential of the system. Judging by the last graph, one does wonder whether glamor is even preferable to using xf86-video-modesetting in such cases, on a high performance multicore system, for the time being, at least. 😉

Advertisement

The idea of glamor is to leverage an existing OpenGL driver in order to implement the 2D acceleration required for an X server, and then in the future one only need to create an OpenGL driver. But how well does the existing mesa/i965 driver handle the task of accelerating glamor for cairo – a task best suited for the 3D pipeline and pixel shaders of the GPU? With the first patches to implement accelerated trapezoids for glamor, I decided to find out.

I ran the usual cairo trace benchmarks, including a couple of new ones to highlight areas of poor performance that have come to our attention, so measuring the throughput of various applications which use cairo against the glamor (with and without the trapezoid shader enabled), SNA and UXA ddx backends on a small selection of Core processors, from a lowly i3 Arrandale to the thoroughbred i7 IvyBridge. The results were then normalized to the performance of UXA on each system, so that we can directly compare the performance of each proposed backend to the current driver. A result above the centre-line means that the driver was faster than UXA, below slower.

glamor vs sna vs uxa on an i3-330m

glamor vs sna vs uxa on an i5-2520m

glamor vs sna vs uxa on an i7-3720qm

Between the impedence mismatch between the Render protocol and OpenGL, that mesa has not been optimised for this workload and that glamor itself is still very immature, therein belies the simplicity and appeal of glamor.

With a couple of embarassing bug fixes and the incremental evolution of the MSAA compositor for the cairo-gl backend, it is almost time for a new bugfix release – just one more bug to squash first.

In the meantime, I looked the performance of an older machine, a Pentium-III mobile processor with an 855gm (a gen2 Intel GPU). It is not a platform known for its high performance…

UXA vs SNA performance on 855gm

Outside of the singularly disappointing performance of the swap happy ocitysmap run, SNA performs adequately delivering performance at least as good as client-side rendering and often better. So even on these old machines we can run current applications fluidly, if not exactly with blistering performance. In contrast, with UXA reasonable performance is the exception rather than rule, with many scenarios where the driver is a CPU hog killing interactivity.

A month in, and we’re starting to get some good feedback from people trying cairo-1.12. Unfortunately, it appears that we’ve exposed some bugs in a few drivers, though hopefully we will see driver updates to resolve those issues shortly. We also received a few bug reports against Cairo itself. The most bizarre perhaps was that LibreOffice failed to display the presentation slideshow. That turned out to be an inadvertent bug caught by better error detection – though since that affected the established API, we had to relax the checks somewhat. Along with a few other bug fixes, Cairo 1.12.2 was released.

In the last round of benchmarking I performed, some of you noticed that the glamor backend for the Intel driver was not included. It was left out for the simple reason that it was not able to complete the task. However with the first stable release of glamor-0.4, it is able to complete a benchmarking run. And so without further ado, let’s see how all the current drivers fare on a SandyBridge i5-2500 desktop with GT1 integrated graphics and a Radeon HD 5770 discrete GPU with cairo-1.12.2.

Performance of cairo-1.12.2 on i5-2500

This time the results are normalized to the performance with Xvfb. Any driver that performs a test faster is above the centre, any that were slower below. Again the general state of the drivers leave much to be desired, and despite the bold claims for glamor, in my testing it fails to improve upon UXA. Early days you might say.

I do have a SandyBridge desktop. This system is a bit special as it is the one in which I can run a discrete card in direct comparison to the ingrated processor graphics, i.e. run both GPUs simultaneously (but not yet co-operatively). In this system I put a Radeon HD5770 which at the time was a mid-range GPU offering good value for performance. So how does Cairo fare on this box? In particular are the DDX drivers any better than pixman, the backend for Cairo’s software rasteriser?

Cairo performance on a Radeon HD5770

Once again the white baseline in the centre represents the performance of the image backend, how fast we expect to be able to render the contents ourselves. Above that baseline, the driver is faster; below slower, laggy and power hungry. This is quite a busy graph as it compares the performance of the propietary fglrx driver and the open-source radeon driver along with the alternative acceleration methods for the integrated graphics. I’d recommend viewing at full size, but the gist of it is that in many cases the propietary driver lags behind the open-source effort. Neither are truly comparable to just using the CPU and only the open-source effort is ever faster. In contrast, we have the integrated processor graphics. Even this GT1 desktop system (only half the GPU capability of a GT2 system such as found on mobiles and on a few desktop chips) can outclass the CPU. When not limited by a poor driver, that is.

The only working Nvidia system I have at the moment is an antiquated, cheap little Ion netbook with a weak N270 Atom. So let see how it fares in the debate whether it is preferrable to use the image backend and push images to the server rather than send geometry via XRender.

Relative performance on Nvidia Ion

The white baseline in the centre is the performance of the image backend on the N270. Above that line and the driver is faster and more responsive, below it consumes more CPU, more power and lags more than if we were just to render it ourselves.

What we can see is that as a result of Nvidia investing engineering time in their driver, on the whole, it performs better than we could by just using the CPU. Better performance means more responsive user interfaces that sip less power, meaning happier users for longer.

18 months is a long time without an update. So why all the excitement now? Well it is a little story of one or two features and lots of performance…

First up, lets compare the relative performance of Cairo versions 1.10 and 1.12 on a recent SandyBridge machine, a i5-2520m, by looking at how long it takes to replay a selection of application traces.
Relative Cairo peformance on Sandybridge

The white line across the centre represents the baseline performance of cairo-1.10. Above that line and that backend is faster with cairo-1.12 for that application, below slower.

Across the board, with one or two exceptions, Cairo version 1.12 is much faster than 1.10. (Those one or two exceptions are interesting and quite unexpected. Analysing those regressions should help push Cairo further.) If we focus on the yellow bar, you can see that the performance of the baseline image backend has been improved consistently, peaking at almost 4 times faster, but on average around 50% faster. Note this excludes any performance gains also made by pixman within the same time frame. The other bars compare the other backends, the xlib backend using the SNA and UXA acceleration methods along with the pure CPU rasteriser (xvfb), and the OpenGL backend.

That tells us that we have made good improvements to Cairo’s performance, but a question that is often asked is whether using XRender is worthwhile and whether we would have better performance and more responsive user interfaces if we were just to use CPU rasterisation (using the image backend) and push those images to the X server, https://bugzilla.mozilla.org/show_bug.cgi?id=738937 for instance. To answer that we need to look beyond the relative performance of the same backend over time, and instead look at the relative performance of the backends against image. (Note this excludes any transfer overhead, which in practice is mostly negligible.)

Relative performance of X against the image backend on SandyBridge
 
The answer is then a resounding no, even on integrated graphics, provided you have a good driver. And that lead will be stretched further in favour of using X (and thus the GPU) as the capabilities of the GPU grow faster than the CPU. Whilst we remain GPU bound at least! On the contrary, though it does say that without any driver it would be preferrable to perform client side rasterisation into a shared memory image to avoid he performance loss due to trapezoid passing in the protocol.

Given that we are forced by the XRender protocol to use a sub-optimal rendering method, we should do even better if we cut out the indirect rendering via X and rendered directly onto the GPU using OpenGL. What happens when we try?

Relative performance of X against the image backend on SandyBridge

Not only is the current gl backend overshadowed by the xlib backend, but it fails to even match the pure CPU rasteriser. There are features of the GPU that we are not using yet and are being experimented with, such as using multi-sample buffers and relying on the GPU to perform low quality antialiasing, but as for today it is applying the same methodology as the X driver uses internally and yet performs much worse.

The story is more less the same across all the generations of Intel’s integrated graphics.

Going back a generation to IronLake integrated graphics:
Relative performance of X against the image backend on IronLake

And for the netbooks we use PineView integrated graphics:
Relative performance of X against the image backend on PineView

So the next time your machine feels slow or runs out of battery, it is more than likely to be the fault of your drivers.

No 1 contributor of kernel regressions!
Who wrote 2.6.37?

As a means to reducing my patch queue and to allow Eric to focus on making GL fast and featureful, I volunteered (some might say was volunteered) to take over maintaining the upstream branches for the i915.ko. At the moment, there are a pair of trees on fd.o for regression fixes [-fixes] and for feature work [-next]. If all goes to plan these should be updated regularly and flow into the drm trees maintained by Dave Airlie. These trees will be used for QA regression testing. We are also discussing the viability of a third tree for proposed patches [-staging] so that we have a single location for QA or bug reporters to pull to test patches and verify fixes before they are merged into the stable trees.

Here are the slides at least. They are of course missing the fullscreen demos.

Cairo: 2D in a 3D World