Skip navigation

Category Archives: cairo

Søren Sandmann Pedersen, maintainer of the pixman library that is at the heart of the software rasteriser in Cairo and X, argues that with all things being equal an integrated graphics processor should never be able to outperform the main processor for rendering Cairo. Ideally, when using either the GPU or CPU, we want the memory being the rate-limiting component. On the Sandybridge processor, the memory is attached to the system agent, along with the large L3 caches, and is shared between the CPU and GPU. So he argues that any performance advantage cairo-xlib might show over cairo-image whilst using Intel graphics is really an opportunity for improvement of the rasteriser and pixman kernels. (But also bear in mind that the GPU should be considerably more efficient and be able to saturate the memory bus at a fraction of the power cost of the CPU). Using SSE2, we are indeed able to fully saturate the bus to L1, L2 and main memory for a variety of common operations in pixman. However, the bilinear scaling kernels take longer to compute than it does to write the results to memory, and so fail to completely utilize the available bandwidth. Furthermore, for a slightly more complex transformation than a simple scale and translation, we do not yet have a fast SSE2 kernel that operates directly on the destination and so must incur some extra work. If we ignore, for the time being, making further optimisations to those kernels to improve performance, we can instead ask the question: if using one SSE2 unit is not enough to saturate the memory bus, what happens if we throw more cores at the problem?

To gain the most improvement from adding threads to cairo, you need to design a rasteriser and usage model with threading in mind in. One such design is the vector renderer O by Øyvind Kolås. Despite being an experiment, it does show quite a bit of promise, but in its raw form just throwing threads at the problem does not beat using the SIMD compositing routines provided by pixman. However, it did raise the question whether we can make improvements to the existing image backend without impacting upon its immediate mode nature and so could be used by existing applications without alteration. To preserve the existing semantics, we can break up the individual composite and scan conversion operations into small pieces and feed those to a pool of threads, and then wait for the threads to complete before returning back to the application. As such we then never run the threads for very long, and risk that the overhead in thread management outweighs any benefit from splitting the operation over multiple cores.

To gain the greatest advantage from adding threads, I used a desktop Sandybridge processor (an i5-2500) for benchmarking this prototype cairo-image backend:

Comparison of multiple threads

About half the test cases regress by about 10%, but around one third are improved by 2-3x. Not bad – the regressions are unacceptable, but it does offer a tantalising suggestion that we can make some significant improvements with only a minor change. (Remember not all processors are equal, and thanks to Sandybridge’s turbo some cores are more equal than others. I tried running the test cases on a Sandybridge ultrabook, and as soon as more one core was active, the entire package was throttled to keep it within its thermal constraints. The performance regression was dire.)

Having tuned the software backend to make better use of the average resources, what can we now say about the relative merits of the integrated graphics processor?

Comparison of multiple threads and SNA

Indeed, for the cases that are almost entirely GPU bound (for example the firefox-fishbowl, -fishtank, -paintball, -particles), we have virtually eliminated all the previous advantage that the GPU held. In a notable couple of cases, we have improved the image backend to outperform SNA, and for all cases now the threaded image backend beats UXA. However, as can be seen there is still plenty of room for improvement of the image backend, and we can’t let the hardware acceleration be merely equal to a software rasteriser…

The Nvidia ion GPU was released a few years back to cater for the low power netbook market as a substantial upgrade for the anemic Intel GMA950 (or 945gm as it is better known to us) that shipped as the integrated graphics processor in the first Atoms. Those made quite nice machines with long runtimes, a touch on the slow side compared to the full grown laptops, but actually quite comparable to the current generation of premium tablets many years later. People still use these, and I get the occasional request to see how well they perform everytime Nvidia releases a new driver. So lets take a look:

Performance of Nvidia Ion with the 313.18 driver release

In nearly every test there is a small improvement, at around 2-3%. But for the paintball HTML5 demo, they have managed to fix some form of resource starvation and have been able to accelerate it, giving an order of magnitude speed increase. Strangely, the nvidia driver fares worse on average than the nouveau driver, despite it having substantially better performance in several cases. This is because the nouveau driver is, at least, consistently slow, whereas the nvidia driver also experience a few severe slowdowns and so has a much more mixed set of results.

As is always the case, everytime you try to compare implementations, you find completely unexpected bugs. The idea was simple: look at the performance of the last few generations and see how we’ve been improving.

To start with, I have a Core2 Quad, Q8400 at 2.66GHz (fixed frequency) and a venerable 3rd generation Intel GPU (a q35 to be precise). This is a 95W desktop beast, only to be beaten by a slightly older Q9550 Core2 Quad at 2.83GHz sporting a 4th generation GPU, a g45.

From the current generation of CPU design, I have two mobile chips, a Sandybridge i5-2520m at up to 3.2GHz and an Ivybridge i7-3720qm at up to 3.6GHz. These mobile chips run much cooler than their desktop brethen at 35W and 45W respectively, and both have the GT2 variants of the 6th and 7th generation Intel GPUs.

To put those two into perspective, I have also added the results from a desktop Sandybridge chip, the 95W i5-2500 which runs up to 3.7GHz. On the downside this processor only has a GT1 6th generation GPU.

Relative GPU performance of Core2 vs Sandybridge

Compared to the baseline performance given by the Core2 Q8440:

Q9550: 1.6x faster
i5-2500: 2.3x faster
i5-2520m: 1.7x faster
i7-3720qm: 1.4x faster

The success story is that the 35W mobile processors really do deliver the performance of the 95W desktop beasts of the previous generation, which is quite frankly very impressive. Of course, they are still dwarfed by their desktop brethen – now I really do want to get my hands on a i7-3770k!

The oddity is then why is my Ivybridge underperforming? Its single thread performance which is under test here should not be as badly limited, the base clock is a tiny bit faster than the Sandybridge i5, and with the larger and improved caches it should acheive better IPC as well. In theory it should also be coupled to faster main memory as well. My guess is that it is being throttled due to poor cooling. Or that there is a glitch in the software, or perhaps a broken configuration, or perhaps the memory is underrated, et cetera.

But what of the graphics performance, I hear you cry! The situation here is a bit more complex. All of those processors where using the same SSE2 backend in pixman which will only be improved upon with the introduction of AVX2 in Haswell, and so we were directly comparing slight variations of processor design executing the same software. When we look at GPU performance, not only do we have a wide variation of processor design, feature set and instruction sets, we also by necessity have different software for each.

If we look at the current driver situation, that is using UXA:

Relative GPU performance of Core2 vs Sandybridge

Compared to the baseline performance given by using SNA on the Core2 Q8440 with a q35 (a gen3 device like found in the Pineview netbook) GPU:

Q9550: 4.3x slower
i5-2500: 1.5x slower
i5-2520m: 3.0x slower
i7-3720qm: 2.3x slower

Despite almost a doubling of CPU power and an even greater increase in the GPU performance across the generations, the drivers continue to do a disservice to the hardware and ourselves.

Lots of numbers from running cairo-traces on a i5-2500, which has a Sandybridge GT1 GPU (or more confusingly called HD2000), and also a Radeon HD5770 discrete GPU.

Performance results

The key results being the geometric mean of all the traces as compared to using a software rasteriser:

UXA (snb): 1.6x slower
glamor (snb): 2.7x slower
SNA (snb): 1.6x faster
GL (snb): 2.9x slower

EXA (r600g): 3.8x slower
glamor (r600g): 3.0x slower
GL (r600g): 2.7x slower

fglrx (xlib): 4.7x slower
fglrx (GL): 32.0x slower [not shown as it makes the graphs even more difficult to read]

All bar one of the acceleration methods is worse (performance, power, latency, by any metric) than simply using the CPU and rendering directly within the client. Note also that software rasterisation is currently more performant than trying to use the GPU through the OpenGL driver stack.

(All software, except for the xserver which was held back to keep glamor working, from git as of 20121228.)

So Michal Danzer just pushed some patches to enable using glamor from within the xf86-video-ati driver. As a quick recap, glamor is a generic Xorg driver library that translates the 2D requests used to render X into OpenGL commands. The argument being then that the driver teams need only concentrate on bringing up the OpenGL stack and gain a functioning display server in the process. The counter argument is that this compromise in saving engineering time penalises the performance of the display server.

To highlight that last point, we can look at the performance of the intel driver with and without glamor, and rendering directly with OpenGL:

glamor on SandyBridge

The centre baseline is the performance of simply using the CPU and pixman to render, above that we are faster and below slower. The first bar is the performance of using OpenGL directly, in theory this should provide the best performance of all, only being limited by hardware. Sadly, the graph shows the stark reality that undermines using glamor – one needs an OpenGL driver that has been optimized for 2D usage in order to maximise GPU performance with the Xorg workload. Note the areas where glamor does better than the direct usage in cairo-gl? This is where glamor itself attempts to mitigate against poor buffer managment in the driver.

Enter a different GPU and a different driver. The whole balance of CPU to GPU power shifts along with the engineering focus. Everything changes.

Taking a look at the same workloads on the same computer, but using the discrete Radeon HD5770 rather than the integrated processor graphics:

glamor on Radeion HD5770

Perhaps the first thing that we notice is the raw power of the discrete graphics as exposed by using OpenGL directly from within cairo. Secondly, we notice the lack luster performance of the existing EXA driver for the Radeon chipset – remember everything below the lines implies that the GPU driver in Xorg is behaving worse than could be achieved just through client-side sowftware rendering, that using RENDER acceleration is nothing of the sort. And then our attention turns to the newcomer, glamor on radeon. It is still notably much slower than both the CPU and using OpenGL directly. However, it is of very similar performance to the existing EXA driver, sometimes slower, sometimes faster (if you look at the relative x11perf, then it reveals some areas where the EXA driver could do major improvements).

glamor on Radeion HD5770

Not bad for the first patch with an immature library, and demonstrates that glamor can be used to reduce the development cost of bringing up a new chipset – yet does not reach the full potential of the system. Judging by the last graph, one does wonder whether glamor is even preferable to using xf86-video-modesetting in such cases, on a high performance multicore system, for the time being, at least. ;-)

With a couple of embarassing bug fixes and the incremental evolution of the MSAA compositor for the cairo-gl backend, it is almost time for a new bugfix release – just one more bug to squash first.

In the meantime, I looked the performance of an older machine, a Pentium-III mobile processor with an 855gm (a gen2 Intel GPU). It is not a platform known for its high performance…

UXA vs SNA performance on 855gm

Outside of the singularly disappointing performance of the swap happy ocitysmap run, SNA performs adequately delivering performance at least as good as client-side rendering and often better. So even on these old machines we can run current applications fluidly, if not exactly with blistering performance. In contrast, with UXA reasonable performance is the exception rather than rule, with many scenarios where the driver is a CPU hog killing interactivity.

A month in, and we’re starting to get some good feedback from people trying cairo-1.12. Unfortunately, it appears that we’ve exposed some bugs in a few drivers, though hopefully we will see driver updates to resolve those issues shortly. We also received a few bug reports against Cairo itself. The most bizarre perhaps was that LibreOffice failed to display the presentation slideshow. That turned out to be an inadvertent bug caught by better error detection – though since that affected the established API, we had to relax the checks somewhat. Along with a few other bug fixes, Cairo 1.12.2 was released.

In the last round of benchmarking I performed, some of you noticed that the glamor backend for the Intel driver was not included. It was left out for the simple reason that it was not able to complete the task. However with the first stable release of glamor-0.4, it is able to complete a benchmarking run. And so without further ado, let’s see how all the current drivers fare on a SandyBridge i5-2500 desktop with GT1 integrated graphics and a Radeon HD 5770 discrete GPU with cairo-1.12.2.

Performance of cairo-1.12.2 on i5-2500

This time the results are normalized to the performance with Xvfb. Any driver that performs a test faster is above the centre, any that were slower below. Again the general state of the drivers leave much to be desired, and despite the bold claims for glamor, in my testing it fails to improve upon UXA. Early days you might say.

I do have a SandyBridge desktop. This system is a bit special as it is the one in which I can run a discrete card in direct comparison to the ingrated processor graphics, i.e. run both GPUs simultaneously (but not yet co-operatively). In this system I put a Radeon HD5770 which at the time was a mid-range GPU offering good value for performance. So how does Cairo fare on this box? In particular are the DDX drivers any better than pixman, the backend for Cairo’s software rasteriser?

Cairo performance on a Radeon HD5770

Once again the white baseline in the centre represents the performance of the image backend, how fast we expect to be able to render the contents ourselves. Above that baseline, the driver is faster; below slower, laggy and power hungry. This is quite a busy graph as it compares the performance of the propietary fglrx driver and the open-source radeon driver along with the alternative acceleration methods for the integrated graphics. I’d recommend viewing at full size, but the gist of it is that in many cases the propietary driver lags behind the open-source effort. Neither are truly comparable to just using the CPU and only the open-source effort is ever faster. In contrast, we have the integrated processor graphics. Even this GT1 desktop system (only half the GPU capability of a GT2 system such as found on mobiles and on a few desktop chips) can outclass the CPU. When not limited by a poor driver, that is.

The only working Nvidia system I have at the moment is an antiquated, cheap little Ion netbook with a weak N270 Atom. So let see how it fares in the debate whether it is preferrable to use the image backend and push images to the server rather than send geometry via XRender.

Relative performance on Nvidia Ion

The white baseline in the centre is the performance of the image backend on the N270. Above that line and the driver is faster and more responsive, below it consumes more CPU, more power and lags more than if we were just to render it ourselves.

What we can see is that as a result of Nvidia investing engineering time in their driver, on the whole, it performs better than we could by just using the CPU. Better performance means more responsive user interfaces that sip less power, meaning happier users for longer.

18 months is a long time without an update. So why all the excitement now? Well it is a little story of one or two features and lots of performance…

First up, lets compare the relative performance of Cairo versions 1.10 and 1.12 on a recent SandyBridge machine, a i5-2520m, by looking at how long it takes to replay a selection of application traces.
Relative Cairo peformance on Sandybridge

The white line across the centre represents the baseline performance of cairo-1.10. Above that line and that backend is faster with cairo-1.12 for that application, below slower.

Across the board, with one or two exceptions, Cairo version 1.12 is much faster than 1.10. (Those one or two exceptions are interesting and quite unexpected. Analysing those regressions should help push Cairo further.) If we focus on the yellow bar, you can see that the performance of the baseline image backend has been improved consistently, peaking at almost 4 times faster, but on average around 50% faster. Note this excludes any performance gains also made by pixman within the same time frame. The other bars compare the other backends, the xlib backend using the SNA and UXA acceleration methods along with the pure CPU rasteriser (xvfb), and the OpenGL backend.

That tells us that we have made good improvements to Cairo’s performance, but a question that is often asked is whether using XRender is worthwhile and whether we would have better performance and more responsive user interfaces if we were just to use CPU rasterisation (using the image backend) and push those images to the X server, https://bugzilla.mozilla.org/show_bug.cgi?id=738937 for instance. To answer that we need to look beyond the relative performance of the same backend over time, and instead look at the relative performance of the backends against image. (Note this excludes any transfer overhead, which in practice is mostly negligible.)

Relative performance of X against the image backend on SandyBridge
 
The answer is then a resounding no, even on integrated graphics, provided you have a good driver. And that lead will be stretched further in favour of using X (and thus the GPU) as the capabilities of the GPU grow faster than the CPU. Whilst we remain GPU bound at least! On the contrary, though it does say that without any driver it would be preferrable to perform client side rasterisation into a shared memory image to avoid he performance loss due to trapezoid passing in the protocol.

Given that we are forced by the XRender protocol to use a sub-optimal rendering method, we should do even better if we cut out the indirect rendering via X and rendered directly onto the GPU using OpenGL. What happens when we try?

Relative performance of X against the image backend on SandyBridge

Not only is the current gl backend overshadowed by the xlib backend, but it fails to even match the pure CPU rasteriser. There are features of the GPU that we are not using yet and are being experimented with, such as using multi-sample buffers and relying on the GPU to perform low quality antialiasing, but as for today it is applying the same methodology as the X driver uses internally and yet performs much worse.

The story is more less the same across all the generations of Intel’s integrated graphics.

Going back a generation to IronLake integrated graphics:
Relative performance of X against the image backend on IronLake

And for the netbooks we use PineView integrated graphics:
Relative performance of X against the image backend on PineView

So the next time your machine feels slow or runs out of battery, it is more than likely to be the fault of your drivers.

Follow

Get every new post delivered to your Inbox.