Skip navigation

Category Archives: cairo

After checking on the progress of Glamor for Intel chipsets, it is time to have a look and see what the state of play is for a Radeon HD5770. This card is now a few years old and is sitting in a Sandybridge i5-2500 desktop.


Comparison of DDX rendering with a Radeon HD5770

The baseline I have chosen here is the performance of the Intel DDX using SNA but with acceleration disabled – that is it is completely rendering using the i5-2500 CPU.

In comparison, we find on average that

  • NoAccel is 1.8x slower
  • fglrx is 9.2x slower
  • EXA is 2.9x slower
  • Glamor is 2.0x slower

Or to put a positive spin on it, the new Glamor acceleration on this particular r600g device is about 50% faster than the existing EXA radeon driver. If you look closely there are just a couple of traces that EXA performs better than Glamor, with those regression fixed Glamor would be a clear improvement for radeon. And almost as fast as not using Glamor at all! However, Glamor was not able to complete the benchmark run without crashing.

For this particular set of benchmarks based on Cairo traces taken from real applications. If we look at synthetic benchmarks, Glamor is significantly faster in several key metrics than EXA, and fglrx is much faster again. Always take benchmarks with a pinch of salt.

So I have a new toy, an i7-4950hq processsor. This little beast is one of the special Intel chips sporting an Iris Pro 5200, better known as Haswell GT3e. That GPU has 40 execution units and 128MiB of eDRAM to serve as a fourth-level cache for both the CPU and GPU.

Enough spiel, just how fast is it?

For context, here are some results comparing it with my old Sandybridge laptop (with an i5-2520m).

Comparing the processor using the single-threaded cairo-image:

Comparison of i7-4950hq to i5-2520m

and again comparing the GPUs, using SNA and cairo-xlib:

Comparison of i7-4950hq to i5-2520m

On the whole, we see a two-fold increase of both single-threaded CPU performance and GPU performance (for 2D graphics using cairo) from the jump from a Sandybridge i5-2520m to a Haswell i7-4950hq. In most cases SNA is being limited by how fast the application can feed it commands and so the performance increase is mostly due to same improvement in CPU speed. (This increase is above and beyond the expected improvements due to IPC, so it is more likely the ability of the Haswell chip to turbo higher and longer thanks to improved thermals and cooling.)

And we can compare the relative merits of using OpenGL and a specialised 2D driver by comparing the various rendering backends available for the DDX. The results are normalized to the cairo-image results, and we have

  • none – a multithreaded CPU renderer inside the DDX
  • blt – disable the render acceleration, but allow the DDX to use the BLT engine to move data about i.e. copies and fills
  • sna – SNA render acceleration, default in xf86-video-intel-3.0
  • uxa – UXA render acceleration, current default
  • glamor – Glamor render acceleration, uses OpenGL to offload rendering operations onto the GPU

Comparison of DDX backends on an i5-2520m

Comparison of DDX backends on an i7-4950q

The summary here is that Glamor offers a meagre improvement over UXA. However, both are still much slower on average than cairo-image, i.e. the performance attainable by using a single CPU core. It takes multiple threads inside the DDX to match the performance of cairo-image – this is due to the inherent inefficiencies of the current Render protocol. However, if we then utilize the render acceleration on the GPU (using SNA) we can indeed outperform cairo-image, on average about 2x faster and about 4x faster than UXA and Glamor. Thus SNA does deliver hardware acceleration that succeeds in offloading work onto the GPU (letting the CPU get on with other tasks) and performs faster than rendering everything with the CPU.

Søren Sandmann Pedersen, maintainer of the pixman library that is at the heart of the software rasteriser in Cairo and X, argues that with all things being equal an integrated graphics processor should never be able to outperform the main processor for rendering Cairo. Ideally, when using either the GPU or CPU, we want the memory being the rate-limiting component. On the Sandybridge processor, the memory is attached to the system agent, along with the large L3 caches, and is shared between the CPU and GPU. So he argues that any performance advantage cairo-xlib might show over cairo-image whilst using Intel graphics is really an opportunity for improvement of the rasteriser and pixman kernels. (But also bear in mind that the GPU should be considerably more efficient and be able to saturate the memory bus at a fraction of the power cost of the CPU). Using SSE2, we are indeed able to fully saturate the bus to L1, L2 and main memory for a variety of common operations in pixman. However, the bilinear scaling kernels take longer to compute than it does to write the results to memory, and so fail to completely utilize the available bandwidth. Furthermore, for a slightly more complex transformation than a simple scale and translation, we do not yet have a fast SSE2 kernel that operates directly on the destination and so must incur some extra work. If we ignore, for the time being, making further optimisations to those kernels to improve performance, we can instead ask the question: if using one SSE2 unit is not enough to saturate the memory bus, what happens if we throw more cores at the problem?

To gain the most improvement from adding threads to cairo, you need to design a rasteriser and usage model with threading in mind in. One such design is the vector renderer O by Øyvind Kolås. Despite being an experiment, it does show quite a bit of promise, but in its raw form just throwing threads at the problem does not beat using the SIMD compositing routines provided by pixman. However, it did raise the question whether we can make improvements to the existing image backend without impacting upon its immediate mode nature and so could be used by existing applications without alteration. To preserve the existing semantics, we can break up the individual composite and scan conversion operations into small pieces and feed those to a pool of threads, and then wait for the threads to complete before returning back to the application. As such we then never run the threads for very long, and risk that the overhead in thread management outweighs any benefit from splitting the operation over multiple cores.

To gain the greatest advantage from adding threads, I used a desktop Sandybridge processor (an i5-2500) for benchmarking this prototype cairo-image backend:

Comparison of multiple threads

About half the test cases regress by about 10%, but around one third are improved by 2-3x. Not bad – the regressions are unacceptable, but it does offer a tantalising suggestion that we can make some significant improvements with only a minor change. (Remember not all processors are equal, and thanks to Sandybridge’s turbo some cores are more equal than others. I tried running the test cases on a Sandybridge ultrabook, and as soon as more one core was active, the entire package was throttled to keep it within its thermal constraints. The performance regression was dire.)

Having tuned the software backend to make better use of the average resources, what can we now say about the relative merits of the integrated graphics processor?

Comparison of multiple threads and SNA

Indeed, for the cases that are almost entirely GPU bound (for example the firefox-fishbowl, -fishtank, -paintball, -particles), we have virtually eliminated all the previous advantage that the GPU held. In a notable couple of cases, we have improved the image backend to outperform SNA, and for all cases now the threaded image backend beats UXA. However, as can be seen there is still plenty of room for improvement of the image backend, and we can’t let the hardware acceleration be merely equal to a software rasteriser…

The Nvidia ion GPU was released a few years back to cater for the low power netbook market as a substantial upgrade for the anemic Intel GMA950 (or 945gm as it is better known to us) that shipped as the integrated graphics processor in the first Atoms. Those made quite nice machines with long runtimes, a touch on the slow side compared to the full grown laptops, but actually quite comparable to the current generation of premium tablets many years later. People still use these, and I get the occasional request to see how well they perform everytime Nvidia releases a new driver. So lets take a look:

Performance of Nvidia Ion with the 313.18 driver release

In nearly every test there is a small improvement, at around 2-3%. But for the paintball HTML5 demo, they have managed to fix some form of resource starvation and have been able to accelerate it, giving an order of magnitude speed increase. Strangely, the nvidia driver fares worse on average than the nouveau driver, despite it having substantially better performance in several cases. This is because the nouveau driver is, at least, consistently slow, whereas the nvidia driver also experience a few severe slowdowns and so has a much more mixed set of results.

As is always the case, everytime you try to compare implementations, you find completely unexpected bugs. The idea was simple: look at the performance of the last few generations and see how we’ve been improving.

To start with, I have a Core2 Quad, Q8400 at 2.66GHz (fixed frequency) and a venerable 3rd generation Intel GPU (a q35 to be precise). This is a 95W desktop beast, only to be beaten by a slightly older Q9550 Core2 Quad at 2.83GHz sporting a 4th generation GPU, a g45.

From the current generation of CPU design, I have two mobile chips, a Sandybridge i5-2520m at up to 3.2GHz and an Ivybridge i7-3720qm at up to 3.6GHz. These mobile chips run much cooler than their desktop brethen at 35W and 45W respectively, and both have the GT2 variants of the 6th and 7th generation Intel GPUs.

To put those two into perspective, I have also added the results from a desktop Sandybridge chip, the 95W i5-2500 which runs up to 3.7GHz. On the downside this processor only has a GT1 6th generation GPU.

Relative GPU performance of Core2 vs Sandybridge

Compared to the baseline performance given by the Core2 Q8440:

Q9550: 1.6x faster
i5-2500: 2.3x faster
i5-2520m: 1.7x faster
i7-3720qm: 1.4x faster

The success story is that the 35W mobile processors really do deliver the performance of the 95W desktop beasts of the previous generation, which is quite frankly very impressive. Of course, they are still dwarfed by their desktop brethen – now I really do want to get my hands on a i7-3770k!

The oddity is then why is my Ivybridge underperforming? Its single thread performance which is under test here should not be as badly limited, the base clock is a tiny bit faster than the Sandybridge i5, and with the larger and improved caches it should acheive better IPC as well. In theory it should also be coupled to faster main memory as well. My guess is that it is being throttled due to poor cooling. Or that there is a glitch in the software, or perhaps a broken configuration, or perhaps the memory is underrated, et cetera.

But what of the graphics performance, I hear you cry! The situation here is a bit more complex. All of those processors where using the same SSE2 backend in pixman which will only be improved upon with the introduction of AVX2 in Haswell, and so we were directly comparing slight variations of processor design executing the same software. When we look at GPU performance, not only do we have a wide variation of processor design, feature set and instruction sets, we also by necessity have different software for each.

If we look at the current driver situation, that is using UXA:

Relative GPU performance of Core2 vs Sandybridge

Compared to the baseline performance given by using SNA on the Core2 Q8440 with a q35 (a gen3 device like found in the Pineview netbook) GPU:

Q9550: 4.3x slower
i5-2500: 1.5x slower
i5-2520m: 3.0x slower
i7-3720qm: 2.3x slower

Despite almost a doubling of CPU power and an even greater increase in the GPU performance across the generations, the drivers continue to do a disservice to the hardware and ourselves.

Lots of numbers from running cairo-traces on a i5-2500, which has a Sandybridge GT1 GPU (or more confusingly called HD2000), and also a Radeon HD5770 discrete GPU.

Performance results

The key results being the geometric mean of all the traces as compared to using a software rasteriser:

UXA (snb): 1.6x slower
glamor (snb): 2.7x slower
SNA (snb): 1.6x faster
GL (snb): 2.9x slower

EXA (r600g): 3.8x slower
glamor (r600g): 3.0x slower
GL (r600g): 2.7x slower

fglrx (xlib): 4.7x slower
fglrx (GL): 32.0x slower [not shown as it makes the graphs even more difficult to read]

All bar one of the acceleration methods is worse (performance, power, latency, by any metric) than simply using the CPU and rendering directly within the client. Note also that software rasterisation is currently more performant than trying to use the GPU through the OpenGL driver stack.

(All software, except for the xserver which was held back to keep glamor working, from git as of 20121228.)

So Michal Danzer just pushed some patches to enable using glamor from within the xf86-video-ati driver. As a quick recap, glamor is a generic Xorg driver library that translates the 2D requests used to render X into OpenGL commands. The argument being then that the driver teams need only concentrate on bringing up the OpenGL stack and gain a functioning display server in the process. The counter argument is that this compromise in saving engineering time penalises the performance of the display server.

To highlight that last point, we can look at the performance of the intel driver with and without glamor, and rendering directly with OpenGL:

glamor on SandyBridge

The centre baseline is the performance of simply using the CPU and pixman to render, above that we are faster and below slower. The first bar is the performance of using OpenGL directly, in theory this should provide the best performance of all, only being limited by hardware. Sadly, the graph shows the stark reality that undermines using glamor – one needs an OpenGL driver that has been optimized for 2D usage in order to maximise GPU performance with the Xorg workload. Note the areas where glamor does better than the direct usage in cairo-gl? This is where glamor itself attempts to mitigate against poor buffer managment in the driver.

Enter a different GPU and a different driver. The whole balance of CPU to GPU power shifts along with the engineering focus. Everything changes.

Taking a look at the same workloads on the same computer, but using the discrete Radeon HD5770 rather than the integrated processor graphics:

glamor on Radeion HD5770

Perhaps the first thing that we notice is the raw power of the discrete graphics as exposed by using OpenGL directly from within cairo. Secondly, we notice the lack luster performance of the existing EXA driver for the Radeon chipset – remember everything below the lines implies that the GPU driver in Xorg is behaving worse than could be achieved just through client-side sowftware rendering, that using RENDER acceleration is nothing of the sort. And then our attention turns to the newcomer, glamor on radeon. It is still notably much slower than both the CPU and using OpenGL directly. However, it is of very similar performance to the existing EXA driver, sometimes slower, sometimes faster (if you look at the relative x11perf, then it reveals some areas where the EXA driver could do major improvements).

glamor on Radeion HD5770

Not bad for the first patch with an immature library, and demonstrates that glamor can be used to reduce the development cost of bringing up a new chipset – yet does not reach the full potential of the system. Judging by the last graph, one does wonder whether glamor is even preferable to using xf86-video-modesetting in such cases, on a high performance multicore system, for the time being, at least. ;-)

With a couple of embarassing bug fixes and the incremental evolution of the MSAA compositor for the cairo-gl backend, it is almost time for a new bugfix release – just one more bug to squash first.

In the meantime, I looked the performance of an older machine, a Pentium-III mobile processor with an 855gm (a gen2 Intel GPU). It is not a platform known for its high performance…

UXA vs SNA performance on 855gm

Outside of the singularly disappointing performance of the swap happy ocitysmap run, SNA performs adequately delivering performance at least as good as client-side rendering and often better. So even on these old machines we can run current applications fluidly, if not exactly with blistering performance. In contrast, with UXA reasonable performance is the exception rather than rule, with many scenarios where the driver is a CPU hog killing interactivity.

A month in, and we’re starting to get some good feedback from people trying cairo-1.12. Unfortunately, it appears that we’ve exposed some bugs in a few drivers, though hopefully we will see driver updates to resolve those issues shortly. We also received a few bug reports against Cairo itself. The most bizarre perhaps was that LibreOffice failed to display the presentation slideshow. That turned out to be an inadvertent bug caught by better error detection – though since that affected the established API, we had to relax the checks somewhat. Along with a few other bug fixes, Cairo 1.12.2 was released.

In the last round of benchmarking I performed, some of you noticed that the glamor backend for the Intel driver was not included. It was left out for the simple reason that it was not able to complete the task. However with the first stable release of glamor-0.4, it is able to complete a benchmarking run. And so without further ado, let’s see how all the current drivers fare on a SandyBridge i5-2500 desktop with GT1 integrated graphics and a Radeon HD 5770 discrete GPU with cairo-1.12.2.

Performance of cairo-1.12.2 on i5-2500

This time the results are normalized to the performance with Xvfb. Any driver that performs a test faster is above the centre, any that were slower below. Again the general state of the drivers leave much to be desired, and despite the bold claims for glamor, in my testing it fails to improve upon UXA. Early days you might say.

I do have a SandyBridge desktop. This system is a bit special as it is the one in which I can run a discrete card in direct comparison to the ingrated processor graphics, i.e. run both GPUs simultaneously (but not yet co-operatively). In this system I put a Radeon HD5770 which at the time was a mid-range GPU offering good value for performance. So how does Cairo fare on this box? In particular are the DDX drivers any better than pixman, the backend for Cairo’s software rasteriser?

Cairo performance on a Radeon HD5770

Once again the white baseline in the centre represents the performance of the image backend, how fast we expect to be able to render the contents ourselves. Above that baseline, the driver is faster; below slower, laggy and power hungry. This is quite a busy graph as it compares the performance of the propietary fglrx driver and the open-source radeon driver along with the alternative acceleration methods for the integrated graphics. I’d recommend viewing at full size, but the gist of it is that in many cases the propietary driver lags behind the open-source effort. Neither are truly comparable to just using the CPU and only the open-source effort is ever faster. In contrast, we have the integrated processor graphics. Even this GT1 desktop system (only half the GPU capability of a GT2 system such as found on mobiles and on a few desktop chips) can outclass the CPU. When not limited by a poor driver, that is.

Follow

Get every new post delivered to your Inbox.