Skip navigation

After checking on the progress of Glamor for Intel chipsets, it is time to have a look and see what the state of play is for a Radeon HD5770. This card is now a few years old and is sitting in a Sandybridge i5-2500 desktop.


Comparison of DDX rendering with a Radeon HD5770

The baseline I have chosen here is the performance of the Intel DDX using SNA but with acceleration disabled – that is it is completely rendering using the i5-2500 CPU.

In comparison, we find on average that

  • NoAccel is 1.8x slower
  • fglrx is 9.2x slower
  • EXA is 2.9x slower
  • Glamor is 2.0x slower

Or to put a positive spin on it, the new Glamor acceleration on this particular r600g device is about 50% faster than the existing EXA radeon driver. If you look closely there are just a couple of traces that EXA performs better than Glamor, with those regression fixed Glamor would be a clear improvement for radeon. And almost as fast as not using Glamor at all! However, Glamor was not able to complete the benchmark run without crashing.

For this particular set of benchmarks based on Cairo traces taken from real applications. If we look at synthetic benchmarks, Glamor is significantly faster in several key metrics than EXA, and fglrx is much faster again. Always take benchmarks with a pinch of salt.

Advertisement

So I have a new toy, an i7-4950hq processsor. This little beast is one of the special Intel chips sporting an Iris Pro 5200, better known as Haswell GT3e. That GPU has 40 execution units and 128MiB of eDRAM to serve as a fourth-level cache for both the CPU and GPU.

Enough spiel, just how fast is it?

For context, here are some results comparing it with my old Sandybridge laptop (with an i5-2520m).

Comparing the processor using the single-threaded cairo-image:

Comparison of i7-4950hq to i5-2520m

and again comparing the GPUs, using SNA and cairo-xlib:

Comparison of i7-4950hq to i5-2520m

On the whole, we see a two-fold increase of both single-threaded CPU performance and GPU performance (for 2D graphics using cairo) from the jump from a Sandybridge i5-2520m to a Haswell i7-4950hq. In most cases SNA is being limited by how fast the application can feed it commands and so the performance increase is mostly due to same improvement in CPU speed. (This increase is above and beyond the expected improvements due to IPC, so it is more likely the ability of the Haswell chip to turbo higher and longer thanks to improved thermals and cooling.)

And we can compare the relative merits of using OpenGL and a specialised 2D driver by comparing the various rendering backends available for the DDX. The results are normalized to the cairo-image results, and we have

  • none – a multithreaded CPU renderer inside the DDX
  • blt – disable the render acceleration, but allow the DDX to use the BLT engine to move data about i.e. copies and fills
  • sna – SNA render acceleration, default in xf86-video-intel-3.0
  • uxa – UXA render acceleration, current default
  • glamor – Glamor render acceleration, uses OpenGL to offload rendering operations onto the GPU

Comparison of DDX backends on an i5-2520m

Comparison of DDX backends on an i7-4950q

The summary here is that Glamor offers a meagre improvement over UXA. However, both are still much slower on average than cairo-image, i.e. the performance attainable by using a single CPU core. It takes multiple threads inside the DDX to match the performance of cairo-image – this is due to the inherent inefficiencies of the current Render protocol. However, if we then utilize the render acceleration on the GPU (using SNA) we can indeed outperform cairo-image, on average about 2x faster and about 4x faster than UXA and Glamor. Thus SNA does deliver hardware acceleration that succeeds in offloading work onto the GPU (letting the CPU get on with other tasks) and performs faster than rendering everything with the CPU.

Søren Sandmann Pedersen, maintainer of the pixman library that is at the heart of the software rasteriser in Cairo and X, argues that with all things being equal an integrated graphics processor should never be able to outperform the main processor for rendering Cairo. Ideally, when using either the GPU or CPU, we want the memory being the rate-limiting component. On the Sandybridge processor, the memory is attached to the system agent, along with the large L3 caches, and is shared between the CPU and GPU. So he argues that any performance advantage cairo-xlib might show over cairo-image whilst using Intel graphics is really an opportunity for improvement of the rasteriser and pixman kernels. (But also bear in mind that the GPU should be considerably more efficient and be able to saturate the memory bus at a fraction of the power cost of the CPU). Using SSE2, we are indeed able to fully saturate the bus to L1, L2 and main memory for a variety of common operations in pixman. However, the bilinear scaling kernels take longer to compute than it does to write the results to memory, and so fail to completely utilize the available bandwidth. Furthermore, for a slightly more complex transformation than a simple scale and translation, we do not yet have a fast SSE2 kernel that operates directly on the destination and so must incur some extra work. If we ignore, for the time being, making further optimisations to those kernels to improve performance, we can instead ask the question: if using one SSE2 unit is not enough to saturate the memory bus, what happens if we throw more cores at the problem?

To gain the most improvement from adding threads to cairo, you need to design a rasteriser and usage model with threading in mind in. One such design is the vector renderer O by Øyvind Kolås. Despite being an experiment, it does show quite a bit of promise, but in its raw form just throwing threads at the problem does not beat using the SIMD compositing routines provided by pixman. However, it did raise the question whether we can make improvements to the existing image backend without impacting upon its immediate mode nature and so could be used by existing applications without alteration. To preserve the existing semantics, we can break up the individual composite and scan conversion operations into small pieces and feed those to a pool of threads, and then wait for the threads to complete before returning back to the application. As such we then never run the threads for very long, and risk that the overhead in thread management outweighs any benefit from splitting the operation over multiple cores.

To gain the greatest advantage from adding threads, I used a desktop Sandybridge processor (an i5-2500) for benchmarking this prototype cairo-image backend:

Comparison of multiple threads

About half the test cases regress by about 10%, but around one third are improved by 2-3x. Not bad – the regressions are unacceptable, but it does offer a tantalising suggestion that we can make some significant improvements with only a minor change. (Remember not all processors are equal, and thanks to Sandybridge’s turbo some cores are more equal than others. I tried running the test cases on a Sandybridge ultrabook, and as soon as more one core was active, the entire package was throttled to keep it within its thermal constraints. The performance regression was dire.)

Having tuned the software backend to make better use of the average resources, what can we now say about the relative merits of the integrated graphics processor?

Comparison of multiple threads and SNA

Indeed, for the cases that are almost entirely GPU bound (for example the firefox-fishbowl, -fishtank, -paintball, -particles), we have virtually eliminated all the previous advantage that the GPU held. In a notable couple of cases, we have improved the image backend to outperform SNA, and for all cases now the threaded image backend beats UXA. However, as can be seen there is still plenty of room for improvement of the image backend, and we can’t let the hardware acceleration be merely equal to a software rasteriser…

The Nvidia ion GPU was released a few years back to cater for the low power netbook market as a substantial upgrade for the anemic Intel GMA950 (or 945gm as it is better known to us) that shipped as the integrated graphics processor in the first Atoms. Those made quite nice machines with long runtimes, a touch on the slow side compared to the full grown laptops, but actually quite comparable to the current generation of premium tablets many years later. People still use these, and I get the occasional request to see how well they perform everytime Nvidia releases a new driver. So lets take a look:

Performance of Nvidia Ion with the 313.18 driver release

In nearly every test there is a small improvement, at around 2-3%. But for the paintball HTML5 demo, they have managed to fix some form of resource starvation and have been able to accelerate it, giving an order of magnitude speed increase. Strangely, the nvidia driver fares worse on average than the nouveau driver, despite it having substantially better performance in several cases. This is because the nouveau driver is, at least, consistently slow, whereas the nvidia driver also experience a few severe slowdowns and so has a much more mixed set of results.

As is always the case, everytime you try to compare implementations, you find completely unexpected bugs. The idea was simple: look at the performance of the last few generations and see how we’ve been improving.

To start with, I have a Core2 Quad, Q8400 at 2.66GHz (fixed frequency) and a venerable 3rd generation Intel GPU (a q35 to be precise). This is a 95W desktop beast, only to be beaten by a slightly older Q9550 Core2 Quad at 2.83GHz sporting a 4th generation GPU, a g45.

From the current generation of CPU design, I have two mobile chips, a Sandybridge i5-2520m at up to 3.2GHz and an Ivybridge i7-3720qm at up to 3.6GHz. These mobile chips run much cooler than their desktop brethen at 35W and 45W respectively, and both have the GT2 variants of the 6th and 7th generation Intel GPUs.

To put those two into perspective, I have also added the results from a desktop Sandybridge chip, the 95W i5-2500 which runs up to 3.7GHz. On the downside this processor only has a GT1 6th generation GPU.

Relative GPU performance of Core2 vs Sandybridge

Compared to the baseline performance given by the Core2 Q8440:

Q9550: 1.6x faster
i5-2500: 2.3x faster
i5-2520m: 1.7x faster
i7-3720qm: 1.4x faster

The success story is that the 35W mobile processors really do deliver the performance of the 95W desktop beasts of the previous generation, which is quite frankly very impressive. Of course, they are still dwarfed by their desktop brethen – now I really do want to get my hands on a i7-3770k!

The oddity is then why is my Ivybridge underperforming? Its single thread performance which is under test here should not be as badly limited, the base clock is a tiny bit faster than the Sandybridge i5, and with the larger and improved caches it should acheive better IPC as well. In theory it should also be coupled to faster main memory as well. My guess is that it is being throttled due to poor cooling. Or that there is a glitch in the software, or perhaps a broken configuration, or perhaps the memory is underrated, et cetera.

But what of the graphics performance, I hear you cry! The situation here is a bit more complex. All of those processors where using the same SSE2 backend in pixman which will only be improved upon with the introduction of AVX2 in Haswell, and so we were directly comparing slight variations of processor design executing the same software. When we look at GPU performance, not only do we have a wide variation of processor design, feature set and instruction sets, we also by necessity have different software for each.

If we look at the current driver situation, that is using UXA:

Relative GPU performance of Core2 vs Sandybridge

Compared to the baseline performance given by using SNA on the Core2 Q8440 with a q35 (a gen3 device like found in the Pineview netbook) GPU:

Q9550: 4.3x slower
i5-2500: 1.5x slower
i5-2520m: 3.0x slower
i7-3720qm: 2.3x slower

Despite almost a doubling of CPU power and an even greater increase in the GPU performance across the generations, the drivers continue to do a disservice to the hardware and ourselves.

Lots of numbers from running cairo-traces on a i5-2500, which has a Sandybridge GT1 GPU (or more confusingly called HD2000), and also a Radeon HD5770 discrete GPU.

Performance results

The key results being the geometric mean of all the traces as compared to using a software rasteriser:

UXA (snb): 1.6x slower
glamor (snb): 2.7x slower
SNA (snb): 1.6x faster
GL (snb): 2.9x slower

EXA (r600g): 3.8x slower
glamor (r600g): 3.0x slower
GL (r600g): 2.7x slower

fglrx (xlib): 4.7x slower
fglrx (GL): 32.0x slower [not shown as it makes the graphs even more difficult to read]

All bar one of the acceleration methods is worse (performance, power, latency, by any metric) than simply using the CPU and rendering directly within the client. Note also that software rasterisation is currently more performant than trying to use the GPU through the OpenGL driver stack.

(All software, except for the xserver which was held back to keep glamor working, from git as of 20121228.)

The introduction of KMS and GEM into the i915 driver broke the i830/i845 chipsets, and a lots of hearts. But fear not! A decade after its introduction, we finally have a driver that is not only stable, but capable of accelerating firefox.

The problem?

The problem was, simply, we could not find a way to enable dynamic video memory on the ancient i830/i845 chipsets without it eventually eating garbage. Since dynamic memory management was the raison d’etre of GEM and critical for acceleration, it is a requirement of the current driver stack. The first cunning solution was simply never to reuse batch buffers, and keep a small amount of memory reserved for our usage. This stopped the command streamer from seeing the garbage, and my system has remained stable for many hours of thrashing. Daniel Vetter extended my solution to implement a kernel workaround whereby every batch would be copied into a reserved area before execution. In the end, we compromised so that I could avoid that extra copy and assume responsibility in the driver for ensuring the batch was coherent, but the kernel would intervene for any non-cooperative driver.

With these workarounds in place, we are finally able to run through the test suites. Which brought us to the next problem:

UXA vs software rasterisation on 845g

The sad fact is that UXA is inadequate for the challenge of accelerating the Render protocol.

If we compare with an architecture that was designed to accelerate cairo, SNA:

SNA vs software rasterisation on 845g

We find a much happier result. In all cases the performance is at least as good as using a software rasteriser in the X server, and often much better than if we avoided the Render protocol entirely and did the rasterisation in the client. With a little more tuning, we may be able to achieve parity even in the worst case – if we can win on an old GPU with an ancient CPU (single core, virtually no cache and even less memory bandwidth) we should be able to excel on more recent GPUs and CPUs, and be more efficient in the process.

Yet, the Render protocol is not the be-all-and-end-all of acceleration. We need to keep an eye on the basics as well, the copies, the fills and the uploads, to know if we are achieving our goals. The basic premise is that using the driver (and thus the GPU) is faster than just using the CPU for everything. (In reality, the choice is more complicated because we have to consider the efficacy of GPU offload for enabling the CPU to get on with other tasks and overall power efficiency.)

1: Baseline performance of Xvfb
2: SNA with acceleration disabled (shadow)
3: UXA
4: SNA

     1        2       3       4    Operation
--------  ------  ------  ------   ---------
277000.0    2.04    1.06    4.58   Char in 80-char aa line (Charter 10) 
265000.0    2.15    1.11    4.83   Char in 80-char rgb line (Charter 10) 
312000.0    0.66    0.15    1.38   Copy 10x10 from window to window 
  6740.0    0.90    1.56    1.75   Copy 100x100 from window to window 
   382.0    0.92    1.30    1.36   Copy 500x500 from window to window 
268000.0    0.74    0.17    1.50   Copy 10x10 from window to pixmap 
  7260.0    0.87    1.43    1.85   Copy 100x100 from window to pixmap 
   376.0    0.94    1.28    1.37   Copy 500x500 from window to pixmap 
154000.0    0.74    0.69    0.86   PutImage 10x10 square 
  1880.0    1.04    1.05    1.04   PutImage 100x100 square 
    87.1    1.03    1.02    1.01   PutImage 500x500 square 
308000.0    0.58    0.46    0.66   ShmPutImage 10x10 square 
  6500.0    1.02    1.16    1.24   ShmPutImage 100x100 square 
   380.0    1.00    1.24    1.28   ShmPutImage 500x500 square 

So it appears that using the GPU for basic operations such as moving the windows about is only at most a marginal win over using a shadow buffer (and often times UXA fails at even that). Overall then it seems that enabling UXA should bring nothing but misery.

Last week saw a new release of glamor, a library for use by the 2D display drivers translating Render drawing commands into OpenGL (and then feeding them into a 3D driver). This release packs in a couple of standout features for cairo: trapezoid shaders and the ability to handle textures larger than supported by hardware. The development emphasis has been on performance, and indeed glamor-0.5 is much improved over glamor–0.4 as measured on Intel’s latest and greatest IvyBridge architecture.

Performance improvement of glamor 0.5 over 0.4

This is a graph of absolute time for each trace, shorter bars are quicker, and as can be seen the improvement is pretty good.

However, to keep this in perspective, lets compare against the standard 2D driver:
Performance improvement of glamor 0.5 over 0.4

Still plenty of room for improvement.

Having looked at the impact of the move from XAA to UMS/EXA and then to KMS/UXA on performance of the core drawing operations, we can turn to look at the impact upon RENDER acceleration. One of the arguments for the reason behind dropping XAA support and writing a new architecture was to address the needs of accelerating the new advanced rendering protocol, RENDER. Did the claim really live up to reality, did the switch to EXA actually help?

Comparison of acceleration architectures on 965gm

In short, the switch to EXA was catastrophic in the beginning. The new acceleration architecture regressed performance in both the core rendering operations and failed to deliver the claim of improving RENDER acceleration. Fast forward a few years and through improvements to the RENDER protocol and many improvements to the driver and we reach UXA, where we start to actually see some benefit from enabling GPU acceleration. But it is not until we look at SNA that we reach a level of maturity and consistency in the driver to have all round performance that is finally at least as good as XAA again (effectively software rendering in these benchmarks).

An outstanding question that is regularly asked is what are design differences between EXA, UXA and SNA?

UXA was originally EXA with the pixmap migration removed. It was argued that given a unified memory architecture such as found on IGP, there was no need to migrate pixmaps between the various GPU memory domains. Instead all pixmap allocations were to be made in the single GPU domain and if necessary the pixmap would be mapped and read by the CPU through the GTT (effectively an uncached readback, very very slow).

In hindsight, that decision was flawed. As it turns out, not only we have both mappable and unmappable memory domains within the IGP, and so we cannot simply map any GPU pixmap without cost, but we can also have snoopable GPU memory (memory that can exist in the CPU cache). The single GPU memory domain argument was a fallacy from the start, and even more so with the advent of the shared last-level-cache between the CPU and GPU. Also it tends to be much much faster to copy from the GPU pixmap into system memory, perform the operation on the copy in system memory and then copy it back, than it is to try and perform the operation through a GTT mapping of a GPU pixmap (more so if you take care to amoritize the cost of the migration). Not to mention the asymmetry between upload and download speeds, and that we can exploit snoopable GPU memory to accelerate those transfers. So it turns out that having a very efficient pixmap migration strategy is the core of an acceleration architecture. In this regard SNA is very much like EXA.

From the design point of view, the only real difference between EXA and SNA is that EXA is a mere midlayer whose existence is to hinder the driver from doing the right thing, and SNA is a complete implementation. Since the devil is in the details (fallbacks are slow, don’t!), that difference turns out to be quite huge.

A few years ago, memory management for the graphics card was performed as single static allocation in the X server which was then carved up into surfaces and used for the scanout and important pixmaps such as renderbuffers for its DRI clients. The upside of this simple scheme meant that all locations were known, allocation was very fast and we could always tell the GPU where its surfaces were. The downside was that the amount of video memory was therefore predetermined and could not be resized, not even if you added a second monitor and needed a new framebuffer, or if you were running a game and it wanted lots of textures. So X was relieved of its role in memory management and the task given to the kernel under the guise Graphics Execution Memory. Now userspace has no idea where its surfaces are and so needs to ask the kernel to patch up its command buffers to insert the correct addresses. This relocation of command buffers is a bottleneck in the new design, and from the outset people were complaining about the performance loss going from XAA to GEM/UXA.

A few years have passed and we’ve been gradually tuning all the parties, but the question remains have we managed to recover that speed which we threw away so long ago?

I compiled Xorg-1.5 and xf86-video-intel-2.6 for my 965gm (a ThinkPad t61) by discarding anything that no longer compiled and was left with a very light shell for investigating XAA in its heyday. I then ran x11perf under the ancient XAA and EXA, and the modern UXA and SNA to see how things had changed.

By looking at the geometric mean of the all the x11perf tests we can get a rough feel for the overall performance change against XAA:

EXA 1.117 (12% faster than XAA)

UXA -1.108 (11% slower than XAA)

SNA 2.284 (128% faster than XAA)

As with all averages of micro-benchmarks take this with an extremely large pinch of salt.