Skip navigation

Category Archives: 2D

A month in, and we’re starting to get some good feedback from people trying cairo-1.12. Unfortunately, it appears that we’ve exposed some bugs in a few drivers, though hopefully we will see driver updates to resolve those issues shortly. We also received a few bug reports against Cairo itself. The most bizarre perhaps was that LibreOffice failed to display the presentation slideshow. That turned out to be an inadvertent bug caught by better error detection – though since that affected the established API, we had to relax the checks somewhat. Along with a few other bug fixes, Cairo 1.12.2 was released.

In the last round of benchmarking I performed, some of you noticed that the glamor backend for the Intel driver was not included. It was left out for the simple reason that it was not able to complete the task. However with the first stable release of glamor-0.4, it is able to complete a benchmarking run. And so without further ado, let’s see how all the current drivers fare on a SandyBridge i5-2500 desktop with GT1 integrated graphics and a Radeon HD 5770 discrete GPU with cairo-1.12.2.

Performance of cairo-1.12.2 on i5-2500

This time the results are normalized to the performance with Xvfb. Any driver that performs a test faster is above the centre, any that were slower below. Again the general state of the drivers leave much to be desired, and despite the bold claims for glamor, in my testing it fails to improve upon UXA. Early days you might say.

I do have a SandyBridge desktop. This system is a bit special as it is the one in which I can run a discrete card in direct comparison to the ingrated processor graphics, i.e. run both GPUs simultaneously (but not yet co-operatively). In this system I put a Radeon HD5770 which at the time was a mid-range GPU offering good value for performance. So how does Cairo fare on this box? In particular are the DDX drivers any better than pixman, the backend for Cairo’s software rasteriser?

Cairo performance on a Radeon HD5770

Once again the white baseline in the centre represents the performance of the image backend, how fast we expect to be able to render the contents ourselves. Above that baseline, the driver is faster; below slower, laggy and power hungry. This is quite a busy graph as it compares the performance of the propietary fglrx driver and the open-source radeon driver along with the alternative acceleration methods for the integrated graphics. I’d recommend viewing at full size, but the gist of it is that in many cases the propietary driver lags behind the open-source effort. Neither are truly comparable to just using the CPU and only the open-source effort is ever faster. In contrast, we have the integrated processor graphics. Even this GT1 desktop system (only half the GPU capability of a GT2 system such as found on mobiles and on a few desktop chips) can outclass the CPU. When not limited by a poor driver, that is.

18 months is a long time without an update. So why all the excitement now? Well it is a little story of one or two features and lots of performance…

First up, lets compare the relative performance of Cairo versions 1.10 and 1.12 on a recent SandyBridge machine, a i5-2520m, by looking at how long it takes to replay a selection of application traces.
Relative Cairo peformance on Sandybridge

The white line across the centre represents the baseline performance of cairo-1.10. Above that line and that backend is faster with cairo-1.12 for that application, below slower.

Across the board, with one or two exceptions, Cairo version 1.12 is much faster than 1.10. (Those one or two exceptions are interesting and quite unexpected. Analysing those regressions should help push Cairo further.) If we focus on the yellow bar, you can see that the performance of the baseline image backend has been improved consistently, peaking at almost 4 times faster, but on average around 50% faster. Note this excludes any performance gains also made by pixman within the same time frame. The other bars compare the other backends, the xlib backend using the SNA and UXA acceleration methods along with the pure CPU rasteriser (xvfb), and the OpenGL backend.

That tells us that we have made good improvements to Cairo’s performance, but a question that is often asked is whether using XRender is worthwhile and whether we would have better performance and more responsive user interfaces if we were just to use CPU rasterisation (using the image backend) and push those images to the X server, for instance. To answer that we need to look beyond the relative performance of the same backend over time, and instead look at the relative performance of the backends against image. (Note this excludes any transfer overhead, which in practice is mostly negligible.)

Relative performance of X against the image backend on SandyBridge
The answer is then a resounding no, even on integrated graphics, provided you have a good driver. And that lead will be stretched further in favour of using X (and thus the GPU) as the capabilities of the GPU grow faster than the CPU. Whilst we remain GPU bound at least! On the contrary, though it does say that without any driver it would be preferrable to perform client side rasterisation into a shared memory image to avoid he performance loss due to trapezoid passing in the protocol.

Given that we are forced by the XRender protocol to use a sub-optimal rendering method, we should do even better if we cut out the indirect rendering via X and rendered directly onto the GPU using OpenGL. What happens when we try?

Relative performance of X against the image backend on SandyBridge

Not only is the current gl backend overshadowed by the xlib backend, but it fails to even match the pure CPU rasteriser. There are features of the GPU that we are not using yet and are being experimented with, such as using multi-sample buffers and relying on the GPU to perform low quality antialiasing, but as for today it is applying the same methodology as the X driver uses internally and yet performs much worse.

The story is more less the same across all the generations of Intel’s integrated graphics.

Going back a generation to IronLake integrated graphics:
Relative performance of X against the image backend on IronLake

And for the netbooks we use PineView integrated graphics:
Relative performance of X against the image backend on PineView

So the next time your machine feels slow or runs out of battery, it is more than likely to be the fault of your drivers.

So how do we diagnose bugs in the drivers when they all result in either a system crash or just a frozen display?

If it is a GPU hang (check dmesg or Xorg.log for a telltale error message), then the debugging process is gradually getting easier. First the kernel detects that the GPU has stopped processing data. It does this with a simple watch dog that polls the GPU for any activity since the last check. If the active head pointer has not advanced in the last two checks, we declare it hung and attempt to reset the GPU. Currently the reset procedure only works for the more recent chips, i965 and above, but in principle we might be able to recover at least the i915 generation as well. Importantly, just before we reset the GPU state, we record what instructions the GPU was executing at the time of the hang and some associated state. At present, this involves scanning the ringbuffer and list of known buffers for the last batch buffer executed and for wherever the GPU ended up. Those buffers are then copied, preserving them for later inspection, and the reset procedure is begun.

Some time afterwards it becomes apparent that the system is misbehaving. Unfortunately this has a tendency to be a frozen X server or even a crash, though the drivers are beginning to grow a little more resilient to their own bugs. Once noticed, we can copy the error report from /sys/kernel/debug/dri/0/i915_error_state (you might have to run “mount -tdebugfs none /sys/kernel/debug” to gain access to the debug information exported by your kernel, your distribution may even automate this by listening for the whimpers from the kernel) and begin a diagnosis.

Lets turn to a real example that I am currently puzzling over.

Running the i915_error_state through intel_error_decode (from the intel-gpu-tools package) prints:

Time: 1278512451 s 982711 us
PCI ID: 0x27a2
EIR: 0x00000000
PGTBL_ER: 0x00000000
INSTPM: 0x00000000
IPEIR: 0x00000000
IPEHR: 0x02000011
INSTDONE: 0x7ffffa41
busy: Bypass FIFO
busy: Color calculator
busy: Map filter
ACTHD: 0x0de88144
seqno: 0x0004d8fe
Buffers [40]:
0e33e000 16384 00000060 00000000 0004d8ff dirty purgeable
09747000 4096 00000003 00000000 0004d8ff
0de88000 16384 00000048 00000000 0004d8ff dirty purgeable
03dc4000 4096 00000006 00000000 0004d900 dirty purgeable
0e20a000 393216 00000002 00000000 0004d900 dirty purgeable
0e3a0000 16384 00000006 00000000 0004d900 dirty purgeable
01800000 8388608 00000002 00000000 0004d900 P X dirty (name: 64) (fence: 2)
01000000 8388608 00000006 00000000 0004d901 P X dirty (name: 1) (fence: 0)
0e360000 16384 00000060 00000000 0004d901 dirty purgeable
03dc5000 4096 00000002 00000002 0004d901 dirty purgeable
0e374000 16384 00000048 00000000 0004d901 dirty purgeable
03e00000 4096 00000002 00000002 0004d902 dirty
03dff000 4096 00000044 00000000 0004d902 dirty
07a74000 16384 00000060 00000000 0004d902 dirty purgeable
0e358000 16384 00000048 00000000 0004d902 dirty purgeable
0e335000 4096 00000045 00000000 0004d903 dirty
0e336000 4096 00000045 00000000 0004d903 dirty
03dc6000 4096 00000006 00000002 0004d903 dirty
0e35c000 16384 00000060 00000000 0004d903 dirty purgeable
03dc7000 229376 00000002 00000002 0004d903 dirty
0e3e3000 16384 00000048 00000000 0004d903 dirty purgeable
097ce000 229376 00000002 00000002 0004d904 dirty
0e37c000 16384 00000060 00000000 0004d904 dirty
0e388000 16384 00000048 00000000 0004d904 dirty
04800000 8388608 00000002 00000002 0004d905 P X dirty (name: 3) (fence: 6)
04000000 8388608 00000005 00000000 0004d905 X (name: 10)
07b81000 32768 00000060 00000000 0004d905 dirty purgeable
03548000 327680 00000006 00000000 0004d905 dirty (name: 8)
02193000 163840 00000006 00000000 0004d905 dirty (name: 12)
021bb000 163840 00000006 00000000 0004d905 dirty (name: 15)
03598000 196608 00000006 00000000 0004d905 dirty (name: 18)
0a000000 8388608 00000005 00000000 0004d905 X (name: 19)
0be87000 327680 00000006 00000000 0004d905 dirty (name: 63)
0be4f000 114688 00000006 00000000 0004d905 dirty (name: 79)
0be6b000 114688 00000006 00000000 0004d905 dirty (name: 82)
0f000000 8388608 00000005 00000000 0004d905 X (name: 86)
0fe86000 327680 00000006 00000000 0004d905 dirty (name: 49)
0fe36000 163840 00000006 00000000 0004d905 dirty (name: 52)
0fe5e000 163840 00000006 00000000 0004d905 dirty (name: 55)
07bdf000 16384 00000048 00000000 0004d905 dirty purgeable
batchbuffer at 0x0de88000:
0x0de88000: 0x66014140: 3DSTATE_ANTI_ALIASING
0x0de88004: 0x6ba008a1: 3DSTATE_INDEPENDENT_ALPHA_BLEND
0x0de88008: 0x7d990000: 3DSTATE_DEFAULT_DIFFUSE
0x0de8800c: 0x00000000: dword 1
0x0de88010: 0x7d9a0000: 3DSTATE_DEFAULT_SPECULAR
0x0de88014: 0x00000000: dword 1
0x0de88018: 0x7d980000: 3DSTATE_DEFAULT_Z
0x0de8801c: 0x00000000: dword 1
0x0de88020: 0x76fac688: 3DSTATE_COORD_SET_BINDINGS
0x0de88024: 0x6700a770: 3DSTATE_RASTERIZATION_RULES
0x0de88028: 0x6db3ffff: 3DSTATE_MODES_4
0x0de8802c: 0x7d040382: 3DSTATE_LOAD_STATE_IMMEDIATE_1
0x0de88030: 0x00000000: S3
0x0de88034: 0x009020c0: S4
0x0de88038: 0x00000000: S5
0x0de8803c: 0x7c800002: 3DSTATE_SCISSOR_ENABLE
0x0de88040: 0x7d810001: 3DSTATE_SCISSOR_RECTANGLE
0x0de88044: 0x00000000: dword 1
0x0de88048: 0x00000000: dword 2
0x0de88050: 0x7d070000: 3DSTATE_LOAD_INDIRECT
0x0de88054: 0x00000000: MI_NOOP
0x0de88058: 0x7d830000: 3DSTATE_SPAN_STIPPLE
0x0de8805c: 0x00000000: dword 1
0x0de88060: 0x68000002: 3DSTATE_BACKFACE_STENCIL_OPS
0x0de88064: 0x00000000: MI_NOOP
0x0de88068: 0x7d000003: 3DSTATE_MAP_STATE
0x0de8806c: 0x00000001: mask
0x0de88070: 0x01000000: map 0 MS2
0x0de88074: 0x63f3fd92: map 0 MS3 [width=1280, height=800, tiling=X]
0x0de88078: 0xffe00000: map 0 MS4 [pitch=8192]
0x0de8807c: 0x7d010003: 3DSTATE_SAMPLER_STATE
0x0de88080: 0x00000001: mask
0x0de88084: 0x00000000: sampler 0 SS2
0x0de88088: 0x00004820: sampler 0 SS3
0x0de8808c: 0x00000000: sampler 0 SS4
0x0de88090: 0x7d8e0001: 3DSTATE_BUFFER_INFO
0x0de88094: 0x03000040: dword 1
0x0de88098: 0x03dc5000: dword 2
0x0de8809c: 0x7d850000: 3DSTATE_DEST_BUFFER_VARIABLES
0x0de880a0: 0x00880300: a8r8g8b8 format, early Z disabled
0x0de880a4: 0x7d800003: 3DSTATE_DRAWING_RECTANGLE
0x0de880a8: 0x00000000: dword 1
0x0de880ac: 0x00000000: dword 2
0x0de880b0: 0x00000000: dword 3
0x0de880b4: 0x00000000: dword 4
0x0de880b8: 0x7d040441: 3DSTATE_LOAD_STATE_IMMEDIATE_1
0x0de880bc: 0xfffffff0: S2
0x0de880c0: 0x00008214: S6
0x0de880c4: 0x7d050008: 3DSTATE_PIXEL_SHADER_PROGRAM
0x0de880c8: 0x19083c00: PS000: DCL T0.xyzw
0x0de880cc: 0x00000000: PS000
0x0de880d0: 0x00000000: PS000
0x0de880d4: 0x19180000: PS001: DCL S0 2D
0x0de880d8: 0x00000000: PS001
0x0de880dc: 0x00000000: PS001
0x0de880e0: 0x15200000: PS002: TEXLD oC, S0, T0
0x0de880e4: 0x01000000: PS002
0x0de880e8: 0x00000000: PS002
0x0de880ec: 0x7d040031: 3DSTATE_LOAD_STATE_IMMEDIATE_1
0x0de880f0: 0x0e33e000: S0
0x0de880f4: 0x04040000: S1
0x0de880f8: 0x7f9c0003: 3DPRIMITIVE sequential indirect RECTLIST, 3 starting from 0
0x0de880fc: 0x00000000: start
0x0de88100: 0x54f00006: XY_SRC_COPY_BLT (rgb enabled, alpha enabled, src tile 0, dst tile 0)
0x0de88104: 0x03cc0080: format 8888, dst pitch 128, clipping disabled
0x0de88108: 0x00000000: dst (0,0)
0x0de8810c: 0x00160016: dst (22,22)
0x0de88110: 0x03dc6000: dst offset 0x03dc6000
0x0de88114: 0x00090396: src (918,9)
0x0de88118: 0x00000040: src pitch 64
0x0de8811c: 0x03dc5000: src offset 0x03dc5000
0x0de88120: 0x54f00006: XY_SRC_COPY_BLT (rgb enabled, alpha enabled, src tile 0, dst tile 0)
0x0de88124: 0x03cc0080: format 8888, dst pitch 128, clipping disabled
0x0de88128: 0x00000000: dst (0,0)
0x0de8812c: 0x00160016: dst (22,22)
0x0de88130: 0x03dc6000: dst offset 0x03dc6000
0x0de88134: 0x00000000: src (0,0)
0x0de88138: 0x00000080: src pitch 128
0x0de8813c: 0x09747000: src offset 0x09747000
0x0de88140: 0x02000011: MI_FLUSH
0x0de88144: HEAD 0x05000000: MI_BATCH_BUFFER_END

The real question is what exactly that tells us and do we get any closer to finding the bug?
As can be seen, not a lot happens in this batch buffer. In fact just 3 operations. One copy using the 3D pipeline, with a lot of preamble, and two copies using the 2D blitter. If you look even closer, you will go mad.

Operation 1: 3D copy.
Destination: linear buffer 03dc5000, size 4096 bytes, pitch 64 bytes, drawing rectangle (0,0), (1, 1)
Source: X tiled buffer 01000000 size 8388608 bytes, pitch 8192 bytes, 1280×800 pixels. This looks like X’s front buffer.
What is actually copied to the 1×1 pixel is a mystery as we do not follow auxiliary buffers yet and so can not decode the vertices used. Presumably it is something a little more complicated than a mere blit as a texture sampler is required – though this may just be something as trivial as ensuring the alpha channel is initialized.

Operation 2: 2D copy.
Destination: linear buffer 03dc6000, size 4096 bytes, pitch 128 bytes.
Source: linear buffer 03dc5000, size 4096 bytes, pitch 64. From above we know its size to be 1×1 pixels.
Copy: (918, 9), (940, 31) to (0, 0), (22, 22).

Operation 3: 2D copy.
Destination: linear buffer 03dc6000, size 4096 bytes, pitch 128 bytes.
Source: linear buffer 09747000, size 4096 bytes, pitch 128 bytes.
Copy: (0, 0), (22, 22) to (0, 0), (22, 22).

So which operation appears entirely superfluous and yet accessed data far beyond the end of its buffer? That’s right the second copy. For a 1×1 surface, attempting to copy a 22×22 region is a bit strange, doing it 900 pixels outside the surface is just adding insult to injury. Not to mention that the user did not even want the data that we spent so much time fetching!

But where does that leave me? The impossible happened and now I have evidence to prove it! Oh, and a loathing for graphics libraries…