Skip navigation

So how do we diagnose bugs in the drivers when they all result in either a system crash or just a frozen display?

If it is a GPU hang (check dmesg or Xorg.log for a telltale error message), then the debugging process is gradually getting easier. First the kernel detects that the GPU has stopped processing data. It does this with a simple watch dog that polls the GPU for any activity since the last check. If the active head pointer has not advanced in the last two checks, we declare it hung and attempt to reset the GPU. Currently the reset procedure only works for the more recent chips, i965 and above, but in principle we might be able to recover at least the i915 generation as well. Importantly, just before we reset the GPU state, we record what instructions the GPU was executing at the time of the hang and some associated state. At present, this involves scanning the ringbuffer and list of known buffers for the last batch buffer executed and for wherever the GPU ended up. Those buffers are then copied, preserving them for later inspection, and the reset procedure is begun.

Some time afterwards it becomes apparent that the system is misbehaving. Unfortunately this has a tendency to be a frozen X server or even a crash, though the drivers are beginning to grow a little more resilient to their own bugs. Once noticed, we can copy the error report from /sys/kernel/debug/dri/0/i915_error_state (you might have to run “mount -tdebugfs none /sys/kernel/debug” to gain access to the debug information exported by your kernel, your distribution may even automate this by listening for the whimpers from the kernel) and begin a diagnosis.

Lets turn to a real example that I am currently puzzling over.

Running the i915_error_state through intel_error_decode (from the intel-gpu-tools package) prints:

Time: 1278512451 s 982711 us
PCI ID: 0x27a2
EIR: 0x00000000
PGTBL_ER: 0x00000000
INSTPM: 0x00000000
IPEIR: 0x00000000
IPEHR: 0x02000011
INSTDONE: 0x7ffffa41
busy: Bypass FIFO
busy: Color calculator
busy: Map filter
ACTHD: 0x0de88144
seqno: 0x0004d8fe
Buffers [40]:
0e33e000 16384 00000060 00000000 0004d8ff dirty purgeable
09747000 4096 00000003 00000000 0004d8ff
0de88000 16384 00000048 00000000 0004d8ff dirty purgeable
03dc4000 4096 00000006 00000000 0004d900 dirty purgeable
0e20a000 393216 00000002 00000000 0004d900 dirty purgeable
0e3a0000 16384 00000006 00000000 0004d900 dirty purgeable
01800000 8388608 00000002 00000000 0004d900 P X dirty (name: 64) (fence: 2)
01000000 8388608 00000006 00000000 0004d901 P X dirty (name: 1) (fence: 0)
0e360000 16384 00000060 00000000 0004d901 dirty purgeable
03dc5000 4096 00000002 00000002 0004d901 dirty purgeable
0e374000 16384 00000048 00000000 0004d901 dirty purgeable
03e00000 4096 00000002 00000002 0004d902 dirty
03dff000 4096 00000044 00000000 0004d902 dirty
07a74000 16384 00000060 00000000 0004d902 dirty purgeable
0e358000 16384 00000048 00000000 0004d902 dirty purgeable
0e335000 4096 00000045 00000000 0004d903 dirty
0e336000 4096 00000045 00000000 0004d903 dirty
03dc6000 4096 00000006 00000002 0004d903 dirty
0e35c000 16384 00000060 00000000 0004d903 dirty purgeable
03dc7000 229376 00000002 00000002 0004d903 dirty
0e3e3000 16384 00000048 00000000 0004d903 dirty purgeable
097ce000 229376 00000002 00000002 0004d904 dirty
0e37c000 16384 00000060 00000000 0004d904 dirty
0e388000 16384 00000048 00000000 0004d904 dirty
04800000 8388608 00000002 00000002 0004d905 P X dirty (name: 3) (fence: 6)
04000000 8388608 00000005 00000000 0004d905 X (name: 10)
07b81000 32768 00000060 00000000 0004d905 dirty purgeable
03548000 327680 00000006 00000000 0004d905 dirty (name: 8)
02193000 163840 00000006 00000000 0004d905 dirty (name: 12)
021bb000 163840 00000006 00000000 0004d905 dirty (name: 15)
03598000 196608 00000006 00000000 0004d905 dirty (name: 18)
0a000000 8388608 00000005 00000000 0004d905 X (name: 19)
0be87000 327680 00000006 00000000 0004d905 dirty (name: 63)
0be4f000 114688 00000006 00000000 0004d905 dirty (name: 79)
0be6b000 114688 00000006 00000000 0004d905 dirty (name: 82)
0f000000 8388608 00000005 00000000 0004d905 X (name: 86)
0fe86000 327680 00000006 00000000 0004d905 dirty (name: 49)
0fe36000 163840 00000006 00000000 0004d905 dirty (name: 52)
0fe5e000 163840 00000006 00000000 0004d905 dirty (name: 55)
07bdf000 16384 00000048 00000000 0004d905 dirty purgeable
batchbuffer at 0x0de88000:
0x0de88000: 0x66014140: 3DSTATE_ANTI_ALIASING
0x0de88004: 0x6ba008a1: 3DSTATE_INDEPENDENT_ALPHA_BLEND
0x0de88008: 0x7d990000: 3DSTATE_DEFAULT_DIFFUSE
0x0de8800c: 0x00000000: dword 1
0x0de88010: 0x7d9a0000: 3DSTATE_DEFAULT_SPECULAR
0x0de88014: 0x00000000: dword 1
0x0de88018: 0x7d980000: 3DSTATE_DEFAULT_Z
0x0de8801c: 0x00000000: dword 1
0x0de88020: 0x76fac688: 3DSTATE_COORD_SET_BINDINGS
0x0de88024: 0x6700a770: 3DSTATE_RASTERIZATION_RULES
0x0de88028: 0x6db3ffff: 3DSTATE_MODES_4
0x0de8802c: 0x7d040382: 3DSTATE_LOAD_STATE_IMMEDIATE_1
0x0de88030: 0x00000000: S3
0x0de88034: 0x009020c0: S4
0x0de88038: 0x00000000: S5
0x0de8803c: 0x7c800002: 3DSTATE_SCISSOR_ENABLE
0x0de88040: 0x7d810001: 3DSTATE_SCISSOR_RECTANGLE
0x0de88044: 0x00000000: dword 1
0x0de88048: 0x00000000: dword 2
0x0de88050: 0x7d070000: 3DSTATE_LOAD_INDIRECT
0x0de88054: 0x00000000: MI_NOOP
0x0de88058: 0x7d830000: 3DSTATE_SPAN_STIPPLE
0x0de8805c: 0x00000000: dword 1
0x0de88060: 0x68000002: 3DSTATE_BACKFACE_STENCIL_OPS
0x0de88064: 0x00000000: MI_NOOP
0x0de88068: 0x7d000003: 3DSTATE_MAP_STATE
0x0de8806c: 0x00000001: mask
0x0de88070: 0x01000000: map 0 MS2
0x0de88074: 0x63f3fd92: map 0 MS3 [width=1280, height=800, tiling=X]
0x0de88078: 0xffe00000: map 0 MS4 [pitch=8192]
0x0de8807c: 0x7d010003: 3DSTATE_SAMPLER_STATE
0x0de88080: 0x00000001: mask
0x0de88084: 0x00000000: sampler 0 SS2
0x0de88088: 0x00004820: sampler 0 SS3
0x0de8808c: 0x00000000: sampler 0 SS4
0x0de88090: 0x7d8e0001: 3DSTATE_BUFFER_INFO
0x0de88094: 0x03000040: dword 1
0x0de88098: 0x03dc5000: dword 2
0x0de8809c: 0x7d850000: 3DSTATE_DEST_BUFFER_VARIABLES
0x0de880a0: 0x00880300: a8r8g8b8 format, early Z disabled
0x0de880a4: 0x7d800003: 3DSTATE_DRAWING_RECTANGLE
0x0de880a8: 0x00000000: dword 1
0x0de880ac: 0x00000000: dword 2
0x0de880b0: 0x00000000: dword 3
0x0de880b4: 0x00000000: dword 4
0x0de880b8: 0x7d040441: 3DSTATE_LOAD_STATE_IMMEDIATE_1
0x0de880bc: 0xfffffff0: S2
0x0de880c0: 0x00008214: S6
0x0de880c4: 0x7d050008: 3DSTATE_PIXEL_SHADER_PROGRAM
0x0de880c8: 0x19083c00: PS000: DCL T0.xyzw
0x0de880cc: 0x00000000: PS000
0x0de880d0: 0x00000000: PS000
0x0de880d4: 0x19180000: PS001: DCL S0 2D
0x0de880d8: 0x00000000: PS001
0x0de880dc: 0x00000000: PS001
0x0de880e0: 0x15200000: PS002: TEXLD oC, S0, T0
0x0de880e4: 0x01000000: PS002
0x0de880e8: 0x00000000: PS002
0x0de880ec: 0x7d040031: 3DSTATE_LOAD_STATE_IMMEDIATE_1
0x0de880f0: 0x0e33e000: S0
0x0de880f4: 0x04040000: S1
0x0de880f8: 0x7f9c0003: 3DPRIMITIVE sequential indirect RECTLIST, 3 starting from 0
0x0de880fc: 0x00000000: start
0x0de88100: 0x54f00006: XY_SRC_COPY_BLT (rgb enabled, alpha enabled, src tile 0, dst tile 0)
0x0de88104: 0x03cc0080: format 8888, dst pitch 128, clipping disabled
0x0de88108: 0x00000000: dst (0,0)
0x0de8810c: 0x00160016: dst (22,22)
0x0de88110: 0x03dc6000: dst offset 0x03dc6000
0x0de88114: 0x00090396: src (918,9)
0x0de88118: 0x00000040: src pitch 64
0x0de8811c: 0x03dc5000: src offset 0x03dc5000
0x0de88120: 0x54f00006: XY_SRC_COPY_BLT (rgb enabled, alpha enabled, src tile 0, dst tile 0)
0x0de88124: 0x03cc0080: format 8888, dst pitch 128, clipping disabled
0x0de88128: 0x00000000: dst (0,0)
0x0de8812c: 0x00160016: dst (22,22)
0x0de88130: 0x03dc6000: dst offset 0x03dc6000
0x0de88134: 0x00000000: src (0,0)
0x0de88138: 0x00000080: src pitch 128
0x0de8813c: 0x09747000: src offset 0x09747000
0x0de88140: 0x02000011: MI_FLUSH
0x0de88144: HEAD 0x05000000: MI_BATCH_BUFFER_END

The real question is what exactly that tells us and do we get any closer to finding the bug?
As can be seen, not a lot happens in this batch buffer. In fact just 3 operations. One copy using the 3D pipeline, with a lot of preamble, and two copies using the 2D blitter. If you look even closer, you will go mad.

Operation 1: 3D copy.
Destination: linear buffer 03dc5000, size 4096 bytes, pitch 64 bytes, drawing rectangle (0,0), (1, 1)
Source: X tiled buffer 01000000 size 8388608 bytes, pitch 8192 bytes, 1280×800 pixels. This looks like X’s front buffer.
What is actually copied to the 1×1 pixel is a mystery as we do not follow auxiliary buffers yet and so can not decode the vertices used. Presumably it is something a little more complicated than a mere blit as a texture sampler is required – though this may just be something as trivial as ensuring the alpha channel is initialized.

Operation 2: 2D copy.
Destination: linear buffer 03dc6000, size 4096 bytes, pitch 128 bytes.
Source: linear buffer 03dc5000, size 4096 bytes, pitch 64. From above we know its size to be 1×1 pixels.
Copy: (918, 9), (940, 31) to (0, 0), (22, 22).

Operation 3: 2D copy.
Destination: linear buffer 03dc6000, size 4096 bytes, pitch 128 bytes.
Source: linear buffer 09747000, size 4096 bytes, pitch 128 bytes.
Copy: (0, 0), (22, 22) to (0, 0), (22, 22).

So which operation appears entirely superfluous and yet accessed data far beyond the end of its buffer? That’s right the second copy. For a 1×1 surface, attempting to copy a 22×22 region is a bit strange, doing it 900 pixels outside the surface is just adding insult to injury. Not to mention that the user did not even want the data that we spent so much time fetching!

But where does that leave me? The impossible happened and now I have evidence to prove it! Oh, and a loathing for graphics libraries…


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: