Last week saw a new release of glamor, a library for use by the 2D display drivers translating Render drawing commands into OpenGL (and then feeding them into a 3D driver). This release packs in a couple of standout features for cairo: trapezoid shaders and the ability to handle textures larger than supported by hardware. The development emphasis has been on performance, and indeed glamor-0.5 is much improved over glamor–0.4 as measured on Intel’s latest and greatest IvyBridge architecture.
This is a graph of absolute time for each trace, shorter bars are quicker, and as can be seen the improvement is pretty good.
However, to keep this in perspective, lets compare against the standard 2D driver:
Still plenty of room for improvement.
Having looked at the impact of the move from XAA to UMS/EXA and then to KMS/UXA on performance of the core drawing operations, we can turn to look at the impact upon RENDER acceleration. One of the arguments for the reason behind dropping XAA support and writing a new architecture was to address the needs of accelerating the new advanced rendering protocol, RENDER. Did the claim really live up to reality, did the switch to EXA actually help?
In short, the switch to EXA was catastrophic in the beginning. The new acceleration architecture regressed performance in both the core rendering operations and failed to deliver the claim of improving RENDER acceleration. Fast forward a few years and through improvements to the RENDER protocol and many improvements to the driver and we reach UXA, where we start to actually see some benefit from enabling GPU acceleration. But it is not until we look at SNA that we reach a level of maturity and consistency in the driver to have all round performance that is finally at least as good as XAA again (effectively software rendering in these benchmarks).
An outstanding question that is regularly asked is what are design differences between EXA, UXA and SNA?
UXA was originally EXA with the pixmap migration removed. It was argued that given a unified memory architecture such as found on IGP, there was no need to migrate pixmaps between the various GPU memory domains. Instead all pixmap allocations were to be made in the single GPU domain and if necessary the pixmap would be mapped and read by the CPU through the GTT (effectively an uncached readback, very very slow).
In hindsight, that decision was flawed. As it turns out, not only we have both mappable and unmappable memory domains within the IGP, and so we cannot simply map any GPU pixmap without cost, but we can also have snoopable GPU memory (memory that can exist in the CPU cache). The single GPU memory domain argument was a fallacy from the start, and even more so with the advent of the shared last-level-cache between the CPU and GPU. Also it tends to be much much faster to copy from the GPU pixmap into system memory, perform the operation on the copy in system memory and then copy it back, than it is to try and perform the operation through a GTT mapping of a GPU pixmap (more so if you take care to amoritize the cost of the migration). Not to mention the asymmetry between upload and download speeds, and that we can exploit snoopable GPU memory to accelerate those transfers. So it turns out that having a very efficient pixmap migration strategy is the core of an acceleration architecture. In this regard SNA is very much like EXA.
From the design point of view, the only real difference between EXA and SNA is that EXA is a mere midlayer whose existence is to hinder the driver from doing the right thing, and SNA is a complete implementation. Since the devil is in the details (fallbacks are slow, don’t!), that difference turns out to be quite huge.