Skip navigation

Monthly Archives: January 2013

Søren Sandmann Pedersen, maintainer of the pixman library that is at the heart of the software rasteriser in Cairo and X, argues that with all things being equal an integrated graphics processor should never be able to outperform the main processor for rendering Cairo. Ideally, when using either the GPU or CPU, we want the memory being the rate-limiting component. On the Sandybridge processor, the memory is attached to the system agent, along with the large L3 caches, and is shared between the CPU and GPU. So he argues that any performance advantage cairo-xlib might show over cairo-image whilst using Intel graphics is really an opportunity for improvement of the rasteriser and pixman kernels. (But also bear in mind that the GPU should be considerably more efficient and be able to saturate the memory bus at a fraction of the power cost of the CPU). Using SSE2, we are indeed able to fully saturate the bus to L1, L2 and main memory for a variety of common operations in pixman. However, the bilinear scaling kernels take longer to compute than it does to write the results to memory, and so fail to completely utilize the available bandwidth. Furthermore, for a slightly more complex transformation than a simple scale and translation, we do not yet have a fast SSE2 kernel that operates directly on the destination and so must incur some extra work. If we ignore, for the time being, making further optimisations to those kernels to improve performance, we can instead ask the question: if using one SSE2 unit is not enough to saturate the memory bus, what happens if we throw more cores at the problem?

To gain the most improvement from adding threads to cairo, you need to design a rasteriser and usage model with threading in mind in. One such design is the vector renderer O by Øyvind Kolås. Despite being an experiment, it does show quite a bit of promise, but in its raw form just throwing threads at the problem does not beat using the SIMD compositing routines provided by pixman. However, it did raise the question whether we can make improvements to the existing image backend without impacting upon its immediate mode nature and so could be used by existing applications without alteration. To preserve the existing semantics, we can break up the individual composite and scan conversion operations into small pieces and feed those to a pool of threads, and then wait for the threads to complete before returning back to the application. As such we then never run the threads for very long, and risk that the overhead in thread management outweighs any benefit from splitting the operation over multiple cores.

To gain the greatest advantage from adding threads, I used a desktop Sandybridge processor (an i5-2500) for benchmarking this prototype cairo-image backend:

Comparison of multiple threads

About half the test cases regress by about 10%, but around one third are improved by 2-3x. Not bad – the regressions are unacceptable, but it does offer a tantalising suggestion that we can make some significant improvements with only a minor change. (Remember not all processors are equal, and thanks to Sandybridge’s turbo some cores are more equal than others. I tried running the test cases on a Sandybridge ultrabook, and as soon as more one core was active, the entire package was throttled to keep it within its thermal constraints. The performance regression was dire.)

Having tuned the software backend to make better use of the average resources, what can we now say about the relative merits of the integrated graphics processor?

Comparison of multiple threads and SNA

Indeed, for the cases that are almost entirely GPU bound (for example the firefox-fishbowl, -fishtank, -paintball, -particles), we have virtually eliminated all the previous advantage that the GPU held. In a notable couple of cases, we have improved the image backend to outperform SNA, and for all cases now the threaded image backend beats UXA. However, as can be seen there is still plenty of room for improvement of the image backend, and we can’t let the hardware acceleration be merely equal to a software rasteriser…

Advertisement

The Nvidia ion GPU was released a few years back to cater for the low power netbook market as a substantial upgrade for the anemic Intel GMA950 (or 945gm as it is better known to us) that shipped as the integrated graphics processor in the first Atoms. Those made quite nice machines with long runtimes, a touch on the slow side compared to the full grown laptops, but actually quite comparable to the current generation of premium tablets many years later. People still use these, and I get the occasional request to see how well they perform everytime Nvidia releases a new driver. So lets take a look:

Performance of Nvidia Ion with the 313.18 driver release

In nearly every test there is a small improvement, at around 2-3%. But for the paintball HTML5 demo, they have managed to fix some form of resource starvation and have been able to accelerate it, giving an order of magnitude speed increase. Strangely, the nvidia driver fares worse on average than the nouveau driver, despite it having substantially better performance in several cases. This is because the nouveau driver is, at least, consistently slow, whereas the nvidia driver also experience a few severe slowdowns and so has a much more mixed set of results.

As is always the case, everytime you try to compare implementations, you find completely unexpected bugs. The idea was simple: look at the performance of the last few generations and see how we’ve been improving.

To start with, I have a Core2 Quad, Q8400 at 2.66GHz (fixed frequency) and a venerable 3rd generation Intel GPU (a q35 to be precise). This is a 95W desktop beast, only to be beaten by a slightly older Q9550 Core2 Quad at 2.83GHz sporting a 4th generation GPU, a g45.

From the current generation of CPU design, I have two mobile chips, a Sandybridge i5-2520m at up to 3.2GHz and an Ivybridge i7-3720qm at up to 3.6GHz. These mobile chips run much cooler than their desktop brethen at 35W and 45W respectively, and both have the GT2 variants of the 6th and 7th generation Intel GPUs.

To put those two into perspective, I have also added the results from a desktop Sandybridge chip, the 95W i5-2500 which runs up to 3.7GHz. On the downside this processor only has a GT1 6th generation GPU.

Relative GPU performance of Core2 vs Sandybridge

Compared to the baseline performance given by the Core2 Q8440:

Q9550: 1.6x faster
i5-2500: 2.3x faster
i5-2520m: 1.7x faster
i7-3720qm: 1.4x faster

The success story is that the 35W mobile processors really do deliver the performance of the 95W desktop beasts of the previous generation, which is quite frankly very impressive. Of course, they are still dwarfed by their desktop brethen – now I really do want to get my hands on a i7-3770k!

The oddity is then why is my Ivybridge underperforming? Its single thread performance which is under test here should not be as badly limited, the base clock is a tiny bit faster than the Sandybridge i5, and with the larger and improved caches it should acheive better IPC as well. In theory it should also be coupled to faster main memory as well. My guess is that it is being throttled due to poor cooling. Or that there is a glitch in the software, or perhaps a broken configuration, or perhaps the memory is underrated, et cetera.

But what of the graphics performance, I hear you cry! The situation here is a bit more complex. All of those processors where using the same SSE2 backend in pixman which will only be improved upon with the introduction of AVX2 in Haswell, and so we were directly comparing slight variations of processor design executing the same software. When we look at GPU performance, not only do we have a wide variation of processor design, feature set and instruction sets, we also by necessity have different software for each.

If we look at the current driver situation, that is using UXA:

Relative GPU performance of Core2 vs Sandybridge

Compared to the baseline performance given by using SNA on the Core2 Q8440 with a q35 (a gen3 device like found in the Pineview netbook) GPU:

Q9550: 4.3x slower
i5-2500: 1.5x slower
i5-2520m: 3.0x slower
i7-3720qm: 2.3x slower

Despite almost a doubling of CPU power and an even greater increase in the GPU performance across the generations, the drivers continue to do a disservice to the hardware and ourselves.