X Acceleration that Finally Works

Abstract

Meaningful hardware acceleration within the X Window System is becoming a reality. We present the recent state of the art of 2D accelerated rendering with the Intel 965 graphics device showing performance gains up to 900 times faster than software rendering.

Presentation

Summary

Background

In the beginning, X provided support for graphics that by today's standards are extremely unattractive. But the graphics capabilities were perhaps not ill-suited for applications and graphics hardware at the time. For example, X provided bitwise raster operations like XOR and applications used XOR rubber-banding for window-frame resizing or selection rectangles. X provided non-antialiased line rendering, and hardware might have provided a Bresenham line implementation. X also provided an acceleration architecture (XAA) that exposes these original "core" rendering primitives to the drivers.

Then time passed. Some things changed, (what applications wanted to draw), while some things didn't (what X and XAA provided). So for a time, any "modern" application resorted to manually constructing the graphical results it wanted, (in software), and simply sending the final result to the X server. As far as graphics, X protocol was reduced to simply image transport and the hardware remained almost unused by applications, (except for a couple notable operations such as solid fills and "blitting" or copying pixels for things like scrolling).

X graphics support was revitalized with the X Render Extension. This extension provides a small number of new primitives that are well-suited for the needs of modern applications. These primitives include image compositing (blending), support for client-side fonts, trapezoid rasterization and gradients. The Render extension also shipped with a remarkably slow software implementation in the X server.

Today, through cairo and similar systems, the standard toolkits (GTK+ and QT) provide applications with easy-to-use drawing APIs that provide sophisticated effects and pipe everything into the X server through the Render extension. So, unlike the days of X being used just to transport the final image, we now have the opportunity to get the video hardware involved in rendering the things the application actually wants to draw.

The EXA acceleration architecture, (an internal implementation detail of the X server), allows X server drivers to implement hardware support for each primitive provided by the Render extension. So architecturally, everything is in place with EXA. Applications' rendering operations are making it all the way to the hardware driver where it should be able to make everything go fast.

Problems

So if that's all in place, why isn't EXA blisteringly fast yet? Or why does adding the 'AccelMethod "EXA"' option to xorg.conf actually slow things down (and sometimes dramatically)? There are a variety of possible causes, but I'll discuss two here:

Migration refers to the problem that if a surface needs to be modified sometimes by a hardware operation and sometimes by software, then the X server needs to migrate the surface contents back and forth from video to system memory. Due to architectural issues in commodity hardware, reading back from video memory is painfully slow---often orders of magnitude slower than writing to video memory. Various attempts at doing clever migration strategies within EXA have been attempted, but it's clear that no amount of cleverness is going to prevent a significant amount of the overhead, (it turns out to be near impossible to predict in advance how a surface will be used next). A punch line here is that often a little-bit of hardware acceleration is much worse than none at all, (as evidence, see many references to people dramatically improving system performance by disabling hardware acceleration in XAA with the XAANoOffscreenPixmaps option). The real answer for the migration problem is to make sure that drivers support everything needed and basically never fallback to software rendering.

So this brings us to the second issue, which is that we don't yet have drivers that do everything we want yet. Fortunately, we now have video-hardware manufacturers that are cooperating by providing complete documentation for several devices, (and more and more devices all the time it appears). On the Intel side, documentation for the latest device, (the i965 or "gen4"), was released under a CC-Attribution-NoDerivs license during LCA. This device is interesting because more than any previous Intel device it should be quite capable of supporting anything that Render and EXA can throw at it. Also, the unified memory architecture it uses, (system memory is reused for video memory), should help with migration issues.

However, the currently-available upstream driver for the i965 is extremely uninteresting performance-wise. It's one of the drivers that will give a tremendous slowdown if EXA is enabled. The fundamental problem that the driver has is that it's using a single chunk of memory to setup all the state for each compositing operation. Then while the hardware pipeline is started up on that operation, the driver receives the next operation. But instead of stuffing this into the pipe and keeping the hardware busy, the driver currently spins in the CPU until the hardware is completely finished with the previous request. It does this because it can't modify that shared state object in memory while the previous operation is still using it. So the CPU stays extremely busy while doing nothing but waiting, and the GPU stays extremely idle, doing short bursts of work that occupy only a tiny fraction of the compute resources on the chip. Not a good state of affairs at all.

Recent Work

Eric Anholt, Dave Airlie and I have been working to fix up the i965 driver to actually be sane. This work depends first on TTM, a new kernel-supported graphics memory manager implemented as part of DRM, (see Dave Airlie's talk at LCA 2008 for more on TTM). The fundamental primitives that TTM provides are buffer objects, (kernel allocated chunks of video memory for the driver to use), and fences which allow the driver to setup operations and receive interrupts when operations are complete rather than busy-waiting.

As it turns out, antialiased text rendering is one of the most difficult things to accelerate in hardware. Currently, the driver will see an independent composite operation for every glyph and the glyph image might be as small as a 10x10 surface or smaller. With surfaces that small, any per-composite-operation overhead in the driver becomes quite significant. And, of course, text is one of the most fundamental operations in 2D interfaces, so it's important to not render it extremely slowly. Because of this, over the past several months we've been focusing on characterizing, profiling, and optimizing the i965 driver with text as the primary benchmark. Other operations, (like image scaling and blending), will all benefit even more from the work we've been doing specifically for text.

When profiling text rendering, we immediately noticed that all operations were falling back to software rendering. This was because the X server was using special system-memory storage for all cached glyph images. We changed the X server to use ordinary pixmaps instead, which allows the glyph images to live in video memory instead, allowing for hardware compositing. Next we discovered that the i965 driver was claiming it didn't support the necessary render-to-8-bit-alpha-mask operation that text rendering needed, so that was also forcing fallbacks. Fortunately, no real code was needed to fix that---we just changed the driver to properly report its capabilities.

Those were all good improvements, but with those alone, the performance of text only got worse with the i965 device, (now orders of magnitude slower than software.) This is because now every single, tiny glyph rendering was subject to the bug described earlier where the driver would spin the CPU while waiting for each separate operate to complete in the GPU. So now the need for a proper implementation of compositing in the driver was much more important.

Dave Airlie made the initial change of the i965 driver to use batch buffers rather than using a single, shared chunk of memory for the graphics state for each operation. He also changed it so that the driver uses TTM to allocate these batch buffers. With this in place, we did several optimizations so that the driver didn't needlessly re-initialize any more state objects than strictly necessary. All of our code is currently available in the intel-batchbuffer branch of the xf86-video-intel driver.

Results

The current results can be summarized as follows. Here we are showing the performance difference of the upstream "master" branch of the driver compared to "intel-batchbuffer" branch. In both cases we are using EXA, but reporting numbers as the speedup compared to XAA, (so higher numbers are better and numbers less than 1 are performance regressions).

        Speedup compared to XAA

Operation       EXA (master) EXA (intel-batchbuffer)
---------       ------------ -----------------------
aa10text          .3            0.6
Blend            9.4          101.6
.5 scale         7.4           34.3
2x scale        23.5          200.3
General scale   20.2          946.1

Measurements made with "x11perf -aa10text" and with renderbench.

So, with our work, i965 EXA antialiased text for small glyphs is now 2x faster than it was before, but still only 60% the speed of XAA text. That's 109,000 glyphs/second for EXA/intel-batchbuffer compared to 186,000 for XAA. So it's still a respectable speed for text, but it could be improved. Eric believes that the remaining problems preventing text from being faster are cache flushing issues within TTM itself.

Meanwhile, image blending with various scale factors is greatly improved. The intel-batchbuffer branch makes EXA on the i965 perform from 5 to 50 times faster than upstream EXA and up to more than 900 times faster than XAA. Obviously, this is a very good result, and it shows the incidental improvements we achieved while looking closely only at text performance.

Future Work

We are currently in the process of merging this work into the master branch of the upstream Intel driver and plan for it to be part of the upcoming Intel driver release scheduled for June 2008.

Beyond that, we plan to add support for hardware-accelerated gradients, trapezoid rasterization, and perhaps polygon rasterization. Obviously there's similar EXA acceleration work needed for other drivers as well. Please come join us in the fun!