Earlier this week I isolated some bugs that are currently causing a 4x slowdown with EXA and the i965 driver compared to using the NoAccel option of the X server.
Some people have wondered if the discouraging results I have found so far suggest that we should give up on hardware acceleration or that EXA as an acceleration architecture is doomed. I think the answer is no on both points. I think we're just seeing typical behavior of new code that needs some optimization.
EXA without acceleration
The first experiment is a very simple one to ensure that the 4x slowdown isn't an unavoidable aspect of having EXA enabled. In this experiment I first disabled the accelerated-compositing functions in the i965 driver, then I disabled EXA migration. The net result of this experiment is that the X server will still go through the EXA paths, but will basically use all the same software-fallbacks for compositing that are used in the case of NoAccel. The performance with this patch can be compared to the NoAccel case here. (Again, click through to my blog if you're just getting a list of numbers, not a colorful bar chart.)
It's worth pointing out that with this change, everything still renders correctly. Basically, we're using the same rendering code as in the NoAccel case, but we're using EXA to get there. And we can see that there is some overhead to EXA seen here, (a 10% slowdown), but nothing like the 400% slowdown seen before. There's certainly no indication here that EXA is doomed to be horribly slow, for example.
Now, this experiment does miss overhead in EXA having to do with managing video memory, (since I disabled all migration so everything lives in system memory). We'll be able to see this additional overhead below.
Emulating future i965 speedups
The above experiment is still pretty boring---it's still just measuring software-fallback performance. A much more interesting experiment allows us to start exploring where will be able to get with some un-broken hardware acceleration.
My previous post highlighted two significant problems preventing the current code from having good performance:
Time lost migrating pixmaps with memcpy
Time wasted while the driver busy-waited between operations
Here's a run-down of what I could find about progress on solving these two problems:
Excessive migration
When I looked closer at what was causing the pixmap migration, I found that much of it was due to glyph images being pinned to system memory, (recall that the benchmark I'm using is Mozilla Firefox on a page consisting of mostly text). I asked Keith Packard about why these glyph images are being pinned to system memory, and he explained that what was preventing the glyphs from migrating is that the X server has not been using straight Pixmaps for glyphs, but something slightly different.
Keith is already mostly finished with a change to make the server use Pixmaps for glyphs. Apparently there is one slight snag in that Pixmaps are a per-screen resource while glyphs are not. For now, that could be worked around by using one Pixmap per screen, (until Pixmaps and other resources can be made global within the server).
So, hopefully that glyph pinning problem will be fixed. Meanwhile, it's fairly silly that there's a bunch of memcpy operations to migrate things from "system" to "video" memory on the i965 anyway. This card doesn't have dedicated video memory, but just uses system memory anyway, (all that's needed is for some entries to be set in the GART table, and for some alignment constraints to be satisfied). So it should be possible to eliminate all of this memcpy time anyway.
I'm told that the long-awaited memory management work, (TTM), is what will solve this. I don't know what the status of that work is, but hopefully it will be ready soon. does anyone have some pointers for more information on TTM status?
Synchronous compositing
I characterized this problem fairly well in my previous post. Eric Anholt suggested a first quick step toward improving the situation would be to use an array of state buffers. With N buffers we could make the waiting happen only 1/N as frequently as it's currently happening. So that's something that even someone like me without any detailed documentation on the i965 could do.
And with a little more smarts, (from someone with more information), we could presumably reclaim buffers that the hardware was done with without having to do any waiting at all.
So it shouldn't be too long before the waiting can be eliminated or reduced to an arbitrarily small amount of time.
Results
Given these identified solutions for the current known problems, (and much of the work in progress already), the next question I want to ask is what will things look like when these are solved?
I implemented quick patches to both EXA and the i965 driver to emulate the time being spent on migration and compositing going to zero. That's not totally realistic, but is at least a best-case look at where we'll be with these problems fixed. And here's what it looks like (with the previous results repeated for comparison):
Note that in this experiment, rendered results are not at all correct, (basically, no text appears, for example).
And, still, things aren't faster than NoAccel, but there's definitely still lots of room for improvement. For example, the pixman profile shows compositing, (fbCombineInU and fbFetch_a1) that should be moved to the hardware, (particularly when the hardware is infinitely fast like it is in my emulation here!).
After that, pixman's rasterization would be at the top of the pixman profile. I've been wanting rasterization to show up at the top of a profile for a long time so I could have an excuse to implement some ideas I have for much faster software rasterization, (and to explore using the hardware for rasterization as well). And, for some applications doing much more than just rendering text, rasterization might already be a lot closer to the top.
So that shows what software operations aren't hooked up to be accelerated yet. What else is here? As I pointed out before, (and is much easier to see in this chart than the one from earlier this week), libxul is mysteriously getting slower once the i965 gets involved, but libxul really shouldn't care. So that will be something to investigate by actually building mozilla with debug symbols.
Also, there's also significantly more overhead in libexa in this chart compared to those above. So there's some room for improvement there, (ExaOffscreenMarkUsed is at the top of the profile, and as I've mentioned before it looks ripe for improvement).
Finally, the i965 driver is still burning a lot of time in its wait
function here. I'm not sure what the cause of that is this time since
I've eliminated all calls to the wait function from
i965_prepare_composite
and i965_composite
in this experiment.
Oh, and the big libc time in this chart is from gettimeofday, (which I showed how to eliminate earlier). That patch hasn't been accepted upstream yet, and it wasn't included in this run.
As always, I've tried to make as much data available as possible, (you can even change the .oprofile extensions on the links to .callgraph for more data---but I often can't make sense of oprofile callgraph reports myself). So I'd be glad for anybody to dig in deeper and provide any useful feedback.