Synchronous compositing in the i965 driver

A couple of weeks ago I aired my confusion about a situation in which using EXA with an i965 card was 4 times slower than using NoAccel, but that there weren't any large bottlenecks visible in the profile at all, (nothing over 5%). And that didn't seem likely at all.

As is often the case this wasn't a wrong answer, but was instead a problem with me not understanding the question I had asked. I had been profiling over enttire runs of mozilla's Trender benchmark, but the 4x slowdown was based on the results it reported which measured only the inner loop of rendering itself. So my profiles including extraneous computation such as page loading that Trender was explicitly not measuring. This was actually easy to determine by simply measuring the total runtime of Trender and seeing that it slowed down by only 1.5x instead of 4x.

So to get better profiling data, I changed strategy slightly and started using an on-demand Trender technique that Vladimir offered me. It makes it easy to run a one-off Trender loop for hundreds of iterations so I can profile exactly the rendering and nothing else.

So I used that approach to measure the rendering performance of a worthwhile web page, with both NoAccel and EXA, (again with an Intel i965 card). I used my standard profile script to generate piles of data.

The trickiest part was then coming up with a reasonable way to visualize the data to try to figure out what was going on. My brother, Richard, pulled his usual HTML+CSS magic and came up with the following for me.

Note: If you click through to my blog, you'll get to see these charts
in their colorful glory. Otherwise, if you're reading this
somewhere like http://planet.gnome.org then you'll probably
just see the boring list version. But one really cool thing
about Richard's hack here is that in the absence of CSS
styling this chart does degrade nicely to just list with all the data
available. So these charts should be much more accessible than things
I've done previously with PNG bar chart images. Plus, I get to
embed hyperlinks at all the intuitive spots as well. Anyway, Richard
totally rocks for having put this together.

NoAccel (14 ms.) system profile symbols profile

pixman 42%
libxul 16%
vmlinux 16%
Xorg 7%
libc 5%
other 14%

EXA (77 ms.) system profile symbols profile

libc 30%
intel_drv 19%
vmlinux 16%
libxul 11%
libexa 6%
pixman 5%
Xorg 4%
other 9%

So there are some big percentages there---much nicer than the little 4% and 5% things I was seeing a couple of weeks ago. So, yes, it definitely helps to know that you're asking the question you mean to be. There are several links in the chart above to different profile reports. If you drill down the libc profile, you'll see that there's a lot of memcpy going on.

An easy guess is that the memcpy is due to pixmap migration due to software fallbacks. To verify this guess, I first disabled all pixmap migration from video memory to system memory, (which was as simple as if (! can_accel) return; in exaDoMigration. This actually got rid of most of the memcpy problem, and didn't cause any incorrect rendering. Next I enabled DEBUG_TRACE_FALL in exa_priv.h to identify all fallbacks that were getting hit by the benchmark and I put early returns in place to disable them. Unsuprisingly, this did cause a lot of incorrect rendering.

And of course, what we really want to do is to find what is triggering those fallbacks and accelerate them. But for now, I wanted to see how performance would be if we were already at the point that we had no software fallbacks at all. Here's what I got:

EXA-no-fallbacks (58 ms.) system profile symbols profile

vmlinux 23%
intel_drv 22%
libc 16%
libxul 13%
pixman 9%
libexa 4%
Xorg 4%
other 9%

So, as hoped, a lot of the libc time went away. But there's still a heck of a lot left, as well as kernel system-call time. That's clearly in gettimeofday and that's easy to track down to the I830WaitLpRing function. Fortunately, it's also extremely easy to eliminate that system call almost entirely there. The only reason it's being called is to ensure that we don't busy-wait forever, (due to a video card lockup). So we could call gettimeofday only once every several thousand iterations and still detect lockup very quickly. Here's a patch to do exactly that.

So, now take a look at the performance with this patch attached:

EXA-no-gettimeofday (54 ms.) system profile symbols profile

intel_drv 49%
libxul 16%
libc 10%
pixman 7%
libexa 4%
Xorg 2%
vmlinux 2%
other 10%

Again, a lot of the libc time was eliminated, and a huge amount of the kernel time was eliminated as well. But, sadly, and dramatically, the overall performance did not improve much at all. The time spent in the Intel driver increased not only in percentage, but also increased in absolute timing, taking up almost all of the time saved from libc and the kernel.

What's going on here? What's happening is that the driver is spending an awful lot of time busy-waiting, and all we did here was to help it busy-wait even faster. Now, we know that there aren't any software fallbacks happening here, so what's all the waiting about? It appears to be caused primarily by i965_prepare_composite which includes the following comment and code:

/* Because we only have a single static buffer for our state currently,
 * we have to sync before updating it every time.
 */
i830WaitSync(pScrn);

That is, when we want to render multiple composite operations in sequence, instead of doing some nice, pipelined fire-and-forget operations, the driver is currently waiting for the previous composite operation to entirely complete before starting off another. And it's doing that waiting with CPU-burning busy-waiting.

I'll need to learn a bit more about this hardware before I can know the right way to fix this, (my, but documentation would certainly be helpful about now). For example, it would be easy to setup a circular buffer of state structures. But can I get the hardware to tell me when composite operations complete so that I can efficiently reuse those? Another approach would be to stuff the necessary compositing state into the LP ring the same way the compositing commands are being sent. Is there any reason not to do that?

Meanwhile, it looks like there's a bunch of state in the strucure that needn't be sent every time, (much of it is always constant). So there are likely easy ways to improve that as well. I'm definitely looking forward to people with more Intel-hardware-specific knowledge helping me out here, (or just plain fixing the thing).

Meanwhile, even if the driver time were completely eliminated, the rest of the time in this profile would still exceed what was spent in the NoAccel case. Some things are plainly working well, such as the fact that the pixman time has gone down. Some things are not working well, such as the new time being spent in libexa, (which appears to be mostly an inefficient approach for computing an LRU heuristic as I mentioned earlier). Some of the increased time, such as that of libc, might be related to driver problems that we hope to fix.

But one item in particular looks very strange. The libxul code is spending signficantly more time in the EXA case than in the NoAccel case, when we wouldn't expect it to be aware of the difference at all. That will be an interesting thing to track down, (and it sure would be nice if that were the biggest thing to worry about right now). But clearly we've got a lot of work to do to get the i965 performing properly with EXA.