After I recently posted some surprising
profiles, I received useful feedback from
Michel Dänzer, Adam Jackson, and Eric Anholt. There was general
agreement that the i965_prepare_composite
function is generally stupid
about acting synchronously in order to reuse a single state buffer,
and that this shouldn't be too hard to optimize. Options include using
a ring of buffers and synchronizing only when wrapping around and also
optimizing to not send redundant data.
I had mentioned earlier that I had tried eliminating the i830WaitSync
calls, but hadn't noticed any performance change. Well, one problem
was that I had edited the files on the wrong machine, (I'm still not a
true X hacker since I'm not totally in the groove of the two-machine
debugging yet). It certainly did make a difference when I removed
these calls from the code actually executing, (all the text appears in
arbitrary colors, giving me more psychedelia than I actually need on
my desktop). But the performance really didn't improve at all.
Then I received a very helpful email from Roland Dreier, (thanks Roland!), cluing me in to opannotate. The results I had posted before were from opeport which gives profiling reports with function-level granularity. The opannotate utility gives a similar report, but at the granularity of either lines of source code or assembly instructions.
So these reports make it clear that sometimes there is more going on than meets the eye by simple examination of the source code. For example, much of the i965_prepare_composite function looks like simple assignments such as these:
memset (cc_viewport, 0, sizeof (*cc_viewport));
cc_viewport->min_depth = -1.e35;
cc_viewport->max_depth = 1.e35;
/* Color calculator state */
memset(cc_state, 0, sizeof(*cc_state));
cc_state->cc0.stencil_enable = 0; /* disable stencil */
cc_state->cc2.depth_test = 0; /* disable depth test */
But now take a look at the same assignments annotated by
opannotate. The first two columns are sample counts and percentage of
total time attributed to each line of code, (recall that we're trying
to determine why i965_prepare_composite
is using more than 25% of
the total time in the test):
274 0.0098 : memset (cc_viewport, 0, sizeof (*cc_viewport));
124 0.0044 : cc_viewport->min_depth = -1.e35;
122 0.0044 : cc_viewport->max_depth = 1.e35;
:
: /* Color calculator state */
861 0.0307 : memset(cc_state, 0, sizeof(*cc_state));
18559 0.6623 : cc_state->cc0.stencil_enable = 0; /* disable stencil */
17836 0.6365 : cc_state->cc2.depth_test = 0; /* disable depth test */
Clearly, not all assignments are created equal as the final two assignments are a couple of orders of magnitude slower than the first two. For a closer look, here's a chunk of the annotated assembly code showing some very expensive operations:
: cc_state->cc2.depth_test = 0; /* disable depth test */
: cc_state->cc2.logicop_enable = 0; /* disable logic op */
: cc_state->cc3.ia_blend_enable = 1; /* blend alpha just like colors */
: cc_state->cc3.blend_enable = 1; /* enable color blend */
1 3.6e-05 : 33277: movzbl 0xd(%ecx),%eax
18168 0.6484 : 3327b: andb $0x7f,0x3(%ecx)
17836 0.6365 : 3327f: andb $0x7f,0x9(%ecx)
12306 0.4392 : 33283: andb $0xfe,0x8(%ecx)
7307 0.2608 : 33287: or $0x30,%eax
So, we've got some bitfields being used here. Is this uncached memory that's causing it to be so expensive?
If I'm as fortunate as I was with my last post, hopefully someone will drop a handy note into my inbox telling me how to make this function go blisteringly fast. I'm really looking forward to that.