Using opannotate to make sense of profiles

After I recently posted some surprising profiles, I received useful feedback from Michel Dänzer, Adam Jackson, and Eric Anholt. There was general agreement that the i965_prepare_composite function is generally stupid about acting synchronously in order to reuse a single state buffer, and that this shouldn't be too hard to optimize. Options include using a ring of buffers and synchronizing only when wrapping around and also optimizing to not send redundant data.

I had mentioned earlier that I had tried eliminating the i830WaitSync calls, but hadn't noticed any performance change. Well, one problem was that I had edited the files on the wrong machine, (I'm still not a true X hacker since I'm not totally in the groove of the two-machine debugging yet). It certainly did make a difference when I removed these calls from the code actually executing, (all the text appears in arbitrary colors, giving me more psychedelia than I actually need on my desktop). But the performance really didn't improve at all.

Then I received a very helpful email from Roland Dreier, (thanks Roland!), cluing me in to opannotate. The results I had posted before were from opeport which gives profiling reports with function-level granularity. The opannotate utility gives a similar report, but at the granularity of either lines of source code or assembly instructions.

So these reports make it clear that sometimes there is more going on than meets the eye by simple examination of the source code. For example, much of the i965_prepare_composite function looks like simple assignments such as these:

memset (cc_viewport, 0, sizeof (*cc_viewport));
cc_viewport->min_depth = -1.e35;
cc_viewport->max_depth = 1.e35;

/* Color calculator state */
memset(cc_state, 0, sizeof(*cc_state));
cc_state->cc0.stencil_enable = 0;   /* disable stencil */
cc_state->cc2.depth_test = 0;       /* disable depth test */

But now take a look at the same assignments annotated by opannotate. The first two columns are sample counts and percentage of total time attributed to each line of code, (recall that we're trying to determine why i965_prepare_composite is using more than 25% of the total time in the test):

  274  0.0098 :    memset (cc_viewport, 0, sizeof (*cc_viewport));
  124  0.0044 :    cc_viewport->min_depth = -1.e35;
  122  0.0044 :    cc_viewport->max_depth = 1.e35;
              :
              :    /* Color calculator state */
  861  0.0307 :    memset(cc_state, 0, sizeof(*cc_state));
18559  0.6623 :    cc_state->cc0.stencil_enable = 0;   /* disable stencil */
17836  0.6365 :    cc_state->cc2.depth_test = 0;       /* disable depth test */

Clearly, not all assignments are created equal as the final two assignments are a couple of orders of magnitude slower than the first two. For a closer look, here's a chunk of the annotated assembly code showing some very expensive operations:

              :    cc_state->cc2.depth_test = 0;       /* disable depth test */
              :    cc_state->cc2.logicop_enable = 0;   /* disable logic op */
              :    cc_state->cc3.ia_blend_enable = 1;  /* blend alpha just like colors */
              :    cc_state->cc3.blend_enable = 1;     /* enable color blend */
    1 3.6e-05 :   33277:    movzbl 0xd(%ecx),%eax
18168  0.6484 :   3327b:    andb   $0x7f,0x3(%ecx)
17836  0.6365 :   3327f:    andb   $0x7f,0x9(%ecx)
12306  0.4392 :   33283:    andb   $0xfe,0x8(%ecx)
 7307  0.2608 :   33287:    or     $0x30,%eax

So, we've got some bitfields being used here. Is this uncached memory that's causing it to be so expensive?

If I'm as fortunate as I was with my last post, hopefully someone will drop a handy note into my inbox telling me how to make this function go blisteringly fast. I'm really looking forward to that.