On Intel Driver Stability

It's hard for me to believe that my last technical post on the progress of the Intel graphics driver was nearly a year ago. Certainly, a lot has happened in that year, and recent developments have been covered well by Keith, Eric, and Jesse. So I won't go into any of those details, but I do want to give my impression of where things stand.

As my colleagues have described so well, a major restructuring of the driver is now complete, (KMS, GEM, dri2, removal of XAA, EXA, dri1, etc.). This restructuring caused months of upheaval, followed by months of stabilization during which many bugs were shaken out. The end result is that the primary developers are all optimistic about the current state of the driver. Things seem better than ever.

In the meantime, however, it's also obvious that some users are not happy with the driver, and some have gotten the impression that the quality is getting consistently worse with time. Some have theorized that no quality-assurance testing is happening, or that the developers just plain don't care about regressions. Neither of those theories are true, so why is there such a disconnect between the developers and the users?

One reason for unhappy users is the lag between development happening and users getting their hands on the code. So the phase of upheaval and instability might be behind the developers, but some users are just entering into it. This is exacerbated by the fact that the major restructuring puts several important components of the driver into the kernel. It used to be that getting the latest Intel driver just meant updating an xf86-video-intel module or upgrading an xserver-xorg-video-intel package. But now it's essential to get a recent kernel as well, (note that almost all of the major fixes that Eric describes happened in the kernel, not the user-level driver). And these fixes aren't in a major kernel release until 2.6.30 which appeared only today.

In addition, distributions and users have a tendency to update the kernel less frequently than user-level components. So when a user checks out the latest xf86-video-intel from git, (without a kernel upgrade), not only is that user not getting the "latest driver" as recommended by the developers, but they might also be running a combination that the developers and QA never saw nor tested.

For example, I recently attempted to run the latest-from-git Intel driver with a kernel version as installed by current Debian installation CDs. This kernel is ancient with respect to our driver development (2.6.25 I think). What I found was that the X server would not start at all. Obviously, we had neglected to test this particular combination, (so yes, QA isn't perfect---but it also can't cover the infinite number of possible combinations). Fortunately, the fix was simple, and appears in the snapshot I just released (today!) so will appear in the 2.8.0 release very soon.

The packagers of X for Ubuntu have been working hard to address this exact problem of the lag between developers and users. They've setup a repository of packages named xorg-edgers where users can easily get packages built directly from git. I don't think it includes the kernel yet, as it will soon hopefully, but this is definitely a step in the right direction. Update: It does include a 2.6.30 kernel now---maybe all that time I spent writing this blog post wasn't wasted after all.

But what about the users that did upgrade their kernel and driver and are still seeing some major problems? In this case, we really want to hear from you. We take pride in our work and are committed to doing our best to make quality releases without regressions. When you hit some awful behavior, (GPU hang, scrambled mode, abysmal performance), it may be that you're just hitting a combination of components or application behavior that we've never seen before. We want those details and we want to fix the problem. File good bug reports and please make use of the "regression" keyword as appropriate. We pay attention to that.

Performance regressions are an especially important issue. In fact, performance is so important that I'll make a separate post about that. I've got some exciting news to share about performance measurement.