Oh the pain

Were to start? Well, I have an HP Laptop. I used to have one of the end-consumer HP models (Pavilion) already before, but in fall 2010 I went for a business model: HP EliteBook 8740w. I didn’t regret it for one and a half years. Until March 2012, when suddenly the system would crash sporadically with a bug-check code that clearly pointed to the hardware (0x124), PCI Express to be exact. I turned to the Icelandic HP partner (Opin Kerfi) for help to no avail. They claimed to have tried it and couldn’t reproduce the crash. Well, it’s a sporadic crash so no surprise here. Other than a week without laptop and three nice bike rides this didn’t give me anything. How long would they possibly spend on a single machine anyway …

Since I am a Windows driver developer I know how to interpret crash dumps and moreover I’ve seen my share of those. Unfortunately all the dumps are corrupt, the WHEA record not there. Switching to full (yeah, that’s 16 GiB) dumps didn’t make a difference either. The basic info in the dumps I got were clearly related to hardware issues (0x124 and very rarely 0x116), though. And now I am getting those WHEA warnings in the event log frequently, too.

Event 17, WHEA-Logger
Component: PCI Express Root Port
Error Source: Advanced Error Reporting (PCI Express)
Bus: Device:Function: 0x0:0x3:0x0
Vendor ID:Device ID: 0x8086:0xd138
Class Code: 0x30400

The device, as I found out by means of lspci (on Linux) is:

00:03.0 PCI bridge: Intel Corporation Core Processor PCI Express Root Port 1 (rev 11)

… actually the “component” part already stated that. Silly me 😉

Keep in mind that bug-check 0x124 is named WHEA_UNCORRECTABLE_ERROR, i.e. the ugly little stepsister of the WHEA warning above. Except that the warning means it could be corrected and the bug-check says it couldn’t. The Parameter1 was consistently 0x4 (i.e. “An uncorrectable PCI Express error occurred”).

Although the HP support was leading me through the same routine every time, which was a bit annoying, it yielded results once I had the eCare Pack. I had purchased it mainly because I wanted the “next business day” option as I was under time constraints the first time around. However, I saw a note in my support ticket that said approximately “the customer is unwilling to reinstall the system” (even though I had done this, while retaining a clean backup of my system state before). But it’s true I was unwilling (and unable, because that one doesn’t exist any more) to employ the recovery DVD! The very same support person also had told me that Windows 7 x64 Ultimate wasn’t a supported OS for my model. A ridiculous comment if you have a little knowledge about the changes particularly since Vista SP1. Windows 7 x64 Pro would be supported.

The eCare Pack, seems to explain the difference in handling this compared to here.

Before the first repair attempt I had tried hard to find a method to reliably reproduce the crash in order to be able to demonstrate it at will. Unfortunately to no avail. I also used the full Memtest86+ to verify that the memory isn’t the culprit here. However, past experience of (hardware) memory errors tells me that the symptoms in such cases are often very random – unlike in my case where I consistently got 0x124 bug-checks with Parameter1 == 0x4.

Now the first repair led to the crash frequency increasing with the same exact bug-checks as before – the very rare 0x116 and the frequent 0x124. That first time around (end of June) the WiFi card and the motherboard were exchanged. Unfortunately I had to leave the country – literally – right after that repair so when I figured it was a failed repair attempt it was too late for any corrections. So I reported back to the HP support that the issue still existed and frequency had increased.

For most of the time since then I didn’t use the laptop at all because any work is at risk with such a frequency of crashes. So for all practical purposes I used it to watch a movie every now and then, but had “decommissioned” it as my main workstation from beginning of July to end of August.

For giggles I tried another reinstall of Windows 7 x64 (Pro! – not a problem as MSDN subscriber) at the end of August and the bug-checks were – unsurprisingly – still occurring. Ten days ago on Saturday I had more than a dozen crashes (0x124) in one day. I didn’t even manage to install the latest updates before the first crash when I reinstalled the system two weeks ago – but I wasted a whole lot of time, of course.

Last week Friday: the HP field service technician was here to exchange the graphics card. Unfortunately the graphics card sent as spare was without back plate – the part that contains the screw holes to fasten the heat sink. Spent some time here, no change. Obviously the weekend didn’t bring any changes in behavior.

This week on Monday: the technician showed up again, a tad bit later (the appointment with me was squeezed in on short notice). This time he had verified the graphics card contained the back plate before he showed up. He exchanged the graphics card. The system works fine until two hours to midnight (i.e. for approx. nine hours) when I have the first bug-check. Surprisingly it was a 0x116. Subsequent crashes (three more before I went to bed) were 0x124 again. Sent him an SMS, he called yesterday to tell me he reported back to HP and his contact approved exchanging the CPU.

Today: got a call from the technician what time would be convenient. Told him any time is fine, he decided best would be after half an hour. Shows up, exchanges the CPU. System boots up fine, we run FurMark at 1600×1200 for approx. 18 min without problems. Note: previous attempts at stressing CPU and/or GPU yielded nothing either. So the problem could never be reproduced reliably.

The technician had just left, I turned on the machine to check my email and during login – bang, my old friend bug-check 0x124 again. Since then I’ve had two more, one while writing this comment.

Right during the boot-up after the first crash I called the technician again, he told me he’d call back. After a few minutes he called back and told me he had suggested to HP to replace the machine completely. Now I’ll see what the near future brings.

Don’t get me wrong. The HP Support was rather professional and they were indeed trying to be helpful. The same holds for the field service technician who stepped out of his way to enable me to give feedback directly to him. However, the problem is still unresolved and it’s hard to work with a machine that crashes randomly and so frequently.

// Oliver

This entry was posted in EN, Thoughts and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *