27 August 2012

Power Saving, PSODs, ESXi 5, and Intel E7 CPUs

Strangely enough, I have not found much on the Internet about this particular issue:

Why is that strange?  Primarily because I have a hard time believing we are the only ones using ESXi 5.0 U1 on HP DL580 G7 servers, which appears to be one of the combinations where you would have this issue.  And actually, the problem is apparently more widespread than that.  Rumor is that it affects any server with the Intel Xeon E7 or 75xx CPU in it.  By now you are surely wanting more details.  Read on...

At a client site, we started receiving new HP DL580 G7 servers.  The first use was for VDI and in that environment we weren't (and still aren't) running ESXi 5 yet.  Those happened to have the Intel X7560 CPUs.  Let's ignore that for now.  Later we started deploying new HP DL580 G7 servers with the Intel E7-L8867 in the server environment, which was already running ESXi 5.0 U1.  This is when the problems began.  At seemingly random intervals, we'd receive the PSOD shown at the top of this post.  Time for an SR to VMware.

In short, new power management features (in the form of CPU throttling as I understand it) were added with ESXi 5 and VMware attempts to utilize them by default.  In our case, we have our BIOS set to a mode that is supposed to not allow the OS (ESXi) any control over the CPU settings, but it does not seem to matter in this case.  Per VMware, even in that scenario, SMIs (System Management Interupts) are still generated and the CPUs are not responding to them in a timely fashion (hence the "didn't have a heartbeat" messages in the PSOD message) and that's when VMware triggers the PSOD.

So at this point I felt like VMware was pointing the finger at HP because they said their internal PR only listed HP DL580 G7s as the affected servers.  Later on it seemed HP was pointing the finger back at VMware.  I, of course, had a support case open with both vendors and at one point HP released a new BIOS update that was rumored to address the problem.  The release notes read:

Problems Fixed:
Addressed a processor issue with Intel Xeon 7500-series Processors and Intel Xeon E7-series Processors that may result in unpredictable system behavior including application level errors, system hangs, Windows blue screens, Linux kernel panics, or a VMware ESX Purple Screen of Death (PSoD).  This issue is not unique to HP ProLiant servers and could impact any system using affected processors.  This revision of the System ROM contains an updated version of Intel's microcode for affected processors that addresses this issue.  The fix for this issue does not impact performance.  Due to the potential severity of the issue addressed in this revision of the System ROM, this System ROM upgrade is considered a critical fix.  HP strongly recommends an immediate update to firmware revisions with required critical fixes.

At this point I was very excited thinking that someone is actually addressing the issue.  Unfortunately, I received a PSOD on one of the servers in less than 24 hours after applying this BIOS update.  Maybe the issue with the Intel CPUs the release notes speak of was something entirely different.

From the beginning, both VMware and HP had a workaround I could apply, but as our servers were not in production yet, I wanted to spend the extra time troubleshooting the issue in hopes for a proper, long-term fix.  But, fast-forward to the end, VMware eventually started getting reports of other vendors' servers being affected--also with the Intel Xeon E7 CPU.

So the bottom line is, VMware has decided that they are going to just disable the power management features that are causing this issue as of ESXi 5.0 U2 when it is released.  Supposedly the code is being re-written for ESXi 6.0 and they are confident the issue will be resolved.  In addition, I was told they aren't even completely sure it will only affect the aforementioned Intel Xeon E7 / 75xx CPUs, so we've made the decision to disable the power management feature on all of our servers.

The setting to disable is an advanced VMkernel boot setting that can be found under the "Software > Advanced Settings > VMkernel > Boot" section and is named "VMkernel.Boot.usePCC" (PCC apparently stands for Power Collaborative Control).  Simply uncheck that option if using the vSphere Client to make the change.  VMware claims this requires a reboot to take effect, though I've seen no indication that this is the case.  Of course, vNugglets.com uses PowerCLI to make the change at build time with the following one-liner:

Set-VMHostAdvancedConfiguration -VMHost <ESXi host name> -Name "VMkernel.Boot.usePCC" -Value $false

However you prefer to do it, if you use Host Profiles, you'll need to keep one more thing in mind.  By default the Host Profile setting "Power system > CPU Policy" is set to "Balanced" if created from a default build of ESXi 5, but if you were to make the above change and then you created the Host Profile from the host, it would be set to "User must explicily choose the policy option" (and yes, that's a VMware typo, not mine).  If you are going back and retroactively changing all your hosts to disable PCC you may find your hosts are out of compliance with the attached Host Profile as it is expecting whatever power management policy you had previously specified and that feature is essentially disabled now.  As a matter of fact, if you tried to apply the Host Profile with a "Power system > CPU Policy" set against a host with PCC disabled, you'd see an error message like so:

To get around this, you'll need to either update the Host Profile from the modified reference host, or simply edit the Host Profile and change the "Power system > CPU Policy" value to "User must explicily choose the policy option" as hinted at earlier.  Then you should be able to re-apply the Host Profile to the affected hosts and it'll disable the option for you, or you can just script it with PowerCLI and verify the hosts are now compliant with your freshly updated Host Profile.

Hopefully this blog post helps someone else that may run into this particular issue with the Intel E7 CPU on ESXi 5.0 U1.