27 August 2012

Power Saving, PSODs, ESXi 5, and Intel E7 CPUs

Strangely enough, I have not found much on the Internet about this particular issue:


Why is that strange?  Primarily because I have a hard time believing we are the only ones using ESXi 5.0 U1 on HP DL580 G7 servers, which appears to be one of the combinations where you would have this issue.  And actually, the problem is apparently more widespread than that.  Rumor is that it affects any server with the Intel Xeon E7 or 75xx CPU in it.  By now you are surely wanting more details.  Read on...

At a client site, we started receiving new HP DL580 G7 servers.  The first use was for VDI and in that environment we weren't (and still aren't) running ESXi 5 yet.  Those happened to have the Intel X7560 CPUs.  Let's ignore that for now.  Later we started deploying new HP DL580 G7 servers with the Intel E7-L8867 in the server environment, which was already running ESXi 5.0 U1.  This is when the problems began.  At seemingly random intervals, we'd receive the PSOD shown at the top of this post.  Time for an SR to VMware.

In short, new power management features (in the form of CPU throttling as I understand it) were added with ESXi 5 and VMware attempts to utilize them by default.  In our case, we have our BIOS set to a mode that is supposed to not allow the OS (ESXi) any control over the CPU settings, but it does not seem to matter in this case.  Per VMware, even in that scenario, SMIs (System Management Interupts) are still generated and the CPUs are not responding to them in a timely fashion (hence the "didn't have a heartbeat" messages in the PSOD message) and that's when VMware triggers the PSOD.

So at this point I felt like VMware was pointing the finger at HP because they said their internal PR only listed HP DL580 G7s as the affected servers.  Later on it seemed HP was pointing the finger back at VMware.  I, of course, had a support case open with both vendors and at one point HP released a new BIOS update that was rumored to address the problem.  The release notes read:

Problems Fixed:
Addressed a processor issue with Intel Xeon 7500-series Processors and Intel Xeon E7-series Processors that may result in unpredictable system behavior including application level errors, system hangs, Windows blue screens, Linux kernel panics, or a VMware ESX Purple Screen of Death (PSoD).  This issue is not unique to HP ProLiant servers and could impact any system using affected processors.  This revision of the System ROM contains an updated version of Intel's microcode for affected processors that addresses this issue.  The fix for this issue does not impact performance.  Due to the potential severity of the issue addressed in this revision of the System ROM, this System ROM upgrade is considered a critical fix.  HP strongly recommends an immediate update to firmware revisions with required critical fixes.

At this point I was very excited thinking that someone is actually addressing the issue.  Unfortunately, I received a PSOD on one of the servers in less than 24 hours after applying this BIOS update.  Maybe the issue with the Intel CPUs the release notes speak of was something entirely different.

From the beginning, both VMware and HP had a workaround I could apply, but as our servers were not in production yet, I wanted to spend the extra time troubleshooting the issue in hopes for a proper, long-term fix.  But, fast-forward to the end, VMware eventually started getting reports of other vendors' servers being affected--also with the Intel Xeon E7 CPU.

So the bottom line is, VMware has decided that they are going to just disable the power management features that are causing this issue as of ESXi 5.0 U2 when it is released.  Supposedly the code is being re-written for ESXi 6.0 and they are confident the issue will be resolved.  In addition, I was told they aren't even completely sure it will only affect the aforementioned Intel Xeon E7 / 75xx CPUs, so we've made the decision to disable the power management feature on all of our servers.

The setting to disable is an advanced VMkernel boot setting that can be found under the "Software > Advanced Settings > VMkernel > Boot" section and is named "VMkernel.Boot.usePCC" (PCC apparently stands for Power Collaborative Control).  Simply uncheck that option if using the vSphere Client to make the change.  VMware claims this requires a reboot to take effect, though I've seen no indication that this is the case.  Of course, vNugglets.com uses PowerCLI to make the change at build time with the following one-liner:

Set-VMHostAdvancedConfiguration -VMHost <ESXi host name> -Name "VMkernel.Boot.usePCC" -Value $false

However you prefer to do it, if you use Host Profiles, you'll need to keep one more thing in mind.  By default the Host Profile setting "Power system > CPU Policy" is set to "Balanced" if created from a default build of ESXi 5, but if you were to make the above change and then you created the Host Profile from the host, it would be set to "User must explicily choose the policy option" (and yes, that's a VMware typo, not mine).  If you are going back and retroactively changing all your hosts to disable PCC you may find your hosts are out of compliance with the attached Host Profile as it is expecting whatever power management policy you had previously specified and that feature is essentially disabled now.  As a matter of fact, if you tried to apply the Host Profile with a "Power system > CPU Policy" set against a host with PCC disabled, you'd see an error message like so:


To get around this, you'll need to either update the Host Profile from the modified reference host, or simply edit the Host Profile and change the "Power system > CPU Policy" value to "User must explicily choose the policy option" as hinted at earlier.  Then you should be able to re-apply the Host Profile to the affected hosts and it'll disable the option for you, or you can just script it with PowerCLI and verify the hosts are now compliant with your freshly updated Host Profile.

Hopefully this blog post helps someone else that may run into this particular issue with the Intel E7 CPU on ESXi 5.0 U1.

8 comments:

  1. Brought to this page with a search for "balanced cpu policy" googling the issue when applying a host profile. We are having the same "random PSOD" with Dell R710 X5680 and R720 E5-2690. We are disabling C-States and C1E, and setting CPU Power management to Maximum Performance in the Bios. had calls with VMware and Dell, all seemed a bit vague.
    Just noticed there is a C1E field in the same location as mentioned, guess I'll have to disable it there as well.
    All seems like a little to much guesswork!

    ReplyDelete
  2. We do have very similar effects:
    Using DL385 G7 an DL585 G7 (AMD CPUs) in two clusters with ESXi5 we also have seen many PSODs, even on U2. Updating the ESXis to that version did not seem to improve anything regarding PSODs.
    VMWare still points the finger at HP and recommends to disable power management on BIOS level and to update the machine's BIOS: I've never updated server's BIOS that much until now, those servers now have the 4th or 5th version! This is very frustrating - generating work and even consuming more power than needed. G6 models with older AMD CPUs never had that issues, even with ESXi5.
    BUT: HP released new BIOS versions within the last days dated december 2012 for AMD and Intel G7-servers - we'll give that a try.

    The Intel-universe also generates headaches here:
    Our DL580 G7 (SQL server on Windows) with E7 CPUs shows WHEA errors (ECC) in its eventlog which is documeted by HP here:
    http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c03282091
    The august's BIOS release notes tells to "disable power management on the Intel Scalable Memory Interconnect (SMI) link" which also quite unsatisfying regarding power consumption.
    In the last days a new BIOS was released:
    "Optimized the memory settings to improve the reliability of the memory system."
    "Resolved an issue where Low Latency Configurations would not function properly."
    This gives us hope to fix the event log issue (we still have an case with HP on that), maybe this addresses also the issues regaring VMWare.

    ReplyDelete
    Replies
    1. So I'm going to reply myself:
      After updating the DL585 G7 to the latest BIOS that day, both crashed only a few hours later with PSODs. One of the two hosts in total three times so we downgraded its BIOS back to the last working version... It was never that obvious.
      The DL385 G7 are still runing, but we updated our VMWare case with the DL585's trouble.

      Delete
    2. We had been running the April 20th, 2011 BIOS version for quite some time now and have our HP power profile set in the BIOS to Maximum Performance, which turns off most (all?) of the Intel CPU power saving features. We've still had PSOD issues as well, but not due to this PCC issue. Instead, our latest issue is with the QLogic NetXen NICs that come both as the four integrated NICs in a DL580 G7 but also in the HP NC522SFP NICs we are using. They keep released driver update after driver update though we typically don't notice it until we open a support case with VMware for a PSOD and they inform us. We are currently in the midst of applying ESXi 5.0 U2 as well as the Oct. 2012 HP SPP firmware updates (which brings the BIOS to Aug. 2012) and also updating the nx_nic driver (QLogic NetXen) to 5.0.626 (from 5.0.619). Hopefully this will help us a bit. Will find out in the coming weeks I'm sure. Needless to say, the G7s have been a rough road for us and the QLogic NIC isn't helping...

      Delete
    3. Now, VMWare has come to the point that:
      - Errors regarding our DL585 G7s are releated to its 522 and 375 NICs like AC already posted. We were adviced to upgrade their firmware months ago and it's still up to date so now a VMWare-internal document mentions that a firmware downgrade might fix this now but customers should contact HP.
      We are still waiting for feedback from HP regarding this.
      - Our DL385 G7's error(s) are related to the PCC issue and we are linked to the same workaround mentioned above and in this KB article:
      http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2000091

      Delete
  3. VMWare support recognized PCC is inactive since U2 by default and cannot be turned on again so now they have to spend more time in reading our 385 G7's logs further since we still experience this issue with U2.
    _

    There is a later firmware for QLogic-NICs in the DL585 G7s that can be found on SPP 10/2012 which curiously was not installed. We were advised to install it and add January's BIOS again. This worked for three days so we tried that combination on our second host too: Two hours later both crashed again and today three times each - even after having set all BIOS options to "Maximized Power". One of them left "Uncorrectable Machine Check Exception" on all four sockets in its IML.
    HP's Bulgarian supporter supplied us with useless advisories with non-existing BIOS options so we decided to go back to BIOS 08/15/2012 while HP are trying to read the latest VMWare-logs we initially generated for VMWare.

    If we don't get our DL585s stable now we will try to downgrade ESXi5 back to U1 or earlier.

    ReplyDelete
  4. We have 24 ESX hosts (DL360 and DL380 G7s) with similar RAM/processor configs and only one has had this issue. Even after processor swaps and a motherboard swap, it continued. We did manage to get it stable by pulling one of the two CPUs, but upon replacing the "bad" CPU with an RMAed part the problem continued.

    HP asked us to disable Collaborative Power Control in the BIOS. Might help others who are experiencing problems.

    ReplyDelete
  5. We set Maximimum perfromance power profile and disable collaborative power in bios. That is what we use and no problems with BL660G8, BL680G7, BL460G6/7 with Esx 5.0 and 5.1U1.

    ReplyDelete

Note: Only a member of this blog may post a comment.