I recently encountered an issue when running an ESXi 4.1 environment where I was getting a Purple Screen of Death at right about the 24 hour uptime mark. Like most technologists, I first hit Google with the error code and had no luck. I found a similar error code which suggested that it had something to do with a storage I/O issue. The issue did not seem driver related, so I began to look into firmware and BIOS updates for the server.
After some investigation, I realized that there were four BIOS/firmware’s in my SR1550 which needed updating. After trying to do them all manually, I found a nifty utility called the Intel Deployment Assistant. The Intel DA utility is a bootable CD that allows you to configure many different features, including the ability connect to the Internet to check and download the latest BIOS/firmware updates and install them automatically. This worked great and I found out that the system had two year old BIOS and firmwares and the current build numbers were much newer. So, I let the utility do its thing, but one update kept failing. I tried it on all three of the SR1550’s and had the same issue on all of them; the HSC/backplane firmware would not update.
At this point, I started looking for older revisions of the Hot Swap Controller firmware figuring that the jump from revision 1.41 to 2.15 was too great. There were many revisions between 1.41 and 2.15, but they were not available on Intel’s website. Finally, after a lot of searching, I found an Intel document that confirmed my suspicion that I need to upgrade to a different older update prior to jumping to 2.15. Alas, I decided I had to get into contact with Intel about this issue.
At this point, I’m sitting in the server room and its already after six and my stomach is growling. I had no phone reception in the server room so I tried a button to chat with an Intel support representative. Much to my surprise, after about 10 minutes, the guy I was in contact had a solution. Although he could not locate the firmware revision I needed, he located an internal document that stated if I disconnected the RMM2 and then proceeded with the upgrade to 2.15, it should work. So I popped open the server, disconnected the RMM2 module, booted up, manually launched the update from a DR-DOS boot-disk and much to my surprise, it worked!
For those of you who do not know what the Intel RMM is, it is a Remote Management Module that allows you to have out-of-band control of the server. As a matter of fact, it offers an IP-based KVM over SSL through a Java applet, and it even allows you to remotely mount and boot the server to ISO images… How awesome is that! After that little bugger gave me all that headache, I decided to give it a try and it is great. You use a little utility called psetup to configure IP and login settings, and then your good to go!
Anyway, yes, after updating all of those firmwares, I now have over 180 days uptime on the ESXi boxes. I hope this helps any other SR15xx, SR25xx users out there that may be encountering stability issues with ESXi.