Thursday, September 25, 2008

Replacing System Boards On Sun Mx000 Series Servers

Hey there,

Shifting gears again, today we're going to take a look at doing some hardware maintenance on Sun's (or, technically, Fujitsu's) new Mx000 series servers. At this point, I think there are only 4 variants available; the M4000, M5000, M8000 and M9000. The numbers relatively equate to how much "better" one is than the other, with the highest number being the best (This is a subjective point, though. Depending on your needs an M4000 may be much better for you than an M9000)

I wanted to take a look at Dynamic Reconfiguration (DR) on the Mx000 series, and this seemed like as good an example as any. One thing to keep in mind is that you can't do this on Midrange servers since the replacement of that system board means replacing a motherboard unit (MBU), which can't be done on-the-fly. Why does this matter? I don't know; the M4000 - M9000 are all Enterprise servers that support the DR we're going to do. Just some trivia to keep it interesting ;)

The first thing you'll want to do is to log into the XSCF shell (akin to the Domain Shell or System Controller that we looked at in our old posts on working with Sun 6800 and 6900 Series Servers).

After that, you'll need to check the status of the domain with the "showdcl" command. You just need to pass it one option ( -d ) to identify the domain you want to check out (note the similarities to the 6800/6900 server DR operation. A lot of the commands are identical. That's the last time I'll refer back to those humungous machines. I promise :)

XSCF> showdcl -d 0
DID LSB XSB Status
00 Running
00 00-0
01 01-0


Then, you'll need (or maybe just want) to check the status of the board that needs to be replaced. This can be done with the "showboards" (So familiar, but I promised not to go there anymore ;) command.

It's important to note that, if the board (itself) doesn't support the DR board deletion command, then - even if you're on an Enterprise system that supports DR - you won't be able to use DR to replace the board. Disregarding other, more eccentric, problems that rarely happen (outside the scope of this post), the thing to look for here is under the "Assignment" column. If a board shows as "Assigned", and meets all the other criteria too Byzantine and awkward to expound upon; this fits the definition of "doesn't support DR board deletion," mentioned above. You'll know for sure that it doesn't work when the command fails (which is another good reason to take an outage no matter how "resilient" your hardware uptime solution is). This is a very easy problem to fix, however. All you usually need to do is add one step before the next (to "unassign" the board) and it will magically support DR board deletion :) We'll group it in with the next step, just to keep things neat and tidy :)

XSCF> showboards 01-0
XSB DID(LSB) Assignment Pwr Conn Conf Test Fault
-----------------------------------------------------------------
01-0 00(01) Assigned y y y Passed Normal


Now, we'll delete the system board using the following command

XSCF> deleteboard -c disconnect 01-0


Note, that if the board is Assigned, and doesn't support DR, you'll need to run this variant of the "deleteboard" command before the one above (to unassign it). Note, also, that it doesn't hurt to do this even if the board "does" support DR:

XSCF> deleteboard -c unassign 01-0


No sweat :)
Now, you'll want to check the status of "showboards" again (We're going to pretend that the "Assigned" status is OK, like it usually is, from now on)

XSCF> showboards 01-0
XSB DID(LSB) Assignment Pwr Conn Conf Test Fault
-----------------------------------------------------------------
01-0 00(01) Assigned y n n Passed Normal


You'll notice here, now, that the Conn (Connected) and Conf (Configured) columns are showing n (no). This is good since you've deleted the board (logically) from the domain configuration.

Next, you'll need to get your hands dirty and physically replace the board. Actually, you probably won't if you've purchased Sun support (or wear a good sturdy pair of pleather gloves ;), since Sun won't let you touch it if you want them to come back out, at no additional charge, ever again should something actually be "wrong" with the replacement board they send you. We won't go into the boring details of hot-replacing the board, since it's (again) outside the scope of this increasingly long post, and should be performed by a Sun FE if you have no idea how to do it!

Once that's all over with, simply type

XSCF> replacefru


to complete the software part of replacing the "field replaceable unit," and check the status of the system board again. This time, also run:

XSCF> showboards -d 0


to ensure that all the system boards are still registered in the DCL (Domain Components List - Basically a list of all the boards that make up the domain - domain 0 in your case today)

If the system board configuration has changed (like the division type has changed from Uni to Quad for some reason... like you figured out a way to sneak in a system upgrade or something ;), you may need to run the "setupfru" command. You most likely won't, since you're replacing your board with another board that's exactly the same as the old board, except it works ;)

If the replacement system board isn't registered in the DCL, double check to make sure it hasn't assigned itself to a different domain (I've never seen this happen) using:

XSCF> showboards -v -a


In any event, since it's not in the DCL for your domain, you'll just need to add it back by running:

XSCF> setdcl -d 0 -l 01


The -d flag is for the domain and the -l is for the LSB number (listed in your "showboards" output).

Now, you should be on the road to all-the-way-good. But you should check and make sure, just in case:

XSCF> showboards 01-0
XSB DID(LSB) Assignment Pwr Conn Conf Test Fault
-----------------------------------------------------------------
01-0 00(01) Assigned y n n Passed Normal


Now, you'll want to check the status of the domain (basically to determine if you want to reboot it or not, which you don't or you'll be directly contradicting everything DR stands for ;)

XSCF> showdcl -d 0
DID LSB XSB Status
00 Running
00 00-0
01 01-0


and then, finally, you'll add the "new" board back to the domain and "configure" it, as well ("adding" will set the Conn column to y and "configuring" will set the Conf column to y).

XSCF> addboard -c configure -d 0 01-0


Then (and you're almost done - just being really cautious...) check the domain component list status again to make sure everything's cool:

XSCF> showdcl -d 0
DID LSB XSB Status
00 Running
00 00-0
01 01-0


and run "showboards" on that new board to make sure everything is peachy ( The words Assigned, Passed, Normal and a few letter y's are excellent indicators that things are all well :)

XSCF> showboards 01-0
XSB DID(LSB) Assignment Pwr Conn Conf Test Fault
-----------------------------------------------------------------
01-0 00(01) Assigned y y y Passed Normal


Congratulations! You've just completed your DR system board replacement on an M4000, 5000, 8000 or 9000. Now that you know how to do it, re-read these instructions and be amazed that it actually takes you longer to plod through this post than it does to do an actual board replacement ;)

For further perusal, enjoyment and possible confusion, check out The Official DR User's Guide For The Mx000 Series and The Mx000 Server Glossary. They're both fascinating reads that double as powerful sleep-aids ;)

Cheers,

, Mike




Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.