Using the Simplified Remote Restart capability on Power8 Scale Out Servers

A few weeks ago I had to work on simplified remote restart. I’m not lucky enough yet -because of some political decisions in my company- to have access to any E880 or E870. We just have a few scale-out machines to play with (S814). For some critical applications we need in the future to be able to reboot the virtual machine if the system hosting the machine has failed (Hardware problem). We decided a couple of month ago not to use remote restart because it was mandatory to use a reserved storage pool device and it was too hard to manage because of this mandatory storage. We now have enough P8 boxes to try and understand the new version of remote restart called simplified remote restart which does not need any reserved storage pool device. For those who want to understand what remote restart is I strongly recommend you to check my previous blog post about remote restart on two P7 boxes: Configuration of a remote restart partition. For the others here is what I learned about the simplified version of this awesome feature.

Please keep in mind that the FSP of the machine must be up to perform a simplified remote restart operation. It means that if for instance you loose one of your datacenter or the link between your two datacenters you cannot use simplified remote restart to restart you partitions on the main/backup site. Simplified Remote Restart only prevents you from an hardware failure of your machine. Maybe this will change in a near future but for the moment it is the most important thing to understand about simplified remote restart.

Updating to the latest version of firmware

I was very surprised when I got my Power8 machines. After deploying these boxes I decided to give a try to simplified remote restart but It was just not possible. Since the Power8 Scale Out servers were release they were NOT simplified remote restart capable. The release of the SV830 firmware now enables the Simplified Remote restart on Power8 Scale Out machines. Please note that there is nothing about it in the patch note, so chmod666.org is the only place where you can get this information :-). Here is the patch note: here. Last word you will find on the internet that you need Power8 to use simplified remote restart. It’s true but partially true. YOU NEED A P8 MACHINE WITH AT LEAST A 820 FIRMWARE.

The first thing to do is to update your firmware to the SV830 version (on both systems participating in the simplified remote restart operation):

# updlic -o u -t sys -l latest -m p814-1 -r mountpoint -d /home/hscroot/SV830_048 -v
[..]
# lslic -m p814-1 -F activated_spname,installed_level,ecnumber
FW830.00,48,01SV830
# lslic -m p814-2 -F activated_spname,installed_level,ecnumber
FW830.00,48,01SV830

You can check the firmware version directly from the Hardware Management Console or in the ASMI:

fw1
fw3

After the firmware upgrade verify that you now have the Simplfied Remote Restart capability set to true.

fw2

# lssyscfg -r sys -F name,powervm_lpar_simplified_remote_restart_capable
p720-1,0
p814-1,1
p720-2,0
p814-2,1

Prerequisites

These prerequisites are true ONLY for Scale out systems:

  • To update to the firmware SV830_048 you need the latest Hardware Management Console release which is v8r8.3.0 plus MH01514 PTF.
  • Obviously on Scale out system SV830_048 is the minimum firmware requirement.
  • Minimum level of Virtual I/O Servers is 2.2.3.4 (for both source and destination systems).
  • PowerVM enterprise. (to be confirmed)

Enabling simplified remote restart of an existing partition

You probably want to enable simplified remote restart after an LPM migration/evacuation. After migrating your virtual machine(s) to a Power 8 with the Simplified Remote Restart Capability you have to enable this capability on all the virtual machines. This can only be done when the machine is shutdown, so you first have to stop the virtual machines (after a live partition mobility move) if you want to enable the SRR. It can’t be done without having to reboot the virtual machine:

  • List current partition running on the system and check which one are “simplified remote restart capable” (here only one is simplified remote restart capable):
  • # lssyscfg -r lpar -m p814-1 -F name,simplified_remote_restart_capable
    vios1,0
    vios2,0
    lpar1,1
    lpar2,0
    lpar3,0
    lpar4,0
    lpar5,0
    lpar6,0
    lpar7,0
    
  • For each lpar not simplified remote restart capable change the simplified_remote_restart_capable attribute using the chssyscfg command. Please note that you can’t do this using the Hardware Management Console gui (in the latest 8r8.3.0, when enabling it by the Hardware management console the GUI is telling you that you need a reserved device storage which is needed by the Remote Restart Capability and not by the simplified version of remote restart. You have to use the command line ! (check screenshot below)
  • You can’t change this attribute while the machine is running:
  • gui_change_to_srr

  • You can’t do it with the GUI after the machine is shutdown:
  • gui_change_to_srr2
    gui_change_to_srr3

  • The only way to enable this attribute is to do it by using the Hardware Management Console command line (please note in the output below that running lpar cannot be changed):
  • # for i in lpar2 lpar3 lpar4 lpar5 lpar6 lpar7 ; do chsyscfg -r lpar -m p824-2 -i "name=$i,simplified_remote_restart_capable=1" ; done
    An error occurred while changing the partition named lpar6.
    HSCLA9F8 The remote restart capability of the partition can only be changed when the partition is shutdown.
    An error occurred while changing the partition named lpar7.
    HSCLA9F8 The remote restart capability of the partition can only be changed when the partition is shutdown.
    # lssyscfg -r lpar -m p824-1 -F name,simplified_remote_restart_capable,lpar_env | grep -v vioserver
    lpar1,1,aixlinux
    lpar2,1,aixlinux
    lpar3,1,aixlinux
    lpar4,1,aixlinux
    lpar5,1,aixlinux
    lpar6,0,aixlinux
    lpar7,0,aixlinux
    

Remote restarting

If you are trying to do a live partition mobility operation back to a P7 or P8 box without the simplified remote restart capability it will not be possible. Enabling the simplified remote restart will force the virtual machine to stay on P8 boxes with simplified remote restart capability. This is one of the reason why most of customers are not doing it:

# migrlpar -o v -m p814-1 -t p720-1 -p lpar2
Errors:
HSCLB909 This operation is not allowed because managed system p720-1 does not support PowerVM Simplified Partition Remote Restart.

lpm_not_capable_anymore

On the Hardware Management Console you can see that the virtual machine is simplified remote restart capable by checking its properties:

gui_change_to_srr4

You can now try to remote restart your virtual machines to another server. As always the status of the server has to be different from Operating (Power Off, Error, Error – Dump in progress, Initializing). As always my advice is to validate before restarting:

# rrstartlpar -o validate -m p824-1 -t p824-2 -p lpar1
# echo $?
0
# rrstartlpar -o restart -m p824-1 -t p824-2 -p lpar1
HSCLA9CE The managed system is not in a valid state to support partition remote restart operations.
# lssyscfg -r sys -F name,state
p824-2,Operating
p824-1,Power Off
# rrstartlpar -o restart -m p824-1 -t p824-2 -p lpar1

By doing a remote restart operation the machine will boot automatically. You can check in the errpt that in most cases the partition ID will be changed (proving that you are on another machine):

# errpt | more
IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
A6DF45AA   0618170615 I O RMCdaemon      The daemon is started.
1BA7DF4E   0618170615 P S SRC            SOFTWARE PROGRAM ERROR
CB4A951F   0618170615 I S SRC            SOFTWARE PROGRAM ERROR
CB4A951F   0618170615 I S SRC            SOFTWARE PROGRAM ERROR
D872C399   0618170615 I O sys0           Partition ID changed and devices recreat

Be very careful with the ghostdev sys0 attribute. Every VM remote restarted needs to have ghostdev set to 0 to avoid an ODM wipe (If you remote restart an lpar with ghostdev set to 1 you will loose all ODM customization)

# lsattr -El sys0 -a ghostdev
ghostdev 0 Recreate ODM devices on system change / modify PVID True

When the source machine is up and running you have to clean the old definition of the remote restarted lpar by launching a cleanup operation. This will wipe the old lpar defintion:

# rrstartlpar -o cleanup -m p814-1 -p lpar1

The RRmonitor (modified version)

There is a script delivered by IBM called rrMonitor, this one is looking at the PowerSystem‘s state and if this one is in particular state is restarting a specific virtual machine. This script is just not usable by a user because it has to be executed directly on the HMC (you need a pesh password to put the script on the hmc) and is only checking one particular virtual machine. I had to modify this script to ssh to the HMC and then check for every lpar on the machine and not just one in particular. You can download my modified version here : rrMonitor. Here is what’s the script is doing:

  • Checking the state of the source machine.
  • If this one is not “Operating”, the script search for every remote restartable lpars on the machine.
  • The script is launching remote restart operations to remote restart all the partitions.
  • The script is telling the user the command to cleanup the old lpar when the source machine will be running again.
# ./rrMonitor p814-1 p814-2 all 60 myhmc
Getting remote restartable lpars
lpar1 is rr simplified capable
lpar1 rr status is Remote Restartable
lpar2 is rr simplified capable
lpar2 rr status is Remote Restartable
lpar3 is rr simplified capable
lpar3 rr status is Remote Restartable
lpar4 is rr simplified capable
lpar4 rr status is Remote Restartable
Checking for source server state....
Source server state is Operating
Checking for source server state....
Source server state is Operating
Checking for source server state....
Source server state is Power Off In Progress
Checking for source server state....
Source server state is Power Off
It's time to remote restart
Remote restarting lpar1
Remote restarting lpar2
Remote restarting lpar3
Remote restarting lpar4
Thu Jun 18 20:20:40 CEST 2015
Source server p814-1 state is Power Off
Source server has crashed and hence attempting a remote restart of the partition lpar1 in the destination server p814-2
Thu Jun 18 20:23:12 CEST 2015
The remote restart operation was successful
The cleanup operation has to be executed on the source server once the server is back to operating state
The following command can be used to execute the cleanup operation,
rrstartlpar -m p814-1 -p lpar1 -o cleanup
Thu Jun 18 20:23:12 CEST 2015
Source server p814-1 state is Power Off
Source server has crashed and hence attempting a remote restart of the partition lpar2 in the destination server p814-2
Thu Jun 18 20:25:42 CEST 2015
The remote restart operation was successful
The cleanup operation has to be executed on the source server once the server is back to operating state
The following command can be used to execute the cleanup operation,
rrstartlpar -m sp814-1 -p lpar2 -o cleanup
Thu Jun 18 20:25:42 CEST 2015
[..]

Conclusion

As you can see the Simplified version of the remote restart feature is simpler that the normal one. My advice is to create all your lpars with the simplified remote restart attribute. It’s that easy :). If you plan to LPM back to P6 or P7 box, don’t use simplified remote restart. I think this functionality will become more popular when all the old P7 and P6 will be replaced by P8. As always I hope it helps.

Here are a couple of link with great documentations about Simplified Remote Restart:

  • Simplified Remote Restart Whitepaper: here
  • Original rrMonitor: here
  • Materials about lastest HMC release and a couple of videos related to the Simplified Remote Restart: here

13 thoughts on “Using the Simplified Remote Restart capability on Power8 Scale Out Servers

  1. Is simplified remote restart supported on E880 class machines ? we are at SC830_048 , HMC 8830 with MH01514 and VIOS 2..2.3.52 Do you have any other presentation materials ?

    Alan Wilcox awilcox@us.ibm.com

    • Hi Alan,

      YES ! Simplified Remote Restart is supported on all Power 8 Enterprise class machines (E870|E880). You need to have at least a 820 firmware. You have the requirements needed to use SRR on your P8 boxes.
      As this feature is pretty new it’s not so easy to find information on the internet. You can find good things on this page (deep dive SRR for PowerVC https://www.ibm.com/developerworks/community/wikis/home?lang=en-us#!/wiki/Wc1c29d23e0fd_4346_b509_f1c00a2099f0/page/PowerVM%20Remote%20Restart%20Functional%20Deep%20Dive).

      You can have a look on my previous post about Remote Restart (not simplified one) … most of this post still applies to SRR : here

      Tell me if you need anything more.

      Regards,

      B.

      • Thank you B: It looks like SRR is enabled BOTH on the managed system AND on the LPAR levels ? is this correct or not ? So to enable SRR on a remote target machine ( and avoid LPM compatibility problems you refer to above) do you simply set SRR ON at the managed system property level ? Do you have to restart the (target) managed system frame to do this ? For LPM the vm guest profile can only be in one place at a time … so how to avoid a catch 22 ?

        • 1/ For the managed system, you don’t have to enable anything, this is a PowerVM Enterprise feature. All Power8 HE (870|880) are delivered with PowerVM Enterprise so no worries about that. (Just check in the capabilities tab “Simplified Remote Restart capable” is True.)
          2/ You have to enable SRR for each lpar you want to remote restart. This can only be done when the lpar is shutdown (and by using the hmc command line). With the HMC gui you can set SRR to true only at the lpar creation. You can’t go back to a Power7 machine with LPM with SRR enable on a lpar. If you want to go back on a P7 system you have to shutdown the lpar, disable SRR and then move back to P7.
          3/ SRR is a P8 only feature with firmware >=820, so for HE System (870|880) Check you have a firmware >= SC820 for SCO System (812|822|814|824) check you have a firmware >= SV830. Nothing to “enable” in the properties.
          4/ You obviously have to restart the managed system by doing a firmware upgrade (if on is needed). If you don’t have to upgrade your firmware … you don’t need any reboot.
          5/ To perform a SRR operation :
          – The source machine must be in Error, Error – Dump in progress, or PowerOff state.
          – You are running a SRR operation on all VM to move lpar on another box.
          – When the machine will be back online you’ll see all the vm in shutdown state.
          You have to be careful not rebooting these lpars and run a cleanup operation to remove the old profiles and avoid a “catch 22″
          # rrstartlpar -o cleanup -m source_system
          So my advice is to note every SRR operations your are doing and run a cleanup operation when the machine will go back. For the moment this is not automatically done by PowerVM. Maybe something like that will be done in the future (maybe in PowerVC … actually don’t know).
          As far as I know … I think the STG lpm automation tool can do this for you, have to ask the team if you are interested in such a thing.

          Hope it’s understandable and clear enough.

          Regards,

          B.

          • Thank you Benoit…. do you know when / if SRR will support multiple HMCs ? ( origin and target managed systems are on different HMCs) — I did not see this capability in the rrestartlpar man page. Thanks !

          • I think it will probably be supported in the future, but do not know when exactly, this is not a problem if you have cross-site HMCs :-)

  2. Pingback: Tips and tricks for PowerVC 1.2.3 (PVID, ghostdev, clouddev, rest API, growing volumes, deleting boot volume) | PowerVC 1.2.3 Redbook | chmod666

  3. Hello,
    I already tested the modified rrMonitor script, it works very well, but I have a couple of comments to it, please correct me if I am wrong in any comment:

    a. Script will remote restart all LPAR’s even the “Not Activated” LPAR’s.
    b. after completing all rrestart tasks, the script goes into loop with below output which after sometime might fill the log filesystem if not stopped or cleaned up:
    Checking for source server state….
    Source server state is Power Off
    Checking for source server state….
    Source server state is Power Off
    Checking for source server state….
    Source server state is Power Off

    c. the script is scanning LPAR’s only one time at the beginning, so if you perform any LPM for some LPARs after starting this monitoring, it will not be reflected into the script.

    aside from that, the script works fine, so it needs some fine tuning :)

    Thanks
    Tamer

  4. Imagine I have 2 sites with one P8 on each site P8A and P8B
    – site A has active production lpars
    – site B has active dev lpars
    in case of P8A failure I can SRR all P8A lpars on P8B, since P8A is in error state or power off
    this will not affect lpars already active on P8B

    when P8A is repaired and back again if I want my migrated lpars back from P8B to P8A
    does it mean that :
    1. I have to put P8B in poweroff state too (impacting dev lpars on p8B) ?
    2. I have to cleanup old lpars on P8A
    3. i can now do a SRR of my lpars from P8B back to P8A

    i would like other lpars active on site B not to be impacted by SRR operations ie that SRR has only an lpar level granularity.

    note that the latest SRR accepts the HMC to see the P8 in DISCONNECTED state for SRR operations.

    • In this case.
      1. Do the cleanup on P8A.
      2. Do a LPM operation from P8B to P8A for remote restarted machine (the one you had on P8A before failure).
      3. P8B partition will never be impacted by this.

      • Hello thanks for your answer.
        OK so since LPM is not an option for me, due to the san structure exposed below (I have to manually failback the san storage to site A before migrating partitions), I think I can use IPM (inactive partition mobility) to get back to site A.

        I wanted to be sure mixing SRR and IPM operation has no particular impact on the configurations.

  5. suppose the following setup :
    – site A with P8A and san storageA
    – site B with P8B and san storageB
    – luns from storageA are replicated at san level to storageB – at any time one of the 2 replicated lun is RW while the other is RO
    – lunA is npiv attached to lparA on lparA vfc primary wwns

    Now I do the following :
    – put P8A off
    – manually reverse luns roles on san storage to make luns RO on site A and their replicates RW on site B
    – do a SRR of the lpar to P8B

    I wonder how AIX reconfigures its vfc wwns on P8B ?
    1- still have its vfc primary wwns used on P8A and access luns with them ?
    2- like 1 but use the vfc secondary wwns defined on P8A (like with lpm) to access luns ?
    3- reconfigure an entirely new set of vfc wwns (and I have to zone and map them on storage) ?

    what will happen to these vfc wwns if I SRR back and forth several times on the same 2 P8 A and B ? will they change each time of AIX will maintain just 2 sets of vfc wwns ?

    • Hi I’m almost sure this setup will not work.
      You have to wait for another IBM product capable of doing that. Can’t talk about it right now ;-).

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>