Live Partition Mobility : Diagnostic and debugging

I’m intensively using Live Partition Mobility. I’m moving lpars every weeks. A lot of problem may occurs while moving partitions. Source and destination frames, source and destination Virtual I/O Servers, moving lpar, Hardware Management Console takes part in a mobility operation. Finding the problem source can really be a brainteaser. This post is a practical how-to based on my own experience on Live Partition Mobility. I’m sure it does not cover all the problems but it can be a starting point before raising a PMR. Some of these tips comes directly from the IBM support, many thanks to them for their efficiency.

Run the mobility operation from command line

Running a mobility operation from the Hardware Management Console GUI doesn’t give you the maximum verbosity level. My advice is to run the operation from the command line by specifying debug and verbose output. Gathering all this errors and warning can really helps you to spot the problem :

# migrlpar --ip 10.10.20.20 -u hscroot -o m -m SOURCE-FRAME -t DESTINATION-FRAME -p lpar1 -i 'source_msp_id=5,source_msp_ipaddr=10.10.20.21,dest_msp_id=14, dest_msp_ipaddr=10.10.20.22,virtual_fc_mappings="10/vios1/14//fcs0,11/vios2/15//fcs0"' -d 5 -v

Clear and change cfglog verbosity on all servers

On all servers taking part in the mobility operation clear the cfglog, and increase its verbosity (on Virtual I/O Servers do this as root). Dynamic reconfiguration output are logged in /var/adm/ras/cfglog. A mobility operation is a dynamic reconfiguration operation, so all output will be logged into cfglog file. cfglog is an alog file and has to be read and modified by the alog command :

# rm /var/adm/ras/cfglog 
# echo "Log Cleared  - $(date)" | alog -t cfg
Log Cleared  - Mon Jul 15 15:21:46 CEST 2013

Set the highest verbosity level for Dynamic reconfiguration by editing /etc/drlog.cmd :

# cat /etc/drlog.cmd
CFGLOG=timestamp,detail,verbosity:9
alog -t cfg -o
Log Cleared  - Mon Jul 15 15:21:46 CEST 2013

After editing the /etc/drlog.cmd the cfglog output will be timestamped (it really can helps you a lot). Cool things you can check are the migrmgr commands run on the server itself. Theses ones can be re-run by hand, and you can check if there is any problem by checking the outpout. Here is an example on a source Virtual I/O Sever. Do not forget to check these logs on the partition itself, on the mover service partition and on the destination Virtual I/O Servers too. My advice is to call the support before re-runing these commands by hand they will help you to understand the output :

# alog -t cfg -o
Log Cleared  - Mon Jul 15 15:21:46 CEST 2013
CS 10748024 7536712 15:30:22 drlog.c 104 /usr/sbin/migmgr -f xchg_capabilities -c 1C -d 1
[..]
CS 10748026 7536712 15:30:24 drlog.c 104 /usr/sbin/migmgr -f xchg_capabilities -c 1C -d 0
[..]
CS 17170526 7536712 15:30:24 drlog.c 104 /usr/sbin/migmgr -f get_adapter -t vscsi -s U9119.FHB.84F55B6-V15-C21 -w 13857705817061523674 -W 13857705817061523675 -d 0
C4 17170526 15:30:24 migmgr.c 357 Running method '/usr/lib/methods/mig_vscsi'
CS 17170526 7536712 15:30:24 mig_vscsi.c 620 /usr/sbin/migmgr -f get_adapter -t vscsi -s U9119.FHB.84F55B6-V15-C21 -w 13857705817061523674 -W 13857705817061523675 -d 0
[..]
# /usr/sbin/migmgr -f xchg_capabilities -c 1C -d 1
req_cap=0x1c, my_cap=0x1c, rtn_cap=0x1c
VIOS_CAPABILITIES=0x1c
VIOS_MIN_LEVEL=1
VIOS_MAX_LEVEL=1

If NPIV use devscan to check LUN masking and zoning

Most of my lpars are using NPIV and Virtual Fibre Channel adapters. Most recurring problems are lun masking and zoning problems. For example when adding a new lun on a lpar my SAN team omits to add the lun for the Live Partition Mobility special WWPNS. Since a couple of month IBM has released a tool called devscan, you can find it at this address http://www-01.ibm.com/support/docview.wss?uid=aixtools74886e0c. My advice is to deploy this tool on all your lpars, it can really helps you a lot, not just for Live Partition Mobility. One very very cool feature was just added to the latest version of devscan, it allows you to check that all lun are correctly masked behind the Live Partition Mobility WWPNS. I personally run it on all the destination Virtual I/O Servers, here is an example :

  • From the HMC get the Live Partition Mobility WWPNS. (these ones are always odd, in our case : c0507604f208016b, c0507604f208016d)
  • # lssyscfg -r prof -m SOURCE-FRAME -F name virtual_fc_adapters | grep ^lpar1
    lpar1 """10/client/9/vios1/5/c0507604f208016a,c0507604f208016b/0"",""11/client/10/vios2/5/c0507604f208016c,c0507604f208016d/0"""
    
  • On the destinations Virtual I/O Servers run devscan using the NPIV option to check how many luns are masked and to get their IDs. Check this for all the fibre channel adapters and all Live Partition Mobility WWPNS. On the output below we can see that devscan has found 8 lun with 12091c,12091d id and so on :
  • # devscan -t f -n c0507604f208016b --dev=fscsi2
    
    devscan v1.0.3
    Copyright (C) 2010-2012 IBM Corp., All Rights Reserved
    
    cmd: devscan -t f -n c0507604f208016b --dev=fscsi2
    Current time: 2013-07-12 12:49:24.323496 GMT
    Running on host: vios3
    
    <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    Processing FC device:
        Adapter driver: fcs2
        Protocol driver: fscsi2
        Connection type: fabric
        Link State: up
        Current link speed: 8 Gbps
        Local SCSI ID: 0x120005
        Local WWPN: 0x10000000c9cc1b18
        Local WWNN: 0x20000000c9cc1b18
        NPIV SCSI ID: 0x1209c0
        NPIV WWPN: 0xc0507604f208016b
        Device ID: 0xdf1000f114108a03
        Microcode level: 202307
    
    SCSI ID LUN ID           WWPN             WWNN
    -----------------------------------------------------------
    12091c  0000000000000000 5000097500025d21 5000097500025c00
    
    12091d  0000000000000000 5000097500025d29 5000097500025c00
    
    12091e  0000000000000000 5000097500025d51 5000097500025c00
    
    12091f  0000000000000000 5000097500025d59 5000097500025c00
    
    130c1c  0000000000000000 500009750001f521 500009750001f400
    
    130c21  0000000000000000 500009750001f551 500009750001f400
    
    130c22  0000000000000000 500009750001f529 500009750001f400
    
    130c27  0000000000000000 500009750001f559 500009750001f400
    
    8 targets found, reporting 0 LUNs,
    0 of which responded to SCIOLSTART.
    Elapsed time this adapter: 00.638549 seconds
    
    Cleaning up...
    Total elapsed time: 00.642269 seconds
    Completed successfully
    
  • Run devscan on the moving lpar, and compare the output with the one you get on the Virtual I/O Server :
  • # devscan --dev=fscsi0 --concise | awk -F '|' '{print $2}' | sort -n | uniq
         SCSI/SAS ID
    0000000000130c1c
    0000000000130c21
    0000000000130c22
    0000000000130c27
    000000000012091c
    000000000012091d
    000000000012091e
    000000000012091f
    

If you get any differences, for example a different number of lun, different id, and so on, you can be sure that you have a zoning or a lun masking problem. You have to raise an emergency call to your SAN team. (In my opinion this is another good reason for using vscsi rather than NPIV …. :-)). Devscan can also be used to check any reservation status (both scsi2 and scsi3), I’m sure you are aware that all your lun must be set with a reserve_policy set to no_reserve before trying to perform any mobility operation. One last thing about devscan is that it is a non-disruptive harmless tool and can be run at any time (it can be compared to a config manager (cfgmgr) call). Use it !

Be prepared : pedbg on HMC, snap on lpar, ctsnap on Virtual I/O Server

Most of the time, you’ll not find what’s the real problem, my advice is to be prepared for a support call. Everybody knows that as the end user of the Hardware management console you can’t investigate yourself without a pesh password, so run a pedbg and send it to support. At the same time run snap on lpar and ctsnap on Virtual I/O Server :

  • On the HMC as hscpe user run a pedbg :
  • # pedbg -c -q 4
    
  • On the moving lpar and on the Virtual I/O Servers run a snap :
  • # snap -r
    # snap -ac
    
  • On both source and destination Virtual I/O Servers and on the Mover Service Partition run a ctsnap, same of client lpar :
  • # ctsnap -x runrpttr
    

Checklisting

Here are all the points I check if I have a mobility errors, be sure you can answer to all of this before opening a call to IBM support

  • Are all my reserve_policy set to no_reserve.
  • Are all my VLANs propagated and configured on all my Virtual I/O Servers.
  • Do I have failed or missing paths on my moving partition (if there are remove failed and missing paths, run a config manager, and correct errors if there are any).
  • Do I have defined adapters, or defined devices on my moving partition (if there are remove defined devices or adapters and run a config manager).
  • Using EMC Powerpath : run a powermt check before running the mobility operation !
  • If I’m using a distant HMC, are my ssh key correctly shared (use mkauthkeys command on the HMC to check key sharing).
  • Is there any RMC connections problem between HMC and Virtual I/O servers, Mover Service Partition, moving lpar ? (check with lspartition -dlpar on the HMC).
  • If I’m using NPIV are all my lun masked to NPIV WWPNS (devscan check) ?
  • If I’m using vscsi are all my lun masked to all my Virtual I/O Server ?

Hope this helps.

4 thoughts on “Live Partition Mobility : Diagnostic and debugging

  1. Great article as usual :)
    i just disagree on the vscsi against npiv argument.
    Having all LUNs on a VIOS without reserve configured by 5 teams of 5 to 10 people, working at the same time, you’ll dislike vscsi ;)
    And trust me, having vscsi everywhere doesn’t fix the masking/zoning issues, they are just different :)

  2. Good work again ! Thanks a lot.
    When you write (at the end) “Are all my reserve_policy set to no_reserve.” , it’s in the case you use vSCSI ? No matter set reserve_policy set to no_reserve when you are running in Full NPIV, no ?

    • Hi Erw,

      You are right; if full NPIV the reserve_policiy does not matter. Anyway, I’m using different multipath software : HDLM,Powerpath,SDDPCM, and so on, so it’s an old habit to set my reserves to no_reserve in all cases ;). Be very careful when using HLDM and LPM with reserve policy …

      Once again thanks for your support. I appreciate.

      Benoit.

  3. Hi there

    I am new in AIX and in learning stage. I have two p6 520 with powerVm Ent enabled. I have one fiber brocade switch with NPIV enabled, ds4700 SAN, I read a lot about LPM feature but could not find ANY particular document with steps how to zone the partition ( to move) to be configured on switch and San. I have .two vios on each server installed , one LPAR with NIM ON IT on one server. If any particular document is available and you can share with me I will be highly obliged to complete this LPM. REGARDS ALEE

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>