Adventures in IBM Systems Director in System P environment. Part 4: Playing with errnotify and genevent.sh

For some puzzling reasons that I don’t understand my client do not want to install supervision agents (Tivoli) on VIO Servers. I can’t argue that decision, I have to work with it, anyway, in my opinion VIO Sersers are one of the most important part in a Pseries virtualized environment. It has to be monitored. One of our recurring problems comes from Ethernet Ports always flapping resulting in Shared Ethernet Adapter failover (become primary, then become backup, and so on) :

  • ent1 is flapping on this VIO Server :
  • # errlog | more 
    E136EAFA   1009203212 I H ent7           BECOME PRIMARY
    F3931284   1009203212 I H ent1           ETHERNET NETWORK RECOVERY MODE
    0B41DD00   1009103812 I H ent7           ADAPTER FAILURE
    EC0BCCD4   1009103812 T H ent1           ETHERNET DOWN
    E136EAFA   1009103812 I H ent7           BECOME PRIMARY
    F3931284   1009103812 I H ent1           ETHERNET NETWORK RECOVERY MODE
    0B41DD00   1009022212 I H ent7           ADAPTER FAILURE
    EC0BCCD4   1009022212 T H ent1           ETHERNET DOWN
    E136EAFA   1009022212 I H ent7           BECOME PRIMARY
    
  • SEA on second VIO Server become backup, then primary :
  • 40D97644   1009203212 I H ent7           BECOME BACKUP
    E136EAFA   1009103912 I H ent7           BECOME PRIMARY
    40D97644   1009103812 I H ent7           BECOME BACKUP
    E136EAFA   1009022212 I H ent7           BECOME PRIMARY
    40D97644   1009022212 I H ent7           BECOME BACKUP
    E136EAFA   1009022112 I H ent7           BECOME PRIMARY
    40D97644   1009022112 I H ent7           BECOME BACKUP
    E136EAFA   1009022112 I H ent7           BECOME PRIMARY
    40D97644   1009022112 I H ent7           BECOME BACKUP
    

This problem can really be important on production hosts and has to be detected “in real-time”. Trapping errpt errors and running scripts when an error is raised is possible with errnotify and can be even more useful if this error is raised on IBM Systems Director. Here is the method is used to send me a mail every time a port is flapping :

errnotify configuration

The first thing I have to do is to setup errnotify to run a script every time an ethernet link is down on a VIO Server :

  • Identify error code : on my VIO Server I have to identify which error code is raised when a link is up or down :
  • This list is maybe not exhaustive but these are errors code I found in errpt when a link is :
    • Down (MSNENT_LINK_DOWN and GOENT_LINK_DOWN) :
    • # oem_setup_env
      # errpt -t | grep ABB8A22B
      ABB8A22B MSNENT_LINK_DOWN    TEMP H  ETHERNET DOWN
      # errpt -t | grep EC0BCCD4
      EC0BCCD4 GOENT_LINK_DOWN     TEMP H  ETHERNET DOWN
      
    • Up (MSNENT_RCVRY_EXIT and GOENT_RCVRY_EXIT) :
    • # oem_setup_env
      # errpt -t | grep 4969AE33
      4969AE33 MSNENT_RCVRY_EXIT   INFO H  ETHERNET NETWORK RECOVERY MODE
      # errpt -t |grep F3931284
      F3931284 GOENT_RCVRY_EXIT    INFO H  ETHERNET NETWORK RECOVERY MODE
      
  • With these errpt identifiers create an odm entries file to be added in odm :
  • # vi errnotify.odmadd
    errnotify:
        en_pid = 0
        en_name = "MSNENT_RCVRY_EXIT"
        en_persistenceflg = 1
        en_label = "MSNENT_RCVRY_EXIT"
        en_class = H
        en_crcid = 0
        en_method = "/usr/lib/ras/link_state.notify $1 $2 $3 $4 $5 $6 $7 $8 $9"
    errnotify:
        en_pid = 0
        en_name = "MSNENT_LINK_DOWN"
        en_persistenceflg = 1
        en_label = "MSNENT_LINK_DOWN"
        en_class = H
        en_crcid = 0
        en_method = "/usr/lib/ras/link_state.notify $1 $2 $3 $4 $5 $6 $7 $8 $9"
    errnotify:
        en_pid = 0
        en_name = "GOENT_LINK_DOWN"
        en_persistenceflg = 1
        en_label = "GOENT_LINK_DOWN"
        en_class = H
        en_crcid = 0
        en_method = "/usr/lib/ras/link_state.notify $1 $2 $3 $4 $5 $6 $7 $8 $9"
    errnotify:
        en_pid = 0
        en_name = "GOENT_RCVRY_EXIT"
        en_persistenceflg = 1
        en_label = "GOENT_RCVRY_EXIT"
        en_class = H
        en_crcid = 0
        en_method = "/usr/lib/ras/link_state.notify $1 $2 $3 $4 $5 $6 $7 $8 $9"
    # odmadd errnotify.odmadd
    
  • Check entries are correctly added in odm :
  • # odmget errnotify | tail -16
    errnotify:
            en_pid = 0
            en_name = "GOENT_RCVRY_EXI"
            en_persistenceflg = 1
            en_label = "GOENT_RCVRY_EXIT"
            en_crcid = 0
            en_class = "H"
            en_type = ""
            en_alertflg = ""
            en_resource = ""
            en_rtype = ""
            en_rclass = ""
            en_symptom = ""
            en_err64 = ""
            en_dup = ""
            en_method = "/usr/lib/ras/link_state.notify $1 $2 $3 $4 $5 $6 $7 $8 $9"
    

Writing script called by errnotify. Using genevent.sh

For my own use I need to write in a file every time a link has become up or down, this will be added to the script. But what I want first is to generate an event on Systems Director. This can be done with a script delivered by Common Agent called genevent.sh. Let’s have a look on my link_state.notify script :

# cat /usr/lib/ras/link_state.notify
#!/bin/ksh
echo "`date` | $1 $2 $3 $4 $5 $6 $7 $8 $9" >> /home/padmin/vio_mon/vio_mon.errnotify
/var/opt/tivoli/ep/runtime/agent/subagents/director/genevent.sh /type:"Managed Resource.Managed System Resource.Logical Resource.Logical Device.Logical Port.Network Port.Ethernet Port" /text:"Link $6 $9" /sev:0

The first line as I told you before is for my own use, the second one call genvent.sh with :

  • /type : event type generated on Systems Director, in my case : Resource.Logical Device.Logical Port.Network Port.Ethernet Port.
  • /text : $6 is the ethernet adapter, and $9 the errpt identifer (GOENT_RCVRY_EXIT, etc.).
  • /sev : event severity, 0 for fatal, 1 for critial and so on.

Create Event filter, Event Action and Event Automation Plan

As shown on the image below, an error will be raised on the Systems Director when a link is flapping :

Right click on this event to create an Event Filter, mine is called “Link Down on VIO Server” :

Create a new Event Action, in my case I want to send en email on my mailbox to notify a Link Down (IP and e-mail adresses are hidden in this screenshot) :

With this Event Filter and the Event Action, create an Event Automation Plan to send a mail when a link is flapping, when you’re creating this Event Automation Plan, use the newly created Event filter and Event Action :

  • Event filter choice :
  • Event action choice :
  • Event automation plan creation summary :

Testing

Do not forget to test this newly created Event Automation Plan, call your network team to shut/no shut an Shared Ethernet adapter port. You’ll receive a new mail in your mail box :

Link ent1 MSNENT_RCVRY_EXIT

Event Text     Link ent1 MSNENT_RCVRY_EXIT
Date           10/18/2012 6:21 PM CEST
Severity       Fatal
Event Type     Managed Resource.Managed System Resource.Logical Resource.Logical Device.Logical Port.Network Port.Ethernet Port
System Name    vio35
Sender Name    vio35

Hope this can help.