new utility for decoding salinfo records

From: Ben Woodard <woodard_at_redhat.com>
Date: 2005-01-12 02:46:28
Excuse me if this ends up being a duplicate. I mailed this out last
night but for some reason, it hasn't come through. It is not in the
archives nor have I seen it come back through to my mail box.

Here is a new utility for looking into salinfo records. It several
things differently than salinfo_decode. We have found that this helps
considerably in understanding problem on our itanium servers. The
attached patch applies to salinfo-0.7 and does not modify
salinfo_decode's functioning in any way. In fact the only files that are
modifed are the Makefile and the spec file.

Here is the man page which tries to illustrate some of the features
which were designed into salinfo_decode2.

SALINFO_DECODE2(8)      Decode Itanium SAL Error Records      SALINFO_DECODE2(8)
 
NAME
       salinfo_decode2 - decode Itanium SAL error records
 
SYNOPSIS
       salinfo_decode2 [OPTION]... [FILE | DIRECTORY]...
 
DESCRIPTION
       salinfo_decode2 decodes CMC/CPE/MCA/INIT records obtained from the SAL.
       It will take a list of files or directories and print out the requested
       information  about  the salinfo records that are contained within those
       files. This is notably different than the salinfo_decode program  which
       processes  only a single record at a time. Experience has shown that it
       can be difficult to identify a hardware failure of the  type  found  in
       the  salinfo  logs  because the failure results in many salinfo records
       being created. salinfo_decode2 allows a system administrator to  glance
       at  a  directory  full  of errors or some subset of files and obtain an
       overall impression of how meaningful the errors are.  This is  done  by
       turning  down  the  verbosity  and  generalizing  what  is  there. More
       experienced  administrators  can  turn  up  the   verbosity   and   get
       progressively more detailed information.
 
       salinfo_decode2  also  has  the  capability  to generate output that is
       designed to be easily parsed by a machine. This is useful when you want
       to  automate  monitoring  of  large  numbers  of machines. For example,
       instead of having scripts notify you every time an ignorable single bit
       memory  error  occurs,  the  monitoring scripts can easily ignore those
       errors and only point out higher priority error conditions.
 
       If no files or directories are specified on the command line, stdin  is
       read and is assumed to be a SAL record.
 
       salinfo_decode2  also  has the advantage that a SAL record from an ia64
       can be inspected and analyzed on a non-ia64, non-little endian machine.
       For  example,  a  system  administrator  using  an ia32 workstation can
       inspect SAL records from an ia64 cluster.  The design of  the  original
       salinfo_decode˙s  internal  architecture  precludes this kind of cross-
       platform utilization.
 
OPTIONS
       -h, --help
              Print usage and exit
 
       -V, --version
              Print version information and exit
 
       -c, --cmc
              Only print cmc records
 
       -p, --cpe
              Only print cpe records
 
       -m, --mca
              Only print mca records
 
       -i, --init
              Only print init records
                                                                                
       -d, --dimm-offset
              Count dimms starting at 1 not 0. This is  useful  when  the  SAL
              reports  failures  starting with 0 but the numbers silk screened
              on a the motherboard begin with  1.  This  helps  reduce  system
              administrator confusion when replacing the memory DIMM.
 
       -o, --cpu-offset
              Count  cpus  starting  at  1  not 0. This is useful when the SAL
              reports failures starting with 0 but the numbers  silk  screened
              on  the  motherboard  begin  with  1.  This  helps reduce system
              administrator confusion when replacing CPUs.
 
       --tiger4
              The same as -d & -o. The Intel Tiger 4 motherboard˙s  silkscreen
              counts  both CPUs and DIMMs beginning with 1 rather than 0 which
              is what the SAL returns.
 
       -f, --forgiving
              Be forgiving of errors when opening files and reading data
 
       -r, --recursive
              When a database is a directory traverse its sub-directories
 
       -v, --verbosity
              Specify the verbosity to print records. Verbosity  can  be  1-6.
              However,  as  the  verbosity  increases, the likelihood that the
              printing of the detailed information hasn˙t been implemented yet
              also  increases.  Patches  to  remedy this situation are eagerly
              accepted.  The goal with the progressive levels of verbosity  is
              to  facilitate  understanding  of records, not just to blurt out
              every scrap of  available  information.  Since  verbosity  6  is
              largely  not  implemented  yet,  if  you  need  all of available
              information, use the original salinfo_decode.
 
       -s, --scriptable
              Output in  a  machine  readable  format.  This  is  designed  to
              facilitate quick and easy shell scripting with the output. Refer
              to the examples section for intended use.
 
EXAMPLES
       Pointing salinfo_decode2 at a  directory  of  a  few  errors  with  the
       verbosity   set   very  low  shows  that  all  the  errors  are  mainly
       inconsequential:
 
       $ ./salinfo_decode2 -v1 tigertest/
       cpe with severity "corrected" occurred at 12:03:08 on Apr 1 2004
       cpe with severity "corrected" occurred at 12:03:10 on Apr 1 2004
       cpe with severity "corrected" occurred at 12:32:14 on Apr 1 2004
       cpe with severity "corrected" occurred at 17:24:44 on Apr 1 2004
 
       Here is an example of how different levels  of  verbosity  present  the
       same SAL record differently:
 
       $ ./salinfo_decode2 -v1 sample_data/tdev2-2004-04-01-12:03:08-cpu1-cpe0
       cpe with severity "corrected" occurred at 12:03:08 on Apr 1 2004
                                                                                
       $ ./salinfo_decode2 -v2 sample_data/tdev2-2004-04-01-12:03:08-cpu1-cpe0
       record 612413502631444488 contains the following sections: (PCI component) (PCI component) (PCI component) (PCI component) (memory) (platform specific)
 
       $ ./salinfo_decode2 -v3 sample_data/tdev2-2004-04-01-12:03:08-cpu1-cpe0
       record 612413502631444488 contains the following sections:
       PCI component with (vend/dev) 8086/500 at (Seg/Bus/Dev/Func) 0/255/24/0 reported a fault
       PCI component with (vend/dev) 8086/501 at (Seg/Bus/Dev/Func) 0/255/24/1 reported a fault
       PCI component with (vend/dev) 8086/502 at (Seg/Bus/Dev/Func) 0/255/24/2 reported a fault
       PCI component with (vend/dev) 8086/503 at (Seg/Bus/Dev/Func) 0/255/24/3 reported a fault
       Memory fault at (node/card/module/bank/device) 0/0/8/0/0
       OEM component with id 0x44fc4766d807e40f reported a fault
 
       Here is an example of how to use the scriptable interface to change the
       formatting of the output and to select one record  out  of  many  which
       match a specific criteria.
 
       $ ./salinfo_decode2 -v1 -s sample_data/ | while read line;do
       > eval $line
       > if [ "$severity" != "corrected" ];then
       >    echo $month/$day/$year
       > fi
       > done
       4/1/2004
 
BUGS
       Many  levels  of  verbosity  for  many  types  of  errors  are  not yet
       implemented. The project reached a state where it did  what  the  users
       needed  it  to do and then I was asked to work on other things. Patches
       are greatfully accepted.
 
AUTHOR
       Ben Woodard <woodard@redhat.com>
 
SEE ALSO
       salinfo_decode(8)
                                                                                
Linux                             Jan 6, 2005                SALINFO_DECODE2(8)



-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Received on Tue Jan 11 10:50:29 2005

This archive was generated by hypermail 2.1.8 : 2005-08-02 09:20:34 EST