메뉴 건너뛰기

Korea Oracle User Group

OS

HP-UX Glance adviser를 활용하여 모니터링 데이터 수집

 

HP-UX 시스템 운영시 관리자는 glance를 사용하게 된다. HP에서 제공하는 툴이기 때문이다.

하지막 막상 glance를 쓰게 되면 glance를 수행하는 프로세스가 CPU를 많이 쓰는 것을 느끼게 된다.

관리자가 한 사람이라면 크게 고민할 정도는 되지 않는다. 하지만 모니터링하는 사람이 많다면 

glance를 통해 모니터링하는 것만도 CPU 사용율을 꽤 높이게 된다. 

이럴 경우 glance를 통해 데이터를 수집하고 모니터링하는 관리자들은 수집한 데이터를 보는 것이 자원 사용의 효율을 높일 수 있다.

 

아래 내용을 통해 구현하고 활용할 수 있다.

 

1. HP-UX Glance adviser를 이용해 APPLICATION의 cpu, memory, disk I/O 등의 정보를 1분 간격으로 저장하는 방법

 

아래 3개의 파일로 구동한다.

 

monitor_app.adv : glance adviser script 파일 

start.sh : glance adviser를 실행하는 스크립트 

stop.sh : glance adviser를 종료하는 스크립트 

 

각 파일의 내용은 아래와 같습니다.

 

monitor_app.adv

PRINT "======================================================================="
PRINT "DATE / TIME: ", GBL_STATDATE, " - ", GBL_STATTIME, " TOT_CPU_USE: ",GBL_CPU_TOTAL_UTIL 
PRINT "=======================================================================" 
PRINT "APP name |totalCPU|sysCPU|userCP|phyDSK|logRd |logWr | MEM " 
PRINT "=======================================================================" 
APPLICATION LOOP { 
PRINT APP_NAME, "|", 
APP_CPU_TOTAL_UTIL, "|", 
APP_CPU_SYS_MODE_UTIL, "|", 
APP_CPU_USER_MODE_UTIL, "|", 
APP_DISK_PHYS_IO_RATE, "|", 
APP_DISK_LOGL_READ_RATE, "|", 
APP_DISK_LOGL_WRITE_RATE, "|", 
APP_MEM_RES 
} 

 

start.sh 

DATE=`date "+%y%m%d%H%M%S"` 
nohup glance -j 60 -adviser_only -syntax ./monitor_app.adv 1>> ./log.$DATE 2>/dev/null & 

 

stop.sh 

kill -9 $(ps -ef | grep adviser_only | grep monitor_app.adv | awk '{print $2}')

 

아래는 위 내용의 파일을 사용해 glance adviser를 수행한 결과 내용이다.

 

log.140714144714

=======================================================================
DATE / TIME: 07/14/2014 - 14:47:19
=======================================================================
APP name |totalCPU|sysCPU|userCP|phyDSK|logRd |logWr | MEM 
=======================================================================
other            |  70.6|  19.0|  51.6|  11.3|  98.3|   8.3|    18.9gb
network          |   0.0|   0.0|   0.0|   0.0|   0.0|   0.0|    19.8mb
memory_management|   0.0|   0.0|   0.0|   2.4|   0.0|   0.0|    32.0mb
other_user_root  |   0.0|   0.0|   0.0|   0.2|   0.8|   1.0|   353.8mb
=======================================================================
DATE / TIME: 07/14/2014 - 14:48:19
=======================================================================
APP name |totalCPU|sysCPU|userCP|phyDSK|logRd |logWr | MEM 
=======================================================================
other            |  50.1|   5.8|  44.3|  31.6|  25.2|  10.6|    18.9gb
network          |   0.0|   0.0|   0.0|   0.0|   0.0|   0.0|    19.8mb
memory_management|   0.0|   0.0|   0.0|   3.3|   0.0|   0.0|    32.0mb
other_user_root  |   0.3|   0.2|   0.0|   0.2|   5.6|   0.4|   354.0mb
=======================================================================
DATE / TIME: 07/14/2014 - 14:49:19
=======================================================================
APP name |totalCPU|sysCPU|userCP|phyDSK|logRd |logWr | MEM 
=======================================================================
other            |  50.5|   6.0|  44.5|  10.0|  27.6|   9.2|    18.9gb
network          |   0.0|   0.0|   0.0|   0.0|   0.0|   0.0|    19.8mb
memory_management|   0.0|   0.0|   0.0|   3.4|   0.0|   0.0|    32.0mb
other_user_root  |   0.4|   0.3|   0.1|   0.5|  88.7|   3.5|   354.0mb

 

2. CPU 사용율 30% 이상인 Process 정보 수집

 

monitor_cpu.adv 

PRINT "======================================================================="
PRINT "DATE / TIME: ", GBL_STATDATE, " - ", GBL_STATTIME, " TOT_CPU_USE: ",GBL_CPU_TOTAL_UTIL 
PRINT "=======================================================================" 
PRINT "PROCESS name   |PROCESS id|    CPU Usage"
PRINT "=======================================================================" 
PROCESS LOOP { 
 if PROC_CPU_TOTAL_UTIL > 30 then {
  PRINT PROC_PROC_NAME|24, PROC_PROC_ID|10," ", PROC_CPU_TOTAL_UTIL|12
 }
} 

 

start_cpu.sh

DATE=`date "+%y%m%d%H%M%S"` 
nohup glance -j 60 -adviser_only -syntax ./monitor_cpu.adv 1>> ./log.$DATE 2>/dev/null & 

 

stop_cpu.sh

kill -9 $(ps -ef | grep adviser_only | grep monitor_cpu.adv | awk '{print $2}')

 

아래는 위 내용으로 glance adviser를 수행한 결과 내용이다.

 

log.140714155022

=======================================================================
DATE / TIME: 07/14/2014 - 15:50:27
=======================================================================
PROCESS name          |PROCESS id|      CPU Usage
=======================================================================
glance                       23679         70.6
oracleMVNOT                   7027        100.8
oracleMVNOT                   6243         98.9
oracleMVNOT                   7032        101.3
=======================================================================
DATE / TIME: 07/14/2014 - 15:50:32
=======================================================================
PROCESS name          |PROCESS id|      CPU Usage
=======================================================================
glance                       23679         65.2
glance                       29478         65.0
oracleMVNOT                   7027         99.0
oracleMVNOT                   6243         99.0
glance                       22084         64.5
oracleMVNOT                   7032         98.3
=======================================================================
DATE / TIME: 07/14/2014 - 15:50:36
=======================================================================
PROCESS name          |PROCESS id|      CPU Usage
=======================================================================
glance                       23679         68.8
oracleMVNOT                   7027         99.2
oracleMVNOT                   6243        101.1
oracleMVNOT                   7032        102.1

 

3. adviser에 참고할 만한 syntax 내용

 

거의가 glance 사용시 하단에 나오는 alarm에 대한 참고 내용이다.

$ cat adviser.syntax 
 
# The following symptoms are used by the default Alarm Window
# Bottleneck alarms.  They are re-evaluated every interval and
# the probabilities are summed.  These summed probabilities are
# checked by the bottleneck alarms.  The buttons on the gpm
# main window will turn yellow when a probability exceeds 50%
# for an interval, and red when a probability exceeds 90% for
# an interval.  You may edit these rules to suit your environment:
 
symptom CPU_Bottleneck type=CPU
rule GBL_CPU_TOTAL_UTIL        >   75  prob 25
rule GBL_CPU_TOTAL_UTIL        >   85  prob 25
rule GBL_CPU_TOTAL_UTIL        >   90  prob 25
rule GBL_PRI_QUEUE             >    3  prob 25
 
symptom Disk_Bottleneck type=DISK
rule GBL_DISK_UTIL_PEAK        >   50  prob GBL_DISK_UTIL_PEAK
rule GBL_DISK_SUBSYSTEM_QUEUE  >    3  prob 25
 
symptom Memory_Bottleneck type=MEMORY
rule GBL_MEM_QUEUE             >    2  prob 20
rule GBL_MEM_PAGEOUT_RATE      >    5  prob 20
rule GBL_MEM_PAGEOUT_RATE      >   50  prob 20
rule GBL_DISK_VM_WRITE_RATE    >    5  prob 20
rule GBL_DISK_VM_WRITE_RATE    >   50  prob 20
rule GBL_MEM_SWAPOUT_RATE      >    1  prob 35
rule GBL_MEM_SWAPOUT_RATE      >    4  prob 50
 
# this symptom definition is only available for 11.0
symptom Network_Bottleneck type=NETWORK
rule GBL_NET_OUTQUEUE          >    0  prob 10
rule GBL_NET_OUTQUEUE          >    1  prob 25 
rule GBL_NFS_CALL_RATE         >  500  prob 10
rule GBL_NET_COLLISION_PCT     >   10  prob 10
rule GBL_NET_COLLISION_PCT     >   25  prob 20
rule GBL_NET_COLLISION_PCT     >   50  prob 30
rule GBL_NET_PACKET_RATE       >  500  prob 10
rule GBL_NET_PACKET_RATE       > 1000  prob 10
rule GBL_NET_PACKET_RATE       > 3000  prob 20
rule GBL_NET_PACKET_RATE       > 5000  prob 20
rule GBL_NET_PACKET_RATE       > 9000  prob 20
 
 
# Below are the primary CPU, Disk, Memory, and Network Bottleneck alarms.
# For each area, a calculated bottleneck symptom probability is used
# to define yellow or red alerts.
 
 
alarm CPU_Bottleneck > 50 for 2 minutes
  start 
    if CPU_Bottleneck > 90 then
      red alert "CPU Bottleneck probability= ", CPU_Bottleneck, "%"
    else
      yellow alert "CPU Bottleneck probability= ", CPU_Bottleneck, "%"
  repeat every 10 minutes
    if CPU_Bottleneck > 90 then
      red alert "CPU Bottleneck probability= ", CPU_Bottleneck, "%"
    else
      yellow alert "CPU Bottleneck probability= ", CPU_Bottleneck, "%"
  end 
    reset alert "End of CPU Bottleneck Alert"
 
 
alarm Disk_Bottleneck > 50 for 2 minutes
  start 
    if Disk_Bottleneck > 90 then
      red alert "Disk Bottleneck probability= ", Disk_Bottleneck, "%"
    else
      yellow alert "Disk Bottleneck probability= ", Disk_Bottleneck, "%"
  repeat every 10 minutes
    if Disk_Bottleneck > 90 then
      red alert "Disk Bottleneck probability= ", Disk_Bottleneck, "%"
    else
      yellow alert "Disk Bottleneck probability= ", Disk_Bottleneck, "%"
  end 
    reset alert "End of Disk Bottleneck Alert"
 
 
alarm Memory_Bottleneck > 50 for 2 minutes
  start 
    if Memory_Bottleneck > 90 then
      red alert "Memory Bottleneck probability= ", Memory_Bottleneck, "%"
    else
      yellow alert "Memory Bottleneck probability= ", Memory_Bottleneck, "%"
  repeat every 10 minutes
    if Memory_Bottleneck > 90 then
      red alert "Memory Bottleneck probability= ", Memory_Bottleneck, "%"
    else
      yellow alert "Memory Bottleneck probability= ", Memory_Bottleneck, "%"
  end 
    reset alert "End of Memory Bottleneck Alert"
 
 
alarm Network_Bottleneck > 50 for 2 minutes
  start 
    if Network_Bottleneck > 90 then
      red alert "Network Bottleneck probability= ", Network_Bottleneck, "%"
    else
      yellow alert "Network Bottleneck probability= ", Network_Bottleneck, "%"
  repeat every 10 minutes
    if Network_Bottleneck > 90 then
      red alert "Network Bottleneck probability= ", Network_Bottleneck, "%"
    else
      yellow alert "Network Bottleneck probability= ", Network_Bottleneck, "%"
  end 
    reset alert "End of Network Bottleneck Alert"
 
# We will alarm according to the percentage of errors only when the packet
# rate exceeds a threshold. The values may need to be modified for your
# environment.
alarm  (GBL_NET_PACKET_RATE > 100) and
      ((GBL_NET_IN_ERROR_PCT > 4) or
       (GBL_NET_OUT_ERROR_PCT > 2))
  start
    yellow alert "Network error rate exceeded threshold"
  end
    reset alert "End of network error rate alert"
 
 
# The following are system table alarms.  If gpm overhead is a concern, and
# you think you will not have system table shortage problems, you may wish
# to delete these alarms.
 
# Global swap space utilization alarm:
alarm GBL_SWAP_SPACE_UTIL > 95
  start
    red alert "Global swap space is nearly full"
  end
    reset alert "End of global swap space full condition"
 
# Shared memory table alarm:
alarm TBL_SHMEM_TABLE_UTIL > 90
  start 
    red alert "Shared memory table is nearly full"
  end
    reset alert "End of shared memory table full condition"
                      
# Semaphore table alarm:
alarm TBL_SEM_TABLE_UTIL > 90
  start 
    red alert "Semaphore table is nearly full"
  end 
    reset alert "End of semaphore table full condition"
                      
# Message queue table alarm:
alarm TBL_MSG_TABLE_UTIL > 90
  start 
    red alert "Message queue table is nearly full"
  end 
    reset alert "End of message queue full condition"
                      
# Process table alarm:
alarm TBL_PROC_TABLE_UTIL > 90
  start
    red alert "Process table is nearly full"
  end 
    reset alert "End of process table full condition"
 
# File table alarm:
alarm TBL_FILE_TABLE_UTIL > 90
  start
    red alert "File table is nearly full"
  end 
    reset alert "End of file table full condition"
 
# File lock table alarm:
alarm TBL_FILE_LOCK_UTIL > 90
  start
    red alert "File lock table is nearly full"
  end 
    reset alert "End of file lock table full condition"
 
# This alarm tests for Transaction Tracker overflows.  If you have old
# transactions then restarting the ttd will free up that memory.  Otherwise,
# you may need to restart the midaemon with the -smdvss parm to increase
# midaemon capacity.
alarm  GBL_TT_OVERFLOW_COUNT > 0
  start
    yellow alert "Transaction Tracker overflow - restart ttd or midaemon - see man pages"
  repeat every 30 minutes
    yellow alert "Transaction Tracker overflow"
 
# This alarm tests for lost MI trace buffers by the kernel instrumentation.
# If this value has increased during the interval, then this alarm triggers.
# Intermittent lost buffers can be expected on busy systems, however
# consistent buffer loss can lead to incorrect performance information being
# reported by the tools.  If this alarm triggers often, you may wish to
# log a call with your local HP Response Center.
 
# initiallost variable used to keep track of how many lost buffers there
# were when glance was first invoked:
initiallost = initiallost
if initiallost == 0 then
  initiallost = GBL_LOST_MI_TRACE_BUFFERS
 
# lostbufs variable tracks increases in the cumulative count of lost buffers
lostbufs = lostbufs
alarm (lostbufs < GBL_LOST_MI_TRACE_BUFFERS) and
      (initiallost < GBL_LOST_MI_TRACE_BUFFERS)
  start {
    yellow alert "MI trace buffer loss detected"
    lostbufs = GBL_LOST_MI_TRACE_BUFFERS
  }

 

4. 결론

 

여러 프로세스가 glance을 사용하여 배보다 배꼽이 더 커지는 경우가 발생할 경우 이를 해결하기 위해 glance adviser를 수행시켜 

필요한 정보에 대해 기록을 하는 것이 좋다. 기록된 내용을 tail 등을 통해 봄으로써 자원 소모를 줄일 수 있다.

위 내용을 통해 필요한 요소에 대해 기록을 하게 되면 이력 정보도 관리할 수 있는 등 여러가지 장점을 얻을 수 있게 된다.

 

 

 

위로