监控系统监控到RAC 的一个实例异常关闭 ,时间是凌晨1点多,还好没有影响到业务

之后就是分析原因

这套RAC搭建在虚拟化环境OS SUSE11

查看oracel alert log信息

 Mon Dec 04 01:16:20 2017
Thread 1 advanced to log sequence 26371 (LGWR switch)
Current log# 2 seq# 26371 mem# 0: +DATA/mgr/onlinelog/group_2.258.887111011
Current log# 2 seq# 26371 mem# 1: +FRA/mgr/onlinelog/group_2.258.887111015
Mon Dec 04 01:16:20 2017
Archived Log entry 52713 added for thread 1 sequence 26370 ID 0x6c454da0 dest 1:
Mon Dec 04 01:24:14 2017
LMS0 (ospid: 17259) has not called a wait for 97 secs. <<<<<<<<<<<===========
Mon Dec 04 01:24:29 2017
Errors in file /oracle/product/diag/rdbms/mgr/mgr1/trace/mgr1_lmhb_17273.trc (incident=252137):
ORA-29770: ?????? LMS0 (OSID 17259) ???????? 70 ?
Incident details in: /oracle/product/diag/rdbms/mgr/mgr1/incident/incdir_252137/mgr1_lmhb_17273_i252137.trc <<<<<<<<<<<===========
Mon Dec 04 01:24:40 2017
ERROR: Some process(s) is not making progress. <<<<<<<<<<<===========
LMHB (ospid: 17273) is terminating the instance.
Please check LMHB trace file for more details.
Please also check the CPU load, I/O load and other system properties for anomalous behavior <<<<<<<<<<<===========
ERROR: Some process(s) is not making progress.
LMHB (ospid: 17273): terminating the instance due to error 29770 <<<<<<<<<<<===========
Mon Dec 04 01:24:41 2017
System state dump requested by (instance=1, osid=17273 (LMHB)), summary=[abnormal instance termination].
System State dumped to trace file /oracle/product/diag/rdbms/mgr/mgr1/trace/mgr1_diag_17245.trc
Mon Dec 04 01:24:41 2017
ORA-1092 : opitsk aborting process
Dumping diagnostic data in directory=[cdmp_20171204012441], requested by (instance=1, osid=17273 (LMHB)), summary=[abnormal instance termination].
Mon Dec 04 01:24:46 2017
ORA-1092 : opitsk aborting process
Termination issued to instance processes. Waiting for the processes to exit
Mon Dec 04 01:24:53 2017
ORA-1092 : opitsk aborting process
Mon Dec 04 01:24:53 2017
License high water mark = 43
Mon Dec 04 01:24:54 2017
Instance termination failed to kill one or more processes
Instance terminated by LMHB, pid = 17273
USER (ospid: 3513): terminating the instance
Instance terminated by USER, pid = 3513
Mon Dec 04 01:25:18 2017
Starting ORACLE instance (normal)

从日志信息来看,LMS0进程1:24:14时已经等待97s,之后生成trace文件,之后报错ORA-29770,之后实例被终止。

科普一下LMS进程:

LMSn进程会维护在Global Resource Directory (GRD)中的数据文件以及每个cached block的状态。LMSn用于在RAC的实例间进行message以及数据块的传输,这个对应的服务也就是GCS(Global Cache Service),LMS是Cache Fusion的一个重要部分。LMS进程可以说是RAC上最活跃的后台进程,会消耗较多的CPU.一般每个实例会有多个LMS进程,每个Oracle版本的默认的LMS进程数目会有所不同,大部分版本的默认值是:MIN(CPU_COUNT/2, 2))

ORACLE报错信息

ORA-29770: global enqueue process LMS0 (OSID 17259) is hung for more than 70 seconds

报错生成了trace文件

Relevant Information Collection
---------------------------------------
System name: Linux
Node name: mgrdb01
Release: 3.0.76-0.11-default
Version: #1 SMP Fri Jun 14 08:21:43 UTC 2013 (ccab990)
Machine: x86_64
VM name: VMWare Version: 6 <<<<<<<=============
Instance name: mgr1
Redo thread mounted by this instance: 1
Oracle process number: 17
Unix process pid: 17273, image: oracle@mgrdb01 (LMHB) <<<<<<<============= *** 2017-12-04 01:23:14.315
*** SESSION ID:(1599.1) 2017-12-04 01:23:14.315
*** CLIENT ID:() 2017-12-04 01:23:14.315
*** SERVICE NAME:(SYS$BACKGROUND) 2017-12-04 01:23:14.315
*** MODULE NAME:() 2017-12-04 01:23:14.315
*** ACTION NAME:() 2017-12-04 01:23:14.315 *** TRACE FILE RECREATED AFTER BEING REMOVED *** *** 2017-12-04 01:23:14.314
==============================
LMS0 (ospid: 17259) has not moved for 41 sec (1512321794.1512321753) <<<<<<<=============
kjfmGCR_HBCheckAll: LMS0 (ospid: 17259) has status 2
: Not in wait; last wait ended 37 secs ago.
: last wait_id 570620887 at 'gcs remote message'. <<<<<<<=============
==============================
Dumping PROCESS LMS0 (ospid: 17259) States
==============================
===[ Callstack ]=== *** 2017-12-04 01:23:14.316
Process diagnostic dump for oracle@mgrdb01 (LMS0), OS id=17259,
pid: 13, proc_ser: 1, sid: 1223, sess_ser: 1
-------------------------------------------------------------------------------
os thread scheduling delay history: (sampling every 1.000000 secs)
0.000000 secs at [ 01:23:13 ]
NOTE: scheduling delay has not been sampled for 0.935154 secs 0.000000 secs from [ 01:23:09 - 01:23:14 ], 5 sec avg
0.000000 secs from [ 01:22:15 - 01:23:14 ], 1 min avg
0.000000 secs from [ 01:18:15 - 01:23:14 ], 5 min avg
loadavg : 22.91 7.20 2.77
Memory (Avail / Total) = 92815.82M / 129069.95M
Swap (Avail / Total) = 16384.00M / 16384.00M
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 D oracle 17259 1 0 58 - - 12123770 sleep_ Sep25 ? 14:24:46 ora_lms0_mgr1 <<<<<<<============= lms0 is in status 'D'.uninterruptible I/O . *** 2017-12-04 01:23:19.322
Short stack dump: ORA-32516: ?????? 'Unix process pid: 17259, image: oracle@mgrdb01 (LMS0)' ??? ORADEBUG ?? 'SHORT_STACK'; ?????? 4920 ?? -------------------------------------------------------------------------------
Process diagnostic dump actual duration=5.010000 sec
(max dump time=5.000000 sec) *** 2017-12-04 01:23:19.322
kjgcr_SlaveReqBegin: message queued to slave
kjgcr_Main: KJGCR_ACTION - id 3
CPU is high. Top oracle users listed below: <<<<<<<============= CPU is high , but no oracle user consuming high CPU .
Session Serial CPU
1882 1 0
1 28043 0
95 1 0
97 24343 0
189 1 0 *** 2017-12-04 01:23:24.390
kjgcr_Main: Reset called for action high cpu, identify users, count 0 *** 2017-12-04 01:23:24.391
kjgcr_Main: Reset called for action high cpu, kill users, count 0 *** 2017-12-04 01:23:24.391
kjgcr_Main: Reset called for action high cpu, activate RM plan, count 0 *** 2017-12-04 01:23:24.391
kjgcr_Main: Reset called for action high cpu, set BG into RT, count 0 *** 2017-12-04 01:23:34.391
==============================
LMS0 (ospid: 17259) has not moved for 61 sec (1512321814.1512321753) <<<<<<<=============
kjfmGCR_HBCheckAll: LMS0 (ospid: 17259) has status 2
: Not in wait; last wait ended 57 secs ago.
: last wait_id 570620887 at 'gcs remote message'.
...... *** 2017-12-04 01:23:54.401
Process diagnostic dump for oracle@mgrdb01 (LMS0), OS id=17259,
pid: 13, proc_ser: 1, sid: 1223, sess_ser: 1
-------------------------------------------------------------------------------
os thread scheduling delay history: (sampling every 1.000000 secs)
0.000000 secs at [ 01:23:53 ]
NOTE: scheduling delay has not been sampled for 0.859464 secs 0.000000 secs from [ 01:23:49 - 01:23:54 ], 5 sec avg
0.000000 secs from [ 01:22:55 - 01:23:54 ], 1 min avg
0.000000 secs from [ 01:18:55 - 01:23:54 ], 5 min avg
loadavg : 31.82 11.43 4.38
Memory (Avail / Total) = 92795.89M / 129069.95M
Swap (Avail / Total) = 16384.00M / 16384.00M
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 D oracle 17259 1 0 58 - - 12123770 sleep_ Sep25 ? 14:24:46 ora_lms0_mgr1 <<<<<<<============= still 'D' *** 2017-12-04 01:23:59.409
Short stack dump: ORA-32515: ???? ORADEBUG ?? 'SHORT_STACK' ??? 'Unix process pid: 17259, image: oracle@mgrdb01 (LMS0)'; ???????????? 4930 ?? -------------------------------------------------------------------------------
Process diagnostic dump actual duration=5.010000 sec
(max dump time=5.000000 sec) *** 2017-12-04 01:23:59.410
kjgcr_Main: KJGCR_ACTION - id 5 *** 2017-12-04 01:24:14.409
==============================
LMS0 (ospid: 17259) has not moved for 101 sec (1512321854.1512321753)
kjfmGCR_HBCheckAll: LMS0 (ospid: 17259) has status 6
==================================================
=== LMS0 (ospid: 17259) Heartbeat Report
==================================================
LMS0 (ospid: 17259) has no heartbeats for 101 sec. (threshold 70 sec)
: Not in wait; last wait ended 97 secs ago.
: last wait_id 570620887 at 'gcs remote message'.
==============================
Dumping PROCESS LMS0 (ospid: 17259) States <<<<<<<=============
==============================
===[ System Load State ]===
CPU Total 32 Core 32 Socket 4
Load normal: Cur 9144 Highmark 40960 (35.71 160.00)
===[ Latch State ]===
Not in Latch Get
===[ Session State Object ]===
----------------------------------------
SO: 0xb98ef85d8, type: 4, owner: 0xb88bf79d0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0xb88bf79d0, name=session, file=ksu.h LINE:12624, pg=0
(session) sid: 1223 ser: 1 trans: (nil), creator: 0xb88bf79d0
flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
flags2: (0x409) -/-/INC
DID: , short-term DID:
txn branch: (nil)
oct: 0, prv: 0, sql: (nil), psql: (nil), user: 0/SYS
ksuxds FALSE at location: 0
service name: SYS$BACKGROUND
Current Wait Stack:
Not in wait; last wait ended 1 min 37 sec ago <<<<<<<============= not in wait.
Wait State:
fixed_waits=0 flags=0x21 boundary=(nil)/-1
Session Wait History:
elapsed time of 1 min 37 sec since last wait
0: waited for 'gcs remote message'<<<<<<<=============
waittime=0x1, poll=0x0, event=0x0
wait_id=570620887 seq_num=7644 snap_id=1
wait times: snap=0.001200 sec, exc=0.001200 sec, total=0.001200 sec
wait times: max=0.010000 sec
wait counts: calls=1 os=1
occurred after 0.000218 sec of elapsed time
1: waited for 'gcs remote message'
waittime=0x1, poll=0x0, event=0x0
wait_id=570620886 seq_num=7643 snap_id=1
wait times: snap=0.006362 sec, exc=0.006362 sec, total=0.006362 sec
wait times: max=0.010000 sec
wait counts: calls=1 os=1
occurred after 0.000022 sec of elapsed time

通过trace文件可以看到,lms0 的状态是 'D'

D是什么意思?

Linux进程状态:D (TASK_UNINTERRUPTIBLE),不可中断的睡眠状态。

引起D状态的根本原因是由于IO等待,若你对某个磁盘的IO操作特别频繁,就会造成后续的IO操作处于等待状态,即处于D状态。

也就是说LMS0进程在不可中断的睡眠状态,同时CPU很高,但是不是ORACLE用户占用的。

到这里就可以怀疑,问题可能出现在OS层面,去查看OSwatcher监控信息

zzz ***Mon Dec 4 01:21:02 CST 2017
procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
5 0 0 95113492 1518640 31129980 0 0 3 5 0 0 1 0 99 0 0
0 0 0 95107952 1518640 31130044 0 0 80 1 6505 8514 2 1 97 0 0
0 0 0 95107652 1518640 31130020 0 0 80 34 4694 7057 1 0 99 0 0
zzz ***Mon Dec 4 01:21:32 CST 2017
procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
4 0 0 95120764 1518644 31130072 0 0 3 5 0 0 1 0 99 0 0
1 0 0 95116052 1518644 31130136 0 0 80 1 6378 8855 2 1 97 0 0
0 0 0 95115644 1518644 31130132 0 0 16 34 5174 7705 1 0 99 0 0
zzz ***Mon Dec 4 01:24:53 CST 2017
procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
8 6 0 96085040 1518752 31135792 0 0 3 5 0 0 1 0 99 0 0
2 0 0 96155748 1518772 31120520 0 0 0 9642 11662 9591 9 5 84 3 0
1 0 0 97170696 1518780 30175204 0 0 0 2749 5527 6026 3 2 95 0 0
zzz ***Mon Dec 4 01:25:23 CST 2017
procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
4 0 0 117056400 1518796 10318440 0 0 3 5 0 0 1 0 99 0 0
0 0 0 117052664 1518796 10318456 0 0 0 230 3792 4556 0 1 99 0 0
1 0 0 116783816 1518796 10587120 0 0 0 841 3856 4530 1 2 97 0 0
vmstat 采样间隔为30秒,但是从01:21:32 之后一直到01:24:53之间的记录都没有 。说明OS在此期间性能很差。 
性能差到什么程度,简单的top,iostat,ps都无法生成,直接导致lms进程通信障碍。
问题定位到OS层面,很可惜我们抛开了oracle的嫌疑,但是OS工程师并没有找到问题的根因。

最新文章

  1. TortoiseSVN客户端重新设置用户名和密码
  2. 关于 jsp 解析特殊字符的问题
  3. 实操UNITY3D接入91SDK安卓版
  4. node.js在linux下的安装
  5. 3、JS中的语句
  6. jvm中加载类的全过程
  7. 《javascript高级程序设计》 第24章 最佳实践 Best Practices
  8. bzoj3668: [Noi2014]起床困难综合症
  9. CityEngine2012(32位)安装
  10. 【Python】实践笔记
  11. sql server 的JDBC 配置
  12. Ubuntu 使用apt-get时提示错误:无法获得锁 /var/lib/dpkg/lock
  13. [连载]JavaScript讲义(03)--- JavaScript面向对象编程
  14. EditText输入长度动态控制,最大长度为16位,小数点后面最大为2位,输入整数只能为13位
  15. shell全备份脚本(借鉴别人的,在其基础上修复完善了bug)
  16. 2.10 工具使用 after effects(图形视频处理软件)
  17. 通过DNS传输后门来绕过杀软
  18. Java单链表实现
  19. javascript面向对象系列第四篇——OOP中的常见概念
  20. foreach next 操作数组指针移动问题

热门文章

  1. 洛谷——P1331 海战
  2. luogu P1510 精卫填海
  3. SnakeYaml使用
  4. maven命令行创建project
  5. Laravel composer自定义命令空间,方便引用第三方库
  6. java单测时的等待模块awaitility
  7. jsp网页在浏览器中不显示图片_eclipse环境下配置tomcat中jsp项目的虚拟路径
  8. C# 操作摄像头
  9. AngularJS的ng-class示例
  10. binary-tree-maximum-path-sum——二叉树任意一条路径上的最大值