RAC 单节点实例异常关闭，关键报错ORA--29770

监控系统监控到RAC 的一个实例异常关闭，时间是凌晨1点多，还好没有影响到业务

之后就是分析原因

这套RAC搭建在虚拟化环境OS SUSE11

查看oracel alert log信息

 Mon Dec 04 01:16:20 2017

 Thread 1 advanced to log sequence 26371 (LGWR switch)

 Current log# 2 seq# 26371 mem# 0: +DATA/mgr/onlinelog/group_2.258.887111011

 Current log# 2 seq# 26371 mem# 1: +FRA/mgr/onlinelog/group_2.258.887111015

 Mon Dec 04 01:16:20 2017

 Archived Log entry 52713 added for thread 1 sequence 26370 ID 0x6c454da0 dest 1:

 Mon Dec 04 01:24:14 2017

 LMS0 (ospid: 17259) has not called a wait for 97 secs. <<<<<<<<<<<===========

 Mon Dec 04 01:24:29 2017

 Errors in file /oracle/product/diag/rdbms/mgr/mgr1/trace/mgr1_lmhb_17273.trc (incident=252137):

 ORA-29770: ?????? LMS0 (OSID 17259) ???????? 70 ?

 Incident details in: /oracle/product/diag/rdbms/mgr/mgr1/incident/incdir_252137/mgr1_lmhb_17273_i252137.trc <<<<<<<<<<<===========

 Mon Dec 04 01:24:40 2017

 ERROR: Some process(s) is not making progress. <<<<<<<<<<<===========

 LMHB (ospid: 17273) is terminating the instance.

 Please check LMHB trace file for more details.

 Please also check the CPU load, I/O load and other system properties for anomalous behavior <<<<<<<<<<<===========

 ERROR: Some process(s) is not making progress.

 LMHB (ospid: 17273): terminating the instance due to error 29770 <<<<<<<<<<<===========

 Mon Dec 04 01:24:41 2017

 System state dump requested by (instance=1, osid=17273 (LMHB)), summary=[abnormal instance termination].

 System State dumped to trace file /oracle/product/diag/rdbms/mgr/mgr1/trace/mgr1_diag_17245.trc

 Mon Dec 04 01:24:41 2017

 ORA-1092 : opitsk aborting process

 Dumping diagnostic data in directory=[cdmp_20171204012441], requested by (instance=1, osid=17273 (LMHB)), summary=[abnormal instance termination].

 Mon Dec 04 01:24:46 2017

 ORA-1092 : opitsk aborting process

 Termination issued to instance processes. Waiting for the processes to exit

 Mon Dec 04 01:24:53 2017

 ORA-1092 : opitsk aborting process

 Mon Dec 04 01:24:53 2017

 License high water mark = 43

 Mon Dec 04 01:24:54 2017

 Instance termination failed to kill one or more processes

 Instance terminated by LMHB, pid = 17273

 USER (ospid: 3513): terminating the instance

 Instance terminated by USER, pid = 3513

 Mon Dec 04 01:25:18 2017

 Starting ORACLE instance (normal)

从日志信息来看，LMS0进程1:24:14时已经等待97s，之后生成trace文件，之后报错ORA-29770，之后实例被终止。

科普一下LMS进程：

LMSn进程会维护在Global Resource Directory (GRD)中的数据文件以及每个cached block的状态。LMSn用于在RAC的实例间进行message以及数据块的传输，这个对应的服务也就是GCS(Global Cache Service)，LMS是Cache Fusion的一个重要部分。LMS进程可以说是RAC上最活跃的后台进程，会消耗较多的CPU.一般每个实例会有多个LMS进程，每个Oracle版本的默认的LMS进程数目会有所不同，大部分版本的默认值是：MIN(CPU_COUNT/2, 2))

ORACLE报错信息

ORA-29770: global enqueue process LMS0 (OSID 17259) is hung for more than 70 seconds

报错生成了trace文件

Relevant Information Collection

---------------------------------------

System name: Linux

Node name: mgrdb01

Release: 3.0.76-0.11-default

Version: #1 SMP Fri Jun 14 08:21:43 UTC 2013 (ccab990)

Machine: x86_64

VM name: VMWare Version: 6 <<<<<<<=============

Instance name: mgr1

Redo thread mounted by this instance: 1

Oracle process number: 17

Unix process pid: 17273, image: oracle@mgrdb01 (LMHB) <<<<<<<============= 

*** 2017-12-04 01:23:14.315

*** SESSION ID:(1599.1) 2017-12-04 01:23:14.315

*** CLIENT ID:() 2017-12-04 01:23:14.315

*** SERVICE NAME:(SYS$BACKGROUND) 2017-12-04 01:23:14.315

*** MODULE NAME:() 2017-12-04 01:23:14.315

*** ACTION NAME:() 2017-12-04 01:23:14.315 

*** TRACE FILE RECREATED AFTER BEING REMOVED *** 

*** 2017-12-04 01:23:14.314

==============================

LMS0 (ospid: 17259) has not moved for 41 sec (1512321794.1512321753) <<<<<<<=============

kjfmGCR_HBCheckAll: LMS0 (ospid: 17259) has status 2

: Not in wait; last wait ended 37 secs ago.

: last wait_id 570620887 at 'gcs remote message'. <<<<<<<=============

==============================

Dumping PROCESS LMS0 (ospid: 17259) States

==============================

===[ Callstack ]=== 

*** 2017-12-04 01:23:14.316

Process diagnostic dump for oracle@mgrdb01 (LMS0), OS id=17259,

pid: 13, proc_ser: 1, sid: 1223, sess_ser: 1

-------------------------------------------------------------------------------

os thread scheduling delay history: (sampling every 1.000000 secs)

0.000000 secs at [ 01:23:13 ]

NOTE: scheduling delay has not been sampled for 0.935154 secs 0.000000 secs from [ 01:23:09 - 01:23:14 ], 5 sec avg

0.000000 secs from [ 01:22:15 - 01:23:14 ], 1 min avg

0.000000 secs from [ 01:18:15 - 01:23:14 ], 5 min avg

loadavg : 22.91 7.20 2.77

Memory (Avail / Total) = 92815.82M / 129069.95M

Swap (Avail / Total) = 16384.00M / 16384.00M

F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD

0 D oracle 17259 1 0 58 - - 12123770 sleep_ Sep25 ? 14:24:46 ora_lms0_mgr1 <<<<<<<============= lms0 is in status 'D'.uninterruptible I/O . 

*** 2017-12-04 01:23:19.322

Short stack dump: ORA-32516: ?????? 'Unix process pid: 17259, image: oracle@mgrdb01 (LMS0)' ??? ORADEBUG ?? 'SHORT_STACK'; ?????? 4920 ?? 

-------------------------------------------------------------------------------

Process diagnostic dump actual duration=5.010000 sec

(max dump time=5.000000 sec) 

*** 2017-12-04 01:23:19.322

kjgcr_SlaveReqBegin: message queued to slave

kjgcr_Main: KJGCR_ACTION - id 3

CPU is high. Top oracle users listed below: <<<<<<<============= CPU is high , but no oracle user consuming high CPU .

Session Serial CPU

1882 1 0

1 28043 0

95 1 0

97 24343 0

189 1 0 

*** 2017-12-04 01:23:24.390

kjgcr_Main: Reset called for action high cpu, identify users, count 0 

*** 2017-12-04 01:23:24.391

kjgcr_Main: Reset called for action high cpu, kill users, count 0 

*** 2017-12-04 01:23:24.391

kjgcr_Main: Reset called for action high cpu, activate RM plan, count 0 

*** 2017-12-04 01:23:24.391

kjgcr_Main: Reset called for action high cpu, set BG into RT, count 0 

*** 2017-12-04 01:23:34.391

==============================

LMS0 (ospid: 17259) has not moved for 61 sec (1512321814.1512321753) <<<<<<<=============

kjfmGCR_HBCheckAll: LMS0 (ospid: 17259) has status 2

: Not in wait; last wait ended 57 secs ago.

: last wait_id 570620887 at 'gcs remote message'.

...... 

*** 2017-12-04 01:23:54.401

Process diagnostic dump for oracle@mgrdb01 (LMS0), OS id=17259,

pid: 13, proc_ser: 1, sid: 1223, sess_ser: 1

-------------------------------------------------------------------------------

os thread scheduling delay history: (sampling every 1.000000 secs)

0.000000 secs at [ 01:23:53 ]

NOTE: scheduling delay has not been sampled for 0.859464 secs 0.000000 secs from [ 01:23:49 - 01:23:54 ], 5 sec avg

0.000000 secs from [ 01:22:55 - 01:23:54 ], 1 min avg

0.000000 secs from [ 01:18:55 - 01:23:54 ], 5 min avg

loadavg : 31.82 11.43 4.38

Memory (Avail / Total) = 92795.89M / 129069.95M

Swap (Avail / Total) = 16384.00M / 16384.00M

F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD

0 D oracle 17259 1 0 58 - - 12123770 sleep_ Sep25 ? 14:24:46 ora_lms0_mgr1 <<<<<<<============= still 'D' 

*** 2017-12-04 01:23:59.409

Short stack dump: ORA-32515: ???? ORADEBUG ?? 'SHORT_STACK' ??? 'Unix process pid: 17259, image: oracle@mgrdb01 (LMS0)'; ???????????? 4930 ?? 

-------------------------------------------------------------------------------

Process diagnostic dump actual duration=5.010000 sec

(max dump time=5.000000 sec) 

*** 2017-12-04 01:23:59.410

kjgcr_Main: KJGCR_ACTION - id 5 

*** 2017-12-04 01:24:14.409

==============================

LMS0 (ospid: 17259) has not moved for 101 sec (1512321854.1512321753)

kjfmGCR_HBCheckAll: LMS0 (ospid: 17259) has status 6

==================================================

=== LMS0 (ospid: 17259) Heartbeat Report

==================================================

LMS0 (ospid: 17259) has no heartbeats for 101 sec. (threshold 70 sec)

: Not in wait; last wait ended 97 secs ago.

: last wait_id 570620887 at 'gcs remote message'.

==============================

Dumping PROCESS LMS0 (ospid: 17259) States <<<<<<<=============

==============================

===[ System Load State ]===

CPU Total 32 Core 32 Socket 4

Load normal: Cur 9144 Highmark 40960 (35.71 160.00)

===[ Latch State ]===

Not in Latch Get

===[ Session State Object ]===

----------------------------------------

SO: 0xb98ef85d8, type: 4, owner: 0xb88bf79d0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3

proc=0xb88bf79d0, name=session, file=ksu.h LINE:12624, pg=0

(session) sid: 1223 ser: 1 trans: (nil), creator: 0xb88bf79d0

flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/-

flags2: (0x409) -/-/INC

DID: , short-term DID:

txn branch: (nil)

oct: 0, prv: 0, sql: (nil), psql: (nil), user: 0/SYS

ksuxds FALSE at location: 0

service name: SYS$BACKGROUND

Current Wait Stack:

Not in wait; last wait ended 1 min 37 sec ago <<<<<<<============= not in wait.

Wait State:

fixed_waits=0 flags=0x21 boundary=(nil)/-1

Session Wait History:

elapsed time of 1 min 37 sec since last wait

0: waited for 'gcs remote message'<<<<<<<=============

waittime=0x1, poll=0x0, event=0x0

wait_id=570620887 seq_num=7644 snap_id=1

wait times: snap=0.001200 sec, exc=0.001200 sec, total=0.001200 sec

wait times: max=0.010000 sec

wait counts: calls=1 os=1

occurred after 0.000218 sec of elapsed time

1: waited for 'gcs remote message'

waittime=0x1, poll=0x0, event=0x0

wait_id=570620886 seq_num=7643 snap_id=1

wait times: snap=0.006362 sec, exc=0.006362 sec, total=0.006362 sec

wait times: max=0.010000 sec

wait counts: calls=1 os=1

occurred after 0.000022 sec of elapsed time

通过trace文件可以看到，lms0 的状态是 'D'

D是什么意思？

Linux进程状态：D (TASK_UNINTERRUPTIBLE)，不可中断的睡眠状态。

引起D状态的根本原因是由于IO等待，若你对某个磁盘的IO操作特别频繁，就会造成后续的IO操作处于等待状态，即处于D状态。

也就是说LMS0进程在不可中断的睡眠状态，同时CPU很高，但是不是ORACLE用户占用的。

到这里就可以怀疑，问题可能出现在OS层面，去查看OSwatcher监控信息

zzz ***Mon Dec 4 01:21:02 CST 2017

procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------

r b swpd free buff cache si so bi bo in cs us sy id wa st

5 0 0 95113492 1518640 31129980 0 0 3 5 0 0 1 0 99 0 0

0 0 0 95107952 1518640 31130044 0 0 80 1 6505 8514 2 1 97 0 0

0 0 0 95107652 1518640 31130020 0 0 80 34 4694 7057 1 0 99 0 0

zzz ***Mon Dec 4 01:21:32 CST 2017

procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------

r b swpd free buff cache si so bi bo in cs us sy id wa st

4 0 0 95120764 1518644 31130072 0 0 3 5 0 0 1 0 99 0 0

1 0 0 95116052 1518644 31130136 0 0 80 1 6378 8855 2 1 97 0 0

0 0 0 95115644 1518644 31130132 0 0 16 34 5174 7705 1 0 99 0 0

zzz ***Mon Dec 4 01:24:53 CST 2017

procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------

r b swpd free buff cache si so bi bo in cs us sy id wa st

8 6 0 96085040 1518752 31135792 0 0 3 5 0 0 1 0 99 0 0

2 0 0 96155748 1518772 31120520 0 0 0 9642 11662 9591 9 5 84 3 0

1 0 0 97170696 1518780 30175204 0 0 0 2749 5527 6026 3 2 95 0 0

zzz ***Mon Dec 4 01:25:23 CST 2017

procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------

r b swpd free buff cache si so bi bo in cs us sy id wa st

4 0 0 117056400 1518796 10318440 0 0 3 5 0 0 1 0 99 0 0

0 0 0 117052664 1518796 10318456 0 0 0 230 3792 4556 0 1 99 0 0

1 0 0 116783816 1518796 10587120 0 0 0 841 3856 4530 1 2 97 0 0

vmstat 采样间隔为30秒，但是从01:21:32 之后一直到01:24:53之间的记录都没有 。说明OS在此期间性能很差。 
性能差到什么程度，简单的top，iostat，ps都无法生成，直接导致lms进程通信障碍。
问题定位到OS层面，很可惜我们抛开了oracle的嫌疑，但是OS工程师并没有找到问题的根因。

巴特西

RAC 单节点实例异常关闭，关键报错ORA--29770

最新文章

热门文章