由 BUG 引发 ORA-15064 进而导致数据库实例意外中止
今天早上刚到公司一会,就接到南京客户打来的电话,说他们的核心会员数据库宕机了,让我远程帮忙查看一下。还没来得急看报纸的我赶紧打开电脑,远程连接到客户的服务器进行诊断。
客户的生产环境是AIX 6.1上的Oracle 11.2.0.3.0,在凌晨1:40分左右发生的故障。为了不涉及泄露客户隐私,把数据库实例名进行了替换。下面看具体的分析:
1. 数据库alert.log分析
Mon Jan 05 01:40:50 2015
WARNING: ASM communication error: op 18 state 0x50 (3113)
ERROR: slave communication error with ASM
NOTE: Deferred communication with ASM instance
Errors in file /u01/app/oracle/diag/rdbms/test5/test5/trace/test5_ora_16581034.trc:
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 288 Serial number: 5649
NOTE: deferred map free for map id 4422
Mon Jan 05 01:40:55 2015
NOTE: ASMB terminating
Mon Jan 05 01:40:55 2015
***********************************************************************
Fatal NI connect error 12170.
VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Time: 05-JAN-2015 01:40:55
Tracing not turned on.
Tns error struct:
ns main err code: 12535
TNS-12535: TNS:operation timed out
ns secondary err code: 12606
nt main err code: 0
nt secondary err code: 0
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=test2)(PORT=64460))
WARNING: inbound connection timed out (ORA-3136)
Mon Jan 05 01:40:55 2015
***********************************************************************
Fatal NI connect error 12170.
VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Time: 05-JAN-2015 01:40:55
Tracing not turned on.
Tns error struct:
ns main err code: 12535
TNS-12535: TNS:operation timed out
ns secondary err code: 12606
nt main err code: 0
nt secondary err code: 0
nt OS err code: 0
Mon Jan 05 01:40:55 2015
***********************************************************************
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=test2)(PORT=64530))
Errors in file /u01/app/oracle/diag/rdbms/test5/test5/trace/test5_asmb_5898342.trc:
ORA-15064: communication failure with ASM instance
ORA-03135: connection lost contact
Process ID:
Session ID: 216 Serial number: 7
Fatal NI connect error 12170.
WARNING: inbound connection timed out (ORA-3136)
Errors in file /u01/app/oracle/diag/rdbms/test5/test5/trace/test5_asmb_5898342.trc:
ORA-15064: communication failure with ASM instance
ORA-03135: connection lost contact
Process ID:
Session ID: 216 Serial number: 7
VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Time: 05-JAN-2015 01:40:55
Tracing not turned on.
Tns error struct:
ASMB (ospid: 5898342): terminating the instance due to error 15064
ns main err code: 12535
TNS-12535: TNS:operation timed out
Mon Jan 05 01:40:55 2015
***********************************************************************
ns secondary err code: 12606
Fatal NI connect error 12170.
nt main err code: 0
VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
nt secondary err code: 0
Time: 05-JAN-2015 01:40:55
Tracing not turned on.
nt OS err code: 0
Mon Jan 05 01:40:56 2015
System state dump requested by (instance=1, osid=5898342 (ASMB)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/test5/test5/trace/test5_diag_20578640.trc
Mon Jan 05 01:40:59 2015
ORA-1092 : opitsk aborting process
Mon Jan 05 01:41:00 2015
ORA-1092 : opitsk aborting process
Instance terminated by ASMB, pid = 5898342
Mon Jan 05 08:46:45 2015
Starting ORACLE instance (normal)
由于在故障点之前也会报一些ORA-3136的错误,故这里不做说明,与本次错误干系不大。从alert日志中发现,数据库实例在1:40:55时,被ASMB(进程号:5898342)进程终止,直到8点46分左右才开始启动(客户自己手动重启)。
2.相关trace文件分析
test5_ora_16581034.trc:
*** 2015-01-05 01:40:50.900
*** SESSION ID:(424.19913) 2015-01-05 01:40:50.900
*** CLIENT ID:() 2015-01-05 01:40:50.900
*** SERVICE NAME:(SYS$USERS) 2015-01-05 01:40:50.900
*** MODULE NAME:(backup incr datafile) 2015-01-05 01:40:50.900
*** ACTION NAME:(0000097 STARTED16) 2015-01-05 01:40:50.900
WARNING: ASM communication error: op 18 state 0x50 (3113)
ERROR: slave communication error with ASM
NOTE: Deferred communication with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 288 Serial number: 5649
基本与alert日志中报的错误相同,就是获知在1:40分这个故障点,数据库与ASM实例的连接中断了
test5_asmb_5898342.trc:
*** 2015-01-05 01:40:55.870
NOTE: ASMB terminating
*** 2015-01-05 01:40:55.922
ORA-15064: communication failure with ASM instance
ORA-03135: connection lost contact
Process ID:
Session ID: 216 Serial number: 7
error 15064 detected in background process
ORA-15064: communication failure with ASM instance
ORA-03135: connection lost contact
Process ID:
Session ID: 216 Serial number: 7
kjzduptcctx: Notifying DIAG for crash event
----- Abridged Call Stack Trace -----
ksedsts()+360<-kjzdssdmp()+240<-kjzduptcctx()+228<-kjzdicrshnfy()+100<-ksuitm()+5124<-ksbrdp()+4508<-opirip()+1624<-opidrv()+608<-sou2o()+136<-opimai_real()+188<-ssthrdmain()+268<-main()+204<-__start()+112
----- End of Abridged Call Stack Trace -----
*** 2015-01-05 01:40:55.954
ASMB (ospid: 5898342): terminating the instance due to error 15064
ksuitm: waiting up to [5] seconds before killing DIAG(20578640)
从这个trace文件可以获知,由于遭遇ora-15064错误,ASMB进程终止了数据库实例。
test5_diag_20578640.trc:
Session Wait History:
elapsed time of 1.543354 sec since last wait
0: waited for 'ASM background timer'
=0x0, =0x0, =0x0
wait_id=881966 seq_num=30034 snap_id=1
wait times: snap=0.000422 sec, exc=0.000422 sec, total=0.000422 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000016 sec of elapsed time
1: waited for 'ASM file metadata operation'
msgop=0x2, locn=0x3, =0x0
wait_id=881964 seq_num=30033 snap_id=2
wait times: snap=0.000082 sec, exc=0.000127 sec, total=4 min 4 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000000 sec of elapsed time
2: waited for 'latch: shared pool'
address=0x700000000107048, number=0x133, tries=0x0
wait_id=881965 seq_num=30032 snap_id=1
wait times: snap=4 min 4 sec, exc=4 min 4 sec, total=4 min 4 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000000 sec of elapsed time
3: waited for 'ASM file metadata operation'
msgop=0x2, locn=0x3, =0x0
wait_id=881964 seq_num=30031 snap_id=1
wait times: snap=0.000045 sec, exc=0.000045 sec, total=0.000045 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000019 sec of elapsed time
4: waited for 'ASM background timer'
=0x0, =0x0, =0x0
wait_id=881963 seq_num=30030 snap_id=1
wait times: snap=3.285876 sec, exc=3.285876 sec, total=3.285876 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000011 sec of elapsed time
5: waited for 'ASM file metadata operation'
msgop=0xc, locn=0x3, =0x0
wait_id=881962 seq_num=30029 snap_id=1
wait times: snap=0.000034 sec, exc=0.000034 sec, total=0.000034 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000019 sec of elapsed time
6: waited for 'ASM background timer'
=0x0, =0x0, =0x0
wait_id=881961 seq_num=30028 snap_id=1
wait times: snap=5.000528 sec, exc=5.000528 sec, total=5.000528 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000009 sec of elapsed time
7: waited for 'ASM file metadata operation'
msgop=0x0, locn=0x3, =0x0
wait_id=881960 seq_num=30027 snap_id=1
wait times: snap=0.000003 sec, exc=0.000003 sec, total=0.000003 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000016 sec of elapsed time
8: waited for 'ASM background timer'
=0x0, =0x0, =0x0
wait_id=881959 seq_num=30026 snap_id=1
wait times: snap=5.000493 sec, exc=5.000493 sec, total=5.000493 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000011 sec of elapsed time
9: waited for 'ASM file metadata operation'
msgop=0x0, locn=0x3, =0x0
wait_id=881958 seq_num=30025 snap_id=1
wait times: snap=0.000002 sec, exc=0.000002 sec, total=0.000002 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000017 sec of elapsed time
sample interval: 1 sec, max history 120 sec
---------------------------------------------------
[120 samples, 01:38:56 - 01:40:55]
waited for 'latch: shared pool', seq_num: 30032
p1: 'address'=0x700000000107048
p2: 'number'=0x133
p3: 'tries'=0x0
time_waited: >= 119 sec (still in wait)
---------------------------------------------------
Sampled Session History Summary:
longest_non_idle_wait: 'latch: shared pool'
[120 samples, 01:38:56 - 01:40:55]
time_waited: >= 119 sec (still in wait)
在ASMB进程的的dump文件中,发现有2个4分4秒的等待,waited for 'ASM file metadata operation'和waited for 'latch: shared pool',而其他的相关操作都是只有几微秒,甚至0秒的
3.ASM实例的alert日志:
Mon Jan 05 01:40:56 2015
***********************************************************************
TNS-12537: TNS:connection closed
Fatal NI connect error 12537, connecting to:
(DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
ns secondary err code: 12560
VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
nt main err code: 0
Time: 05-JAN-2015 01:40:56
nt secondary err code: 0
Tracing not turned on.
nt OS err code: 0
Tns error struct:
ns main err code: 12537
TNS-12537: TNS:connection closed
ns secondary err code: 12560
nt main err code: 0
nt secondary err code: 0
nt OS err code: 0
Mon Jan 05 04:42:23 2015
NOTE: ASMB process exiting due to lack of ASM file activity for 305 seconds
Mon Jan 05 08:46:53 2015
NOTE: client test5:test5 registered, osid 23593052, mbr 0x0
Mon Jan 05 08:47:24 2015
Starting background process ASMB
Mon Jan 05 08:47:24 2015
ASMB started with pid=26, OS id=20185292
Mon Jan 05 08:47:25 2015
NOTE: client +ASM:+ASM registered, osid 16384406, mbr 0x1
在故障点之前,ASM的alert日志中未报过TNS错误,但故障点的时候,ASM中也出现了TNS连接关闭的错误,因此,也不排除网络因素引起的ASM实例故障。4点多的时候有一个NOTE: ASMB process exiting due to lack of ASM file activity for 305 seconds,根据文档“NOTE ASMB process exiting due to lack of ASM file activity (文档 ID 754110.1)”,这个NOTE可以忽略。到了8:46的时候,ASM实例随着数据库的重启也正常启动了
最终,找到了一篇MOS文档,是和这次遇到的故障相关的,应该就是Bug 13914613导致的这个问题:
Bug 13914613 Excessive time holding shared pool latch in kghfrunp with auto memory management
This note gives a brief overview of bug 13914613.The content was last updated on: 17-DEC-2014
Click here for details of each of the sections below.
Affects:
Product (Component) | Oracle Server (Rdbms) |
Range of versions believed to be affected | (Not specified) |
Versions confirmed as being affected | |
Platforms affected | Generic (all / most platforms affected) |
Fixed:
The fix for 13914613 is first included in |
Description
相关推荐
XPY 2020-11-11
CoderToy 2020-11-16
技术之博大精深 2020-10-16
emmm00 2020-11-17
bianruifeng 2020-11-16
云中舞步 2020-11-12
世樹 2020-11-11
暗夜之城 2020-11-11
张荣珍 2020-11-12
amienshxq 2020-11-14
ASoc 2020-11-14
yungpheng 2020-10-19
loveyouluobin 2020-09-29
尘封飞扬 2020-09-29
Coder技术文摘 2020-09-29
lbyd0 2020-11-17
BigYellow 2020-11-16
sushuanglei 2020-11-12