由 BUG 引发 ORA-15064 进而导致数据库实例意外中止

今天早上刚到公司一会,就接到南京客户打来的电话,说他们的核心会员数据库宕机了,让我远程帮忙查看一下。还没来得急看报纸的我赶紧打开电脑,远程连接到客户的服务器进行诊断。

客户的生产环境是AIX 6.1上的Oracle 11.2.0.3.0,在凌晨1:40分左右发生的故障。为了不涉及泄露客户隐私,把数据库实例名进行了替换。下面看具体的分析:
 
1. 数据库alert.log分析
 
Mon Jan 05 01:40:50 2015
WARNING: ASM communication error: op 18 state 0x50 (3113)
ERROR: slave communication error with ASM
NOTE: Deferred communication with ASM instance
Errors in file /u01/app/oracle/diag/rdbms/test5/test5/trace/test5_ora_16581034.trc:
ORA-03113: end-of-file on communication channel
Process ID: 
Session ID: 288 Serial number: 5649
NOTE: deferred map free for map id 4422
Mon Jan 05 01:40:55 2015
NOTE: ASMB terminating
Mon Jan 05 01:40:55 2015
 
 
***********************************************************************
 
Fatal NI connect error 12170.
 
  VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
  Time: 05-JAN-2015 01:40:55
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12535
   
TNS-12535: TNS:operation timed out
    ns secondary err code: 12606
    nt main err code: 0
    nt secondary err code: 0
    nt OS err code: 0
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=test2)(PORT=64460))
WARNING: inbound connection timed out (ORA-3136)
Mon Jan 05 01:40:55 2015
 
 
***********************************************************************
 
Fatal NI connect error 12170.
 
  VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
  Time: 05-JAN-2015 01:40:55
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12535
   
TNS-12535: TNS:operation timed out
    ns secondary err code: 12606
    nt main err code: 0
    nt secondary err code: 0
    nt OS err code: 0
Mon Jan 05 01:40:55 2015
 
 
***********************************************************************
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=test2)(PORT=64530))
Errors in file /u01/app/oracle/diag/rdbms/test5/test5/trace/test5_asmb_5898342.trc:
ORA-15064: communication failure with ASM instance
ORA-03135: connection lost contact
Process ID: 
Session ID: 216 Serial number: 7
 
Fatal NI connect error 12170.
WARNING: inbound connection timed out (ORA-3136)
Errors in file /u01/app/oracle/diag/rdbms/test5/test5/trace/test5_asmb_5898342.trc:
ORA-15064: communication failure with ASM instance
ORA-03135: connection lost contact
Process ID: 
Session ID: 216 Serial number: 7
 
  VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
  Time: 05-JAN-2015 01:40:55
  Tracing not turned on.
  Tns error struct:
ASMB (ospid: 5898342): terminating the instance due to error 15064
    ns main err code: 12535
   
TNS-12535: TNS:operation timed out
Mon Jan 05 01:40:55 2015
 
 
***********************************************************************
    ns secondary err code: 12606
 
Fatal NI connect error 12170.
    nt main err code: 0
 
  VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
    nt secondary err code: 0
  Time: 05-JAN-2015 01:40:55
  Tracing not turned on.
    nt OS err code: 0
Mon Jan 05 01:40:56 2015
System state dump requested by (instance=1, osid=5898342 (ASMB)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/test5/test5/trace/test5_diag_20578640.trc
Mon Jan 05 01:40:59 2015
ORA-1092 : opitsk aborting process
Mon Jan 05 01:41:00 2015
ORA-1092 : opitsk aborting process
Instance terminated by ASMB, pid = 5898342
Mon Jan 05 08:46:45 2015
Starting ORACLE instance (normal)
 
由于在故障点之前也会报一些ORA-3136的错误,故这里不做说明,与本次错误干系不大。从alert日志中发现,数据库实例在1:40:55时,被ASMB(进程号:5898342)进程终止,直到8点46分左右才开始启动(客户自己手动重启)。
 
2.相关trace文件分析
 
test5_ora_16581034.trc:
 
*** 2015-01-05 01:40:50.900
*** SESSION ID:(424.19913) 2015-01-05 01:40:50.900
*** CLIENT ID:() 2015-01-05 01:40:50.900
*** SERVICE NAME:(SYS$USERS) 2015-01-05 01:40:50.900
*** MODULE NAME:(backup incr datafile) 2015-01-05 01:40:50.900
*** ACTION NAME:(0000097 STARTED16) 2015-01-05 01:40:50.900
 
WARNING: ASM communication error: op 18 state 0x50 (3113)
ERROR: slave communication error with ASM
NOTE: Deferred communication with ASM instance
ORA-03113: end-of-file on communication channel
Process ID: 
Session ID: 288 Serial number: 5649
 
基本与alert日志中报的错误相同,就是获知在1:40分这个故障点,数据库与ASM实例的连接中断了
 
test5_asmb_5898342.trc:
 
*** 2015-01-05 01:40:55.870
NOTE: ASMB terminating
 
*** 2015-01-05 01:40:55.922
ORA-15064: communication failure with ASM instance
ORA-03135: connection lost contact
Process ID: 
Session ID: 216 Serial number: 7
error 15064 detected in background process
ORA-15064: communication failure with ASM instance
ORA-03135: connection lost contact
Process ID: 
Session ID: 216 Serial number: 7
kjzduptcctx: Notifying DIAG for crash event
----- Abridged Call Stack Trace -----
ksedsts()+360<-kjzdssdmp()+240<-kjzduptcctx()+228<-kjzdicrshnfy()+100<-ksuitm()+5124<-ksbrdp()+4508<-opirip()+1624<-opidrv()+608<-sou2o()+136<-opimai_real()+188<-ssthrdmain()+268<-main()+204<-__start()+112 
----- End of Abridged Call Stack Trace -----
 
*** 2015-01-05 01:40:55.954
ASMB (ospid: 5898342): terminating the instance due to error 15064
ksuitm: waiting up to [5] seconds before killing DIAG(20578640)
 
从这个trace文件可以获知,由于遭遇ora-15064错误,ASMB进程终止了数据库实例。
 
test5_diag_20578640.trc:
 
Session Wait History:
        elapsed time of 1.543354 sec since last wait
     0: waited for 'ASM background timer'
        =0x0, =0x0, =0x0
        wait_id=881966 seq_num=30034 snap_id=1
        wait times: snap=0.000422 sec, exc=0.000422 sec, total=0.000422 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000016 sec of elapsed time
     1: waited for 'ASM file metadata operation'
        msgop=0x2, locn=0x3, =0x0
        wait_id=881964 seq_num=30033 snap_id=2
        wait times: snap=0.000082 sec, exc=0.000127 sec, total=4 min 4 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000000 sec of elapsed time
     2: waited for 'latch: shared pool'
        address=0x700000000107048, number=0x133, tries=0x0
        wait_id=881965 seq_num=30032 snap_id=1
        wait times: snap=4 min 4 sec, exc=4 min 4 sec, total=4 min 4 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000000 sec of elapsed time
     3: waited for 'ASM file metadata operation'
        msgop=0x2, locn=0x3, =0x0
        wait_id=881964 seq_num=30031 snap_id=1
        wait times: snap=0.000045 sec, exc=0.000045 sec, total=0.000045 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000019 sec of elapsed time
     4: waited for 'ASM background timer'
        =0x0, =0x0, =0x0
        wait_id=881963 seq_num=30030 snap_id=1
        wait times: snap=3.285876 sec, exc=3.285876 sec, total=3.285876 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000011 sec of elapsed time
     5: waited for 'ASM file metadata operation'
        msgop=0xc, locn=0x3, =0x0
        wait_id=881962 seq_num=30029 snap_id=1
        wait times: snap=0.000034 sec, exc=0.000034 sec, total=0.000034 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000019 sec of elapsed time
     6: waited for 'ASM background timer'
        =0x0, =0x0, =0x0
        wait_id=881961 seq_num=30028 snap_id=1
        wait times: snap=5.000528 sec, exc=5.000528 sec, total=5.000528 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000009 sec of elapsed time
     7: waited for 'ASM file metadata operation'
        msgop=0x0, locn=0x3, =0x0
        wait_id=881960 seq_num=30027 snap_id=1
        wait times: snap=0.000003 sec, exc=0.000003 sec, total=0.000003 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000016 sec of elapsed time
     8: waited for 'ASM background timer'
        =0x0, =0x0, =0x0
        wait_id=881959 seq_num=30026 snap_id=1
        wait times: snap=5.000493 sec, exc=5.000493 sec, total=5.000493 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000011 sec of elapsed time
     9: waited for 'ASM file metadata operation'
        msgop=0x0, locn=0x3, =0x0
        wait_id=881958 seq_num=30025 snap_id=1
        wait times: snap=0.000002 sec, exc=0.000002 sec, total=0.000002 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000017 sec of elapsed time
 
sample interval: 1 sec, max history 120 sec
    ---------------------------------------------------
      [120 samples,                                        01:38:56 - 01:40:55]
        waited for 'latch: shared pool', seq_num: 30032
          p1: 'address'=0x700000000107048
          p2: 'number'=0x133
          p3: 'tries'=0x0
          time_waited: >= 119 sec (still in wait)
    ---------------------------------------------------
    Sampled Session History Summary:
      longest_non_idle_wait: 'latch: shared pool'
      [120 samples, 01:38:56 - 01:40:55]
          time_waited: >= 119 sec (still in wait)
 
在ASMB进程的的dump文件中,发现有2个4分4秒的等待,waited for 'ASM file metadata operation'和waited for 'latch: shared pool',而其他的相关操作都是只有几微秒,甚至0秒的
 
3.ASM实例的alert日志:
 
Mon Jan 05 01:40:56 2015
 
 
***********************************************************************
TNS-12537: TNS:connection closed
 
Fatal NI connect error 12537, connecting to:
 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
    ns secondary err code: 12560
 
  VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
    nt main err code: 0
  Time: 05-JAN-2015 01:40:56
    nt secondary err code: 0
  Tracing not turned on.
    nt OS err code: 0
  Tns error struct:
    ns main err code: 12537
   
TNS-12537: TNS:connection closed
    ns secondary err code: 12560
    nt main err code: 0
    nt secondary err code: 0
    nt OS err code: 0
Mon Jan 05 04:42:23 2015
NOTE: ASMB process exiting due to lack of ASM file activity for 305 seconds
Mon Jan 05 08:46:53 2015
NOTE: client test5:test5 registered, osid 23593052, mbr 0x0
Mon Jan 05 08:47:24 2015
Starting background process ASMB
Mon Jan 05 08:47:24 2015
ASMB started with pid=26, OS id=20185292 
Mon Jan 05 08:47:25 2015
NOTE: client +ASM:+ASM registered, osid 16384406, mbr 0x1
 
在故障点之前,ASM的alert日志中未报过TNS错误,但故障点的时候,ASM中也出现了TNS连接关闭的错误,因此,也不排除网络因素引起的ASM实例故障。4点多的时候有一个NOTE: ASMB process exiting due to lack of ASM file activity for 305 seconds,根据文档NOTE ASMB process exiting due to lack of ASM file activity (文档 ID 754110.1)”,这个NOTE可以忽略。到了8:46的时候,ASM实例随着数据库的重启也正常启动了
 
最终,找到了一篇MOS文档,是和这次遇到的故障相关的,应该就是Bug 13914613导致的这个问题:
 

Bug 13914613  Excessive time holding shared pool latch in kghfrunp with auto memory management

 This note gives a brief overview of bug 13914613. 
 The content was last updated on: 17-DEC-2014
 Click here for details of each of the sections below.

Affects:

Product (Component)Oracle Server (Rdbms)
Range of versions believed to be affected(Not specified)
Versions confirmed as being affected
Platforms affectedGeneric (all / most platforms affected)

Fixed:

The fix for 13914613 is first included in
 

Description

相关推荐