GoldenGate Extract进程hang问题解决一例

一直运行正常的extract进程突然hang住了,起初怀疑是不是找不到归档所致,但是细细的检查了下,extract所需的归档都还在,而且日志中也没报任何错。而且将extract进程停掉后,也能正常起来,没有任何错误或者告警:

// ggserr.log
2015-05-25 11:03:27 INFO OGG-01513 Oracle GoldenGate Capture for Oracle, elis.prm: Positioning to (Thread 1) Sequence 5824, RBA 290544656, SCN 1415.2504802626.
2015-05-25 11:03:27 INFO OGG-01516 Oracle GoldenGate Capture for Oracle, elis.prm: Positioned to (Thread 1) Sequence 5824, RBA 290544656, SCN 1415.2504802626, May 20, 2015 4:50:25 PM.
2015-05-25 11:03:27 INFO OGG-01513 Oracle GoldenGate Capture for Oracle, elis.prm: Positioning to (Thread 2) Sequence 4798, RBA 18781712, SCN 1415.2504799356.
2015-05-25 11:03:27 INFO OGG-01516 Oracle GoldenGate Capture for Oracle, elis.prm: Positioned to (Thread 2) Sequence 4798, RBA 18781712, SCN 1415.2504799356, May 20, 2015 4:50:19 PM.
2015-05-25 11:03:27 INFO OGG-00993 Oracle GoldenGate Capture for Oracle, elis.prm: EXTRACT ELIS started.

//extract进程的rpt文件
2015-05-25 11:03:27 INFO OGG-00546 Default thread stack size: 196608.

2015-05-25 11:03:27 INFO OGG-00547 Increasing thread stack size from 196608 to 1048576.

2015-05-25 11:03:27 INFO OGG-01513 Positioning to (Thread 1) Sequence 5824, RBA 290544656, SCN 1415.2504802626.

2015-05-25 11:03:27 INFO OGG-01516 Positioned to (Thread 1) Sequence 5824, RBA 290544656, SCN 1415.2504802626, May 20, 2015 4:50:25 PM.

2015-05-25 11:03:27 INFO OGG-01513 Positioning to (Thread 2) Sequence 4798, RBA 18781712, SCN 1415.2504799356.

2015-05-25 11:03:27 INFO OGG-01516 Positioned to (Thread 2) Sequence 4798, RBA 18781712, SCN 1415.2504799356, May 20, 2015 4:50:19 PM.

2015-05-25 11:03:27 INFO OGG-01517 Position of first record processed for Thread 1, Sequence 5824, RBA 290544656, SCN 1415.2504802623, May 20, 2015 4:50:25 PM.

2015-05-25 11:03:27 INFO OGG-01644 BOUNDED RECOVERY: COMPLETE: for object pool 1: p33619994_Redo Thread 1 at SeqNo: 5824, RBA: 290545168, SCN: 1415.2504802626 (6079883526466), Thread: 1.

2015-05-25 11:03:27 INFO OGG-01055 Recovery initialization completed for target file /ogg/ggate/data/dirext/lis/ea002534, at RBA 1109.

2015-05-25 11:03:27 INFO OGG-01478 Output file /ogg/ggate/data/dirext/lis/ea is using format RELEASE 11.2.

2015-05-25 11:03:27 INFO OGG-01517 Position of first record processed for Thread 2, Sequence 4798, RBA 18781712, SCN 1415.2504799356, May 20, 2015 4:50:19 PM.

2015-05-25 11:03:27 INFO OGG-01026 Rolling over remote file /ogg/ggate/data/dirext/lis/ea002534.

2015-05-25 11:03:27 INFO OGG-01053 Recovery completed for target file /ogg/ggate/data/dirext/lis/ea002535, at RBA 1109.

2015-05-25 11:03:27 INFO OGG-01057 Recovery completed for all targets.
从日志中看,extract进程没有问题,但是extract进程实际上是运行不正常的:


GGSCI (hxddlis01) 8> info all

Program Status Group Lag at Chkpt Time Since Chkpt

MANAGER RUNNING
JAGENT RUNNING
EXTRACT RUNNING ELIS 96:14:58 00:00:04
EXTRACT RUNNING PL2IS 00:00:00 00:00:00
EXTRACT RUNNING PLIS 00:00:00 00:00:09
extract进程的lag越来越大,而且trail文件也不再增长。这情况真是让人一头雾水。我们先来看看当前系统的运行环境:


GoldenGate:11.2.1.0.27
Database: 11.2.0.3 两节点的RAC,使用ASM
OS: AIX 6.1
然后搜索Oracle Support,找到一篇文章1432994.1,和碰到的情形有点相像。在我们的环境中也是使用了DBLOGREADER,但是不同的是DBLOGREADERBUFSIZE已经配置成1M:


...
TRANLOGOPTIONS BUFSIZE 1048576
TRANLOGOPTIONS DBLOGREADER, DBLOGREADERBUFSIZE 1048576
--TRANLOGOPTIONS _NOREADAHEAD ANY
EXTTRAIL /ogg/ggate/data/dirext/lis/ea
...
这也是AIX下能配置的最大值。难道该值还是太大的缘故?所以就尝试将该值改成了512K:


...
TRANLOGOPTIONS BUFSIZE 1048576
TRANLOGOPTIONS DBLOGREADER, DBLOGREADERBUFSIZE 524288
--TRANLOGOPTIONS _NOREADAHEAD ANY
EXTTRAIL /ogg/ggate/data/dirext/lis/ea
...
然后重启extract进程,结果就正常了。估计使用DBLOGREADER还是有不少的bug。

相关推荐