eygle.com   eygle.com
eygle.com eygle
eygle.com  
 

« 珍惜当下,命运无常 - 纪念一位朋友的离开 | Blog首页 | Oracle Developer Forum -开发者大会 »

ORA-600 [2103]错误及CF enqueue竞争

昨天,客户的一套Oracle 10.2.0.3 RAC环境遇到了一个严重故障,数据库最后以ORA-600 [2103]错误中止了一个实例。

日志信息大致如下:
Tue Dec  2 17:21:06 2008
Errors in file /u01/admin/erpdb/bdump/erpdb2_lgwr_127968.trc:
ORA-00600: internal error code, arguments: [2103], [0], [0], [1], [900], [], [], []
Tue Dec  2 17:21:08 2008
Trace dumping is performing id=[cdmp_20081202172108]
Tue Dec  2 17:21:11 2008
Errors in file /u01/admin/erpdb/bdump/erpdb2_lgwr_127968.trc:
ORA-00600: internal error code, arguments: [2103], [0], [0], [1], [900], [], [], []

这里显示LGWR进程中止,出现故障,错误就是ORA-00600 [2103]号错误。
进一步的跟踪文件里的信息如下:
*** 2008-12-02 17:21:06.631
TIMEOUT ON CONTROL FILE ENQUEUE
mode=X, type=0, wait=1, eqt=900
*** 2008-12-02 17:21:06.631
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [2103], [0], [0], [1], [900], [], [], []

这个错误是说,CONTROL FILE ENQUEUE等待超时,超时时间是900秒,也就是错误信息后面的参数,900秒杀15分钟,也就是说,在数据库解决这个队列冲突之前,RAC hang住了15分钟,这15分钟是一个要命的时间。

一个内部参数可以控制这个超时时间,这个参数是:_controlfile_enqueue_timeout,其缺省值是900秒:
SQL> SELECT x.ksppinm NAME, y.ksppstvl VALUE, x.ksppdesc describ
  2    FROM SYS.x$ksppi x, SYS.x$ksppcv y
  3  WHERE x.indx = y.indx
  4    AND x.ksppinm LIKE '%&par%'
  5  /
Enter value for par: controlfile_enqueue
old  4:    AND x.ksppinm LIKE '%&par%'
new  4:    AND x.ksppinm LIKE '%controlfile_enqueue%'

NAME                                    VALUE      DESCRIB
---------------------------------------- ---------- ------------------------------------------------------------
_controlfile_enqueue_timeout            900        control file enqueue timeout in seconds
_controlfile_enqueue_holding_time        120        control file enqueue max holding time in seconds
_controlfile_enqueue_dump                FALSE      dump the system states after controlfile enqueue timeout
_kill_controlfile_enqueue_blocker        TRUE      enable killing controlfile enqueue blocker on timeout

但是调整这个参数需要验证和慎重。

ORA-600 [2103]相关的Bug有很多,只能提醒大家的是,遇到这个错误就要注意了!


历史上的今天...
    >> 2010-12-03文章:
    >> 2007-12-03文章:
    >> 2005-12-03文章:
           2005年的第一场雪

无觅

By eygle on 2008-12-03 14:23 | Comments (16) | Case | 2101 |

16 Comments

这个错误是如何产生,如何解决的呢?

yes,怎么解决的这个问题?

为什么出现超时??

此时有做备份吗?

这个bug不是说在10.2.0.2时解决掉了么?

在本子上做个记录

这个故障我还没有完全研究完,回头再和大家分享。

系统在做rman备份?还是日志切换太频繁?

这么巧,我今天也碰到一个,哈哈

lala,描述一下你的情况,我对这个案例比较感兴趣!

等待解决办法

db每个早上8点作rman backup,那天早上到公司查邮件收到告警,发出时间是8:23

RMAN-06900: WARNING: unable to generate V$RMAN_STATUS or V$RMAN_OUTPUT row
RMAN-06901: WARNING: disabling update of the V$RMAN_STATUS and V$RMAN_OUTPUT rows
ORACLE error from target database:
ORA-03113: end-of-file on communication channel

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-00554: initialization of internal recovery manager package failed
RMAN-06003: ORACLE error from target database:
ORA-03114: not connected to ORACLE

查看日志,在8点15,8点19,8点23生成了2个archiver的trace
Fri Dec 5 08:15:47 2008
GES: Potential blocker (pid=29589) on resource CF-00000000-00000000;
enqueue info in file /u01/oracle/admin/lala/bdump/lala_arc1_25055.trc
Killing enqueue blocker (pid=29589) on resource CF-00000000-00000000
by killing session 27.38552
Fri Dec 5 08:19:56 2008
GES: Potential blocker (pid=29589) on resource CF-00000000-00000000;
enqueue info in file /u01/oracle/admin/lala/bdump/lala_arc0_25053.trc
Killing enqueue blocker (pid=29589) on resource CF-00000000-00000000
by killing session 27.38552
Fri Dec 5 08:23:21 2008
GES: Potential blocker (pid=29589) on resource CF-00000000-00000000;
enqueue info in file /u01/oracle/admin/lala/bdump/lala_arc1_25055.trc
Killing enqueue blocker (pid=29589) on resource CF-00000000-00000000
by terminating the process

后来我又试了一下rman target /
就出现了ora-600 2103
ORA-00600: internal error code, arguments: [2103], [1], [0], [1], [900], [], [], []

9点21的时候有一个log switch, 接着arc1就被发现死了
Fri Dec 5 09:21:49 2008
ARCH: Detected ARCH process failure
ARCH: STARTING ARCH PROCESSES
Fri Dec 5 09:24:00 2008
Errors in file /u01/oracle/admin/lala/bdump/lala_ckpt_24869.trc:
ORA-00445: background process "ARC1" did not start after 120 seconds

看第一个arc1的trace,原来早上4点就有记录了
*** 2008-10-30 04:00:39.752
*** SERVICE NAME:(SYS$BACKGROUND) 2008-10-30 04:00:39.746
*** SESSION ID:(35.1) 2008-10-30 04:00:39.746
Control file resized from 866 to 868 blocks
kccrsd_append: rectype = 11, lbn = 433, recs = 28
*** 2008-12-05 08:15:47.486
Unable to get enqueue on resource CF-00000000-00000000 (ges mode req=3 held=6)
Possible local blocker ospid=29589 sid=27 sser=38552 time_held=942 secs (ges mode req=6 held=4)
Killing blocker (pid=29589) on resource CF-00000000-00000000
DUMP LOCAL BLOCKER: initiate state dump for KILL BLOCKER
possible owner[16.29589]
Dumping process info of pid[16.29589] requested by pid[29.25055]
Dumping process 16.29589 info:
*** 2008-12-05 08:15:47.494
Dumping diagnostic information for ospid 29589:
OS pid = 29589
loadavg : 1.62 1.69 1.94

看日志从前天晚上10点到当天早上6点没有记录,6点12到7点52有很频繁的log switch,最频繁的间隔1分钟,最久的间隔40分钟

遭遇过相同情况,这个问题关注,到现在遇到的唯一没搞明白的问题,当然那是因为遇到的问题太少了
O(∩_∩)O哈哈~

应该是切换太快了,rac无法支撑

造成队列等待的问题原因是bug还是什么?

show parameter control_file_record_keep_time


CopyRight © 2004~2020 云和恩墨,成就未来!, All rights reserved.
数据恢复·紧急救援·性能优化 云和恩墨 24x7 热线电话:400-600-8755 业务咨询:010-59007017-7040 or 7037 业务合作: marketing@enmotech.com