What's Mean "reliable message"?

« DMT之后SMON还需要coalesce么? | Blog首页 | Oracle Release Number Format含义 »

今天客户的一套RAC环境出现问题
双节点RAC环境中，一个节点因为锁竞争而挂起，shutdown之后无法启动。

故障出现时我正在路上，匆匆回到家中，处理故障。
解决之后查找故障原因。

检查当时的AWR信息发现Top 5 Timed Events显示如下信息：

Top 5 Timed Events Avg %Total
~~~~~~~~~~~~~~~~~~ wait Call
Event Waits Time (s) (ms) Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
reliable message 354 89 251 219.4 Other
CPU time 32 78.3
db file sequential read 2,223 12 6 30.3 User I/O
control file sequential read 29,151 8 0 20.9 System I/O
db file scattered read 36 2 62 5.5 User I/O
-------------------------------------------------------------

这里最显著的事件是reliable message，这个事件Metalink的解释为:

When you send a message using the 'KSR' intra-instance broadcast
service, the message publisher waits on this wait-event until
all subscribers have consumed the 'reliable message' just sent.
The publisher waits on this wait-event for three seconds and
then re-tests if all subscribers have consumed the message, or
until posted.

也就是说当跨实例发送消息时，发送者期望收到订阅者的回复信息，如果得不到可信回复，就会一直处于等待。等待以3秒为周期进行反复尝试，知道收到所有订阅者的回复或者被唤醒。

那么在这个环境中，也就是说两个节点的通讯已经出现问题，一个节点得不到另外一个节点的回复。
这是一个可怕的故障，reliable message也是一个让人头疼的事件。
As rocx123 describe:
Althoug this is an old issue it just happened to in a test RAC. "reliable message" is really not to worry for but if some sessions are waiting and the wait time (secs) is increasing you may look at parameter aq_tm_processes: it should not be ZERO. If it is, set it to at least 2.

-The End-

历史上的今天...
>> 2012-02-22文章:

AskTom ACOUG China - 2012 Events BeiJing

>> 2011-02-22文章:

DBA手记:Grid Control监控-进程累积导致的宕机

>> 2009-02-22文章:

Oracle Wait Events: Wait for scn ack

>> 2006-02-22文章:

索引与Null值对于Hints及执行计划的影响

>> 2005-02-22文章:

如何有条件的分步删除数据表中的记录

By eygle on 2008-02-22 22:58 | Comments (13) | Advanced | 1797 |

13 Comments

木匠 | February 23, 2008 1:38 AM

你的意思是说, 网络通信出了问题. ?

解决问题要从网络连接着手.

木匠 | February 23, 2008 1:40 AM

你的意思是说, 网络通信出了问题. ?

解决问题要从网络连接着手.

eygle | February 23, 2008 11:10 AM

不一定是网络问题，有可能是CRS之间的通讯有问题，也就是说可能是CRS出了问题。

路千里 | February 23, 2008 3:19 PM

是不是应当这么说呀
What's "reliable message"?
or
What does "reliable message" mean?

eygle | February 24, 2008 1:20 AM

汗，达意我就满足了！

xwqj | February 28, 2008 4:54 PM

这两天正为这个头疼呢，10.2.0.3两节点集群，HPUNIX 11.23 心跳是一根交叉线直接连两个网卡，alert老提示other event占用时间警告，有时候到了100%，运行ADDM没有查到任何问题，查看awr记录就是这个reliable message ,该怎么办搞呢,一点头绪也没有

qq | February 28, 2008 4:58 PM

xwqj | February 28, 2008 5:16 PM

这两天正为这个头疼呢，10.2.0.3两节点集群，HPUNIX 11.23 心跳是一根交叉线直接连两个网卡，alert老提示other event占用时间警告，有时候到了100%，运行ADDM没有查到任何问题，查看awr记录就是这个reliable message ,该怎么办搞呢,一点头绪也没有,哭啊

eygle | February 29, 2008 10:10 PM

检查网络是否出现过异常，如流量等有没有问题？

还有CPU消耗等。

xwqj | March 3, 2008 2:01 PM

我的机器是新买的hp8640 每个节点8cpu itanium 32G内存,目前只跑了2-3G的数据,用户也就那么几十个人,28号看了以后立即把心跳换到一个交换机上去了,alert还是other event占用时间警告,只是在AWR top 5 中这个reliable message 出现的不是很多了,不过这几天忙也没太注意
TOP 5 timed Event
Event Waits Time(s) Avg Wait(ms) % Total Call Time Wait Class
CPU time 26 98.0
Streams AQ: qmn coordinator waiting for slave to start 4 24 6,034 90.4 Other
CGS wait for IPC msg 405,145 3 0 12.3 Other
gc current block busy 34 2 69 8.8 Cluster
gcs log flush sync 171 2 10 6.7 Other

后悔当初没买光纤网卡做心跳用,怎么查查这个gc的问题是不是由网络引起来的呀

不知道会不会由应用引起,因为这个公司的应用是在太差了,但我得先排出数据库的问题才好去和他们理论,郁闷了

yb | June 24, 2011 10:10 AM

盖老师，您好！
这个问题最后有解决办法吗？最近在对一个节点手工创建快照时，出现此等待事件，时间高达800多秒。

rocx123 | August 24, 2011 4:29 PM

Althoug this is an old issue it just happened to in a test RAC.
"reliable message" is really not to worry for but if some sessions are waiting and the wait time (secs) is increasing you may look at parameter aq_tm_processes: it should not be ZERO. If it is, set it to at least 2.

eygle | August 29, 2011 3:29 PM

谢谢补充。