eygle.com   eygle.com
eygle.com eygle
eygle.com  
 

« Oracle数据恢复:文件 数据错误(循环冗余检查) 解决 | Blog首页 | ALL_OBJECTS与X$KGLDP、X$KZSPR的复杂执行计划 »

诊断案例:PMON failed acquire latch QMNC Holder

在有些情况下,数据库关闭时会遇到PMON进程阻塞的情况,如果PMON进程不能及时获得Latch锁资源,就无法及时去清理事务,关闭进程,就会导致数据库无法关闭的情况。

这种情况在数据库运行状况下也可能发生,一旦这类情况出现,数据库就会出现严重的阻塞,失败的进程也无法获得清理和恢复,是严重的故障情况。

昨天一个客户的数据库就遇到这样的问题,在数据库关闭时出现如下错误。
首先是Latch无法获得,这里提示可能的阻塞者是 13648 号操作系统进程:

* SESSION ID:(333.1) 2011-06-23 00:48:48.250
PMON unable to acquire latch  c00000002000a558 slave class create level=0
        Location from where latch is held: ksvcreate:
        Context saved from call: 0
        state=busy, wlstate=free
    waiters [orapid (seconds since: put on list, posted, alive check)]:
     18 (53275, 1308761328, 53275)
     68 (3218, 1308761328, 334)
     waiter count=2
    gotten 11442 times wait, failed first 1324 sleeps 1325
    gotten 0 times nowait, failed: 0
  possible holder pid = 34 ospid=13648
----------------------------------------
而34号数据库进程,13648号系统进程,其详细进程信息随后输出:

SO: c000000436007f78, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00
  (process) Oracle pid=34, calls cur/top: c0000004362407c8/c0000004362407c8, flag: (2) SYSTEM
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 0 0 9
              last post received-location: ksqrcl
              last process to post me: c000000436003808 1 2
              last post sent: 0 0 24
              last post sent-location: ksasnd
              last process posted by me: c000000436000868 1 6
  (latch info) wait_event=0 bits=1
    holding    (efd=3) c00000002000a558 slave class create level=0
        Location from where latch is held: ksvcreate:
        Context saved from call: 0
        state=busy, wlstate=free
        waiters [orapid (seconds since: put on list, posted, alive check)]:
         18 (53275, 1308761328, 53275)
         68 (3218, 1308761328, 334)
         waiter count=2
    Process Group: DEFAULT, pseudo proc: c00000003b3adad8
    O/S info: user: oracle, term: UNKNOWN, ospid: 13648
    OSD pid info: Unix process pid: 13648, image: oracle@wfrb1 (QMNC)
    Short stack dump:
Dump of memory from 0xC00000003B39BA58 to 0xC00000003B39BC60
C00000003B39BA50                   00000005 00000000          [........]
C00000003B39BA60 C0000004 3927CAF8 00000010 0003139D  [....9'..........]
C00000003B39BA70 C0000004 362407C8 00000003 0003139D  [....6$..........]
C00000003B39BA80 C0000004 362EED70 0000000B 0003139D  [....6..p........]
C00000003B39BA90 C0000004 36224388 00000004 00031291  [....6"C.........]
C00000003B39BAA0 C0000004 36300C28 0000000D 0003139D  [....60.(........]
C00000003B39BAB0 00000000 00000000 00000000 00000000  [................]
  Repeat 26 times

可以看到进程是QMNC后台进程,QMNC进程是用于AQ队列的监控,为Oracle的流复制所使用:
qmnc -A queue monitor process which monitors the message queues. Used by Oracle Streams Advanced Queuing.


QMNC进程对于AQ表来说就相当于CJQ0进程之于作业表。QMNC进程会监视高级队列,并警告从队列中删除等待消息的"出队进程 "(dequeuer):已经有一个消息变为可用。QMNC和Qnnn还要负责队列传播(propagation),也就是说,能够将在一个数据库中入队(增加)的消息移到另一个数据库的队列中,从而实现出队(dequeueing)。


Qnnn进程对于QMNC进程就相当于Jnnn进程与CJQ0进程的关系。QMNC进程要通知Qnnn进程需要完成什么工作,Qnnn进程则会处理这些工作。


QMNC和Qnnn进程是可选的后台进程。参数AQTMPROCESSES 可以指定最多创建10个这样的进程(分别名为Q000,...,Q009),以及一个QMNC进程。如果AQTMPROCESSES设置为0,就没有 QMNC或Qnnn进程。不同于作业队列所用的Jnnn进程,Qnnn进程是持久的。如果将 AQTMPROCESSES设置为10,数据库启动时可以看到10个Qnnn进程和一个QMNC进程,而且在实例的整个生存期中这些进程都存在。

这个进程可以安全的Kill掉,以消除阻塞,在客户的这个系统中:
kill -9 13648

就释放了这个Latch占用。

根据Bug 5069930: QMNC PROCESS IS SPINNING AND CONSUMING HIGH CPU 的描述,在Oracle 10g中,可能存在QMNC进程SPIN空耗CPU的问题。

BUG描述如下:
PROBLEM:
--------
QMN process oraqmnc<SID> is taking upto 99% of CPU.
Customer is on 10.2.0.1.0.

DIAGNOSTIC ANALYSIS:
--------------------

+ looks like the oraqmnc<SID> process is not doing anything.
+ tried to take a 10046 trace using oradebug, but does not dump anything.
+ tried to find out if any SQLs executing using
DBMSSYSTEM.SETSQLTRACEINSESION, but again no SQLs found
+ errorstack shows, this process is apparently spinning.
+ system call trace (using strace) shows it is spinning on "times(NULL) =
489352606" system call.


WORKAROUND:
-----------
NONE

RELATED BUGS:
-------------

REPRODUCIBILITY:
----------------
reproducing at production, development and test environments of the customer.

TEST CASE:
----------

STACK TRACE:
------------
SQL> oradebug setospid 8692
Oracle pid: 24, Unix process pid: 8692, image:
oracle@ftibprod-db01.aozora.lan (QMNC)
SQL> oradebug short
stack
ksdxfstk()+32<-ksdxcb()+1547<-sspuser()+90<-<0x3e13b0c320>
SQL> oradebug shortstack
ksdxfstk()+32<-ksdxcb()+1547<-sspuser()+90<-<0x3e13b0c320>
SQL> oradebug short
stack
ksdxfstk()+32<-ksdxcb()+1547<-sspuser()+90<-<0x3e13b0c320>
SQL> oradebug short_stack
ksdxfstk()+32<-ksdxcb()+1547<-sspuser()+90<-<0x3e13b0c320>


这里通过short Stack跟踪堆栈的方法非常有助于判断问题的故障点。通常情况下,如果没有使用Streams选件,直接Kill掉QMNC进程就可以了。




历史上的今天...
    >> 2010-06-23文章:
    >> 2008-06-23文章:
    >> 2006-06-23文章:
           陌上花开,可缓缓归矣

无觅

By eygle on 2011-06-23 11:41 | Comments (0) | Case | 2822 |


CopyRight © 2004~2020 云和恩墨,成就未来!, All rights reserved.
数据恢复·紧急救援·性能优化 云和恩墨 24x7 热线电话:400-600-8755 业务咨询:010-59007017-7040 or 7037 业务合作: marketing@enmotech.com