January 19, 2006
作者:eygle
Oracle10g引入ASM以来,关于ASM的担心就从来没有停止过.因为ASM引入了一个新的ASM Instance,新的实例的稳定性一度成为了关注的焦点.
我们看一下这个新的ASM实例引入的后台进程:
[oracle@danaly bdump]$ ps -ef|grep ASM|grep -v grep oracle 3720 1 0 14:38 ? 00:00:00 asm_pmon_+ASM oracle 3722 1 0 14:38 ? 00:00:00 asm_psp0_+ASM oracle 3724 1 0 14:38 ? 00:00:00 asm_mman_+ASM oracle 3726 1 0 14:38 ? 00:00:00 asm_dbw0_+ASM oracle 3728 1 0 14:38 ? 00:00:00 asm_lgwr_+ASM oracle 3730 1 0 14:38 ? 00:00:00 asm_ckpt_+ASM oracle 3732 1 0 14:38 ? 00:00:00 asm_smon_+ASM oracle 3734 1 0 14:38 ? 00:00:00 asm_rbal_+ASM oracle 3736 1 0 14:38 ? 00:00:00 asm_gmon_+ASM oracle 3748 1 0 14:38 ? 00:00:00 asm_o000_+ASM oracle 3781 1 0 14:38 ? 00:00:00 oracle+ASM (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
|
这些进程和数据库的后台进程同样重要(甚至可以说更为重要),如果ASM实例Crash,数据库将会立刻中止.
今天这个10g的数据库就遇到了这样的问题.第一次,Oracle10gR2的ASM挂了。
检查数据库的alert文件,获得如下信息:
Thu Jan 19 13:34:11 2006 WARNING: inbound connection timed out (ORA-3136) Thu Jan 19 13:52:47 2006 Errors in file /opt/oracle/admin/danaly/bdump/danaly_asmb_4154.trc: ORA-15064: communication failure with ASM instance ORA-03113: end-of-file on communication channel Thu Jan 19 13:52:47 2006 ASMB: terminating instance due to error 15064 Instance terminated by ASMB, pid = 4154
|
看来是ASM实例出现问题,导致数据库down掉。
trace文件中记录了如下错误:
*** 2006-01-19 13:52:47.922 *** SERVICE NAME:(SYS$BACKGROUND) 2006-01-19 13:52:47.917 *** SESSION ID:(157.1) 2006-01-19 13:52:47.917 error 15064 detected in background process ORA-15064: communication failure with ASM instance ORA-03113: end-of-file on communication channel
|
毫无疑问,都是ASM惹的祸。
检查ASM的日志文件记录了如下错误,显示磁盘组故障导致ASM终止:
Thu Jan 19 13:52:47 2006 Errors in file /opt/oracle/admin/+ASM/bdump/+asm_gmon_3948.trc: ORA-29702: error occurred in Cluster Group Service operation ORA-29702: error occurred in Cluster Group Service operation Thu Jan 19 13:52:47 2006 GMON: terminating instance due to error 29702 Instance terminated by GMON, pid = 3948
|
trace文件里记录了如下信息:
*** 2006-01-19 13:52:47.912 2006-01-19 13:52:47.912: [ CSSCLNT]clsssRecvMsg: comm error received, comrc 11, con (0xcc9fb48), msg (0xbfffe5d8), msgl 144 2006-01-19 13:52:47.912: [ CSSCLNT]clssgsGroupGetStatus: clsssRecvMsg failed 3 -1073748296) 2006-01-19 13:52:47.912: [ CSSCLNT]clssgsGroupGetStatus: returning 8 kgxgnpstat: received ABORT event from CLSS Group services Error [NM abort event ] @ 28019:719 error 29702 detected in background process ORA-29702: error occurred in Cluster Group Service operation ORA-29702: error occurred in Cluster Group Service operation
|
启动数据库,经过恢复,成功加载ASM磁盘组,随之数据库得以正常启动:
Thu Jan 19 14:38:16 2006 SQL> ALTER DISKGROUP ALL MOUNT Thu Jan 19 14:38:16 2006 NOTE: cache registered group ORADG number=1 incarn=0x728b66fe Thu Jan 19 14:38:16 2006 Loaded ASM Library - Generic Linux, version 2.0.0 (KABI_V2) library for asmlib interface Thu Jan 19 14:38:16 2006 NOTE: Hbeat: instance first (grp 1) Thu Jan 19 14:38:20 2006 NOTE: start heartbeating (grp 1) NOTE: cache opening disk 0 of grp 1: VOL1 label:VOL1 Thu Jan 19 14:38:20 2006 NOTE: F1X0 found on disk 0 fcn 0.0 NOTE: cache opening disk 1 of grp 1: VOL2 label:VOL2 NOTE: cache opening disk 2 of grp 1: VOL3 label:VOL3 NOTE: cache opening disk 3 of grp 1: VOL4 label:VOL4 NOTE: cache mounting (first) group 1/0x728B66FE (ORADG) NOTE: starting recovery of thread=1 ckpt=17.8535 NOTE: advancing ckpt for thread=1 ckpt=17.8535 NOTE: cache recovered group 1 to fcn 0.401368 Thu Jan 19 14:38:21 2006 NOTE: opening chunk 1 at fcn 0.401368 ABA NOTE: seq=18 blk=8536 Thu Jan 19 14:38:21 2006 NOTE: cache mounting group 1/0x728B66FE (ORADG) succeeded SUCCESS: diskgroup ORADG was mounted Thu Jan 19 14:38:22 2006 NOTE: recovering COD for group 1/0x728b66fe (ORADG) SUCCESS: completed COD recovery for group 1/0x728b66fe (ORADG) Thu Jan 19 14:38:38 2006 Starting background process ASMB ASMB started with pid=12, OS id=3744
|
我们注意到,日志中几次出现heartbeat字样,原来heartbeat无处不在.
顺便把数据库的恢复信息也记录一下:
Thu Jan 19 14:38:43 2006 Starting ORACLE instance (normal) LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Picked latch-free SCN scheme 2 Using LOG_ARCHIVE_DEST_10 parameter default value as USE_DB_RECOVERY_FILE_DEST WARNING: db_recovery_file_dest is same as db_create_file_dest Autotune of undo retention is turned on. IMODE=BR ILAT =18 LICENSE_MAX_USERS = 0 SYS auditing is disabled ksdpec: called for event 13740 prior to event group initialization Starting up ORACLE RDBMS Version: 10.2.0.1.0. System parameters with non-default values: processes = 150 __shared_pool_size = 75497472 __large_pool_size = 4194304 __java_pool_size = 4194304 __streams_pool_size = 0 spfile = +ORADG/danaly/spfiledanaly.ora sga_target = 943718400 control_files = +ORADG/danaly/controlfile/current.256.600173845, +ORADG/danaly/controlfile/current.257.600173845 db_block_checksum = FULL db_block_size = 8192 __db_cache_size = 851443712 compatible = 10.2.0.1.0 db_file_multiblock_read_count= 16 db_create_file_dest = +ORADG db_recovery_file_dest = +ORADG db_recovery_file_dest_size= 2147483648 undo_management = AUTO undo_tablespace = UNDOTBS2 recyclebin = ON remote_login_passwordfile= EXCLUSIVE db_domain = dispatchers = (PROTOCOL=TCP) (SERVICE=danalyXDB) job_queue_processes = 10 background_dump_dest = /opt/oracle/admin/danaly/bdump user_dump_dest = /opt/oracle/admin/danaly/udump core_dump_dest = /opt/oracle/admin/danaly/cdump audit_file_dest = /opt/oracle/admin/danaly/adump db_name = danaly open_cursors = 300 pga_aggregate_target = 314572800 PMON started with pid=2, OS id=3752 PSP0 started with pid=3, OS id=3754 MMAN started with pid=4, OS id=3756 DBW0 started with pid=5, OS id=3758 LGWR started with pid=6, OS id=3760 CKPT started with pid=7, OS id=3762 SMON started with pid=8, OS id=3764 RECO started with pid=9, OS id=3766 CJQ0 started with pid=10, OS id=3768 MMON started with pid=11, OS id=3770 Thu Jan 19 14:38:44 2006 starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'... MMNL started with pid=12, OS id=3772 Thu Jan 19 14:38:44 2006 starting up 1 shared server(s) ... Thu Jan 19 14:38:44 2006 ALTER DATABASE MOUNT Thu Jan 19 14:38:44 2006 Starting background process ASMB ASMB started with pid=16, OS id=3779 Starting background process RBAL RBAL started with pid=17, OS id=3783 Loaded ASM Library - Generic Linux, version 2.0.0 (KABI_V2) library for asmlib interface Thu Jan 19 14:38:52 2006 SUCCESS: diskgroup ORADG was mounted Thu Jan 19 14:38:57 2006 Setting recovery target incarnation to 1 Thu Jan 19 14:38:57 2006 Successful mount of redo thread 1, with mount id 3945133236 Thu Jan 19 14:38:57 2006 Database mounted in Exclusive Mode Completed: ALTER DATABASE MOUNT Thu Jan 19 14:38:57 2006 ALTER DATABASE OPEN Thu Jan 19 14:38:57 2006 Beginning crash recovery of 1 threads parallel recovery started with 3 processes Thu Jan 19 14:38:58 2006 Started redo scan Thu Jan 19 14:38:58 2006 Completed redo scan 59 redo blocks read, 14 data blocks need recovery Thu Jan 19 14:38:58 2006 Started redo application at Thread 1: logseq 5834, block 72749 Thu Jan 19 14:38:58 2006 Recovery of Online Redo Log: Thread 1 Group 2 Seq 5834 Reading mem 0 Mem# 0 errs 0: +ORADG/danaly/onlinelog/group_2.260.600173851 Mem# 1 errs 0: +ORADG/danaly/onlinelog/group_2.261.600173853 Thu Jan 19 14:38:58 2006 Completed redo application Thu Jan 19 14:38:59 2006 Completed crash recovery at Thread 1: logseq 5834, block 72808, scn 16042818137 14 data blocks read, 14 data blocks written, 59 redo blocks read Thu Jan 19 14:39:00 2006 Thread 1 advanced to log sequence 5835 Thread 1 opened at log sequence 5835 Current log# 3 seq# 5835 mem# 0: +ORADG/danaly/onlinelog/group_3.262.600173855 Current log# 3 seq# 5835 mem# 1: +ORADG/danaly/onlinelog/group_3.263.600173857 Successful open of redo thread 1 Thu Jan 19 14:39:00 2006 MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set Thu Jan 19 14:39:00 2006 SMON: enabling cache recovery Thu Jan 19 14:39:01 2006 Successfully onlined Undo Tablespace 15. Thu Jan 19 14:39:01 2006 SMON: enabling tx recovery Thu Jan 19 14:39:01 2006 Database Characterset is ZHS16GBK replication_dependency_tracking turned off (no async multimaster replication found) Starting background process QMNC QMNC started with pid=22, OS id=3796 Thu Jan 19 14:39:09 2006 db_recovery_file_dest_size of 2048 MB is 16.75% used. This is a user-specified limit on the amount of space that will be used by this database for recovery-related files, and does not reflect the amount of space available in the underlying filesystem or ASM diskgroup. Thu Jan 19 14:39:09 2006 Completed: ALTER DATABASE OPEN
|
这样的错误在Metalink上都被当做Bug处理,很难找到真正的原因,如果不能再现,就只能做为一次异常处理了.
Posted by eygle at 6:07 PM
| Comments (0)
作者:eygle
在Oracle中有一个事件叫Heartbeat,这个词在很多地方被提及,并且有着不同的含义(比如RAC中),我们这里要讨论的是CKPT的Heartbeat机制。
Oracle通过CKPT进程每3秒将Heartbeat写入控制文件,以减少故障时的恢复时间(这个我们后面再详细阐述)。
我们可以通过如下方法验证这个过程。
1.首先在系统级启用10046时间跟踪
并重新启动数据库使之生效
[oracle@jumper oracle]$ sqlplus "/ as sysdba"
SQL*Plus: Release 9.2.0.4.0 - Production on Thu Jan 19 09:24:04 2006
Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.
Connected to: Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production With the Partitioning option JServer Release 9.2.0.4.0 - Production
SQL> alter system set event='10046 trace name context forever,level 12' scope=spfile;
System altered.
SQL> shutdown immediate; Database closed. Database dismounted. ORACLE instance shut down.
SQL> startup ORACLE instance started.
Total System Global Area 114365800 bytes Fixed Size 451944 bytes Variable Size 50331648 bytes Database Buffers 62914560 bytes Redo Buffers 667648 bytes Database mounted. Database opened. SQL> exit Disconnected from Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production With the Partitioning option JServer Release 9.2.0.4.0 - Production
|
2.检查bdump目录下生成的跟踪文件
目录在$ORACLE_HOME/admin/$ORACLE_SID/bdump目录下,每个后台进程都会生成一个跟踪文件。
[oracle@jumper bdump]$ ls 20050424_alert_conner.log conner_arc0_2569.trc conner_dbw0_2559.trc conner_reco_2567.trc alert_conner.log conner_arc1_2571.trc conner_lgwr_2561.trc conner_smon_2565.trc a.sql conner_ckpt_2563.trc conner_pmon_2557.trc
|
3.检查CKPT进程的跟踪文件
我们可以很容易的发现CKPT进程每3秒都对控制文件进行一次写入
[oracle@jumper bdump]$ tail -f conner_ckpt_2563.trc WAIT #0: nam='rdbms ipc message' ela= 2994710 p1=300 p2=0 p3=0 WAIT #0: nam='control file parallel write' ela= 2442 p1=3 p2=3 p3=3 WAIT #0: nam='rdbms ipc message' ela= 2995171 p1=300 p2=0 p3=0 WAIT #0: nam='control file parallel write' ela= 2586 p1=3 p2=3 p3=3 WAIT #0: nam='rdbms ipc message' ela= 2994962 p1=300 p2=0 p3=0 WAIT #0: nam='control file parallel write' ela= 2582 p1=3 p2=3 p3=3 WAIT #0: nam='rdbms ipc message' ela= 2995020 p1=300 p2=0 p3=0 WAIT #0: nam='control file parallel write' ela= 2455 p1=3 p2=3 p3=3 WAIT #0: nam='rdbms ipc message' ela= 2995188 p1=300 p2=0 p3=0 WAIT #0: nam='control file parallel write' ela= 2412 p1=3 p2=3 p3=3 WAIT #0: nam='rdbms ipc message' ela= 2995187 p1=300 p2=0 p3=0 WAIT #0: nam='control file parallel write' ela= 2463 p1=3 p2=3 p3=3 WAIT #0: nam='rdbms ipc message' ela= 2995095 p1=300 p2=0 p3=0 WAIT #0: nam='control file parallel write' ela= 2448 p1=3 p2=3 p3=3
|
4.检查控制文件的变更
通过2次dump控制文件,比较其trace文件输出可以比较得到不同之处,我们发现,Oracle仅仅更新了Heartbeat这个数值。
[oracle@jumper udump]$ sqlplus "/ as sysdba"
SQL*Plus: Release 9.2.0.4.0 - Production on Wed Jan 18 22:44:10 2006
Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.
Connected to: Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production With the Partitioning option JServer Release 9.2.0.4.0 - Production
SQL> alter session set events 'immediate trace name CONTROLF level 10';
Session altered.
SQL> exit Disconnected from Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production With the Partitioning option JServer Release 9.2.0.4.0 - Production [oracle@jumper udump]$ sqlplus "/ as sysdba"
SQL*Plus: Release 9.2.0.4.0 - Production on Wed Jan 18 22:44:18 2006
Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.
Connected to: Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production With the Partitioning option JServer Release 9.2.0.4.0 - Production
SQL> alter session set events 'immediate trace name CONTROLF level 10' ;
Session altered.
SQL> exit Disconnected from Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production With the Partitioning option JServer Release 9.2.0.4.0 - Production
[oracle@jumper udump]$ ls conner_ora_21896.trc conner_ora_21898.trc [oracle@jumper udump]$ diff conner_ora_21896.trc conner_ora_21898.trc 1c1 < /opt/oracle/admin/conner/udump/conner_ora_21896.trc --- > /opt/oracle/admin/conner/udump/conner_ora_21898.trc 14c14 < Unix process pid: 21896, image: oracle@jumper.hurray.com.cn (TNS V1-V3) --- > Unix process pid: 21898, image: oracle@jumper.hurray.com.cn (TNS V1-V3) 16c16 < *** SESSION ID9.813) 2006-01-18 22:44:14.314 --- > *** SESSION ID9.815) 2006-01-18 22:44:21.569
63c63 < heartbeat: 579991793 mount id: 3191936000 --- > heartbeat: 579991796 mount id: 3191936000 [oracle@jumper udump]$
|
Steve曾经这样描述CKPT Heartbeat:
In 8.0.5 a heartbeat mechanism was included in CKPT's timeout action (every 3 seconds) to update the checkpoint progress record for the thread in the controlfile.
Posted by eygle at 12:36 PM
| Comments (0)