Oracle10gR2 ASM 值得信赖么?

« 为何而心跳-Oracle Heartbeat研究 | Blog首页 | 2006年我们有更多的时间能浪费 »

Oracle10g引入ASM以来,关于ASM的担心就从来没有停止过.因为ASM引入了一个新的ASM Instance,新的实例的稳定性一度成为了关注的焦点.

我们看一下这个新的ASM实例引入的后台进程:

[oracle@danaly bdump]$ ps -ef|grep ASM|grep -v grep
oracle    3720     1  0 14:38 ?        00:00:00 asm_pmon_+ASM
oracle    3722     1  0 14:38 ?        00:00:00 asm_psp0_+ASM
oracle    3724     1  0 14:38 ?        00:00:00 asm_mman_+ASM
oracle    3726     1  0 14:38 ?        00:00:00 asm_dbw0_+ASM
oracle    3728     1  0 14:38 ?        00:00:00 asm_lgwr_+ASM
oracle    3730     1  0 14:38 ?        00:00:00 asm_ckpt_+ASM
oracle    3732     1  0 14:38 ?        00:00:00 asm_smon_+ASM
oracle    3734     1  0 14:38 ?        00:00:00 asm_rbal_+ASM
oracle    3736     1  0 14:38 ?        00:00:00 asm_gmon_+ASM
oracle    3748     1  0 14:38 ?        00:00:00 asm_o000_+ASM
oracle    3781     1  0 14:38 ?        00:00:00 oracle+ASM (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))

这些进程和数据库的后台进程同样重要(甚至可以说更为重要),如果ASM实例Crash,数据库将会立刻中止.

今天这个10g的数据库就遇到了这样的问题.第一次，Oracle10gR2的ASM挂了。

检查数据库的alert文件，获得如下信息:

Thu Jan 19 13:34:11 2006
WARNING: inbound connection timed out (ORA-3136)
Thu Jan 19 13:52:47 2006
Errors in file /opt/oracle/admin/danaly/bdump/danaly_asmb_4154.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Thu Jan 19 13:52:47 2006
ASMB: terminating instance due to error 15064
Instance terminated by ASMB, pid = 4154

看来是ASM实例出现问题,导致数据库down掉。
trace文件中记录了如下错误:

*** 2006-01-19 13:52:47.922
*** SERVICE NAME:(SYS$BACKGROUND) 2006-01-19 13:52:47.917
*** SESSION ID:(157.1) 2006-01-19 13:52:47.917
error 15064 detected in background process
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel

毫无疑问，都是ASM惹的祸。

检查ASM的日志文件记录了如下错误，显示磁盘组故障导致ASM终止:

Thu Jan 19 13:52:47 2006
Errors in file /opt/oracle/admin/+ASM/bdump/+asm_gmon_3948.trc:
ORA-29702: error occurred in Cluster Group Service operation
ORA-29702: error occurred in Cluster Group Service operation
Thu Jan 19 13:52:47 2006
GMON: terminating instance due to error 29702
Instance terminated by GMON, pid = 3948

trace文件里记录了如下信息：

*** 2006-01-19 13:52:47.912
2006-01-19 13:52:47.912: [ CSSCLNT]clsssRecvMsg: comm error received,
  comrc 11, con (0xcc9fb48), msg (0xbfffe5d8), msgl 144
2006-01-19 13:52:47.912: [ CSSCLNT]clssgsGroupGetStatus:  clsssRecvMsg failed 3 -1073748296)
2006-01-19 13:52:47.912: [ CSSCLNT]clssgsGroupGetStatus: returning 8
kgxgnpstat: received ABORT event from CLSS
Group services Error [NM abort event ] @ 28019:719
error 29702 detected in background process
ORA-29702: error occurred in Cluster Group Service operation
ORA-29702: error occurred in Cluster Group Service operation

启动数据库，经过恢复，成功加载ASM磁盘组，随之数据库得以正常启动:

Thu Jan 19 14:38:16 2006
SQL> ALTER DISKGROUP ALL MOUNT
Thu Jan 19 14:38:16 2006
NOTE: cache registered group ORADG number=1 incarn=0x728b66fe
Thu Jan 19 14:38:16 2006
Loaded ASM Library - Generic Linux, version 2.0.0 (KABI_V2) library for asmlib interface
Thu Jan 19 14:38:16 2006
NOTE: Hbeat: instance first (grp 1)
Thu Jan 19 14:38:20 2006
NOTE: start heartbeating (grp 1)
NOTE: cache opening disk 0 of grp 1: VOL1 label:VOL1
Thu Jan 19 14:38:20 2006
NOTE: F1X0 found on disk 0 fcn 0.0
NOTE: cache opening disk 1 of grp 1: VOL2 label:VOL2
NOTE: cache opening disk 2 of grp 1: VOL3 label:VOL3
NOTE: cache opening disk 3 of grp 1: VOL4 label:VOL4
NOTE: cache mounting (first) group 1/0x728B66FE (ORADG)
NOTE: starting recovery of thread=1 ckpt=17.8535
NOTE: advancing ckpt for thread=1 ckpt=17.8535
NOTE: cache recovered group 1 to fcn 0.401368
Thu Jan 19 14:38:21 2006
NOTE: opening chunk 1 at fcn 0.401368 ABA
NOTE: seq=18 blk=8536
Thu Jan 19 14:38:21 2006
NOTE: cache mounting group 1/0x728B66FE (ORADG) succeeded
SUCCESS: diskgroup ORADG was mounted
Thu Jan 19 14:38:22 2006
NOTE: recovering COD for group 1/0x728b66fe (ORADG)
SUCCESS: completed COD recovery for group 1/0x728b66fe (ORADG)
Thu Jan 19 14:38:38 2006
Starting background process ASMB
ASMB started with pid=12, OS id=3744

我们注意到,日志中几次出现heartbeat字样,原来heartbeat无处不在.

顺便把数据库的恢复信息也记录一下:

Thu Jan 19 14:38:43 2006
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 2
Using LOG_ARCHIVE_DEST_10 parameter default value as USE_DB_RECOVERY_FILE_DEST
WARNING: db_recovery_file_dest is same as db_create_file_dest
Autotune of undo retention is turned on. 
IMODE=BR
ILAT =18
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.1.0.
System parameters with non-default values:
  processes                = 150
  __shared_pool_size       = 75497472
  __large_pool_size        = 4194304
  __java_pool_size         = 4194304
  __streams_pool_size      = 0
  spfile                   = +ORADG/danaly/spfiledanaly.ora
  sga_target               = 943718400
  control_files            = +ORADG/danaly/controlfile/current.256.600173845, 
                             +ORADG/danaly/controlfile/current.257.600173845
  db_block_checksum        = FULL
  db_block_size            = 8192
  __db_cache_size          = 851443712
  compatible               = 10.2.0.1.0
  db_file_multiblock_read_count= 16
  db_create_file_dest      = +ORADG
  db_recovery_file_dest    = +ORADG
  db_recovery_file_dest_size= 2147483648
  undo_management          = AUTO
  undo_tablespace          = UNDOTBS2
  recyclebin               = ON
  remote_login_passwordfile= EXCLUSIVE
  db_domain                = 
  dispatchers              = (PROTOCOL=TCP) (SERVICE=danalyXDB)
  job_queue_processes      = 10
  background_dump_dest     = /opt/oracle/admin/danaly/bdump
  user_dump_dest           = /opt/oracle/admin/danaly/udump
  core_dump_dest           = /opt/oracle/admin/danaly/cdump
  audit_file_dest          = /opt/oracle/admin/danaly/adump
  db_name                  = danaly
  open_cursors             = 300
  pga_aggregate_target     = 314572800
PMON started with pid=2, OS id=3752
PSP0 started with pid=3, OS id=3754
MMAN started with pid=4, OS id=3756
DBW0 started with pid=5, OS id=3758
LGWR started with pid=6, OS id=3760
CKPT started with pid=7, OS id=3762
SMON started with pid=8, OS id=3764
RECO started with pid=9, OS id=3766
CJQ0 started with pid=10, OS id=3768
MMON started with pid=11, OS id=3770
Thu Jan 19 14:38:44 2006
starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'...
MMNL started with pid=12, OS id=3772
Thu Jan 19 14:38:44 2006
starting up 1 shared server(s) ...
Thu Jan 19 14:38:44 2006
ALTER DATABASE   MOUNT
Thu Jan 19 14:38:44 2006
Starting background process ASMB
ASMB started with pid=16, OS id=3779
Starting background process RBAL
RBAL started with pid=17, OS id=3783
Loaded ASM Library - Generic Linux, version 2.0.0 (KABI_V2) library for asmlib interface
Thu Jan 19 14:38:52 2006
SUCCESS: diskgroup ORADG was mounted
Thu Jan 19 14:38:57 2006
Setting recovery target incarnation to 1
Thu Jan 19 14:38:57 2006
Successful mount of redo thread 1, with mount id 3945133236
Thu Jan 19 14:38:57 2006
Database mounted in Exclusive Mode
Completed: ALTER DATABASE   MOUNT
Thu Jan 19 14:38:57 2006
ALTER DATABASE OPEN
Thu Jan 19 14:38:57 2006
Beginning crash recovery of 1 threads
 parallel recovery started with 3 processes
Thu Jan 19 14:38:58 2006
Started redo scan
Thu Jan 19 14:38:58 2006
Completed redo scan
 59 redo blocks read, 14 data blocks need recovery
Thu Jan 19 14:38:58 2006
Started redo application at
 Thread 1: logseq 5834, block 72749
Thu Jan 19 14:38:58 2006
Recovery of Online Redo Log: Thread 1 Group 2 Seq 5834 Reading mem 0
  Mem# 0 errs 0: +ORADG/danaly/onlinelog/group_2.260.600173851
  Mem# 1 errs 0: +ORADG/danaly/onlinelog/group_2.261.600173853
Thu Jan 19 14:38:58 2006
Completed redo application
Thu Jan 19 14:38:59 2006
Completed crash recovery at
 Thread 1: logseq 5834, block 72808, scn 16042818137
 14 data blocks read, 14 data blocks written, 59 redo blocks read
Thu Jan 19 14:39:00 2006
Thread 1 advanced to log sequence 5835
Thread 1 opened at log sequence 5835
  Current log# 3 seq# 5835 mem# 0: +ORADG/danaly/onlinelog/group_3.262.600173855
  Current log# 3 seq# 5835 mem# 1: +ORADG/danaly/onlinelog/group_3.263.600173857
Successful open of redo thread 1
Thu Jan 19 14:39:00 2006
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Thu Jan 19 14:39:00 2006
SMON: enabling cache recovery
Thu Jan 19 14:39:01 2006
Successfully onlined Undo Tablespace 15.
Thu Jan 19 14:39:01 2006
SMON: enabling tx recovery
Thu Jan 19 14:39:01 2006
Database Characterset is ZHS16GBK
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process QMNC
QMNC started with pid=22, OS id=3796
Thu Jan 19 14:39:09 2006
db_recovery_file_dest_size of 2048 MB is 16.75% used. This is a
user-specified limit on the amount of space that will be used by this
database for recovery-related files, and does not reflect the amount of
space available in the underlying filesystem or ASM diskgroup.
Thu Jan 19 14:39:09 2006
Completed: ALTER DATABASE OPEN