« 为何而心跳-Oracle Heartbeat研究 |
Blog首页
| 2006年 我们有更多的时间能浪费 »
Oracle10gR2 ASM 值得信赖么?
Oracle10g引入ASM以来,关于ASM的担心就从来没有停止过.因为ASM引入了一个新的ASM Instance,新的实例的稳定性一度成为了关注的焦点.
我们看一下这个新的ASM实例引入的后台进程:
    
        
            | [oracle@danaly bdump]$ ps -ef|grep ASM|grep -v greporacle    3720     1  0 14:38 ?        00:00:00 asm_pmon_+ASM
 oracle    3722     1  0 14:38 ?        00:00:00 asm_psp0_+ASM
 oracle    3724     1  0 14:38 ?        00:00:00 asm_mman_+ASM
 oracle    3726     1  0 14:38 ?        00:00:00 asm_dbw0_+ASM
 oracle    3728     1  0 14:38 ?        00:00:00 asm_lgwr_+ASM
 oracle    3730     1  0 14:38 ?        00:00:00 asm_ckpt_+ASM
 oracle    3732     1  0 14:38 ?        00:00:00 asm_smon_+ASM
 oracle    3734     1  0 14:38 ?        00:00:00 asm_rbal_+ASM
 oracle    3736     1  0 14:38 ?        00:00:00 asm_gmon_+ASM
 oracle    3748     1  0 14:38 ?        00:00:00 asm_o000_+ASM
 oracle    3781     1  0 14:38 ?        00:00:00 oracle+ASM (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
 | 
    
这些进程和数据库的后台进程同样重要(甚至可以说更为重要),如果ASM实例Crash,数据库将会立刻中止. 
今天这个10g的数据库就遇到了这样的问题.第一次,Oracle10gR2的ASM挂了。
检查数据库的alert文件,获得如下信息:
    
        
            | Thu Jan 19 13:34:11 2006WARNING: inbound connection timed out (ORA-3136)
 Thu Jan 19 13:52:47 2006
 Errors in file /opt/oracle/admin/danaly/bdump/danaly_asmb_4154.trc:
 ORA-15064: communication failure with ASM instance
 ORA-03113: end-of-file on communication channel
 Thu Jan 19 13:52:47 2006
 ASMB: terminating instance due to error 15064
 Instance terminated by ASMB, pid = 4154
 | 
    
看来是ASM实例出现问题,导致数据库down掉。
trace文件中记录了如下错误: 
    
        
            | *** 2006-01-19 13:52:47.922*** SERVICE NAME:(SYS$BACKGROUND) 2006-01-19 13:52:47.917
 *** SESSION ID:(157.1) 2006-01-19 13:52:47.917
 error 15064 detected in background process
 ORA-15064: communication failure with ASM instance
 ORA-03113: end-of-file on communication channel
 | 
    
毫无疑问,都是ASM惹的祸。
检查ASM的日志文件记录了如下错误,显示磁盘组故障导致ASM终止:
    
        
            | Thu Jan 19 13:52:47 2006Errors in file /opt/oracle/admin/+ASM/bdump/+asm_gmon_3948.trc:
 ORA-29702: error occurred in Cluster Group Service operation
 ORA-29702: error occurred in Cluster Group Service operation
 Thu Jan 19 13:52:47 2006
 GMON: terminating instance due to error 29702
 Instance terminated by GMON, pid = 3948
 | 
    
trace文件里记录了如下信息: 
    
        
            | *** 2006-01-19 13:52:47.9122006-01-19 13:52:47.912: [ CSSCLNT]clsssRecvMsg: comm error received,
 comrc 11, con (0xcc9fb48), msg (0xbfffe5d8), msgl 144
 2006-01-19 13:52:47.912: [ CSSCLNT]clssgsGroupGetStatus:  clsssRecvMsg failed 3 -1073748296)
 2006-01-19 13:52:47.912: [ CSSCLNT]clssgsGroupGetStatus: returning 8
 kgxgnpstat: received ABORT event from CLSS
 Group services Error [NM abort event ] @ 28019:719
 error 29702 detected in background process
 ORA-29702: error occurred in Cluster Group Service operation
 ORA-29702: error occurred in Cluster Group Service operation
 | 
    
启动数据库,经过恢复,成功加载ASM磁盘组,随之数据库得以正常启动: 
    
        
            | Thu Jan 19 14:38:16 2006SQL> ALTER DISKGROUP ALL MOUNT
 Thu Jan 19 14:38:16 2006
 NOTE: cache registered group ORADG number=1 incarn=0x728b66fe
 Thu Jan 19 14:38:16 2006
 Loaded ASM Library - Generic Linux, version 2.0.0 (KABI_V2) library for asmlib interface
 Thu Jan 19 14:38:16 2006
 NOTE: Hbeat: instance first (grp 1)
 Thu Jan 19 14:38:20 2006
 NOTE: start heartbeating (grp 1)
 NOTE: cache opening disk 0 of grp 1: VOL1 label:VOL1
 Thu Jan 19 14:38:20 2006
 NOTE: F1X0 found on disk 0 fcn 0.0
 NOTE: cache opening disk 1 of grp 1: VOL2 label:VOL2
 NOTE: cache opening disk 2 of grp 1: VOL3 label:VOL3
 NOTE: cache opening disk 3 of grp 1: VOL4 label:VOL4
 NOTE: cache mounting (first) group 1/0x728B66FE (ORADG)
 NOTE: starting recovery of thread=1 ckpt=17.8535
 NOTE: advancing ckpt for thread=1 ckpt=17.8535
 NOTE: cache recovered group 1 to fcn 0.401368
 Thu Jan 19 14:38:21 2006
 NOTE: opening chunk 1 at fcn 0.401368 ABA
 NOTE: seq=18 blk=8536
 Thu Jan 19 14:38:21 2006
 NOTE: cache mounting group 1/0x728B66FE (ORADG) succeeded
 SUCCESS: diskgroup ORADG was mounted
 Thu Jan 19 14:38:22 2006
 NOTE: recovering COD for group 1/0x728b66fe (ORADG)
 SUCCESS: completed COD recovery for group 1/0x728b66fe (ORADG)
 Thu Jan 19 14:38:38 2006
 Starting background process ASMB
 ASMB started with pid=12, OS id=3744
 | 
    
我们注意到,日志中几次出现heartbeat字样,原来heartbeat无处不在. 
顺便把数据库的恢复信息也记录一下:
    
        
            | Thu Jan 19 14:38:43 2006Starting ORACLE instance (normal)
 LICENSE_MAX_SESSION = 0
 LICENSE_SESSIONS_WARNING = 0
 Picked latch-free SCN scheme 2
 Using LOG_ARCHIVE_DEST_10 parameter default value as USE_DB_RECOVERY_FILE_DEST
 WARNING: db_recovery_file_dest is same as db_create_file_dest
 Autotune of undo retention is turned on.
 IMODE=BR
 ILAT =18
 LICENSE_MAX_USERS = 0
 SYS auditing is disabled
 ksdpec: called for event 13740 prior to event group initialization
 Starting up ORACLE RDBMS Version: 10.2.0.1.0.
 System parameters with non-default values:
 processes                = 150
 __shared_pool_size       = 75497472
 __large_pool_size        = 4194304
 __java_pool_size         = 4194304
 __streams_pool_size      = 0
 spfile                   = +ORADG/danaly/spfiledanaly.ora
 sga_target               = 943718400
 control_files            = +ORADG/danaly/controlfile/current.256.600173845,
 +ORADG/danaly/controlfile/current.257.600173845
 db_block_checksum        = FULL
 db_block_size            = 8192
 __db_cache_size          = 851443712
 compatible               = 10.2.0.1.0
 db_file_multiblock_read_count= 16
 db_create_file_dest      = +ORADG
 db_recovery_file_dest    = +ORADG
 db_recovery_file_dest_size= 2147483648
 undo_management          = AUTO
 undo_tablespace          = UNDOTBS2
 recyclebin               = ON
 remote_login_passwordfile= EXCLUSIVE
 db_domain                =
 dispatchers              = (PROTOCOL=TCP) (SERVICE=danalyXDB)
 job_queue_processes      = 10
 background_dump_dest     = /opt/oracle/admin/danaly/bdump
 user_dump_dest           = /opt/oracle/admin/danaly/udump
 core_dump_dest           = /opt/oracle/admin/danaly/cdump
 audit_file_dest          = /opt/oracle/admin/danaly/adump
 db_name                  = danaly
 open_cursors             = 300
 pga_aggregate_target     = 314572800
 PMON started with pid=2, OS id=3752
 PSP0 started with pid=3, OS id=3754
 MMAN started with pid=4, OS id=3756
 DBW0 started with pid=5, OS id=3758
 LGWR started with pid=6, OS id=3760
 CKPT started with pid=7, OS id=3762
 SMON started with pid=8, OS id=3764
 RECO started with pid=9, OS id=3766
 CJQ0 started with pid=10, OS id=3768
 MMON started with pid=11, OS id=3770
 Thu Jan 19 14:38:44 2006
 starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'...
 MMNL started with pid=12, OS id=3772
 Thu Jan 19 14:38:44 2006
 starting up 1 shared server(s) ...
 Thu Jan 19 14:38:44 2006
 ALTER DATABASE   MOUNT
 Thu Jan 19 14:38:44 2006
 Starting background process ASMB
 ASMB started with pid=16, OS id=3779
 Starting background process RBAL
 RBAL started with pid=17, OS id=3783
 Loaded ASM Library - Generic Linux, version 2.0.0 (KABI_V2) library for asmlib interface
 Thu Jan 19 14:38:52 2006
 SUCCESS: diskgroup ORADG was mounted
 Thu Jan 19 14:38:57 2006
 Setting recovery target incarnation to 1
 Thu Jan 19 14:38:57 2006
 Successful mount of redo thread 1, with mount id 3945133236
 Thu Jan 19 14:38:57 2006
 Database mounted in Exclusive Mode
 Completed: ALTER DATABASE   MOUNT
 Thu Jan 19 14:38:57 2006
 ALTER DATABASE OPEN
 Thu Jan 19 14:38:57 2006
 Beginning crash recovery of 1 threads
 parallel recovery started with 3 processes
 Thu Jan 19 14:38:58 2006
 Started redo scan
 Thu Jan 19 14:38:58 2006
 Completed redo scan
 59 redo blocks read, 14 data blocks need recovery
 Thu Jan 19 14:38:58 2006
 Started redo application at
 Thread 1: logseq 5834, block 72749
 Thu Jan 19 14:38:58 2006
 Recovery of Online Redo Log: Thread 1 Group 2 Seq 5834 Reading mem 0
 Mem# 0 errs 0: +ORADG/danaly/onlinelog/group_2.260.600173851
 Mem# 1 errs 0: +ORADG/danaly/onlinelog/group_2.261.600173853
 Thu Jan 19 14:38:58 2006
 Completed redo application
 Thu Jan 19 14:38:59 2006
 Completed crash recovery at
 Thread 1: logseq 5834, block 72808, scn 16042818137
 14 data blocks read, 14 data blocks written, 59 redo blocks read
 Thu Jan 19 14:39:00 2006
 Thread 1 advanced to log sequence 5835
 Thread 1 opened at log sequence 5835
 Current log# 3 seq# 5835 mem# 0: +ORADG/danaly/onlinelog/group_3.262.600173855
 Current log# 3 seq# 5835 mem# 1: +ORADG/danaly/onlinelog/group_3.263.600173857
 Successful open of redo thread 1
 Thu Jan 19 14:39:00 2006
 MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
 Thu Jan 19 14:39:00 2006
 SMON: enabling cache recovery
 Thu Jan 19 14:39:01 2006
 Successfully onlined Undo Tablespace 15.
 Thu Jan 19 14:39:01 2006
 SMON: enabling tx recovery
 Thu Jan 19 14:39:01 2006
 Database Characterset is ZHS16GBK
 replication_dependency_tracking turned off (no async multimaster replication found)
 Starting background process QMNC
 QMNC started with pid=22, OS id=3796
 Thu Jan 19 14:39:09 2006
 db_recovery_file_dest_size of 2048 MB is 16.75% used. This is a
 user-specified limit on the amount of space that will be used by this
 database for recovery-related files, and does not reflect the amount of
 space available in the underlying filesystem or ASM diskgroup.
 Thu Jan 19 14:39:09 2006
 Completed: ALTER DATABASE OPEN
 | 
    
这样的错误在Metalink上都被当做Bug处理,很难找到真正的原因,如果不能再现,就只能做为一次异常处理了.
历史上的今天...
    >>
2012-01-19文章:
    >>
2009-01-19文章:
    >>
2008-01-19文章:
    >>
2007-01-19文章:
    >>
2005-01-19文章:
 By eygle on 2006-01-19 18:07 |
Comments (0) |
 
Oracle12c/11g | 645 |
    ASMOracle