« 为何而心跳-Oracle Heartbeat研究 |
Blog首页
| 2006年 我们有更多的时间能浪费 »
Oracle10gR2 ASM 值得信赖么?
Oracle10g引入ASM以来,关于ASM的担心就从来没有停止过.因为ASM引入了一个新的ASM Instance,新的实例的稳定性一度成为了关注的焦点.
我们看一下这个新的ASM实例引入的后台进程:
    
        
            
            [oracle@danaly bdump]$ ps -ef|grep ASM|grep -v grep oracle    3720     1  0 14:38 ?        00:00:00 asm_pmon_+ASM oracle    3722     1  0 14:38 ?        00:00:00 asm_psp0_+ASM oracle    3724     1  0 14:38 ?        00:00:00 asm_mman_+ASM oracle    3726     1  0 14:38 ?        00:00:00 asm_dbw0_+ASM oracle    3728     1  0 14:38 ?        00:00:00 asm_lgwr_+ASM oracle    3730     1  0 14:38 ?        00:00:00 asm_ckpt_+ASM oracle    3732     1  0 14:38 ?        00:00:00 asm_smon_+ASM oracle    3734     1  0 14:38 ?        00:00:00 asm_rbal_+ASM oracle    3736     1  0 14:38 ?        00:00:00 asm_gmon_+ASM oracle    3748     1  0 14:38 ?        00:00:00 asm_o000_+ASM oracle    3781     1  0 14:38 ?        00:00:00 oracle+ASM (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))  
             | 
        
    
这些进程和数据库的后台进程同样重要(甚至可以说更为重要),如果ASM实例Crash,数据库将会立刻中止. 
今天这个10g的数据库就遇到了这样的问题.第一次,Oracle10gR2的ASM挂了。
检查数据库的alert文件,获得如下信息:
    
        
            
            Thu Jan 19 13:34:11 2006 WARNING: inbound connection timed out (ORA-3136) Thu Jan 19 13:52:47 2006 Errors in file /opt/oracle/admin/danaly/bdump/danaly_asmb_4154.trc: ORA-15064: communication failure with ASM instance ORA-03113: end-of-file on communication channel Thu Jan 19 13:52:47 2006 ASMB: terminating instance due to error 15064 Instance terminated by ASMB, pid = 4154 
             | 
        
    
看来是ASM实例出现问题,导致数据库down掉。
trace文件中记录了如下错误: 
    
        
            
            *** 2006-01-19 13:52:47.922 *** SERVICE NAME:(SYS$BACKGROUND) 2006-01-19 13:52:47.917 *** SESSION ID:(157.1) 2006-01-19 13:52:47.917 error 15064 detected in background process ORA-15064: communication failure with ASM instance ORA-03113: end-of-file on communication channel  
             | 
        
    
毫无疑问,都是ASM惹的祸。
检查ASM的日志文件记录了如下错误,显示磁盘组故障导致ASM终止:
    
        
            
            Thu Jan 19 13:52:47 2006 Errors in file /opt/oracle/admin/+ASM/bdump/+asm_gmon_3948.trc: ORA-29702: error occurred in Cluster Group Service operation ORA-29702: error occurred in Cluster Group Service operation Thu Jan 19 13:52:47 2006 GMON: terminating instance due to error 29702 Instance terminated by GMON, pid = 3948  
             | 
        
    
trace文件里记录了如下信息: 
    
        
            
            *** 2006-01-19 13:52:47.912 2006-01-19 13:52:47.912: [ CSSCLNT]clsssRecvMsg: comm error received,   comrc 11, con (0xcc9fb48), msg (0xbfffe5d8), msgl 144 2006-01-19 13:52:47.912: [ CSSCLNT]clssgsGroupGetStatus:  clsssRecvMsg failed 3 -1073748296) 2006-01-19 13:52:47.912: [ CSSCLNT]clssgsGroupGetStatus: returning 8 kgxgnpstat: received ABORT event from CLSS Group services Error [NM abort event ] @ 28019:719 error 29702 detected in background process ORA-29702: error occurred in Cluster Group Service operation ORA-29702: error occurred in Cluster Group Service operation  
             | 
        
    
启动数据库,经过恢复,成功加载ASM磁盘组,随之数据库得以正常启动: 
    
        
            
            Thu Jan 19 14:38:16 2006 SQL> ALTER DISKGROUP ALL MOUNT Thu Jan 19 14:38:16 2006 NOTE: cache registered group ORADG number=1 incarn=0x728b66fe Thu Jan 19 14:38:16 2006 Loaded ASM Library - Generic Linux, version 2.0.0 (KABI_V2) library for asmlib interface Thu Jan 19 14:38:16 2006 NOTE: Hbeat: instance first (grp 1) Thu Jan 19 14:38:20 2006 NOTE: start heartbeating (grp 1) NOTE: cache opening disk 0 of grp 1: VOL1 label:VOL1 Thu Jan 19 14:38:20 2006 NOTE: F1X0 found on disk 0 fcn 0.0 NOTE: cache opening disk 1 of grp 1: VOL2 label:VOL2 NOTE: cache opening disk 2 of grp 1: VOL3 label:VOL3 NOTE: cache opening disk 3 of grp 1: VOL4 label:VOL4 NOTE: cache mounting (first) group 1/0x728B66FE (ORADG) NOTE: starting recovery of thread=1 ckpt=17.8535 NOTE: advancing ckpt for thread=1 ckpt=17.8535 NOTE: cache recovered group 1 to fcn 0.401368 Thu Jan 19 14:38:21 2006 NOTE: opening chunk 1 at fcn 0.401368 ABA NOTE: seq=18 blk=8536 Thu Jan 19 14:38:21 2006 NOTE: cache mounting group 1/0x728B66FE (ORADG) succeeded SUCCESS: diskgroup ORADG was mounted Thu Jan 19 14:38:22 2006 NOTE: recovering COD for group 1/0x728b66fe (ORADG) SUCCESS: completed COD recovery for group 1/0x728b66fe (ORADG) Thu Jan 19 14:38:38 2006 Starting background process ASMB ASMB started with pid=12, OS id=3744  
             | 
        
    
我们注意到,日志中几次出现heartbeat字样,原来heartbeat无处不在. 
顺便把数据库的恢复信息也记录一下:
    
        
            
            Thu Jan 19 14:38:43 2006 Starting ORACLE instance (normal) LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Picked latch-free SCN scheme 2 Using LOG_ARCHIVE_DEST_10 parameter default value as USE_DB_RECOVERY_FILE_DEST WARNING: db_recovery_file_dest is same as db_create_file_dest Autotune of undo retention is turned on.  IMODE=BR ILAT =18 LICENSE_MAX_USERS = 0 SYS auditing is disabled ksdpec: called for event 13740 prior to event group initialization Starting up ORACLE RDBMS Version: 10.2.0.1.0. System parameters with non-default values:   processes                = 150   __shared_pool_size       = 75497472   __large_pool_size        = 4194304   __java_pool_size         = 4194304   __streams_pool_size      = 0   spfile                   = +ORADG/danaly/spfiledanaly.ora   sga_target               = 943718400   control_files            = +ORADG/danaly/controlfile/current.256.600173845,                               +ORADG/danaly/controlfile/current.257.600173845   db_block_checksum        = FULL   db_block_size            = 8192   __db_cache_size          = 851443712   compatible               = 10.2.0.1.0   db_file_multiblock_read_count= 16   db_create_file_dest      = +ORADG   db_recovery_file_dest    = +ORADG   db_recovery_file_dest_size= 2147483648   undo_management          = AUTO   undo_tablespace          = UNDOTBS2   recyclebin               = ON   remote_login_passwordfile= EXCLUSIVE   db_domain                =    dispatchers              = (PROTOCOL=TCP) (SERVICE=danalyXDB)   job_queue_processes      = 10   background_dump_dest     = /opt/oracle/admin/danaly/bdump   user_dump_dest           = /opt/oracle/admin/danaly/udump   core_dump_dest           = /opt/oracle/admin/danaly/cdump   audit_file_dest          = /opt/oracle/admin/danaly/adump   db_name                  = danaly   open_cursors             = 300   pga_aggregate_target     = 314572800 PMON started with pid=2, OS id=3752 PSP0 started with pid=3, OS id=3754 MMAN started with pid=4, OS id=3756 DBW0 started with pid=5, OS id=3758 LGWR started with pid=6, OS id=3760 CKPT started with pid=7, OS id=3762 SMON started with pid=8, OS id=3764 RECO started with pid=9, OS id=3766 CJQ0 started with pid=10, OS id=3768 MMON started with pid=11, OS id=3770 Thu Jan 19 14:38:44 2006 starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'... MMNL started with pid=12, OS id=3772 Thu Jan 19 14:38:44 2006 starting up 1 shared server(s) ... Thu Jan 19 14:38:44 2006 ALTER DATABASE   MOUNT Thu Jan 19 14:38:44 2006 Starting background process ASMB ASMB started with pid=16, OS id=3779 Starting background process RBAL RBAL started with pid=17, OS id=3783 Loaded ASM Library - Generic Linux, version 2.0.0 (KABI_V2) library for asmlib interface Thu Jan 19 14:38:52 2006 SUCCESS: diskgroup ORADG was mounted Thu Jan 19 14:38:57 2006 Setting recovery target incarnation to 1 Thu Jan 19 14:38:57 2006 Successful mount of redo thread 1, with mount id 3945133236 Thu Jan 19 14:38:57 2006 Database mounted in Exclusive Mode Completed: ALTER DATABASE   MOUNT Thu Jan 19 14:38:57 2006 ALTER DATABASE OPEN Thu Jan 19 14:38:57 2006 Beginning crash recovery of 1 threads  parallel recovery started with 3 processes Thu Jan 19 14:38:58 2006 Started redo scan Thu Jan 19 14:38:58 2006 Completed redo scan  59 redo blocks read, 14 data blocks need recovery Thu Jan 19 14:38:58 2006 Started redo application at  Thread 1: logseq 5834, block 72749 Thu Jan 19 14:38:58 2006 Recovery of Online Redo Log: Thread 1 Group 2 Seq 5834 Reading mem 0   Mem# 0 errs 0: +ORADG/danaly/onlinelog/group_2.260.600173851   Mem# 1 errs 0: +ORADG/danaly/onlinelog/group_2.261.600173853 Thu Jan 19 14:38:58 2006 Completed redo application Thu Jan 19 14:38:59 2006 Completed crash recovery at  Thread 1: logseq 5834, block 72808, scn 16042818137  14 data blocks read, 14 data blocks written, 59 redo blocks read Thu Jan 19 14:39:00 2006 Thread 1 advanced to log sequence 5835 Thread 1 opened at log sequence 5835   Current log# 3 seq# 5835 mem# 0: +ORADG/danaly/onlinelog/group_3.262.600173855   Current log# 3 seq# 5835 mem# 1: +ORADG/danaly/onlinelog/group_3.263.600173857 Successful open of redo thread 1 Thu Jan 19 14:39:00 2006 MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set Thu Jan 19 14:39:00 2006 SMON: enabling cache recovery Thu Jan 19 14:39:01 2006 Successfully onlined Undo Tablespace 15. Thu Jan 19 14:39:01 2006 SMON: enabling tx recovery Thu Jan 19 14:39:01 2006 Database Characterset is ZHS16GBK replication_dependency_tracking turned off (no async multimaster replication found) Starting background process QMNC QMNC started with pid=22, OS id=3796 Thu Jan 19 14:39:09 2006 db_recovery_file_dest_size of 2048 MB is 16.75% used. This is a user-specified limit on the amount of space that will be used by this database for recovery-related files, and does not reflect the amount of space available in the underlying filesystem or ASM diskgroup. Thu Jan 19 14:39:09 2006 Completed: ALTER DATABASE OPEN  
             | 
        
    
这样的错误在Metalink上都被当做Bug处理,很难找到真正的原因,如果不能再现,就只能做为一次异常处理了.
历史上的今天...
    >>
2012-01-19文章:
    >>
2009-01-19文章:
    >>
2008-01-19文章:
    >>
2007-01-19文章:
    >>
2005-01-19文章:
  
By eygle on 2006-01-19 18:07 |
Comments (0) |
 
Oracle12c/11g | 645 |
    ASMOracle