eygle.com   eygle.com
eygle.com  
 

« January 18, 2006 | Blog首页 | January 20, 2006 »



January 19, 2006

Oracle10gR2 ASM 值得信赖么?

作者:eygle

出处:http://blog.eygle.com

Oracle10g引入ASM以来,关于ASM的担心就从来没有停止过.因为ASM引入了一个新的ASM Instance,新的实例的稳定性一度成为了关注的焦点.

我们看一下这个新的ASM实例引入的后台进程:

[oracle@danaly bdump]$ ps -ef|grep ASM|grep -v grep
oracle    3720     1  0 14:38 ?        00:00:00 asm_pmon_+ASM
oracle    3722     1  0 14:38 ?        00:00:00 asm_psp0_+ASM
oracle    3724     1  0 14:38 ?        00:00:00 asm_mman_+ASM
oracle    3726     1  0 14:38 ?        00:00:00 asm_dbw0_+ASM
oracle    3728     1  0 14:38 ?        00:00:00 asm_lgwr_+ASM
oracle    3730     1  0 14:38 ?        00:00:00 asm_ckpt_+ASM
oracle    3732     1  0 14:38 ?        00:00:00 asm_smon_+ASM
oracle    3734     1  0 14:38 ?        00:00:00 asm_rbal_+ASM
oracle    3736     1  0 14:38 ?        00:00:00 asm_gmon_+ASM
oracle    3748     1  0 14:38 ?        00:00:00 asm_o000_+ASM
oracle    3781     1  0 14:38 ?        00:00:00 oracle+ASM (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq))) 

这些进程和数据库的后台进程同样重要(甚至可以说更为重要),如果ASM实例Crash,数据库将会立刻中止. 

今天这个10g的数据库就遇到了这样的问题.第一次,Oracle10gR2的ASM挂了。

检查数据库的alert文件,获得如下信息:

Thu Jan 19 13:34:11 2006
WARNING: inbound connection timed out (ORA-3136)
Thu Jan 19 13:52:47 2006
Errors in file /opt/oracle/admin/danaly/bdump/danaly_asmb_4154.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Thu Jan 19 13:52:47 2006
ASMB: terminating instance due to error 15064
Instance terminated by ASMB, pid = 4154

看来是ASM实例出现问题,导致数据库down掉。
trace文件中记录了如下错误: 

*** 2006-01-19 13:52:47.922
*** SERVICE NAME:(SYS$BACKGROUND) 2006-01-19 13:52:47.917
*** SESSION ID:(157.1) 2006-01-19 13:52:47.917
error 15064 detected in background process
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel 

毫无疑问,都是ASM惹的祸。

检查ASM的日志文件记录了如下错误,显示磁盘组故障导致ASM终止:

Thu Jan 19 13:52:47 2006
Errors in file /opt/oracle/admin/+ASM/bdump/+asm_gmon_3948.trc:
ORA-29702: error occurred in Cluster Group Service operation
ORA-29702: error occurred in Cluster Group Service operation
Thu Jan 19 13:52:47 2006
GMON: terminating instance due to error 29702
Instance terminated by GMON, pid = 3948 

trace文件里记录了如下信息: 

*** 2006-01-19 13:52:47.912
2006-01-19 13:52:47.912: [ CSSCLNT]clsssRecvMsg: comm error received,
  comrc 11, con (0xcc9fb48), msg (0xbfffe5d8), msgl 144
2006-01-19 13:52:47.912: [ CSSCLNT]clssgsGroupGetStatus:  clsssRecvMsg failed 3 -1073748296)
2006-01-19 13:52:47.912: [ CSSCLNT]clssgsGroupGetStatus: returning 8
kgxgnpstat: received ABORT event from CLSS
Group services Error [NM abort event ] @ 28019:719
error 29702 detected in background process
ORA-29702: error occurred in Cluster Group Service operation
ORA-29702: error occurred in Cluster Group Service operation 

启动数据库,经过恢复,成功加载ASM磁盘组,随之数据库得以正常启动: 

Thu Jan 19 14:38:16 2006
SQL> ALTER DISKGROUP ALL MOUNT
Thu Jan 19 14:38:16 2006
NOTE: cache registered group ORADG number=1 incarn=0x728b66fe
Thu Jan 19 14:38:16 2006
Loaded ASM Library - Generic Linux, version 2.0.0 (KABI_V2) library for asmlib interface
Thu Jan 19 14:38:16 2006
NOTE: Hbeat: instance first (grp 1)
Thu Jan 19 14:38:20 2006
NOTE: start heartbeating (grp 1)
NOTE: cache opening disk 0 of grp 1: VOL1 label:VOL1
Thu Jan 19 14:38:20 2006
NOTE: F1X0 found on disk 0 fcn 0.0
NOTE: cache opening disk 1 of grp 1: VOL2 label:VOL2
NOTE: cache opening disk 2 of grp 1: VOL3 label:VOL3
NOTE: cache opening disk 3 of grp 1: VOL4 label:VOL4
NOTE: cache mounting (first) group 1/0x728B66FE (ORADG)
NOTE: starting recovery of thread=1 ckpt=17.8535
NOTE: advancing ckpt for thread=1 ckpt=17.8535
NOTE: cache recovered group 1 to fcn 0.401368
Thu Jan 19 14:38:21 2006
NOTE: opening chunk 1 at fcn 0.401368 ABA
NOTE: seq=18 blk=8536
Thu Jan 19 14:38:21 2006
NOTE: cache mounting group 1/0x728B66FE (ORADG) succeeded
SUCCESS: diskgroup ORADG was mounted
Thu Jan 19 14:38:22 2006
NOTE: recovering COD for group 1/0x728b66fe (ORADG)
SUCCESS: completed COD recovery for group 1/0x728b66fe (ORADG)
Thu Jan 19 14:38:38 2006
Starting background process ASMB
ASMB started with pid=12, OS id=3744 

我们注意到,日志中几次出现heartbeat字样,原来heartbeat无处不在. 

顺便把数据库的恢复信息也记录一下:

Thu Jan 19 14:38:43 2006
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 2
Using LOG_ARCHIVE_DEST_10 parameter default value as USE_DB_RECOVERY_FILE_DEST
WARNING: db_recovery_file_dest is same as db_create_file_dest
Autotune of undo retention is turned on.
IMODE=BR
ILAT =18
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.1.0.
System parameters with non-default values:
  processes                = 150
  __shared_pool_size       = 75497472
  __large_pool_size        = 4194304
  __java_pool_size         = 4194304
  __streams_pool_size      = 0
  spfile                   = +ORADG/danaly/spfiledanaly.ora
  sga_target               = 943718400
  control_files            = +ORADG/danaly/controlfile/current.256.600173845,
                             +ORADG/danaly/controlfile/current.257.600173845
  db_block_checksum        = FULL
  db_block_size            = 8192
  __db_cache_size          = 851443712
  compatible               = 10.2.0.1.0
  db_file_multiblock_read_count= 16
  db_create_file_dest      = +ORADG
  db_recovery_file_dest    = +ORADG
  db_recovery_file_dest_size= 2147483648
  undo_management          = AUTO
  undo_tablespace          = UNDOTBS2
  recyclebin               = ON
  remote_login_passwordfile= EXCLUSIVE
  db_domain                =
  dispatchers              = (PROTOCOL=TCP) (SERVICE=danalyXDB)
  job_queue_processes      = 10
  background_dump_dest     = /opt/oracle/admin/danaly/bdump
  user_dump_dest           = /opt/oracle/admin/danaly/udump
  core_dump_dest           = /opt/oracle/admin/danaly/cdump
  audit_file_dest          = /opt/oracle/admin/danaly/adump
  db_name                  = danaly
  open_cursors             = 300
  pga_aggregate_target     = 314572800
PMON started with pid=2, OS id=3752
PSP0 started with pid=3, OS id=3754
MMAN started with pid=4, OS id=3756
DBW0 started with pid=5, OS id=3758
LGWR started with pid=6, OS id=3760
CKPT started with pid=7, OS id=3762
SMON started with pid=8, OS id=3764
RECO started with pid=9, OS id=3766
CJQ0 started with pid=10, OS id=3768
MMON started with pid=11, OS id=3770
Thu Jan 19 14:38:44 2006
starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'...
MMNL started with pid=12, OS id=3772
Thu Jan 19 14:38:44 2006
starting up 1 shared server(s) ...
Thu Jan 19 14:38:44 2006
ALTER DATABASE   MOUNT
Thu Jan 19 14:38:44 2006
Starting background process ASMB
ASMB started with pid=16, OS id=3779
Starting background process RBAL
RBAL started with pid=17, OS id=3783
Loaded ASM Library - Generic Linux, version 2.0.0 (KABI_V2) library for asmlib interface
Thu Jan 19 14:38:52 2006
SUCCESS: diskgroup ORADG was mounted
Thu Jan 19 14:38:57 2006
Setting recovery target incarnation to 1
Thu Jan 19 14:38:57 2006
Successful mount of redo thread 1, with mount id 3945133236
Thu Jan 19 14:38:57 2006
Database mounted in Exclusive Mode
Completed: ALTER DATABASE   MOUNT
Thu Jan 19 14:38:57 2006
ALTER DATABASE OPEN
Thu Jan 19 14:38:57 2006
Beginning crash recovery of 1 threads
 parallel recovery started with 3 processes
Thu Jan 19 14:38:58 2006
Started redo scan
Thu Jan 19 14:38:58 2006
Completed redo scan
 59 redo blocks read, 14 data blocks need recovery
Thu Jan 19 14:38:58 2006
Started redo application at
 Thread 1: logseq 5834, block 72749
Thu Jan 19 14:38:58 2006
Recovery of Online Redo Log: Thread 1 Group 2 Seq 5834 Reading mem 0
  Mem# 0 errs 0: +ORADG/danaly/onlinelog/group_2.260.600173851
  Mem# 1 errs 0: +ORADG/danaly/onlinelog/group_2.261.600173853
Thu Jan 19 14:38:58 2006
Completed redo application
Thu Jan 19 14:38:59 2006
Completed crash recovery at
 Thread 1: logseq 5834, block 72808, scn 16042818137
 14 data blocks read, 14 data blocks written, 59 redo blocks read
Thu Jan 19 14:39:00 2006
Thread 1 advanced to log sequence 5835
Thread 1 opened at log sequence 5835
  Current log# 3 seq# 5835 mem# 0: +ORADG/danaly/onlinelog/group_3.262.600173855
  Current log# 3 seq# 5835 mem# 1: +ORADG/danaly/onlinelog/group_3.263.600173857
Successful open of redo thread 1
Thu Jan 19 14:39:00 2006
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Thu Jan 19 14:39:00 2006
SMON: enabling cache recovery
Thu Jan 19 14:39:01 2006
Successfully onlined Undo Tablespace 15.
Thu Jan 19 14:39:01 2006
SMON: enabling tx recovery
Thu Jan 19 14:39:01 2006
Database Characterset is ZHS16GBK
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process QMNC
QMNC started with pid=22, OS id=3796
Thu Jan 19 14:39:09 2006
db_recovery_file_dest_size of 2048 MB is 16.75% used. This is a
user-specified limit on the amount of space that will be used by this
database for recovery-related files, and does not reflect the amount of
space available in the underlying filesystem or ASM diskgroup.
Thu Jan 19 14:39:09 2006
Completed: ALTER DATABASE OPEN 

这样的错误在Metalink上都被当做Bug处理,很难找到真正的原因,如果不能再现,就只能做为一次异常处理了.

Posted by eygle at 6:07 PM | Comments (0)


为何而心跳-Oracle Heartbeat研究

作者:eygle

出处:http://blog.eygle.com

在Oracle中有一个事件叫Heartbeat,这个词在很多地方被提及,并且有着不同的含义(比如RAC中),我们这里要讨论的是CKPT的Heartbeat机制。

Oracle通过CKPT进程每3秒将Heartbeat写入控制文件,以减少故障时的恢复时间(这个我们后面再详细阐述)。

我们可以通过如下方法验证这个过程。

1.首先在系统级启用10046时间跟踪

并重新启动数据库使之生效

[oracle@jumper oracle]$ sqlplus "/ as sysdba"
SQL*Plus: Release 9.2.0.4.0 - Production on Thu Jan 19 09:24:04 2006
Copyright (c) 1982, 2002, Oracle Corporation.  All rights reserved.

Connected to:
Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production
With the Partitioning option
JServer Release 9.2.0.4.0 - Production
SQL> alter system set event='10046 trace name context forever,level 12' scope=spfile;
System altered.
SQL> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.
Total System Global Area  114365800 bytes
Fixed Size                   451944 bytes
Variable Size              50331648 bytes
Database Buffers           62914560 bytes
Redo Buffers                 667648 bytes
Database mounted.
Database opened.
SQL> exit
Disconnected from Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production
With the Partitioning option
JServer Release 9.2.0.4.0 - Production 

2.检查bdump目录下生成的跟踪文件

目录在$ORACLE_HOME/admin/$ORACLE_SID/bdump目录下,每个后台进程都会生成一个跟踪文件。

[oracle@jumper bdump]$ ls
20050424_alert_conner.log  conner_arc0_2569.trc  conner_dbw0_2559.trc  conner_reco_2567.trc
alert_conner.log           conner_arc1_2571.trc  conner_lgwr_2561.trc  conner_smon_2565.trc
a.sql                      conner_ckpt_2563.trc  conner_pmon_2557.trc

3.检查CKPT进程的跟踪文件

我们可以很容易的发现CKPT进程每3秒都对控制文件进行一次写入

[oracle@jumper bdump]$ tail -f conner_ckpt_2563.trc 
WAIT #0: nam='rdbms ipc message' ela= 2994710 p1=300 p2=0 p3=0
WAIT #0: nam='control file parallel write' ela= 2442 p1=3 p2=3 p3=3
WAIT #0: nam='rdbms ipc message' ela= 2995171 p1=300 p2=0 p3=0
WAIT #0: nam='control file parallel write' ela= 2586 p1=3 p2=3 p3=3
WAIT #0: nam='rdbms ipc message' ela= 2994962 p1=300 p2=0 p3=0
WAIT #0: nam='control file parallel write' ela= 2582 p1=3 p2=3 p3=3
WAIT #0: nam='rdbms ipc message' ela= 2995020 p1=300 p2=0 p3=0
WAIT #0: nam='control file parallel write' ela= 2455 p1=3 p2=3 p3=3
WAIT #0: nam='rdbms ipc message' ela= 2995188 p1=300 p2=0 p3=0
WAIT #0: nam='control file parallel write' ela= 2412 p1=3 p2=3 p3=3
WAIT #0: nam='rdbms ipc message' ela= 2995187 p1=300 p2=0 p3=0
WAIT #0: nam='control file parallel write' ela= 2463 p1=3 p2=3 p3=3
WAIT #0: nam='rdbms ipc message' ela= 2995095 p1=300 p2=0 p3=0
WAIT #0: nam='control file parallel write' ela= 2448 p1=3 p2=3 p3=3

4.检查控制文件的变更

通过2次dump控制文件,比较其trace文件输出可以比较得到不同之处,我们发现,Oracle仅仅更新了Heartbeat这个数值。

[oracle@jumper udump]$ sqlplus "/ as sysdba"
SQL*Plus: Release 9.2.0.4.0 - Production on Wed Jan 18 22:44:10 2006
Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.

Connected to:
Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production
With the Partitioning option
JServer Release 9.2.0.4.0 - Production
SQL> alter session set events 'immediate trace name CONTROLF level 10';
Session altered.
SQL> exit
Disconnected from Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production
With the Partitioning option
JServer Release 9.2.0.4.0 - Production
[oracle@jumper udump]$ sqlplus "/ as sysdba"
SQL*Plus: Release 9.2.0.4.0 - Production on Wed Jan 18 22:44:18 2006
Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.

Connected to:
Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production
With the Partitioning option
JServer Release 9.2.0.4.0 - Production
SQL> alter session set events 'immediate trace name CONTROLF level 10' ;
Session altered.
SQL> exit
Disconnected from Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production
With the Partitioning option
JServer Release 9.2.0.4.0 - Production
[oracle@jumper udump]$ ls
conner_ora_21896.trc conner_ora_21898.trc
[oracle@jumper udump]$ diff conner_ora_21896.trc conner_ora_21898.trc
1c1
< /opt/oracle/admin/conner/udump/conner_ora_21896.trc
---
> /opt/oracle/admin/conner/udump/conner_ora_21898.trc
14c14
< Unix process pid: 21896, image: oracle@jumper.hurray.com.cn (TNS V1-V3)
---
> Unix process pid: 21898, image: oracle@jumper.hurray.com.cn (TNS V1-V3)
16c16
< *** SESSION ID9.813) 2006-01-18 22:44:14.314
---
> *** SESSION ID9.815) 2006-01-18 22:44:21.569
63c63
< heartbeat: 579991793 mount id: 3191936000
---
> heartbeat: 579991796 mount id: 3191936000
[oracle@jumper udump]$ 

Steve曾经这样描述CKPT Heartbeat:

In 8.0.5 a heartbeat mechanism was included in CKPT's timeout action (every 3 seconds) to update the checkpoint progress record for the thread in the controlfile.

Posted by eygle at 12:36 PM | Comments (0)



CopyRight © 2004-2008 eygle.com, All rights reserved.